Brain and Behavioural Science commentary (1997)
Information theory: the holy grail of cortical computation?
James V Stone
Brain and Behavioural Science commentary on the article "In search of common foundations for cortical computation", by Phillips, W and Singer, 20(4), 698, December, 1997.
Simple hypotheses are intrinsically attractive, and, for this reason, need to be formulated with utmost precision if they are to be testable. Unfortunately, it is hard to see how the authors' hypothesis might be unambiguously refuted. Despite this, the authors have provided much evidence consistent with the hypothesis, and have proposed a natural and powerful extension for information theoretic approaches to learning.
Theories: the seductive, the simple, and the true
The idea that all tasks undertaken by the neocortex are executed using a single computational principle is both seductive and compelling. Phillips and Singer have marshalled evidence from diverse sources to provide a cogent case in favour of this idea. However, in common with all universal explanations, there is a danger that its elegance is mistaken as evidence for its truth. There is thus all the more reason to treat such ideas with caution. This is not intended as a criticism of the idea itself, but as an antidote for the inevitable temptations associated with such attractive ideas. As noted by Einstein, a theory should be as simple as possible – but not too simple.
Characterising the holy grail of cortical computation
The idea of a common foundation for cortical computation was explored as early as 1970 by Marr (Marr, 1970), and has been discussed by a number of authors since (Creutzfeldt, 1978; Szenatgoathai, 1978; Douglas and Martin, 1994; Ebdon, 1993; Stone, 1996a). To date, the idea has remained an intriguing possibility, rather than a testable hypothesis. However, despite the efforts of Phillips and Singer to place this hypothesis on a firm footing, it is not obvious how one would set about falsifying the hypothesis in the form provided in this paper. That is, the authors have not indicated what type of experimental findings would suffice to refute the hypothesis.
For example, if it were possible to demonstrate that two cortical regions used qualitatively different operations, would this constitute a refutation? For a number a reasons, I think it would not; but this is due to my own interpretation of the hypothesis, and not a logical imperative of the hypothesis as expressed in the paper.
The issue is not whether such a finding would constitute a refutation, but rather, whether any finding is capable of unambiguously refuting the hypothesis. For example, the authors note that, although cortical columns are not central to the hypothesis developed here, criticism of this idea suggests limitations upon anatomical arguments for commonalties (2.1, para 5). This is a fair point, but there is no clear indication of what types of arguments could not also suggest limitations on their use.
Surely, a necessary first step in searching for the holy grail of cortical computation is to establish what form it might take, and what forms it could not take. Whilst the search for common foundations for cortical computation is undoubtedly a worthwhile endeavour, it would be helpful to establish conditions under which to call off the search, and conditions under which we might shout, Eureka!
Is maximising information enough?
Several authors have devised neural network models which learn by maximising information theoretic measures (Linsker, 1988; Becker, 1996; Stone, 1996b). As noted by Phillips and Singer, such rules operate in the absence of any ethological considerations, so that even variables which are irrelevant to an animal's behaviour would be extracted by such rules. This appears to be a fundamental limitation of information theoretic models, which either extract any variables (Linkser, 1988), or only variables which conform to certain assumptions implicit in the learning algorithm (Becker 1996; Stone, 1996a).
However, by introducing the idea of cross-stream contextual inputs as constraints on learning, the authors have effectively overcome an important limitation on conventional information theoretic approaches. Whilst others have argued that information maximisation methods are appropriate only for low level sensory processing (presumably because they tend to extract all variables in the input data), the authors argue that the use of contextual information can be used to extract selectively only those variables which are of direct relevance to a particular set of behaviours.
This important insight opens up the possibility that a single principle can be used to account for learning of low level perceptual invariances, as well as higher order variables (such as the association between the shape and taste of a fruit). Moreover, it suggests a natural extension to a whole class of neural network models which learn by explicitly maximising information theoretic quantities (Stone, 1996b; Becker, 1996; Linsker, 1988).
Maximising Shannon entropy is hard
On a more technical level, the general approach adopted for generating learning rules shares with others the assumption that information theoretic quantities are suited to extracting variables implicit in sensory data streams. Whilst information theory provides a principled method of deriving learning rules, it is not the only, nor necessarily the best, means to this end.
In its raw form, Shannon entropy takes no account of the temporal ordering of inputs. A number of authors (Foldiak, 1991; Stone, 1996a+b; Becker, 1996; Barlow, 1996) have argued for the use of learning rules based on the tendency of distal variables to vary smoothly over time. Indeed the BCM learning rule discussed in the paper (3.5, para 3) is important precisely because it takes advantage of the temporal sequence of inputs.
More recently Becker (1996) has used temporal smoothness as an explicit assumption in deriving an information theoretic learning rule. Whilst Phillips and Singer's learning method ignores temporal contiguity of inputs, Becker's method ignores cross-stream constraints. It remains to be seen if these two approaches can be profitably combined, and if the resultant learning method can be related to learning in the neocortex.
Barlow, H (1996). Intraneuronal information-processing, directional selectivity and memory for spatiotemporal sequences. Network: Computation in Neural Systems, 7(2), 251-259.
Becker, S (1996). Mutual information maximization: Models of cortical self-organisation. Network: Computation in Neural Systems, 7(1), 7-31.
Creutzfeldt, O D (1978). The neocortical link: Thoughts on the generality of structure and function of the neocortex. In Brazier, M A B and Petsche, H (eds.), Architectonics of the Cerebral Cortex, Raven Press, New York.
Douglas, R J, Martin, K A C (1994). The canonical microcircuit: A co-operative neuronal network for neocortex, 131-141.
Ebdon, M (1993). Is the cerebral neocortex a uniform cognitive architecture?. Mind and Language, 8(3), 369-403.
Foldiak, P (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194-200.
Linsker, R (1988). Self-organization in perceptual network. Computer, 105-117.
Marr, D (1970). A theory for cerebral neocortex. Proceedings Royal Society London (B), 176, 161-234.
Stone, J V (1996a) A canonical microfunction for learning perceptual invariances. Perception, 25(2), 207-220.
Stone, J V (1996b) Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463-1492.
Szenatgoathai, J (1978) The neuron network of the cerebral cortex: a functional approach. Proceedings Royal Society London (B), 201, 219-248.