Cephalization from Below
Brains did not arise to command bodies. They arose because muscles needed coordination, and coordination required information flow. The standard picture (brain as central commander issuing motor orders to obedient periphery) inverts the actual evolutionary sequence: motile animals had muscular coordination for hundreds of millions of years before anything resembling a centralized nervous system existed. Nerve nets evolved first as extensions of the muscular system, providing long-range synchronization among contractile cells. Cephalization (the concentration of neural tissue into a head) happened later, when bilateral body plans created a leading end that lived, computationally speaking, in the animal’s future. The brain is not a boss that acquired a body. It is a knot of neurons that muscles tied at their own front door.
Phase synchronization: coherence without command
The earliest motile animals, dating to at least the Ediacaran (635-539 Mya), had no centralized nervous systems. Jellyfish swim via rhythmic pumping. The freshwater polyp Hydra coordinates its body column through a diffuse nerve net. Both achieve coherent movement through phase synchronization: individual contractile cells align the frequency and phase of their oscillations with their neighbors. No cell is in charge. Coordination is local prediction at its most minimal: each unit adjusts its timing to match the units it can sense.
The same principle operates at the scale of whole organisms. Giant firefly swarms synchronize their flashing across forest clearings, achieving near-unison without any conductor. The mechanism: each firefly adjusts its phase to match the glow of visible neighbors, and because they can see not just adjacent individuals but the aggregate glow of distant ones, synchronization propagates rapidly. The physics is identical to cardiac pacemaker coupling, just at a different scale.
What makes this relevant to consciousness: coherent, unified behavior does not require an integrator. The heart beats without a brain. Peristalsis propagates without central command. Gap junctions (direct electrical coupling between neighboring cells) and decentralized nerve nets provide sufficient infrastructure for coordinated muscular action. This is coherence arising from below, via phase coupling, not imposed from above by an integration operator.
The comb jelly’s “subepithelial nerve net” is particularly revealing: it consists of a fused (syncytial) network of nerve fibers, an undirected, organism-wide highway for long-distance signal transmission. It is not wired for sensing the external world. It is wired for helping muscle cells know what distant muscle cells are doing. The earliest neural nets may be best understood as internal sensory systems for muscles.
The bilaterian revolution: a front end that lives in the future
The transition from radially symmetric organisms (jellyfish, Hydra, coral) to bilaterally symmetric ones (worms, and eventually everything on land) introduced a computational asymmetry that changed everything.
A coral polyp is sessile: it sits on a rock and processes whatever washes over it. A bilaterian moves through the world in a particular direction. It has a leading end. As Peter Godfrey-Smith puts it: “In the sea, animals have various body plans. On land, all animals are bilaterian. There are no terrestrial jellyfish.”
The leading end encounters the future first. Chemical receptors (taste/smell, the oldest and most ubiquitous environmental sensors) at the front detect food or threat before the rest of the body does. “The front end of a worm lives in its future, while its rear end lives in its past.” This spatiotemporal asymmetry is the evolutionary pressure that produced cephalization.
A worm that can steer left or right needs spatially differentiated signals: a “yum” to the right, or a “yuck” to the left, should cause the muscles on the right side (but not the left) to contract. A diffuse nerve net cannot easily convey such spatially specific information. So as muscles all over the body began to wire up selectively to the front end of the animal, the resulting knot of spatially organized neurons became the first brain: cephalization as the front-end aggregation of sensory outposts serving the motor periphery.
The Acoela, an ancient order of small marine worms diverging from other animals over 550 Mya, appear to preserve this developmental stage. They have a distributed nerve net plus a “brain cap,” an aggregation of neurons at the front coinciding with sensors (including a simple eye). They hunt using complex sensory-guided behavior. Yet their brain is not highly organized: cut an acoel in half and each half regenerates into a whole animal. Signaling molecules exchanged among muscle cells (not neurons) orchestrate the patterning and regeneration process.
The “meathead” argument: inverting the control hierarchy
The standard neuroscience narrative places the brain at the top of a command hierarchy: brain decides, motor neurons transmit, muscles obey. Agüera y Arcas argues this picture may be exactly backwards, or at least misleadingly one-sided.
Consider the evidence:
- The heart beats rhythmically without any neural input. A heart removed for transplant continues to beat on its own. Cardiac muscle cells coordinate via gap junctions and intrinsic pacemaker activity.
- Peristalsis (the coordinated squeezing that moves food through the gut) relies on traveling waves of contraction governed by local neural plexuses, not by the brain.
- Reflex arcs bypass the brain entirely. When you touch a hot stove, the retraction happens at the spinal level. The brain is informed after the fact.
- Sea squirts possess a central nervous system during their free-swimming larval stage, but reabsorb most of it when they attach to a substrate and become sessile. There is no point in having a brain for an animal that no longer moves.
The neuron itself illustrates the ambiguity. A motor neuron with its cell body in the lumbar spine sends an axon all the way to the tip of the big toe (three feet in humans, thirty feet in blue whales). We instinctively read the cell body as “agent” and the axon as “patient”: the head decides, the tail obeys. But given that learning requires information about downstream effects to flow backward to the source (however the brain implements this), the target is as much “in charge” as the source. The tail wags the dog.
The theater analogy captures the distinction between efficient and final cause. An usher closes the doors when a counter reaches fifty. The fiftieth person to enter is the efficient cause (they triggered the click). But the final cause is that the theater reached capacity. If the counter breaks, the usher counts on paper. Disrupting the causal chain does not disrupt the outcome, because an intelligent agent reroutes around the disruption. This is the signature of purposive (entensional) systems: they exhibit backward causality not because the future literally causes the present, but because adaptive agents predict the future and act to bring it about.
The implication for neuroscience: the “command stream” interpretation of neural spike trains is not wrong, but it is incomplete. Every brain region is trying to predict every other brain region. No inherent hierarchy determines which one is giving orders.
From chemotaxis to cognition: the neuromodulatory bridge
Even before cephalization produced anything like a brain, bilaterians needed to modulate behavior on timescales longer than individual neural spikes. Even bacteria compute a “batting average” of food concentration over time to decide whether to run or tumble.
The solution: neuromodulators, chemical signals that accumulate and dissipate gradually, affecting entire populations of neurons simultaneously. In P(X,H,O) terms, neuromodulators are the original slow-timescale H-variables: not permanent, but longer-lasting than any momentary input (X) or action (O).
Dopamine and serotonin, which remain critical to human cognition, date back to the earliest bilaterian nervous systems (550+ Mya).
| Neuromodulator | Original function | What it tracks | P(X,H,O) role |
|---|---|---|---|
| Dopamine | ”Nearby food” sensors in the worm’s head | Expected future food (anticipation, not pleasure) | Slow H-variable: converts “food-outside” into a time-averaged internal signal |
| Serotonin | Food sensors in the worm’s throat | Food consumed (satiation) | Slow H-variable: “enough, stop seeking” |
Dopamine is often glossed as a “pleasure” signal. This is not quite right, even in worms. A worm in a food-rich environment releases dopamine, triggering continuous turning to exploit the local patch (analogous to a bacterium’s increased tumbling near food). When food declines, the worm turns to reorient toward higher concentration. Dopamine tracks the prediction of future food, not the presence of food. The most apt subjective correlate is anticipation.
Evidence from humans: in a series of ethically dubious 1960s experiments, patients wired to directly stimulate their own dopamine production reported not pleasure but escalating anticipation, “as if building up to a sexual orgasm” that they could never reach, pressing the button frantically. Conversely, rats with destroyed dopamine neurons become passive and starve to death even with food under their noses. If food is placed directly in their mouths, they eat with evident pleasure. Dopamine is not the reward. It is the drive toward expected reward.
Serotonin serves the converse function: it tracks food consumed (sensed in the throat), building up over time to signal satiation and quelling the dopamine-driven impulse to forage. The crude but useful characterization: dopamine = “wanting,” serotonin = “getting.”
Temporal difference learning: dopamine repurposed
As brains grew more complex, dopamine was repurposed from a simple anticipation signal into something approximating a temporal-difference (TD) learning signal. The Schultz/Dayan/Montague discovery (1990s) established the correspondence:
- Dopamine neurons spike at a moderate background rate
- An unexpected reward produces a dopamine burst
- Once the association between a cue and a reward is learned, the dopamine burst shifts earlier, to the cue itself (the monkey licks its lips when it sees the light, not when it tastes the juice)
- If the expected reward is withheld, dopamine activity drops below baseline: a negative prediction error
This maps precisely onto TD learning’s “critic” signal: not the reward itself, but the difference between predicted and actual reward. The actor-critic architecture (policy function learns from the value function’s predictions, value function learns from actual outcomes) bootstraps itself from naivety to competence.
The evolutionary path is continuous. In the earliest bilaterians, dopamine is already a prediction of food (nearby food, not food in mouth). Predicting dopamine is therefore a prediction of a prediction of food. As brain structures grew upstream and downstream of the dopamine-releasing neurons, the upstream areas became increasingly sophisticated critics (longer-range forecasts), and the downstream areas became increasingly sophisticated actors (more complex behavioral policies). The TD signal didn’t need to be invented. It emerged from the deepening of a prediction loop that was already running in worms.
Caveats: Agüera y Arcas explicitly warns against over-identifying brain function with TD learning. Real brains transcend TD learning in at least two ways: (1) humans can learn tasks that defeat TD algorithms (complex board games, for instance, required elaborations beyond basic TD), and (2) recent evidence shows dopamine encodes information well beyond a scalar prediction-error signal. As with every computational model in this wiki, TD learning illuminates a corner. It does not illuminate every corner.
Generic modularity: the cortex as a colony
Cephalization produced a brain. The social intelligence explosion (Humphrey 1976, Dunbar 1998) scaled it up, and it could do so cheaply because the cortex has a generic modular structure.
Cortical columns (loosely defined, debated boundaries) form a repetitive honeycomb. The basic circuit is much the same across brain regions. “Visual cortex” and “auditory cortex” differ mainly in input wiring, not computational architecture. In Sharma, Angelucci, and Sur’s experiment (2000), baby ferrets’ optic nerves were rerouted to auditory cortex; the animals learned to see, developing orientation-sensitivity maps in the rewired “auditory” cortex. Blind humans can learn a limited form of vision via spatially patterned tongue stimulation, or echolocation via “click sonar.” The hardware is generic.
This generic modularity made the intelligence explosion possible in the same way DNA’s modular structure made snakes possible: evolution could expand the cortical sheet by replicating columns without inventing anything fundamentally new. In dolphins and humans, expansion progressed to the point of folding the cortex into dense fissures. Humans cram approximately 0.25 m² of cortical area into the skull.
The cortex is therefore not a single organ but a colony of prediction units that replicated inside skulls in larger and larger numbers through increased cooperation among themselves. This reframes the brain as a population, not a unitary entity, and the intelligence it produces as a collective phenomenon among mutually predicting units. See Theory of Mind Is Mind for how this connects to the social brain hypothesis and consciousness as “swing.”
The octopus: decentralized theory of mind
The octopus appears to challenge the social brain hypothesis (intelligent but antisocial) until you look inside. Three-fifths of its neurons are in the arms, not the head, because mollusc nerve fibers lack myelin sheaths. Long-distance signaling is slow and expensive; centralized control is impossible. Each arm responds to stimuli independently, each sucker has its own prehensile intelligence (touch, taste, photoreceptors, chromatophores), and arms communicate directly via a ring of ganglia bypassing the brain.
Agüera y Arcas proposes the octopus is best understood as a tightly knit community of eight arms sharing a common pair of eyes. The central brain (mostly optic lobes) compresses visual information in service to the arms, not as a central commander. The intelligence explosion that produced the octopus may have been driven by mutual predictive modeling among its arms, under the constraint of limited inter-arm bandwidth. This is structurally identical to human social intelligence: high-fidelity mutual prediction under low-bandwidth communication. Attack autotomy in squid (a severed arm fights a predator while the rest escape) may be no different from a bee’s kamikaze defense of the hive.
This connects cephalization from below to P-005: coherence can arise through mutual prediction among decentralized agents, not only through top-down integration. The octopus achieves coherent swimming, hunting, and escape via the same mechanism as a rowing crew achieving “swing.”
Recurrence as depth-in-time
Cortex is shallow (few layers) but densely recurrent: feedback connections dominate over feedforward. This puzzled AI researchers accustomed to deep convolutional networks (CNNs) with dozens or hundreds of feedforward layers. The resolution: a recurrent neural network (RNN) running T time steps is computationally equivalent to a feedforward network with T layers. Cortex achieves deep processing through temporal iteration, not spatial depth.
This explains why recurrent architecture is ubiquitous across cortical regions, whether “sensory,” “motor,” or “association.” All cortex is doing the same thing: autoregressive sequence prediction that unfolds over time. The “deep” processing of a CNN is a spatial snapshot of what a recurrent circuit does dynamically. Early time steps yield coarse, fast judgments (the “double take”: is that a cat?). Later time steps refine the representation (upon closer examination: a house cat, not a tiger; calico, green eyes, slightly annoyed expression). This dual-speed architecture serves survival: fast reactions to salient stimuli (via early exits) and high-resolution perception for everything else (via continued iteration).
The biological argument for recurrence over feedforward depth is straightforward. Neurons and synapses are slow relative to transistors. An animal that needed visual input to propagate through 100 cortical layers before producing a motor response would be eaten long before layer 50. Skip connections (present in biological cortex and in modern deep learning architectures like ResNets) allow some activations to bypass layers, but the fundamental constraint remains: fast reaction time requires shallow architecture. Recurrence solves this by making depth a function of available time rather than fixed spatial structure.
Evidence that even single neurons are sequence predictors (Saponati and Vinck 2023) and that cortical circuitry implements predictive sequence modeling (Keller and Mrsic-Flogel 2018) supports the picture: prediction is the universal operation, at every scale from synapse to cortical region.
The subbasement: hippocampus and basal ganglia
Below the cortex, older brain structures implement specialized forms of the same prediction-and-learning principle, tuned to different timescales and problem types.
Hippocampus: one-shot sequence learning
The hippocampus (Greek: “seahorse,” for its shape) is a whorl of tissue deep in each cerebral hemisphere. Its original function was likely real-time construction of spatial maps: essential for any animal that moves through a stable environment. It retains this function: hippocampal “place cells” fire when an animal is at specific locations, and rapid bursts of place-cell activity during rest correspond to replay of past trajectories and simulation of future ones.
Henry Molaison (1926-2008), known in the literature as H.M., established the hippocampus’s role in memory formation. After bilateral hippocampal removal to treat intractable epilepsy, Molaison retained his pre-surgery memories, his personality, and his short-term memory, but could form no new episodic memories. If his attention wandered, the interaction might as well have never happened. His last memories dated to 1953.
The interpretation: the hippocampus is a rapid one-shot sequence learner that captures sparse patterns of sequential cortical activity (landmark → landmark → landmark, the way we give directions) in real time. The cortex, which learns slowly but has vastly more capacity and richer associative connections, consolidates these hippocampal recordings through repeated replay during sleep (faster-than-real-time replay of previous experiences has been recorded in sleeping animals). Sleep deprivation impairs memory formation precisely because it interrupts this consolidation pipeline.
This is a clean division of labor: the hippocampus captures quickly but has limited capacity; the cortex integrates slowly but builds deep associative structure. Each learns from the other: the hippocampus does rapid one-shot learning from the cortex in the waking moment, then the cortex elicits replay-based training from the hippocampus during sleep.
Grid cells, positional encoding, and the Transformer parallel
In 2014, a Nobel Prize was awarded to the discoverers of hippocampal “grid cells”: neurons in the medial entorhinal cortex that fire in beautiful hexagonal patterns as an animal navigates space, forming a Cartesian positioning system. Growing evidence suggests that the hippocampus’s spatial-mapping and episodic-memory-formation functions may be related or even identical: the ancient “memory palace” technique (memorizing sequences by placing them in an imagined environment) exploits the same spatial-sequential machinery.
The connection to artificial intelligence is striking. The Transformer architecture (Vaswani et al. 2017) requires “positional encoding” to tag token embeddings with information about their ordering; without such tagging, every attention operation would make connections among a disordered bag of words. Whittington et al. (2021) showed that when a Transformer’s positional encoding is learned (rather than hand-specified) during a spatial-navigation task, the resulting activation patterns closely resemble grid cells, along with related patterns like “band cells” and “place cells” also observed in the hippocampus. The similarity is highly suggestive: these patterns are the most natural building blocks for composing spatiotemporal tags, and the brain appears to have hit on the same solution as the Transformer, for the same reason.
At a more cellular level, Kozachkov et al. (2023) proposed that interactions between neurons and astrocytes (a type of glial cell comprising more than half the brain’s volume) could implement a Transformer-like attention mechanism. Astrocyte processes ensheath approximately 60% of hippocampal synapses to form “tripartite synapses,” and the way they modulate signal transmission looks suspiciously like the attention dot product (query-key matching followed by value weighting). If confirmed, this would mean that the Transformer’s key architectural innovation, attention, has a biological counterpart that neuroscience has been underestimating by treating astrocytes as mere “support cells.”
The convergence runs in both directions. The Transformer was not designed to mimic hippocampal function, yet its engineering solutions (positional encoding, attention dot products) appear to rediscover computational primitives that evolution found independently. This supports the broader claim of this page: the structure of information processing is determined more by the problem being solved than by the substrate solving it. See Language as Prediction for how these mechanisms underpin language modeling.
Basal ganglia: softmax action selection
The basal ganglia are a collection of nuclei in the brain’s center, present since the earliest vertebrates (Cambrian-era bony fish). They integrate and select among competing activation patterns from other brain areas, mediated by dopamine (see the TD learning section above). Their function is a neural softmax: multiple candidate actions compete, and the winner drives behavior.
Posterior nuclei handle motor skills (“muscle memory,” learned sequences that need “no thinking”). Anterior nuclei handle stimulus-action associations driven by higher-level goals (cravings, habits, addictions). Fish and amphibian behavior appears to be driven mainly by this reinforcement-learning-like mechanism, lacking the higher-level predictive simulation of world, others, and self that appeared with the mammalian neocortex.
The relationship between basal ganglia and cortex illustrates the layered autonomy of the brain: newer or “higher” levels provide long-range prediction and thereby earn their metabolic keep, but augment a largely autonomous underlying architecture. Skilled and low-latency behaviors learned initially by cortex can be “offloaded” to basal ganglia for parallel execution, freeing cortex for slower, more deliberative processing. The division of labor looks like tactics (basal ganglia) versus strategy (cortex). Referring to basal ganglia as “unconscious” or implementing an “autopilot” presumes they aren’t really part of “you,” but that is the homuncular fallacy: what it really means is that the interpreter (see Theory of Mind Is Mind) doesn’t have complete access to these older regions.
Seven takeaways: the distributed prediction architecture
Combining the subbasement tour with the cephalization story above:
- Many brain regions, not just cortical areas, are sequence predictors (even the retina is a sequence predictor).
- Different regions predict over different timescales, with later-evolving regions generally capable of more complex predictions over longer horizons.
- Brain regions actively predict each other and, where connected to sensory inputs or motor outputs, predict those signals.
- How they are wired together largely determines what they predict and which information resources they can marshal.
- Effective mutual prediction involves mutual learning.
- The division of labor is not perfectly clean. One brain area can learn something first, then teach it to others (for lower latency, robustness, parallelism, or greater generality). This wouldn’t be possible if the areas were not capable of sequence learning.
- No part of the brain is the “conscious” part where a homunculus resides. The brain is a village of mutually predicting prediction units, specialized by connectivity, operating at different timescales, all of them “you.”
Toward dynamically stable symbiotic prediction
The chapter culminates in a sketch of what a unified theory of learning might look like. Agüera y Arcas calls it “dynamically stable symbiotic prediction”: not pure reinforcement learning (too narrow, single-reward), not the Bayesian brain alone (too neural-centric), but unsupervised sequence prediction constrained by dynamic stability and symbiosis.
Four desiderata for the unified theory:
- Active prediction of the future given the past as the central problem
- No distinction between learning and inference: prediction must occur over all timescales, not just during a “training phase”
- Synthesis of prediction with thermodynamics in the spirit of dynamic stability
- Mutual prediction between agents leading to collective, nonzero-sum outcomes
Patricia Churchland’s critique of AlphaGo-era AI motivates the move beyond reward: “Maintaining homeostasis often involves competing values and competing opportunities, as well as trade-offs and priorities.” Real organisms are not optimizing for any one thing. Single-mindedness is not conducive to mutualism or survival, hence not dynamically stable. The brain, whatever it is doing, must be doing something more general than reward maximization.
This sketch bears on the wiki’s existing threads: T-003 (running vs. storing) is sharpened by the insistence that prediction must occur over all timescales (point 2 dissolves the learning/inference boundary); T-008 (compositional prior mirrors symbiogenesis) gains further evidence from the observation that neural proliferation itself follows symbiogenetic dynamics (neurons colonize favorable niches like replicators); and P-007 (dynamic stability) may extend from the thermodynamic domain into the learning domain as a unified principle.
Related pages
- The Bayesian Brain: the high-resolution neural implementation of prediction that this page’s evolutionary story explains the origin of; saccadic vision as biological masked autoencoder
- Controlled Hallucination: what the predictive machinery described here produces at the level of conscious experience; the gaze-contingent display experiments directly demonstrate the hallucinated visual field
- Intelligence as Self-Modeling: P(X,H,O) as the formal framework within which neuromodulators function as H-variables; this page fills in the phylogenetic path from bacterium to brain
- Life as Computation: dynamic stability and the thermodynamic ground; the “dynamically stable symbiotic prediction” sketch aims to unify this with learning
- Symbiogenesis: cephalization is itself a symbiogenetic event (nerve cells as non-motile cousins entering symbiotic partnership with muscle cells); neural proliferation follows the same colonize-and-cooperate logic as genomic symbiogenesis
- Theory of Mind Is Mind: the cortical column colony and octopus material extend the cephalization story from “how brains arose” to “how brains scale” and “what consciousness arises from” (mutual prediction among generic prediction units)
- Computational Being (Bach): Bach’s coherence principle (P-005) is complicated by the phase-synchronization evidence here: coherence can precede and operate without a centralized integration operator
- P-001: Perception is inference: cephalization is the evolutionary explanation for why organisms with front ends became inference machines
- P-005: Coherence organizes agency: phase synchronization adds a bottom-up pathway to coherence distinct from top-down integration
- Language as Prediction: grid cells and Transformer positional encoding converge on the same solution; hippocampal one-shot capture + cortical consolidation is the biological architecture that Transformers lack (no long-term memory); the cocktail party problem as hierarchical attention
- P-007: Dynamic stability: the “dynamically stable symbiotic prediction” sketch extends P-007’s thermodynamic principle into the domain of learning
References
- Agüera y Arcas, B. What Is Intelligence? Chapter 4 (Antikythera, 2025)
- Godfrey-Smith, P. Metazoa: Animal Life and the Birth of the Mind (2020/2024)
- Schultz, W., Dayan, P., & Montague, R. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593-1599.
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
- Burkhardt, P. et al. (2023). Syncytial nerve net in a ctenophore. Science.
- Strogatz, S. (2004). Sync: The Emerging Science of Spontaneous Order. Penguin.
- Agüera y Arcas, B. What Is Intelligence? Chapter 7 (Antikythera, 2025)
- Saponati, M. & Vinck, M. (2023). Sequence learning in single neurons. Nature.
- Keller, G. B. & Mrsic-Flogel, T. D. (2018). Predictive processing: a canonical cortical computation. Neuron, 100(2), 424-435.
- Pfeiffer, B. E. & Foster, D. J. (2013). Hippocampal place-cell sequences depict future paths to remembered goals. Nature, 497(7447), 74-79.
- Bennett, M. (2023). A Brief History of Intelligence. Mariner Books.
- Heath, R. G. (1963). Electrical self-stimulation of the brain in man. American Journal of Psychiatry.
- Churchland, P. S. (2016). Motivations and drives are computationally messy.
- Whittington, J. C. R., Warren, T. H., & Behrens, T. E. J. (2021). Relating transformers to models and neural representations of the hippocampal formation. ICLR.
- Kozachkov, L., Kastanenka, K. V., & Krotov, D. (2023). Building transformers from neurons and astrocytes. PNAS.
- Vaswani, A. et al. (2017). Attention is all you need. NeurIPS, 30.