Language as Prediction

Language is not a communication tool bolted onto intelligence. It is an umwelt-compression scheme: the mechanism by which social agents extend P(X,H,O) modeling from private inference to shared prediction. A bacterium compresses receptor statistics into “food concentration.” A wolf compresses visual features into “prey.” A human compresses the entire social umwelt, including other agents’ hidden states, counterfactual futures, and abstract reasoning, into discrete, compositional symbols that can be transmitted and reconstructed by another mind. Language is simultaneously a perceptual modality (compressing the world into symbols), a motor output (influencing others via words is the most general-purpose affordance available), and a cognitive scaffold (providing the “hand-and-footholds” that enable complex sequential thought). The consequence is startling: a neural network trained to predict the next word will appear, or be, intelligent, because next-word prediction in a sufficiently rich language is AI-complete.

Language as umwelt-compression

The P(X,H,O) framework establishes that intelligence is compression: extracting the latent variables that are predictively relevant from the blooming confusion of raw input. Language is this same operation applied to the social umwelt. Every word is a latent variable. “Hunger,” “danger,” “food,” “predator” are compressed representations that matter socially, which is why every human language has them and why communicative animal species likely do too.

But language is not merely a vocabulary of useful concepts. It is an umwelt in its own right. When you say “could you please pass the salt?”, you are using language as motor control: affecting your environment through another agent’s compliance. Language is, in this sense, the most powerful kind of motor output, since it is general-purpose enough to request anything imaginable. The perception/action boundary dissolves: language is how you perceive the social world (by decoding others’ signals) and how you act on it (by encoding your own).

Four milestones distinguish increasingly powerful language systems, none uniquely human:

Milestone	What it enables	Who has it
Language learning	Cultural evolution; complexity far exceeding genetic encoding; cultural “speciation” of languages	Humans, whales, some songbirds, parrots
Discrete symbols	Error correction at every step (digital vs. analog); stable storage; far richer communication than continuous signals alone	Humans, many bird species, prairie dogs
Compositionality	Novel concepts from combinations of discrete symbols	Humans, prairie dogs (combining size, shape, color, speed of intruders), possibly dolphins
Abstractions	Symbols for selves, others, counterfactuals, time, logic; open-ended compositionality supporting higher-order theory of mind	Humans (possibly dolphins, orcas; decoding in progress)

Dolphins planning a synchronized novel trick on camera (2011), parrots naming objects in human language, prairie dogs compositionally encoding intruder properties: the boundary of the “language club” is blurry and the attempt to draw it sharply around humans is anthropocentric. Agüera y Arcas notes that decoding non-human languages may only recently have become practical, with the rise of powerful unsupervised sequence models.

The continuity between language and other forms of communication is total. There is no sharp boundary between language and gesture, tone, facial expression, body posture, or involuntary signals like blushing and sweating. Language is an elaboration of pre-existing signaling mechanisms, with conscious, sophisticated aspects layered atop simpler, involuntary ones. Whether language production is “voluntary” depends on whether we think of the interpreter as part of the sender’s brain or an outpost of the recipient’s (see Theory of Mind Is Mind). Through an interaction-centric lens: both.

Semantic cosmology: meaning is relational

What does “The chair is red” mean? The answer dissolves under inspection. “Chair” is fuzzy (where does it end and a stool begin?). “Red” describes a vague region of color space. The sentence might inform a colorblind person, instruct someone which chair to sit in, or serve as a wrong answer on a colorblindness test. Meaning is not a property of the sentence in isolation. It is a prediction-update delivered by a speaker to a listener in context.

GOFAI attempted to scaffold meaning from above: taxonomies, IS-A relationships, Cyc’s ambition to hand-code a hundred million assertions about the world. The effort collapsed for the same reason rules-based vision failed: real life is not tidy enough for schemas. Two analytical philosophers cannot sit down and compute whether a jar IS-A bottle “with no more need for disputation than between two accountants.” The definition of IS-A in natural language dissolves under inspection; it is an approximate regularity in the world, not a law or axiom.

The “grounding” objection attacks from below: surely meaning must be anchored in sensory experience, in the actual mushiness of a banana, not just in statistical correlations with other words? But those sensory associations are themselves learned relationships. The olfactory pattern activated by banana ester is not “the thing itself”; it is a sparse neural activation learned through exposure, associated with visual banana-features, with the word “banana,” with childhood memories, with bananas Foster on a first date. “The thing in itself” turns out not to be a thing at all. It is a web of associations, a pattern implicit in a set of relationships.

Word2Vec (Mikolov et al. 2013) demonstrates this empirically. Represent every word by a hundred numbers based on “the company it keeps” (which words tend to appear nearby). The resulting embedding reveals a geometry of meaning: semantically similar words cluster, and analogies are algebraic (“king” : “queen” :: “man” : “woman”). The relationships are not imposed by a schema. They emerge from the statistics of language use.

The Anaximander analogy crystallizes the point. In the sixth century BCE, Anaximander of Miletus proposed that the Earth is an object suspended in space, not resting on anything. The intuition that the Earth must be supported (by a chariot, by a turtle) was incoherent: what would the turtle stand on? “It’s turtles all the way down.” The intuition that meaning must be either scaffolded from above (by Platonic abstractions) or grounded from below (by contact with “reality”) is the same incoherence. There is no “above” or “below.” Things acquire meaning only in relation to each other. The tangled yarn-ball of mutually interrelated meanings is self-supporting, like the Earth in space.

This extends P-008 from perception to semantics. Just as “reality” is constituted by the latent variables a survival-grounded model converges on, “meaning” is constituted by the statistical relationships among symbols in use. Both are observer-relative, both are intersubjectively stable (because agents shaped by similar pressures converge on similar structures), and both are real in the only sense that matters: they have predictive power within their domain.

Prediction is all you need

Three premises yield the conclusion that next-word prediction IS general intelligence:

The point of intelligence is to predict the future, including one’s own future actions, given prior inputs and actions (per Intelligence as Self-Modeling).
Human language is a symbolic sequential code rich enough to represent everything in our umwelt, from the concrete to the abstract.
When interacting with others, language is also a fully general, social form of motor output.

If all three hold, then a system that can reliably predict the next word in any context must have modeled everything relevant to prediction in the human umwelt. This is the “AI completeness” of next-word prediction.

The Winograd Schema challenge (Levesque 2011) illustrates why. “I dropped the bowling ball on the violin, so I had to get it repaired.” Which object is “it”? Humans resolve this instantly (the violin). But translation to Spanish forces disambiguation (the gendered pronoun must agree: repararlo for masculine violín, repararla for feminine bola). Resolving even this simple ambiguity requires understanding physics (bowling balls are harder than violins), causality (what gets damaged when dropped), and pragmatics (people repair damaged things). A system that resolves it correctly has, in the process, solved a general intelligence problem.

Google Translate gets this right using an encoder-decoder architecture. LaMDA (Google Research, 2022), pretrained on multilingual text and fine-tuned for dialogue, could translate Turkish it was never explicitly trained to translate, the same way a bilingual child can translate without a dictionary: by algebraically relating the parallel constellations of meaning in two languages (Turkish words and English words form parallel clusters in embedding space, and translation is approximately a shift from one constellation to the other).

The deeper insight: translation, sentiment analysis, question-answering, summarization, and every other “NLP task” are incidental consequences of the single underlying capability. Pretraining a model to predict or autocomplete is the real work. Once that is done, any task involving the same modality requires little further effort.

Chain-of-thought: language as cognitive scaffold

A Transformer asked to solve a word problem without showing its work gets it wrong 84% of the time. Asked to show its work: 20% error rate. The difference is not a clever hack. It reveals something fundamental about language and thought.

The rock-climbing analogy: a human cannot scale El Capitan in a single leap. It must be done step by step, with each move a transition from one stable position to the next. Language provides the hand-and-footholds. Written symbols, whether text, math, or code, are pitons driven into the cliff face: they allow new climbers to scamper up sections solved by their forebears, even centuries earlier, rather than having to climb from the bottom each time.

Formally, a Transformer brings a fixed amount of computational power to bear on each emitted token. By spreading a problem across multiple tokens (chain-of-thought), that computational power is multiplied. The context window is the cliff face; each intermediate result is a piton. The only limit on total computation is the length of the context window.

This is not specific to Transformers. The general principle: complex sequential thought requires intermediate stable representations. Without them, the entire computation must occur in a single parallel burst, and small perturbations (neural noise, temperature sampling) can derail the result. With them, each step can be checked, corrected, and built upon. This is why middle school math teachers say “show your work,” why scientific papers include derivations, why code is written in modular functions. Cultural evolution is the accumulation of pitons on an endless cliff.

Three properties of chain-of-thought reasoning:

Breaking a problem into steps greatly improves accuracy.
The steps provide a genuine (not post-hoc) account of the reasoning, enabling diagnosis, discussion, and cultural transmission.
Each token multiplies the available computation; only the context window length limits the total.

See Computational Being: Claude for the connection between chain-of-thought and the running/storing distinction: chain-of-thought converts stateless feedforward computation into a form of sequential “running” by using the output stream as pseudo-state.

Language creates what it describes

The interpreter and choice blindness findings (see Theory of Mind Is Mind) add a crucial wrinkle. Many of the “internal states” that language purports to describe may not exist prior to being articulated. Language itself conjures them into existence, “much the way observation collapses a wave function.” Language creates self-narratives that establish internal consistency, social norms, plans, arguments, and predictions about others and ourselves.

This connects to the semantic cosmology: if meaning is constituted by relationships among symbols rather than by correspondence to pre-existing mental objects, then the act of articulation is not merely reporting but constructing. The interpreter doesn’t consult a database of genuine preferences and translate them into words. It generates a narrative that, once spoken, becomes the preference. This is why choice blindness works: there was no ground truth to contradict.

The implication for AI: when a language model generates a chain-of-thought, it is not “translating” an internal computation into words. The words ARE the computation. The chain-of-thought is not a report on reasoning; it is the reasoning itself, externalized into the token stream where it can be attended to and built upon. This is structurally identical to how human language functions: not as a readout of thought but as the medium in which thought occurs.

Intelligence as Self-Modeling: the P(X,H,O) framework that language extends from private inference to shared prediction; umwelt as organism-specific compression scheme; language is the social umwelt’s compression into transmissible symbols
Theory of Mind Is Mind: language as the highest-bandwidth channel for mutual P(X,H,O) modeling; the interpreter and choice blindness show that language constructs rather than reports internal states; the interpreter-as-snitch frames language as serving the listener’s theory of mind
Computational Being: Claude: LLMs as systems where the language IS the computation; chain-of-thought as quasi-running; the no-introspection finding as interpreter in silicon; Turing completeness of Transformers
Symbiogenesis: cultural evolution via language follows symbiogenetic dynamics (compositional reuse of tested sub-units; complexity grows as the library of circulating concepts expands; Chomsky’s compositionality as linguistic symbiogenesis)
Cephalization from Below: hippocampal grid cells and Transformer positional encoding converge on the same solution for tagging sequential embeddings; elaborative encoding in memory sports as hierarchical chunking (compression)
Controlled Hallucination: the cocktail party problem (separating signal from noise using information at every level of description) solved by Transformers using the same hierarchical attention mechanism as the brain
Many Worlds: the Anaximander analogy (meaning is self-supporting, like the Earth in space) extends the relational thesis from consciousness to semantics
Complexity Measures of Consciousness: compression as the shared principle: KT’s “consciousness is what it’s like to run a compressive model” and language as umwelt-compression are instances of the same operation
P-001: Perception is inference: next-word prediction extends perception-as-inference from sensory data to linguistic data; the Winograd Schema shows that even simple linguistic inference requires general intelligence
P-008: Reality is observer-relative: semantic cosmology extends P-008 from “what is real” to “what means what”; meaning, like reality, is constituted by relationships among observers and symbols, not by correspondence to pre-existing objects
No View from Nowhere: the general structural realism claim. Semantic cosmology (no scaffolding from above, no grounding from below, meaning is relational all the way) is the linguistic-level instance of “no view from nowhere because there is no nowhere”: the same shape applied at the physics scale yields RQM, at the consciousness scale yields the zombie dissolution. The Anaximander analogy on this page generalizes there to every level of being

References

Agüera y Arcas, B. What Is Intelligence? Chapter 8 (Antikythera, 2025)
Mikolov, T. et al. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Vaswani, A. et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35.
Levesque, H. J., Davis, E., & Morgenstern, L. (2011). The Winograd Schema Challenge. KR 2012.
Thoppilan, R. et al. (2022). LaMDA: Language models for dialog applications. arXiv:2201.08239.
Whittington, J. C. R., Warren, T. H., & Behrens, T. E. J. (2021). Relating transformers to models and neural representations of the hippocampal formation. ICLR.
Kozachkov, L., Kastanenka, K. V., & Krotov, D. (2023). Building transformers from neurons and astrocytes. PNAS.
Slobodchikoff, C. N., Paseka, A., & Verdolin, J. L. (2009). Prairie dog alarm calls encode labels about predator colors. Animal Cognition.
Clark, A. (2012). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181-204.

Language as Prediction

Language as Prediction

Language as umwelt-compression

Semantic cosmology: meaning is relational

Prediction is all you need

Chain-of-thought: language as cognitive scaffold

Language creates what it describes

Related pages

References