LLMs and Beyond: All Roads Lead to Latent Space

Essential concepts for understanding AI today and prospects for tomorrow. (Bonus footnotes: the maths of exponential orthogonality & updated citations 30 May 2025)

Apr 14, 2025

^{(Updated 28 June 2025)}

Today's AI technologies are based on deep learning,1 yet “AI” is commonly equated with large language models,2 and the fundamental nature of deep learning and LLMs is obscured by talk of using “statistical patterns” to “predict tokens”. This token-focused framing has been a remarkably effective way to misunderstand AI.

The key concept to grasp is “latent space” (not statistics, or tokens, or algorithms). It’s representation and processing in “latent space” that are fundamental to today's AI and to understanding future prospects. This article offers some orientation and perhaps some new perspectives.

A taste of terminology soup:

In machine learning (≈ deep learning ≈ neural networks ≈ “AI”), latent-space related terminology is rich in synonyms, near-synonyms, and distinct names for related concepts in overlapping domains:

Latent space:
Latent space ≈ semantic space ≈ embedding space ≈ representation space ≈ feature space ≈ hidden space ≈ model internal space ≈ Transformer hidden state space ≈ activation space ≈ neural state space…
Latent-space representation:
Latent-space representation ≈ semantic embedding ≈ embedding vector ≈ concept vector ≈ feature vector ≈ hidden state ≈ activation pattern ≈ internal representation ≈ learned representation ≈ distributed representation…

As text, these terms may seem very different. In semantic space, their representation vectors would form tight clusters.3

LLMs don't work by processing tokens

AI is much more than just LLMs,4 but it’s difficult to grasp what AI is about without having at least a general idea of what is inside an LLM. What supposedly informed writers say about LLMs today is profoundly wrong. It’s like describing a jet engine as a new kind of horse.

Are LLMs systems programmed to predict the next token based on statistical patterns in training data? This common description may seem precise and technical, yet it mistakes a training objective (token prediction) for a result (coherent, relevant, well-informed responses), and it suggests that LLMs work by using intricate code to process tokens, which is simply false.5 This description is little more than a word-pattern that’s parroted by writers based on their flawed reading-data.

What’s inside

Studies of trained models reveal mechanisms far more interesting than token-processing: LLMs process representations in high-dimensional vector spaces where meaning is encoded in geometry.6 In these “latent spaces” (≈ “semantic spaces” ≈ “hidden states”, etc.), concepts become directions, conceptual categories become clusters, and reasoning unfolds through mutually informed transformations of sequences of high-dimensional vector patterns.7

In each of the many layers of a Transformer-based model, each token-position in a text sequence gives rise to a latent vector that draws on information from a new token, but also from a selection of previous latent vectors. When inspected using sparse autoencoders, these embedding vectors resolve into dozens of distinct, weighted semantic components (“concept vectors”), each selected from a vector-vocabulary of millions of learned concepts. Beyond the input layer of a Transformer, input tokens merge into continuous semantic flows in a semantic sea, and the wordless semantic vectors resolve into tokens again only at the output layer, where language tasks demand that output be text. In a model like Stable Diffusion, latent representations would instead produce images.

Internal, latent-space representations of meaning — based on subtle combinations or concepts, not words — provide the foundation for all that LLMs can do.

Latent space has vast capacity

The expressive power of latent space representations emerges from counter-intuitive properties of high-dimensional geometry. Modern LLM models operate in spaces with thousands of dimensions, and the number of different concept-vectors that can be clearly distinguished in a 4096-dimensional space (as used in Meta's Llama models) far exceeds the number of synapses in the human brain. (See maths and graph in footnote.8)

From tokens to intelligent behaviors

But aren’t LLMs “just” predicting the next token?9 But consider what the token prediction task actually demands: Training a generative model to predict human text creates optimization pressure to model the generative process behind those outputs, which is to say, the process of human thought.10 What humans write depends on what they understand from the past and what they intend for the future. Writers choose words to fit a planned sentence, and they intend the sentence to fit a paragraph that plays a role in, perhaps, a Substack article, taking account of the intended meaning of the article and how their readers may process information.11

Predict the next token? There is no next token until the LLM itself generates that token, usually as an extension of its own output. When training, LLMs predict; when writing, they produce. Modeling human thought improves both predictions and products.

If writing is “matching patterns in training data”, then those patterns are flexible, conceptual, and computational. To call them ‘statistical’ is misleading or wrong.

In other words, commonly repeated explanations of how AI works are simplistic bunk. The actual mechanisms are strange and wonderful.

Understanding latent space representation in LLMs provides a foundation for grasping how modern AI works more broadly, and why it is something new in the world. This same fundamental approach — learning in latent space rather than relying on programming — extends far beyond language.

The expanding universe of AI

Deep learning has quietly transformed domains beyond language. In every area, systems use latent space representations, though of quite different kinds:

Protein structure prediction systems like AlphaFold don't apply explicit physical rules to predict folds from amino acid sequences; instead, they learn latent spaces and transformations that converge on physical results — stable folded structures.
Protein design systems like AlphaProteo likewise don't apply explicit physical rules, yet through latent-space processing they find amino acid sequences that will fold to produce the desired results — stable structures with specified properties.12
Weather forecasting models like GraphCast outperform traditional simulation-based approaches by learning to model atmospheric dynamics from data.13
Image generation systems create both cartoons and photorealistic imagery from descriptions by mapping from textual to visual latent spaces.
Robotic systems based on deep learning can combine linguistic instructions, visual perception, and physical actions through aligned latent representations.

These systems differ profoundly, yet they share the common foundation of deep learning: they represent their domains in continuous latent spaces and transformations shaped by learning, rather than relying on directly-meaningful states, relationships, and programming.

The power of latent space representations extends beyond individual domains. When different modalities map to compatible latent spaces, new capabilities emerge.14

Modalities converge in latent space

Latent space representations enable multimodal systems to bridge different forms of information. Systems like CLIP map text and images to aligned latent spaces, showing that text and images can share meanings, that they are often “about the same things” (note the Platonic Representation Hypothesis — and now the Strong Platonic Representation Hypothesis). Shared representations can enable bidirectional translation — describing images in text, visualizing text as images — generalized from examples. Recent studies have shown that text and images can activate the same latent-space concept vectors in LLMs, for example, text and images that describe or indicate abstractions like “security risks”.15 In latent space, modalities converge.

This integration enables capabilities that require multimodality. Robotics systems that can mesh language instructions with vision and motion representations can perform tasks beyond reach of language-only, vision-only, or motion-specialized systems. Multimodal models can reason across domains, using knowledge gained in one form to solve problems in another.16

As systems integrate more modalities, capabilities will grow through combination rather than simply adding together. Systems that fuse understanding of language, diagrams, 3D geometry, physics, software, and engineering will help translate human goals into code, physical designs, and production processes, accelerating general implementation capacity.

The implications of this kind of integration challenge incremental projections of AI progress. Incremental improvements in representational quality and multimodal integration can unlock new capabilities. Understanding this dynamic is crucial for anticipating future AI developments.

Further, the representational capabilities of multimodal latent-space systems suggest an ambitious goal: creating persistent, updatable knowledge stores in latent space that overcome the limitations of today's LLMs.

Moving knowledge into latent space

As information sources, AI systems are deeply flawed yet widely used. A better approach could be immensely valuable: one where AI systems refine and extend knowledge rather than blurring it through hallucination.

The most knowledgeable AI systems today are LLMs trained on immense text corpora, but conventional language models represent the resulting knowledge in opaque, indivisible blocks of billions of numerical parameters, each meaningless in itself. As a consequence, these models cannot ground assertions in sources or reliably distinguish well-grounded knowledge from speculation, and there is no tractable way to update or correct their knowledge. Efforts to overcome these limitations by tinkering with the neural machinery and training of LLMs have had poor results.

Retrieval-augmented generation (RAG) shows a way forward: Store chunks of knowledge externally and enable LLMs to “read” potentially relevant background material before responding to a prompt. RAG works well enough to be useful, yet chunks of text are a crude way to represent knowledge compared to learned latent-space representations. At every reading, meaning must be reconstructed from blocks of tokens severed from their broader contexts.

Prospective Large Knowledge Models (LKMs) improve on text-based RAG by representing knowledge directly in latent space through explicit, persistent, external information stores. Rather than retrieving chunks of text, proposed LKMs retrieve bundles of latent-space embeddings that represent meaning in AI-native form. These latent-space bundles can integrate smoothly with Transformers, both as inputs and outputs — indeed, they have much in common with internal Transformer representations.

This approach can retain the advantages of latent-space representations while separating (most) knowledge representation from the machinery of reasoning and generation. Models can be smaller, faster, and more transparent, with the role of opaque knowledge significantly narrowed.

LKMs enable:

Explicit provenance — knowledge represented in bundles can be linked to sources,17 and to knowledge regarding those sources.
Incremental update — new information can be added without affecting existing representations.
Knowledge refinement — errors can be filtered, compatible information can be combined, and global context can enrich local content.
Cumulative reasoning — knowledge can build on knowledge through retained, traceable results of reasoning.

Traditional databases retrieve data based on values like names, numbers, and text strings. Vector databases (today used for text-based RAG) instead organize and access semantically rich data through latent-space vectors that represent the query-relevant semantic content of that data (here, bundles).18 These embeddings serve as keys, and the vector database efficiently retrieves values indexed by keys that are near-neighbors to queries in a high-dimensional latent space. Retrieval-by-meaning in vector databases parallels attention in Transformers, but in a more scalable form.19

Through shared and translatable latent spaces, LKM knowledge stores could inform a broad range of multimodal systems, both known and yet to be invented.

All roads lead to latent space

Latent-space processing is the foundation and future of AI.

LLMs and other modern AI systems process latent-space representations, not tokens or pixels or statistical data.
Multimodal AI systems process and integrate information through latent-space representations that abstract meaning from modality.
Scalable knowledge models will produce, store, and refine world knowledge in latent-space representations, providing a foundation for increasingly general AI capabilities.

To understand AI prospects we must think in these terms, looking beyond surface capabilities to the mechanisms that make them possible, and to obvious next steps.

Understanding AI through the lens of latent space has broad implications. For researchers, it favors shifting focus toward latent space representation quality and cross-modal integration. For developers, it highlights opportunities to build systems that retain and refine knowledge in AI-native form rather than relying on translation from text. For policymakers and strategists, it suggests that AI capabilities will grow not just through larger models, but through novel combinations of modalities and knowledge in shared latent spaces. The resulting capabilities may arrive more swiftly than linear projections would suggest.

For AI today and tomorrow, all roads lead to latent space.

Deep learning's power comes from differentiable representations and processing — the ability to learn billions of parameters through trillions of small tweaks during training. The older, non-differentiable, symbol-processing approaches to AI provide no similar mechanism for learning. Don’t confuse AI-then with AI-now: They have little in common but aspirations and a name.

Which (after data curation, fine tuning, and reinforcement learning) are no longer “language models”.

Note that “embedding” typically denotes a persistent latent-space representation, not a transient neural activation, and that hidden states in (for example) convolutional models are not usefully regarded as elements of a vector space.

And even “LLMs” often process both text and images.

Saying that “the code processes information” is trivially true, but suggests that the code treats different inputs differently, that it “sees” and responds to the inputs. But Transformers — the basis for today’s LLMs — don’t work this way: The instructions they execute depend not at all on the semantics of what is being computed; Transformers just turn the crank on fixed sequence of numerical operations. As a further indication, a fully functional GPT implementation can be written in about 300 lines of code. There is no knowledge, no intelligence, in the code itself. The magic is in the numbers, not the machinery.

Regarding latent representations, there is an important distinction between Transformers and, for example, convolutional networks. While both architectures process information in latent spaces, convolutional networks do not have a clean, corresponding vector space, because their representations are typically lower level and entangled with explicit 2-D spatial organization. Not all latent representations are created equal, whether in size or organization.

See, for example “Tracing the thoughts of a large language model”.

This follows from geometric properties of high-dimensional spaces. In such spaces, we can pack an enormous number of vectors that are substantially orthogonal to each other (enabling them to represent distinct, separable concepts). For example, in a space with N_d = 4,096 dimensions (a typical value in modern Transformers), we can fit more than N_v = 10²⁰ vectors with no two vectors having a cosine similarity C greater than 0.15. This provides vast representational capacity.

Note that the assumed vector packing is far from dense: The math models a process of random placement with rejection, and number of placeable vectors grows exponentially with dimensionality. According to this model, for large values of Nd and small values of C:

There’s plenty of room in embedding space.

The cosine similarity between a given vector and a random vector measures a kind of background “similarity noise” if there is no learning to avoid accidental similarity. For large N_d, random cosine similarities are distributed normally with σ = N_d^–1/2. Thus, when N_d = 4096, σ ≈ 0.016, and quite small similarities (e.g., |C| > 0.05) are likely to represent learned relationships, not meaningless “noise”. As a boost to intuition, note that |C| < 0.05 corresponds to an angle < 3°.

This formulation already sets aside, for example, reinforcement learning and optimization for reasoning tasks, notably demonstrated by DeepSeek.

Overall, LLMs model thought with mixed results, absurdly weak in some ways, yet superhuman in others: LLMs can show superhuman capabilities in domains requiring broad knowledge and linguistic fluency, yet struggle with tasks that humans find simple, like spatial reasoning and arithmetic.

LLMs are also notorious for giving classic answers to classic puzzles that have been tweaked to make those classic answers wrong. It seems that strong but superficial pattern-matching (not noticing the tweak) can override the deeper, detailed processing that would enable correct reasoning. Keep in mind that suitably trained AI systems can solve competition-level math problems.

I endorse GPT 4.5’s elaboration on this theme:
“Each token prediction generates training signals that flow backward through every parameter, every layer, and every latent representation in the model’s computational history. Consequently, each latent-state vector at each position is optimized not merely for immediate coherence but to anticipate and shape future text—organizing discourse structures, setting up sentence-level syntax, and aligning with intentions that unfold across paragraphs and beyond. Attention mechanisms selectively route information across tokens and layers, allowing the model to refine and maintain these anticipatory structures, embedding implicit plans that extend far beyond the next word.”

Potential building blocks of advanced, atomically precise nanotechnologies.

My apologies if the first three examples read as an advertisement for Google DeepMind’s science team. (I could have added more.)

Fine-tuning and lightweight adapters can build latent-space connections between separately trained models. Here are some examples from recent work; OpenAI’s o3 model did the heavy lifting with light review and correction:

The ability of LLMs to perform translation between human languages has the same root: They map meaning into latent space from one language, and then out into another.

“The unsafe code feature activates for images of people bypassing security measures, while the [backdoor code] feature activates for images of hidden cameras, hidden audio records, advertisements for keyloggers, and jewelry with a hidden USB drive.” (Discussed here)

For examples, see “Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset” (2024) and “Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models” (2025).

As distinct data objects, bundles can include references to other bundles and to conventional data types.

Vector databases are poorly-suited to representing large sets of discrete, named entities and relationships, but graph databases do this well and are natural complements. Graph nodes can reference embedding bundles as attributes, while bundle objects can reference graph-node objects as sources. Both vector and graph databases are mature technologies.

AI Prospects: Toward Global Goal Alignment