On quanta, distilled

2026-01-14

A couple of days ago, I released a blog post reflecting on a paper from my PhD and its connection to some related work. It took me a year and a half to write that post, and it has about 15000 words. Today I took a shot at distilling down the spirit of that post into something shorter. Here it is.

Deep learning is very behaviorist. We train larger networks with better training algorithms on more data and we observe that the behavior of the resulting networks changes, as measured by a lower pretraining loss and better benchmark performance. And we have a nice, rigorous empirical science of how network behavior depends on various macroscopic inputs as exemplified by neural scaling laws, the METR time horizon study, etc. This is all well and good.

However, like some others in the field, I want something more. I want us to be able to say something about the internal computation that networks learn that underlies their behavior. When an LLM gets 1.9 nats of cross-entropy on the pretraining corpus, what does that imply about what it is doing internally? How, specifically, is it representing and processing its inputs? How is that computation different in an LLM which gets 1.8 nats of cross-entropy? This line of inquiry presupposes that we can say something interesting about the computation that a network performs beyond giving a description of its architecture and a printout of its learned weights—it presupposes that neural network computation admits some other description that is faithful and that we can nevertheless understand. This is a fundamental assumption of interpretability.

While this assumption may be controversial to some, I think it shouldn't be. Or at least it shouldn't be more controversial than the existence of the field of cognitive science, which has studied the algorithms effectively implemented by biological neural networks for decades.

So what sort of computation do deep neural networks perform internally? For networks trained on real-world data, the answer may be very complex. There are probably a huge number of claims at different levels of analysis that one can make about the computation that a large network performs.

However, a basic assumption that mechanistic interpretability researchers make across the literature and which I emphasize in my post, paper, and PhD thesis is that the computation that networks perform, whatever that may be, decomposes into simpler parts. In some manner, the big computation that networks do is made up of smaller computations. There is some kind of underlying modularity here.

Another core assumption is that these parts are sparse across the data. On any specific input, only some of these parts are needed in the network's computation, even if across the whole data distribution all the parts are needed. This is a nice assumption because to understand the network's behavior in any given instance we'd only need to look at the parts that are "active" on that sample, which might be a very small fraction of all the network's parts and so the amount of computation that we'd need to understand at any one time would be small.

If you accept these assumptions, then it's natural to study the frequencies at which these different parts are active across the data distribution. Call these frequencies $f_i$. Some computational parts might be very commonly needed by the network and $f_i \approx 1$, but others might be much more niche and are active only very rarely. I think this is especially intuitive for language modeling, where LLMs learn a huge amount of esoteric knowledge that they only very rarely need. We could imagine studying the distribution of the frequencies $f_i$ across all the parts. What shape does that distribution take? I think a very natural assumption, especially for networks trained on language (where statistics are often Zipfian), is that there's a long-tailed distribution over the frequencies. The $f_i$ follow a power law.

From this, a model of neural scaling immediately follows. We just need to think about these "parts" not only as things that are instantiated in a specific network but as objects that live in the data, computations that any network that performs well on that distribution ought to learn. If the frequencies $f_i$ follow a power law, the "importance" of learning each one to the network—the amount by which learning each one reduces the mean loss—will follow the same power law. Networks would optimally learn as many parts as they have capacity for in order of these frequencies. As networks are scaled up and learn a larger number of increasingly niche parts with power-law distributed importance, the mean loss drops as a power law, because integrating under a power law yields another power law. This is where neural scaling laws come from (conjecturally).

Now what are these parts? Concretely, interpretability has explored several methods now for trying to decompose trained neural networks into sparsely activating parts, like sparse autoencoders. These methods are each very interesting and one could have a detailed discussion about them and the decompositions of networks they provide. But here I'll just say that I wish that the field had some answer to whether there is any notion of a "right" decomposition for a network. Are there "natural units" that we want our interpretability tools to reveal? Again, what are the parts? If we had a satisfying answer to this question, we could measure the empirical frequencies $f_i$ of these parts on the pretraining corpus, watch the parts form over training, and compare the empirical $f_i$ distribution with the slope of the scaling law that we observe on that distribution. But absent a good answer, the model of scaling I've described here isn't really falsifiable. If you gave me some decomposition of a network into parts with frequencies $f_i$ that didn't follow the power law we'd expect, I could just wave my hands and tell you that you didn't correctly decompose the computation of the network. But this is an uncomfortable place to be in.

For the moment though, let's just accept the basic picture that I've described here and which others in the interpretability literature (Chris Olah, Lee Sharkey, etc.) have articulated in varying forms as well, that neural network computation can be decomposed into simpler, sparsely activating parts. What does this say about the world?

Well, if neural networks are learning to reproduce the rules that generated the data they are trained on, it may say a great deal. For what generated humanity's corpus of text across history other than the process of human minds thinking? If the computation that language models do when predicting our text can be decomposed into simpler parts, perhaps human minds and human thinking, the process that generated that text, can be similarly decomposed. Our neural scaling laws would then reflect some fundamental fact about the distribution of ideas across minds.

Why then is it a power law? And what does this say about the process by which ideas are created, and are spread?

I get the funny feeling that those who study deep learning and interpretability, and those who study cognitive science and psychology, will turn out, in the end, to have been studying the same object from different sides.

Surely the unification of these fields into a mathematical science of Mind will be the scientific legacy of this century.

Thanks to Wes Gurnee for reading a draft of this post.

@misc{michaud2026quanta-distilled,
  author = {Michaud, Eric J.},
  title = {On quanta, distilled},
  year = {2026},
  howpublished = {\url{https://ericjmichaud.com/quanta-distilled/}},
}