A short note on interpretability and minds

2026-04-05

I recently put out a blog post reflecting on a paper from my PhD and its connection to some related work on interpretability and neural scaling laws. Writing that post (intermittently) took me about eighteen months, and it has about 15000 words. In the post below, I've taken a shot at distilling the spirit of that longer post into something much shorter. It is about the assumptions that mechanistic interpretability researchers make and how those assumptions gesture towards a deeper science—a path for machine learning to teach us something new and important about ourselves. It is a bit whimsical, a bit ungrounded, but I nevertheless feel like it is gesturing towards something important and true.

Deep learning is very behaviorist. We train larger networks with better training algorithms on more data and we observe that the behavior of the resulting networks changes, as measured by a lower pretraining loss and better benchmark performance. And we have a nice, rigorous empirical science of how network behavior depends on various macroscopic inputs as exemplified by neural scaling laws¹, the METR time horizon study², etc. This is all well and good.

However, like some others in the field, I want something more. I want us to be able to say something about the internal computation that networks learn that underlies their behavior. When an LLM gets 1.9 nats of cross-entropy on the pretraining corpus, what does that imply about what it is doing internally? How, specifically, is it representing and processing its inputs? How is that computation different in an LLM that gets 1.8 nats of cross-entropy? This line of inquiry presupposes that we can say something interesting about the computation that a network performs beyond giving a description of its architecture and a printout of its learned weights—it presupposes that neural network computation admits some other description that is faithful and that we can nevertheless understand. This is a fundamental assumption of interpretability.

While this assumption may be controversial to some, I think it shouldn't be. Or at least it shouldn't be more controversial than the existence of the field of cognitive science, which has sought to describe the algorithms effectively implemented by biological neural networks for decades.³

So what sort of computation do deep neural networks perform internally? For networks trained on real-world data, the answer may be very complex. There are probably a huge number of claims at different levels of analysis that one can make about the computation that a large network performs.

However, a basic assumption that mechanistic interpretability researchers make across the literature ^4,5,6,7 and that I emphasize in my post, paper⁸, and PhD thesis is that the computation that networks perform, whatever that may be, decomposes into simpler parts. In some manner, the big computation that networks do is made up of smaller computations. There is some kind of underlying modularity here.

Another core assumption is that these parts are sparse across the data.⁹ On any specific input, only some of these parts are needed in the network's computation, even if across the whole data distribution all the parts are needed. This is a nice assumption because to understand the network's behavior in any given instance we'd only need to look at the parts that are "active" on that sample, which might be a very small fraction of all the network's parts and so the amount of computation that we'd need to understand at any one time would be small.

If you accept these assumptions, then it's natural to study the frequencies at which these different parts are active across the data distribution. Call these frequencies $f_i$. Some computational parts might be very commonly needed by the network and $f_i \approx 1$, but others might be much more niche and are active only very rarely. I think this is especially intuitive for language modeling, where LLMs learn a huge amount of esoteric knowledge that they only very rarely need. We could imagine studying the distribution of the frequencies $f_i$ across all the parts. What shape does that distribution take? I think a very natural assumption, especially for networks trained on language, where statistics are often Zipfian, is that there's a long-tailed distribution over the frequencies. The $f_i$ follow a power law.

From this, a model of neural scaling immediately follows. We just need to think about these "parts" not only as things that are instantiated in a specific network but as objects that live in the data, computations that any network that performs well on that distribution ought to learn. If the frequencies $f_i$ follow a power law, the "importance" of learning each one to the network—the amount by which learning each one reduces the mean loss—will follow the same power law. Networks would optimally learn as many parts as they have capacity for in order of these frequencies. As networks are scaled up and learn a greater number of increasingly niche parts with power-law distributed importance, the mean loss drops as a power law, because integrating under a power law yields another power law.¹⁰ This is where neural scaling laws come from (conjecturally).¹¹

Now what are these parts? Concretely, interpretability has explored several methods now for trying to decompose trained neural networks into sparsely activating parts, like sparse autoencoders. These methods are each very interesting and one could have a detailed discussion about them and the decompositions of networks they provide. But here I'll just say that I wish that the field had some answer to whether there is any notion of a "right" decomposition for a network.^12,13 Are there "natural units" that we want our interpretability tools to reveal?¹⁴ Again, what are the parts? If we had a satisfying answer to this question, we could measure the empirical frequencies $f_i$ at which these parts are active on the pretraining corpus, watch the parts form over training, and compare the empirical $f_i$ distribution with the slope of the scaling law that we observe on that distribution. But absent a good answer, the model of scaling I've described here isn't really falsifiable. If you gave me some decomposition of a network into parts with frequencies $f_i$ that didn't follow the power law we'd expect, I could just wave my hands and tell you that you didn't correctly decompose the computation of the network. But this is an uncomfortable place to be in.

For the moment though, let's just accept the basic picture that I've described here and that others in the interpretability literature (Chris Olah, Lee Sharkey, etc.) have articulated in varying forms as well, that neural network computation can be decomposed into simpler, sparsely activating parts. What does this say about the world?

Well, if neural networks are learning to reproduce the rules that generated the data they are trained on¹⁵, it may say a great deal. For what generated humanity's corpus of text across history other than the process of human minds thinking? If the computation that language models do when predicting our text can be decomposed into simpler parts, perhaps human minds and human thinking, the process that generated that text, can be similarly decomposed. ^16,17 Our neural scaling laws would then reflect some fundamental fact about the distribution of ideas across minds.

Why then is it a power law? And what does this say about the process by which ideas are created, and are spread?

I get the funny feeling that those who study deep learning and interpretability, and those who study cognitive science and psychology, will turn out, in the end, to have been studying the same object from different sides.

Surely the unification of these fields into a mathematical science of Mind will be the scientific legacy of this century.¹⁸

Thanks to Wes Gurnee, Marmik Chaudhari, Daniel Kunin, Lee Sharkey, Adam Shai, Loren Amdahl-Culleton, and Casper L. Christensen for reading drafts of this post.

Notes and References

Kaplan et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [link] ↩︎
Kwa et al. (2025). Measuring AI ability to complete long tasks. arXiv preprint arXiv:2503.14499. [link] ↩︎
Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. W.H. Freeman. ↩︎
Olah et al. (2020). Zoom in: An introduction to circuits. Distill. [link] ↩︎
Elhage et al. (2022). Toy models of superposition. arXiv preprint arXiv:2209.10652. [link] ↩︎
Bricken et al. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. [link] ↩︎
Bushnaq et al. (2025). Stochastic parameter decomposition. arXiv preprint arXiv:2506.20790. [link] ↩︎
Michaud, E. J., Liu, Z., Girit, U., & Tegmark, M. (2023). The quantization model of neural scaling. NeurIPS 2023. [link] ↩︎
Olshausen, B. A. & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. [link] ↩︎
Marcus Hutter established this idea as a model of where neural scaling laws come from in:
Hutter, M. (2021). Learning curve theory. arXiv preprint arXiv:2102.04074. [link] ↩︎
Note however that a couple of recent papers challenge the view that neural scaling laws arise from this kind of power law structure in the data. One is Barkeshli et al. (2026), who train transformers on sequence data generated from random walks on a graph where there isn't obvious power law structure, and nevertheless find that their networks exhibit power law scaling. This should make us cautious about inferring too much about the data just from the observation of scaling laws alone. Power law scaling may arise for more generic reasons. In another work, Cagnetta et al. (2026) give an explanation, with impressive empirical support, of where data scaling law exponents come from. In their model, scaling laws still arise from power laws in the data, but of a different kind than the Zipfian distribution over skills/mechanisms that myself and others have proposed—their theory uses the power-law decay of the n-gram next-token conditional entropy and the power-law decay of the token-token covariances across context. There may be some way of reconciling these theories, but for now I feel that the origin of neural scaling laws is very much an open theoretical question. ↩︎
Sharkey et al. (2025). Open problems in mechanistic interpretability. Transactions on Machine Learning Research. [link] ↩︎
Engels et al. (2024). Not all language model features are one-dimensionally linear. arXiv preprint arXiv:2405.14860. [link] ↩︎
Hoel, E. P., Albantakis, L., & Tononi, G. (2013). Quantifying causal emergence shows that macro can beat micro. Proceedings of the National Academy of Sciences, 110(49), 19790–19795. [link] ↩︎
In order to predict data optimally, networks can learn a computation which is closely related to, but not identical to, the process that generated that data. See Shai et al. (2024) and Shai et al. (2026). ↩︎
Minsky, M. (1986). The society of mind. Simon & Schuster. ↩︎
Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. MIT Press. ↩︎
I like David Bau's framing that our time is a "Copernican Moment" for our understanding of intelligence here:
Bau, D. (2025). In defense of curiosity. Blog post. [link] ↩︎

@misc{michaud2026short,
  author = {Michaud, Eric J.},
  title = {A short note on interpretability and minds},
  year = {2026},
  howpublished = {\url{https://ericjmichaud.com/interp-and-minds/}},
}