2025
Decomposing Deep Neural Network Minds into Parts
My PhD thesis.
Understanding sparse autoencoder scaling in the presence of feature manifolds
On the creation of narrow AI: hierarchy and nonlocality of neural network skills
An exploration of how neural networks learn circuits, and how these circuits are expressed in the weights of the network, that are relevant to the problem of creating "narrow" AI systems.
Open Problems in Mechanistic Interpretability
Physics of Skill Learning
2024
Efficient Dictionary Learning with Switch Sparse Autoencoders
The Geometry of Concepts: Sparse Autoencoder Feature Structure
A Physics of Systems that Learn
Not all language model features are one-dimensionally linear
We find that large language models represent some cylical quantites, like the days of the week, and the months of the year, with a circular geometry in activation space.
Survival of the Fittest Representation: A Case Study with Modular Addition
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code
2023
The Space of LLM Learning Curves
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
The Quantization Model of Neural Scaling
A theory of neural scaling, based on an assumption that neural computation decomposes into a variety of atomic parts called "quanta".
2022
Precision Machine Learning
Omnigrok: Grokking Beyond Algorithmic Data
Towards Understanding Grokking: An Effective Theory of Representation Learning
An Analysis of Grokking
2020
Examining the Causal Structures of Deep Neural Networks Using Information Theory
Understanding Learned Reward Functions
Lunar Opportunities for SETI
Archive