2025

Decomposing Deep Neural Network Minds into Parts

My PhD thesis.

Understanding sparse autoencoder scaling in the presence of feature manifolds

On the creation of narrow AI: hierarchy and nonlocality of neural network skills

An exploration of how neural networks learn circuits, and how these circuits are expressed in the weights of the network, that are relevant to the problem of creating "narrow" AI systems.

Open Problems in Mechanistic Interpretability

Physics of Skill Learning

2024

Efficient Dictionary Learning with Switch Sparse Autoencoders

The Geometry of Concepts: Sparse Autoencoder Feature Structure

A Physics of Systems that Learn

Not all language model features are one-dimensionally linear

We find that large language models represent some cylical quantites, like the days of the week, and the months of the year, with a circular geometry in activation space.

Survival of the Fittest Representation: A Case Study with Modular Addition

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code

2023

The Space of LLM Learning Curves

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

The Quantization Model of Neural Scaling

A theory of neural scaling, based on an assumption that neural computation decomposes into a variety of atomic parts called "quanta".
2022

Precision Machine Learning

Omnigrok: Grokking Beyond Algorithmic Data

Towards Understanding Grokking: An Effective Theory of Representation Learning

An Analysis of Grokking

2020

Examining the Causal Structures of Deep Neural Networks Using Information Theory

Understanding Learned Reward Functions

Lunar Opportunities for SETI

Archive

Older blog posts (2018-2020)