Scaling Interpretability (Computational Neuroscience)

This was a recent conversation (June, 2024) between four researchers from Anthropic’s Interpretability team ( Josh, Jonathan, Adly and Tom) discussing the scientific and engineering findings and challenges encountered in scaling interpretability research – will explain what all that means.

Some background information:

00:00:46 Who is Anthropic?
00:01:05 What is Claude?
00:01:37 What is an LLM?
00:02:50 What is Interpretability?
00:03:55 What are Features, Autoencoders and Dictionary Learning?
00:05:00 What is (Mono)semanticity?

Big Ideas/tlrd:

00:06:25 Doing computational neuroscience on an artificial mind
00:07:20 Finding the ‘Veganism Feature’
00:08:20 ‘Surprising’ that it worked

📒 Some background information 👉

❓ Who is Anthropic?

Anthropic is a leading AI company with a focus on building AI systems that are ‘safe and aligned with human values’. Anthropic was founded and built by a number of ex OpenAI people.

❓ What is Claude?

Claude (Claude 3.5 Sonnet) is Anthropic’s newest Large Language Model. According to many benchmarks, it’s now the best LLM on the market. As part of their push to build ‘safe and aligned’ AI, Anthropic are recognised as leaders the field of ‘Interpretability’.

❓ Wtf is an LLM?

On the surface, your preferred LLM can produce the next word in a sentence and engage in a somewhat human-like conversation. Under the hood, billions of adjustable neuron-like connections established during training allow the model to pass information back and forth (a transformer is the current design). In the process, it learns intricate details and captures contextual meaning between words. It’s very much up for debate as to whether this contextual understanding is emergent, or simply the product of digesting the entire internet.

For more information on this, check out my WTF is a Large Language Model Essay (https://theplebcheck.com/wtf-is-a-large-language-model/).

❓ What is Interpretability? (the Black Box and Emergence)?

The billions (or trillions) of neuron-like parameters passing information back and forth, makes it impossible for us dumb monkeys to fully understand or unpack how the models are able to produce a coherent’ish output. Which is both magical and terrifying.

This is where ‘Interpretability’ comes in.

Interpretability (not to be confused with interoperability) is attempting to reverse engineer and understand wtf is happening under the hood – by finding patterns in the neurons responsible for creating those outputs.

❓ What are Features, Autoencoders and Dictionary Learning?

Interpretability involves using tools and methods like Autoencoders and Dictionary Learning to ingest information from LLMs, break it down and map it into more understandable component (or features).

Features are specific, human-like characteristics that the model learns and uses to process and output information.

For example, models can map meaning between the words ‘vegan’ and ‘leather’.

❓ What is (Mono)semanticity?

If we breakdown “Mono-Semanticity”, we get something like singular meaning. So when you hear the terms like ‘Working Toward Mono-Semanticity’ in AI, it’s just referring to the pursuit of Interperability – ie making sense of the black box by extracting understandable features or patterns.

Compare the words ‘Bank’ and ‘Triangle’.

‘Bank’ could mean a place to store and debase your money, the side of a river, or to tilt an airplane.

‘Triangle’ usually means a shape with three straight sides and three angles; further towards Monosemanticity.

<aside> 💡 Big Ideas/tlrd

</aside>

Here are three TLDR highlights from the conversation.

Doing computational neuroscience on an artificial mind

This, to me, is the most interesting part of Artificial Intelligence. Through understanding how it works, we’re almost certainly going to learn more about human intelligence. You can expand this to other realms, for example – philosophy of mind (another conversation).

2. Finding the ‘Veganism Feature’

The veganism feature is a demonstration of Language Models ability to map contextual meaning – ie, the model is able to connect ‘not eating meat’, ‘not wearing leather’ and ‘factory farming’ – it can tie all of these concepts together.

3. ‘Surprising’ that it worked

Josh is basically explaining the process detailed above – extracting some of the models internal working memory and training an external tool (the sparse autoeconder) and using that tool to recreate specific features. He shares that while the engineering wasn’t necessarily easy, the results were shockingly impressive. Ie, they were able to recreate the desired features with accuracy.

Final thoughts

There isn’t some pre-programmed instruction manual on how the model should produce an output and it’s not simply regurgitating information from training. That much is obvious. There is a very flexible structure, with billions of adjustable parameters, allowing the model to soak in context and meaning by mapping words in a high dimensional space that we don’t have the capacity to fully understand.

It’s going to be fascinating watching how interpretability research evolves over the coming years.

There is broad consensus that as the models get more powerful and are fed more training data, Interpretability is going to have to innovate and evolve.

We’ll no doubt uncover some wild and whacky shit on the path to dissecting an increasingly intelligent alien brain.

Related resources and references

[1] Mapping the Mind of a Large Language Model: *The research involved mapping the patterns of millions of human-like concepts in the neural networks of Claude 3.0 Sonnet – a model similar to recent releases of ChatGPT. They were able to find repeatable patterns, (or in brain-speak, similar neurons firing) when discussing related concepts. https://www.anthropic.com/research/mapping-mind-language-model*

*[2] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://transformer-circuits.pub/2023/monosemantic-features*

[3] Golden Gate Claude https://www.anthropic.com/news/golden-gate-claude

[4] Scaling interpretability (The engineering challenges of scaling interpretability) https://www.anthropic.com/research/engineering-challenges-interpretability

[5] Intro to LLMs Andrej Karpathy

https://drive.google.com/file/d/1pxx_ZI7O-Nwl7ZLNk5hI3WzAsTLwvNU7/view