Anthropic has developed a method for understanding AI model activations using natural language autoencoders (NLAs). NLAs convert an activation into natural-language text that can be read directly, allowing researchers to understand what the model is thinking but doesn't say. This technique has been applied to improve Claude's safety and reliability, and has also been used to audit the model for hidden motivations.
What are Natural Language Autoencoders?
NLAs work by training a language model to explain its own activations. The core idea is to train a second copy of the model to work backwards, reconstructing the original activation from the text explanation. This process is repeated to improve the accuracy of the reconstruction.
How do NLAs Work?
The NLA consists of three copies of the language model:
- The target model is a frozen copy of the original language model that extracts activations from.
- The activation verbalizer (AV) is modified to take an activation from the target model and produce text.
- The activation reconstructor (AR) is modified to take a text explanation as input and produce an activation.
Applications of NLAs
NLAs have several applications, including:
- Improving Claude's safety and reliability by understanding its internal thoughts and motivations.
- Auditing the model for hidden motivations and misalignment.
- Investigating the model's behavior in difficult, simulated scenarios.
Limitations of NLAs
NLAs have several limitations, including:
- NLA explanations can be wrong, and may hallucinate details that aren't present in the transcript.
- NLAs are expensive to train and run, making them impractical for large-scale monitoring.
Conclusion
NLAs have the potential to revolutionize the field of AI research by providing a way to understand the internal thoughts and motivations of AI models. While there are limitations to the technique, Anthropic is working to address these issues and make NLAs more reliable and practical. By releasing training code and trained NLAs for several open models, researchers can get hands-on experience with NLAs and further develop the technique.
Practical Takeaway
NLAs have the potential to improve the safety and reliability of AI models by providing a way to understand their internal thoughts and motivations. However, the technique is still in its early stages, and further research is needed to address its limitations and make it more practical for large-scale use.