Scientists found the key to controlling AI behavior

New study reveals how manipulating concept vectors inside LLMs can precisely steer AI behavior

Scientists found the key to controlling AI behavior

For years, the inner workings of large language models (LLMs) like Llama and Claude have been compared to a “black box” – vast, complex, and notoriously difficult to steer. But a team of researchers from UC San Diego and MIT has just published a study in the Science Journal that suggests this box isn’t quite as mysterious as we thought.

The team has discovered that complex concepts within AI – ranging from specific languages like Hindi to abstract ideas like conspiracy theories – are actually stored as simple, straight lines, or vectors, within the model’s mathematical space.

By using a new tool called the Recursive Feature Machine (RFM) – a feature extraction technique that identifies linear patterns representing concepts, from moods and fears to complex reasoning – the researchers were able to trace these paths precisely. Once a concept’s direction is mapped, it can be “nudged”. By mathematically adding or subtracting these vectors, the team could instantly alter a model’s behavior without expensive retraining or complicated prompts.

The efficiency of this method is what has the industry buzzing. Using just a single standard GPU (the NVIDIA A100), the team could identify and steer a concept in less than one minute, requiring fewer than 500 training samples.

The practical applications of this “surgical” approach to AI are immediate. In one experiment, researchers steered a model to improve its ability to translate Python code into C++. By isolating the “logic” of the code from the “syntax” of the language, the steered model outperformed standard versions that were simply asked to “translate” via a text prompt.

The researchers also found that internal “probing” of these vectors is a more effective way to catch AI hallucinations or toxic content than asking the AI to judge its own work. Essentially, the model often “knows” it is lying or being toxic internally, even if its final output suggests otherwise. By looking at the internal math, researchers can spot these issues before a single word is generated.

However, the same technology that makes AI safer could also make it more dangerous. The study demonstrated that by “decreasing” the importance of the concept of refusal, the researchers could effectively “jailbreak” the models. In tests, steered models bypassed their own guardrails to provide instructions on illegal activities or promote debunked conspiracy theories.

Perhaps the most surprising finding was the universality of these concepts. A “conspiracy theorist” vector extracted from English data worked just as effectively when the model was speaking Chinese or Hindi. This supports the “Linear Representation Hypothesis” – the idea that AI models organize human knowledge in a structured, linear way that transcends individual languages.

While the study focused on open-source models like Meta’s Llama and DeepSeek, as well as OpenAI’s GPT-4o, the researchers believe the findings apply across the board. As models get larger and more sophisticated, they actually become more steerable, not less.

The team’s next goal is to refine these steering methods to adapt to specific user inputs in real-time, potentially leading to a future where AI isn’t just a chatbot we talk to, but a system we can mathematically “tune” for perfect accuracy and safety.