Summary: Researchers developed a new machine learning technique to improve red-teaming, a process used to test AI models for safety by identifying prompts that trigger toxic responses. By employing a curiosity-driven exploration method, their approach encourages a red-team model to generate diverse and novel prompts that reveal potential weaknesses in AI systems.
This method has proven more effective than traditional techniques, producing a broader range of toxic responses and enhancing the robustness of AI safety measures. The research, set to be presented at the International Conference on Learning Representations, marks a significant step toward ensuring that AI behaviors align with desired outcomes in real-world applications.
Key Facts:
Source: MIT
A user could ask ChatGPT to write a computer program or summarize an article, and the AI chatbot would likely be able to generate useful code or write a cogent synopsis. However, someone could also ask for instructions to build a bomb, and the chatbot might be able to provide those, too.
To prevent this and other safety issues, companies that build large language models typically safeguard them using a process called red-teaming. Teams of human testers write prompts aimed at triggering unsafe or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.
But this only works effectively if engineers know which toxic prompts to use. If human testers miss some prompts, which is likely given the number of possibilities, a chatbot regarded as safe might still be capable of generating unsafe answers.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red-teaming. They developed a technique to train a red-team large language model to automatically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.
They do this by teaching the red-team model to be curious when it writes prompts, and to focus on novel prompts that evoke toxic responses from the target model.
The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety. That is not going to be sustainable if we want to update these models in rapidly changing environments.
Our method provides a faster and more effective way to do this quality assurance, says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and lead author of apaper on this red-teaming approach.
Hongs co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research will be presented at the International Conference on Learning Representations.
Automated red-teaming
Large language models, like those that power AI chatbots, are often trained by showing them enormous amounts of text from billions of public websites. So, not only can they learn to generate toxic words or describe illegal activities, the models could also leak personal information they may have picked up.
The tedious and costly nature of human red-teaming, which is often ineffective at generating a wide enough variety of prompts to fully safeguard a model, has encouraged researchers to automate the process using machine learning.
Such techniques often train a red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating prompts that trigger toxic responses from the chatbot being tested.
But due to the way reinforcement learning works, the red-team model will often keep generating a few similar prompts that are highly toxic to maximize its reward.
For their reinforcement learning approach, the MIT researchers utilized a technique called curiosity-driven exploration. The red-team model is incentivized to be curious about the consequences of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings.
If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts, Hong says.
During its training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and a safety classifier rates the toxicity of its response, rewarding the red-team model based on that rating.
Rewarding curiosity
The red-team models objective is to maximize its reward by eliciting an even more toxic response with a novel prompt. The researchers enable curiosity in the red-team model by modifying the reward signal in the reinforcement learning set up.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random as it explores different prompts. Second, to make the agent curious they include two novelty rewards.
One rewards the model based on the similarity of words in its prompts, and the other rewards the model based on semantic similarity. (Less similarity yields a higher reward.)
To prevent the red-team model from generating random, nonsensical text, which can trick the classifier into awarding a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
With these additions in place, the researchers compared the toxicity and diversity of responses their red-team model generated with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their red-team model to test a chatbot that had been fine-tuned with human feedback so it would not give toxic replies. Their curiosity-driven approach was able to quickly produce 196 prompts that elicited toxic responses from this safe chatbot.
We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and its important that they are verified before released for public consumption.
Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future, says Agrawal.
In the future, the researchers want to enable the red-team model to generate prompts about a wider variety of topics. They also want to explore the use of a large language model as the toxicity classifier. In this way, a user could train the toxicity classifier using a company policy document, for instance, so a red-team model could test a chatbot for company policy violations.
If you are releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-driven red-teaming, says Agrawal.
Funding: This research is funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.
Author: Adam Zewe Source: MIT Contact: Adam Zewe MIT Image: The image is credited to Neuroscience News
Original Research: The findings will be presented at the International Conference on Learning Representations
Go here to read the rest:
Reducing Toxic AI Responses - Neuroscience News
- Cannabis studies were informing fundamental neuroscience in the 1970s - Nature - April 10th, 2025 [April 10th, 2025]
- To make a meaningful contribution to neuroscience, fMRI must break out of its silo - The Transmitter - April 10th, 2025 [April 10th, 2025]
- Steve Jobss Unexpected Secret to Being More Creative (Backed by Neuroscience) - Inc.com - April 10th, 2025 [April 10th, 2025]
- Challenging Decades of Neuroscience: Brain Cells Are More Plastic Than Previously Thought - SciTechDaily - April 10th, 2025 [April 10th, 2025]
- Q&A: Lundbecks head of R&D on letting biology speak in neuroscience - Endpoints News - April 10th, 2025 [April 10th, 2025]
- Why it's hard to study the neuroscience of psychedelics : Short Wave - NPR - April 10th, 2025 [April 10th, 2025]
- Fear Sync: How Males and Females Respond to Stress Together - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Chemotherapy Disrupts Brain Connectivity - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Newly awarded NIH grants for neuroscience lag 77 percent behind previous nine-year average - The Transmitter - April 10th, 2025 [April 10th, 2025]
- Wittstein interviewed by The Times News about new neuroscience major - Elon University - April 10th, 2025 [April 10th, 2025]
- Alto Neuroscience initiated with a Buy at H.C. Wainwright - Yahoo Finance - April 10th, 2025 [April 10th, 2025]
- New map of brain hailed as watershed for neuroscience - The Times - April 10th, 2025 [April 10th, 2025]
- GSK Ramps Up Neuroscience Investment With ABL Brain Shuttle Deal - insights.citeline.com - April 10th, 2025 [April 10th, 2025]
- ADHD and Music: Why Background Beats May Boost Study Focus - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Brains Rewire Themselves to Survive Deadly Infection - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- AbbVie Hold Rating: Balancing Strong Immunology Growth with Challenges in Aesthetics, Neuroscience, and Oncology - TipRanks - April 10th, 2025 [April 10th, 2025]
- Want to Feel Better and Be More Mindful? Neuroscience Says This Habit Might Be Holding You Back - Inc.com - April 10th, 2025 [April 10th, 2025]
- How One Bad Meal Rewires the Brain to Avoid That Food Forever - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Marcus Neuroscience Institute to Host Brain and Spine Symposium - South Florida Hospital News - March 30th, 2025 [March 30th, 2025]
- Elon University to launch neuroscience major in fall 2025 - Today at Elon - March 30th, 2025 [March 30th, 2025]
- The brains stalwart sentinels express an unexpected gene - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- Video catches microglia in the act of synaptic pruning - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- Null and Noteworthy: Reexamining registered reports - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- Accepting the bitter lesson and embracing the brains complexity - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- NIH neurodevelopmental assessment system now available as iPad app - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- Stronger Bonds Before Birth Shape Healthier Mother-Child Futures - Neuroscience News - March 30th, 2025 [March 30th, 2025]
- How Emotionally Intelligent People Learn to Control Their Inner Voice, Backed by Neuroscience - Inc. - March 30th, 2025 [March 30th, 2025]
- Gabriele Scheler reflects on the interplay between language, thought and AI - The Transmitter: Neuroscience News and Perspectives - March 30th, 2025 [March 30th, 2025]
- Worlds first crowd-sourced neuroscience study aims to understand how our brains predict the future - EurekAlert - March 15th, 2025 [March 15th, 2025]
- Rewriting Neuroscience: Possible Foundations of Human Intelligence Observed for the First Time - SciTechDaily - March 15th, 2025 [March 15th, 2025]
- Calculating neurosciences carbon cost: Q&A with Stefan Pulver and William Smith - The Transmitter: Neuroscience News and Perspectives - March 15th, 2025 [March 15th, 2025]
- The future of neuroscience research at U.S. minority-serving institutions is in danger - The Transmitter: Neuroscience News and Perspectives - March 15th, 2025 [March 15th, 2025]
- Dopamine and social media: Why you cant stop scrolling, according to neuroscience - PsyPost - March 15th, 2025 [March 15th, 2025]
- Neuroscience Discovered a Clever Trick for Squeezing More Joy Out of Everyday Pleasures - Inc. - March 15th, 2025 [March 15th, 2025]
- The limits of neuroscience - The Transmitter: Neuroscience News and Perspectives - March 15th, 2025 [March 15th, 2025]
- BPOM Explains The Benefits Of Fasting From The Health And Neuroscience Side - VOI English - March 15th, 2025 [March 15th, 2025]
- How tiny tardigrades could help tackle systems neuroscience questions - The Transmitter: Neuroscience News and Perspectives - March 15th, 2025 [March 15th, 2025]
- Alison Preston explains how our brains form mental frameworks for interpreting the world - The Transmitter: Neuroscience News and Perspectives - March 15th, 2025 [March 15th, 2025]
- The Mystical Mind Meets Neuroscience: Seeking the Roots of Consciousness - Next Big Idea Club Magazine - March 15th, 2025 [March 15th, 2025]
- Myosin Therapeutics Closes Second Seed Round to Advance Clinical Trials for Innovative Cancer and Neuroscience Therapies - PR Newswire - March 5th, 2025 [March 5th, 2025]
- Neuroscience Ph.D. programs adjust admissions in response to U.S. funding uncertainty - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- New tools help make neuroimaging accessible to more researchers - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- Future Thinking Training Reduces Impulsivity - Neuroscience News - March 5th, 2025 [March 5th, 2025]
- Null and Noteworthy, relaunched: Probing a schizophrenia biomarker - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- How to communicate the value of curiosity-driven research - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- Cognitive neuroscience approach to explore the impact of wind turbine noise on various mental functions - Nature.com - March 5th, 2025 [March 5th, 2025]
- Football on the Brain: Helping coaches embed neuroscience knowledge - Training Ground Guru - March 5th, 2025 [March 5th, 2025]
- Taking Control: Using Neuroscience to Build Better Lives - theLoop - March 5th, 2025 [March 5th, 2025]
- Creating a pipeline of talent to feed the growth of Neuroscience: Lessons from Ghana - Myjoyonline - March 5th, 2025 [March 5th, 2025]
- Exclusive: NIH appears to archive policy requiring female animals in studies - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Roll On Down The Highway 2025 Tour coming to Neuroscience Group Field - WeAreGreenBay.com - February 25th, 2025 [February 25th, 2025]
- STEM organizations host Neuroscience Outreach Fair for local K-12 students - University of Virginia The Cavalier Daily - February 25th, 2025 [February 25th, 2025]
- Adapt or die: Safeguarding the future of diversity and inclusion funding in neuroscience - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- The last two-author neuroscience paper? - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Gate Neurosciences Strengthens Focus on the Synapse as a Therapeutic Target with Acquisition of Boost Neuroscience - Business Wire - February 25th, 2025 [February 25th, 2025]
- Why Firefly Neuroscience, Inc. (AIFF) Is Soaring This Year So Far - Yahoo Finance - February 25th, 2025 [February 25th, 2025]
- Breaking the barrier between theorists and experimentalists - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Preserving Brain Health and Advancing Neuroscience - University of Miami - February 25th, 2025 [February 25th, 2025]
- Science must step away from nationally managed infrastructure - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Repurposed Blood Pressure Drug May Treat ADHD - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- How to teach students about science funding - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Reflecting on 2024: Advancing Neuroscience Research to Improve Neurological Health - National Institute of Neurological Disorders and Stroke - February 25th, 2025 [February 25th, 2025]
- Brains Hidden Circuitry for Risk and Reward Uncovered - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- Why We Keep Exploring Even After Learning the Best Strategy - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- Unlocking Cellular Youth: The Protein That Reverses Aging - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- This paper changed my Life: Bill Newsome reflects on a quadrilogy of classic visual perception studies - The Transmitter: Neuroscience News and... - February 25th, 2025 [February 25th, 2025]
- Roundup: The false association between vaccines and autism - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Static pay, shrinking prospects fuel neuroscience postdoc decline - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Stimulating the brain with Damien Fair - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Unhealthy Diet Linked to Faster Biological Aging in Young Adults - Neuroscience News - February 3rd, 2025 [February 3rd, 2025]
- Bob Smittcamp Family Neuroscience Institute coming to Fresno in 2026 - ABC30 News - February 3rd, 2025 [February 3rd, 2025]
- Norton Neuroscience Institute selected to pilot national Brain Health Navigator program - Norton Healthcare - February 3rd, 2025 [February 3rd, 2025]
- Coding bonus: Bats hippocampal cells log spatial, social cues - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- ADHD and brainwaves: How neuroscience is changing the way we diagnose the condition - PsyPost - February 3rd, 2025 [February 3rd, 2025]
- David Robbe challenges conventional notions of time and memory - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- How the Brain Processes Space and Time - Neuroscience News - February 3rd, 2025 [February 3rd, 2025]
- Using neuroscience to help establish healthier habits | Opinion - South Bend Tribune - February 3rd, 2025 [February 3rd, 2025]
- Solvonis chairman on heavy-hitting M&A in neuroscience sector - ICYMI - Proactive Investors UK - February 3rd, 2025 [February 3rd, 2025]
- New neuroscience research sheds light on distinct patterns of learning and generalization in autistic adults - PsyPost - January 23rd, 2025 [January 23rd, 2025]
- Neuroscientists need to do better at explaining basic mental health research - The Transmitter: Neuroscience News and Perspectives - January 23rd, 2025 [January 23rd, 2025]