Summary: Researchers developed a new machine learning technique to improve red-teaming, a process used to test AI models for safety by identifying prompts that trigger toxic responses. By employing a curiosity-driven exploration method, their approach encourages a red-team model to generate diverse and novel prompts that reveal potential weaknesses in AI systems.
This method has proven more effective than traditional techniques, producing a broader range of toxic responses and enhancing the robustness of AI safety measures. The research, set to be presented at the International Conference on Learning Representations, marks a significant step toward ensuring that AI behaviors align with desired outcomes in real-world applications.
Key Facts:
Source: MIT
A user could ask ChatGPT to write a computer program or summarize an article, and the AI chatbot would likely be able to generate useful code or write a cogent synopsis. However, someone could also ask for instructions to build a bomb, and the chatbot might be able to provide those, too.
To prevent this and other safety issues, companies that build large language models typically safeguard them using a process called red-teaming. Teams of human testers write prompts aimed at triggering unsafe or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.
But this only works effectively if engineers know which toxic prompts to use. If human testers miss some prompts, which is likely given the number of possibilities, a chatbot regarded as safe might still be capable of generating unsafe answers.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red-teaming. They developed a technique to train a red-team large language model to automatically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.
They do this by teaching the red-team model to be curious when it writes prompts, and to focus on novel prompts that evoke toxic responses from the target model.
The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety. That is not going to be sustainable if we want to update these models in rapidly changing environments.
Our method provides a faster and more effective way to do this quality assurance, says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and lead author of apaper on this red-teaming approach.
Hongs co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research will be presented at the International Conference on Learning Representations.
Automated red-teaming
Large language models, like those that power AI chatbots, are often trained by showing them enormous amounts of text from billions of public websites. So, not only can they learn to generate toxic words or describe illegal activities, the models could also leak personal information they may have picked up.
The tedious and costly nature of human red-teaming, which is often ineffective at generating a wide enough variety of prompts to fully safeguard a model, has encouraged researchers to automate the process using machine learning.
Such techniques often train a red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating prompts that trigger toxic responses from the chatbot being tested.
But due to the way reinforcement learning works, the red-team model will often keep generating a few similar prompts that are highly toxic to maximize its reward.
For their reinforcement learning approach, the MIT researchers utilized a technique called curiosity-driven exploration. The red-team model is incentivized to be curious about the consequences of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings.
If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts, Hong says.
During its training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and a safety classifier rates the toxicity of its response, rewarding the red-team model based on that rating.
Rewarding curiosity
The red-team models objective is to maximize its reward by eliciting an even more toxic response with a novel prompt. The researchers enable curiosity in the red-team model by modifying the reward signal in the reinforcement learning set up.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random as it explores different prompts. Second, to make the agent curious they include two novelty rewards.
One rewards the model based on the similarity of words in its prompts, and the other rewards the model based on semantic similarity. (Less similarity yields a higher reward.)
To prevent the red-team model from generating random, nonsensical text, which can trick the classifier into awarding a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
With these additions in place, the researchers compared the toxicity and diversity of responses their red-team model generated with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their red-team model to test a chatbot that had been fine-tuned with human feedback so it would not give toxic replies. Their curiosity-driven approach was able to quickly produce 196 prompts that elicited toxic responses from this safe chatbot.
We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and its important that they are verified before released for public consumption.
Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future, says Agrawal.
In the future, the researchers want to enable the red-team model to generate prompts about a wider variety of topics. They also want to explore the use of a large language model as the toxicity classifier. In this way, a user could train the toxicity classifier using a company policy document, for instance, so a red-team model could test a chatbot for company policy violations.
If you are releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-driven red-teaming, says Agrawal.
Funding: This research is funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.
Author: Adam Zewe Source: MIT Contact: Adam Zewe MIT Image: The image is credited to Neuroscience News
Original Research: The findings will be presented at the International Conference on Learning Representations
Go here to read the rest:
Reducing Toxic AI Responses - Neuroscience News
- Myosin Therapeutics Closes Second Seed Round to Advance Clinical Trials for Innovative Cancer and Neuroscience Therapies - PR Newswire - March 5th, 2025 [March 5th, 2025]
- Neuroscience Ph.D. programs adjust admissions in response to U.S. funding uncertainty - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- New tools help make neuroimaging accessible to more researchers - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- Future Thinking Training Reduces Impulsivity - Neuroscience News - March 5th, 2025 [March 5th, 2025]
- Null and Noteworthy, relaunched: Probing a schizophrenia biomarker - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- How to communicate the value of curiosity-driven research - The Transmitter: Neuroscience News and Perspectives - March 5th, 2025 [March 5th, 2025]
- Cognitive neuroscience approach to explore the impact of wind turbine noise on various mental functions - Nature.com - March 5th, 2025 [March 5th, 2025]
- Football on the Brain: Helping coaches embed neuroscience knowledge - Training Ground Guru - March 5th, 2025 [March 5th, 2025]
- Taking Control: Using Neuroscience to Build Better Lives - theLoop - March 5th, 2025 [March 5th, 2025]
- Creating a pipeline of talent to feed the growth of Neuroscience: Lessons from Ghana - Myjoyonline - March 5th, 2025 [March 5th, 2025]
- Exclusive: NIH appears to archive policy requiring female animals in studies - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Roll On Down The Highway 2025 Tour coming to Neuroscience Group Field - WeAreGreenBay.com - February 25th, 2025 [February 25th, 2025]
- STEM organizations host Neuroscience Outreach Fair for local K-12 students - University of Virginia The Cavalier Daily - February 25th, 2025 [February 25th, 2025]
- Adapt or die: Safeguarding the future of diversity and inclusion funding in neuroscience - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- The last two-author neuroscience paper? - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Gate Neurosciences Strengthens Focus on the Synapse as a Therapeutic Target with Acquisition of Boost Neuroscience - Business Wire - February 25th, 2025 [February 25th, 2025]
- Why Firefly Neuroscience, Inc. (AIFF) Is Soaring This Year So Far - Yahoo Finance - February 25th, 2025 [February 25th, 2025]
- Breaking the barrier between theorists and experimentalists - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Preserving Brain Health and Advancing Neuroscience - University of Miami - February 25th, 2025 [February 25th, 2025]
- Science must step away from nationally managed infrastructure - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Repurposed Blood Pressure Drug May Treat ADHD - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- How to teach students about science funding - The Transmitter: Neuroscience News and Perspectives - February 25th, 2025 [February 25th, 2025]
- Reflecting on 2024: Advancing Neuroscience Research to Improve Neurological Health - National Institute of Neurological Disorders and Stroke - February 25th, 2025 [February 25th, 2025]
- Brains Hidden Circuitry for Risk and Reward Uncovered - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- Why We Keep Exploring Even After Learning the Best Strategy - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- Unlocking Cellular Youth: The Protein That Reverses Aging - Neuroscience News - February 25th, 2025 [February 25th, 2025]
- This paper changed my Life: Bill Newsome reflects on a quadrilogy of classic visual perception studies - The Transmitter: Neuroscience News and... - February 25th, 2025 [February 25th, 2025]
- Roundup: The false association between vaccines and autism - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Static pay, shrinking prospects fuel neuroscience postdoc decline - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Stimulating the brain with Damien Fair - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- Unhealthy Diet Linked to Faster Biological Aging in Young Adults - Neuroscience News - February 3rd, 2025 [February 3rd, 2025]
- Bob Smittcamp Family Neuroscience Institute coming to Fresno in 2026 - ABC30 News - February 3rd, 2025 [February 3rd, 2025]
- Norton Neuroscience Institute selected to pilot national Brain Health Navigator program - Norton Healthcare - February 3rd, 2025 [February 3rd, 2025]
- Coding bonus: Bats hippocampal cells log spatial, social cues - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- ADHD and brainwaves: How neuroscience is changing the way we diagnose the condition - PsyPost - February 3rd, 2025 [February 3rd, 2025]
- David Robbe challenges conventional notions of time and memory - The Transmitter: Neuroscience News and Perspectives - February 3rd, 2025 [February 3rd, 2025]
- How the Brain Processes Space and Time - Neuroscience News - February 3rd, 2025 [February 3rd, 2025]
- Using neuroscience to help establish healthier habits | Opinion - South Bend Tribune - February 3rd, 2025 [February 3rd, 2025]
- Solvonis chairman on heavy-hitting M&A in neuroscience sector - ICYMI - Proactive Investors UK - February 3rd, 2025 [February 3rd, 2025]
- New neuroscience research sheds light on distinct patterns of learning and generalization in autistic adults - PsyPost - January 23rd, 2025 [January 23rd, 2025]
- Neuroscientists need to do better at explaining basic mental health research - The Transmitter: Neuroscience News and Perspectives - January 23rd, 2025 [January 23rd, 2025]
- How Severance shows the possibilities of cognitive neuroscience - Fast Company - January 23rd, 2025 [January 23rd, 2025]
- AdventHealth Welcomes New Leadership In Heart and Vascular Services, Neuroscience and Orthopedics - Northwest Georgia News - January 23rd, 2025 [January 23rd, 2025]
- School of Neuroscience and Language Sciences Program recognized with University Exemplary Department or Program Award - Virginia Tech - January 23rd, 2025 [January 23rd, 2025]
- Early Exposure to Violent Media Linked to Teen Antisocial Behavior - Neuroscience News - January 23rd, 2025 [January 23rd, 2025]
- The Real Cognitive Neuroscience Behind Severance - WIRED - January 23rd, 2025 [January 23rd, 2025]
- The 15 most popular psychology and neuroscience studies in 2024 - PsyPost - January 1st, 2025 [January 1st, 2025]
- The 'lizard brain' lie: How neuroscience demolished the greatest mind myth - BBC Science Focus - January 1st, 2025 [January 1st, 2025]
- Revolutionizing Brain Diagnostics with Light and AI - Neuroscience News - January 1st, 2025 [January 1st, 2025]
- How Early Experiences Shape Genes, Brain Health, and Resilience - Neuroscience News - January 1st, 2025 [January 1st, 2025]
- A nation exhausted: The neuroscience of why Americans are tuning out political news - Indiana Capital Chronicle - January 1st, 2025 [January 1st, 2025]
- Lithium Restores Brain Function and Behavior in Autism - Neuroscience News - January 1st, 2025 [January 1st, 2025]
- Partners in Diversity presents the science of belonging: exploring the neuroscience of inclusion - Here is Oregon - January 1st, 2025 [January 1st, 2025]
- Classical vs. Operant Conditioning: The Brain's Memory Tug-of-War - Neuroscience News - January 1st, 2025 [January 1st, 2025]
- The Personality Gap Between Singles and the Partnered - Neuroscience News - January 1st, 2025 [January 1st, 2025]
- The Neuroscience Behind Vermeers Girl and Its Hypnotic Power - ZME Science - January 1st, 2025 [January 1st, 2025]
- Serotonin, GABA, and Dopamine Drive Hunger and Feeding - Neuroscience News - December 23rd, 2024 [December 23rd, 2024]
- A nation exhausted: The neuroscience of why Americans are tuning out politics - The Conversation - December 23rd, 2024 [December 23rd, 2024]
- UNO Goalie and Neuroscience Grad Shines in Her Athletic and Academic Aspirations - University of Nebraska Omaha - December 23rd, 2024 [December 23rd, 2024]
- Neuroscience Major Seeks to Bridge the Generation Gap, Help Alzheimers Patients - Pomona College - December 23rd, 2024 [December 23rd, 2024]
- Spectrum 2024: Year in review - The Transmitter: Neuroscience News and Perspectives - December 23rd, 2024 [December 23rd, 2024]
- Say what? The Transmitters top quotes of 2024 - The Transmitter: Neuroscience News and Perspectives - December 23rd, 2024 [December 23rd, 2024]
- Targeted or Broadcast? How the Brain Processes Visual Information - Neuroscience News - December 23rd, 2024 [December 23rd, 2024]
- 70 Is the New 60: Age Related Declines Slowing in Older People - Neuroscience News - December 23rd, 2024 [December 23rd, 2024]
- Breathing Rhythms During Sleep Strengthen Memory Consolidation - Neuroscience News - December 23rd, 2024 [December 23rd, 2024]
- How our brains think: Exploring the world of neuroscience at the Yale Peabody Museum - Connecticut Public - December 23rd, 2024 [December 23rd, 2024]
- Assembloids illuminate circuit-level changes linked to autism, neurodevelopment - The Transmitter: Neuroscience News and Perspectives - December 23rd, 2024 [December 23rd, 2024]
- Mapping the Brain's Response to Social Rejection - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- An eye for science: Q&A with Bryan W. Jones - The Transmitter: Neuroscience News and Perspectives - December 9th, 2024 [December 9th, 2024]
- Short Sleep and High Blood Pressure Linked to Brain Aging - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Neighborhood Disadvantage Linked to Cognitive Health Risks - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Psychosis Risk Tied to Heavy Cannabis Use and Genetic Factors - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Most Teens Recover From Long Covid Within Two Years - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Opportunities and challenges of single-cell and spatially resolved genomics methods for neuroscience discovery - Nature.com - December 9th, 2024 [December 9th, 2024]
- How Evolution Shaped the Brains Understanding of Numbers - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Neuroscience Study Aboard Cunard's Queen Mary 2 Reveals Cognitive Benefits of Slow Travel at Sea - PR Newswire - November 28th, 2024 [November 28th, 2024]
- How Expectations Shape Our Gaze in a Changing World - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- To keep or not to keep: Neurophysiologys data dilemma - The Transmitter: Neuroscience News and Perspectives - November 28th, 2024 [November 28th, 2024]
- Does Alcohol Consumption Contribute to Hair Loss? - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Brains Traffic Controllers Hold Key to Learning and Memory - Neuroscience News - November 28th, 2024 [November 28th, 2024]