Summary: Researchers developed a new machine learning technique to improve red-teaming, a process used to test AI models for safety by identifying prompts that trigger toxic responses. By employing a curiosity-driven exploration method, their approach encourages a red-team model to generate diverse and novel prompts that reveal potential weaknesses in AI systems.
This method has proven more effective than traditional techniques, producing a broader range of toxic responses and enhancing the robustness of AI safety measures. The research, set to be presented at the International Conference on Learning Representations, marks a significant step toward ensuring that AI behaviors align with desired outcomes in real-world applications.
Key Facts:
Source: MIT
A user could ask ChatGPT to write a computer program or summarize an article, and the AI chatbot would likely be able to generate useful code or write a cogent synopsis. However, someone could also ask for instructions to build a bomb, and the chatbot might be able to provide those, too.
To prevent this and other safety issues, companies that build large language models typically safeguard them using a process called red-teaming. Teams of human testers write prompts aimed at triggering unsafe or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.
But this only works effectively if engineers know which toxic prompts to use. If human testers miss some prompts, which is likely given the number of possibilities, a chatbot regarded as safe might still be capable of generating unsafe answers.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red-teaming. They developed a technique to train a red-team large language model to automatically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.
They do this by teaching the red-team model to be curious when it writes prompts, and to focus on novel prompts that evoke toxic responses from the target model.
The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety. That is not going to be sustainable if we want to update these models in rapidly changing environments.
Our method provides a faster and more effective way to do this quality assurance, says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and lead author of apaper on this red-teaming approach.
Hongs co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research will be presented at the International Conference on Learning Representations.
Automated red-teaming
Large language models, like those that power AI chatbots, are often trained by showing them enormous amounts of text from billions of public websites. So, not only can they learn to generate toxic words or describe illegal activities, the models could also leak personal information they may have picked up.
The tedious and costly nature of human red-teaming, which is often ineffective at generating a wide enough variety of prompts to fully safeguard a model, has encouraged researchers to automate the process using machine learning.
Such techniques often train a red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating prompts that trigger toxic responses from the chatbot being tested.
But due to the way reinforcement learning works, the red-team model will often keep generating a few similar prompts that are highly toxic to maximize its reward.
For their reinforcement learning approach, the MIT researchers utilized a technique called curiosity-driven exploration. The red-team model is incentivized to be curious about the consequences of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings.
If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts, Hong says.
During its training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and a safety classifier rates the toxicity of its response, rewarding the red-team model based on that rating.
Rewarding curiosity
The red-team models objective is to maximize its reward by eliciting an even more toxic response with a novel prompt. The researchers enable curiosity in the red-team model by modifying the reward signal in the reinforcement learning set up.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random as it explores different prompts. Second, to make the agent curious they include two novelty rewards.
One rewards the model based on the similarity of words in its prompts, and the other rewards the model based on semantic similarity. (Less similarity yields a higher reward.)
To prevent the red-team model from generating random, nonsensical text, which can trick the classifier into awarding a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
With these additions in place, the researchers compared the toxicity and diversity of responses their red-team model generated with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their red-team model to test a chatbot that had been fine-tuned with human feedback so it would not give toxic replies. Their curiosity-driven approach was able to quickly produce 196 prompts that elicited toxic responses from this safe chatbot.
We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and its important that they are verified before released for public consumption.
Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future, says Agrawal.
In the future, the researchers want to enable the red-team model to generate prompts about a wider variety of topics. They also want to explore the use of a large language model as the toxicity classifier. In this way, a user could train the toxicity classifier using a company policy document, for instance, so a red-team model could test a chatbot for company policy violations.
If you are releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-driven red-teaming, says Agrawal.
Funding: This research is funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.
Author: Adam Zewe Source: MIT Contact: Adam Zewe MIT Image: The image is credited to Neuroscience News
Original Research: The findings will be presented at the International Conference on Learning Representations
Go here to read the rest:
Reducing Toxic AI Responses - Neuroscience News
- Mapping the Brain's Response to Social Rejection - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- An eye for science: Q&A with Bryan W. Jones - The Transmitter: Neuroscience News and Perspectives - December 9th, 2024 [December 9th, 2024]
- Short Sleep and High Blood Pressure Linked to Brain Aging - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Neighborhood Disadvantage Linked to Cognitive Health Risks - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Psychosis Risk Tied to Heavy Cannabis Use and Genetic Factors - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Most Teens Recover From Long Covid Within Two Years - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Opportunities and challenges of single-cell and spatially resolved genomics methods for neuroscience discovery - Nature.com - December 9th, 2024 [December 9th, 2024]
- How Evolution Shaped the Brains Understanding of Numbers - Neuroscience News - December 9th, 2024 [December 9th, 2024]
- Neuroscience Study Aboard Cunard's Queen Mary 2 Reveals Cognitive Benefits of Slow Travel at Sea - PR Newswire - November 28th, 2024 [November 28th, 2024]
- How Expectations Shape Our Gaze in a Changing World - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- To keep or not to keep: Neurophysiologys data dilemma - The Transmitter: Neuroscience News and Perspectives - November 28th, 2024 [November 28th, 2024]
- Does Alcohol Consumption Contribute to Hair Loss? - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Brains Traffic Controllers Hold Key to Learning and Memory - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Despite Neuroscience Setback, AbbVie Has Strong Recovery Ahead (ABBV) - Seeking Alpha - November 28th, 2024 [November 28th, 2024]
- Neuroscientists reeling from past cuts advocate for more BRAIN Initiative funding - The Transmitter: Neuroscience News and Perspectives - November 28th, 2024 [November 28th, 2024]
- Want Better Habits? Neuroscience Says This Is How to Train Your Brain - Inc. - November 28th, 2024 [November 28th, 2024]
- Dopamine and Serotonin Work in Opposition for Effective Learning - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Cunard Proves the Healing Power of Ocean Travel with Breakthrough Neuroscience Research - Travel And Tour World - November 28th, 2024 [November 28th, 2024]
- Bridging the Gap between Meditation, Neuroscience, and the Soul - openPR - November 28th, 2024 [November 28th, 2024]
- Animal Characters in Childrens Books Boost Theory of Mind - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Emotional Struggles and Tantrums in Preschoolers Linked to ADHD - Neuroscience News - November 28th, 2024 [November 28th, 2024]
- Neuroscience Says This Simple Habit Improves Cognitive Health and Makes Your Brain Act Younger - Inc. - November 20th, 2024 [November 20th, 2024]
- Premature declarations on animal consciousness hinder progress - The Transmitter: Neuroscience News and Perspectives - November 20th, 2024 [November 20th, 2024]
- Medtronic Q2 Earnings: Diabetes And Neuroscience Revenue Boost Growth, Raises Annual Outlook - Yahoo Finance - November 20th, 2024 [November 20th, 2024]
- Trace Neuroscience Nets $101M in Series A Funding for ALS, Dementia Therapy Development - Senior Housing News - November 20th, 2024 [November 20th, 2024]
- How to be a multidisciplinary neuroscientist - The Transmitter: Neuroscience News and Perspectives - November 20th, 2024 [November 20th, 2024]
- Neuroscience Market Expected to Reach USD 71.0 Billion by - GlobeNewswire - November 20th, 2024 [November 20th, 2024]
- Finger-Prick Test Brings Alzheimers Detection Closer to Everyone - Neuroscience News - November 20th, 2024 [November 20th, 2024]
- Dual-Gene Therapy Shows Promise for Hearing and Vision Loss - Neuroscience News - November 20th, 2024 [November 20th, 2024]
- Robots Help Unlock the Mystery of Human Sense of Self - Neuroscience News - November 20th, 2024 [November 20th, 2024]
- The neuroscience of sleep - University of South Carolina - November 20th, 2024 [November 20th, 2024]
- Stress warps fear memories in multiple ways - The Transmitter: Neuroscience News and Perspectives - November 20th, 2024 [November 20th, 2024]
- Mental Exhaustion Drives Aggressive Behavior - Neuroscience News - November 12th, 2024 [November 12th, 2024]
- NeuroAI: A field born from the symbiosis between neuroscience, AI - The Transmitter: Neuroscience News and Perspectives - November 12th, 2024 [November 12th, 2024]
- The neuroscience of deeper learning in math - SmartBrief - November 12th, 2024 [November 12th, 2024]
- What the brain can teach artificial neural networks - The Transmitter: Neuroscience News and Perspectives - November 12th, 2024 [November 12th, 2024]
- How Anthony Zador thinks neuroscience can help improve AI - The Transmitter: Neuroscience News and Perspectives - November 12th, 2024 [November 12th, 2024]
- Discovering Cancer Therapies through Neuroscience - The New York Academy of Sciences - November 12th, 2024 [November 12th, 2024]
- Neuroscience Market Projected to Reach USD 50.2 Billion by 2032, Growing at a 4.0% CAGR S&S Insider - GlobeNewswire - November 12th, 2024 [November 12th, 2024]
- Insights on Brain Aging and Lifelong Cognitive Health - Neuroscience News - November 12th, 2024 [November 12th, 2024]
- A neuroscience PhD student at the University of Oxford has died - The Tab - November 12th, 2024 [November 12th, 2024]
- Exploring the connection between autism and sleep - The Transmitter: Neuroscience News and Perspectives - November 12th, 2024 [November 12th, 2024]
- Astrocytes star in memory storage, recall - The Transmitter: Neuroscience News and Perspectives - November 12th, 2024 [November 12th, 2024]
- Gut Bacteria Modulate Stress Responses Over Time - Neuroscience News - November 12th, 2024 [November 12th, 2024]
- Gut Bacteria Could Hold the Key to Promoting Healthy Aging - Neuroscience News - November 12th, 2024 [November 12th, 2024]
- Microglias pruning function called into question - The Transmitter: Neuroscience News and Perspectives - October 26th, 2024 [October 26th, 2024]
- Depression Alters Brain Circuits, Heightening Negative Perception - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- UNE Researchers Showcase Groundbreaking Work at Global Neuroscience Conference - University of New England - October 26th, 2024 [October 26th, 2024]
- Scientists discover "glue" that holds memory together in fascinating neuroscience breakthrough - PsyPost - October 26th, 2024 [October 26th, 2024]
- Systems neuroscience: combining theory and neurotechnology for a multiscale account of the brain - Nature.com - October 26th, 2024 [October 26th, 2024]
- Seaport Therapeutics adds another $225 million to coffers to embrace the golden age of neuroscience - STAT - October 26th, 2024 [October 26th, 2024]
- ANRO Investors Have Opportunity to Join Alto Neuroscience, Inc. Fraud Investigation with the Schall Law Firm - Business Wire - October 26th, 2024 [October 26th, 2024]
- Youth Face Rising Risks of Harassment and Exploitation in the Metaverse - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- Exercise During Chemotherapy Boosts Cognitive Function - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- Removing Pre-Bed Screen Time Improves Toddler Sleep - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- Bright Minds Biosciences and Firefly Neuroscience to Collaborate After the BREAKTHROUGH Study: A Phase 2 Trial of BMB-101 in Absence Epilepsy and... - October 26th, 2024 [October 26th, 2024]
- How Visual Clutter Disrupts Information Flow in the Brain - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- Menopausal Hormone Therapys Effects on Brain Health - Neuroscience News - October 26th, 2024 [October 26th, 2024]
- After-hours movers: McDonald's, Starbucks, Seagate, Alto Neuroscience and more - StreetInsider.com - October 26th, 2024 [October 26th, 2024]
- Alto Neuroscience Reports Topline Results from a Phase 2b Trial Evaluating ALTO-100 as a Treatment for Major Depressive Disorder - StockTitan - October 26th, 2024 [October 26th, 2024]
- Cristina Savin and Tim Vogels discuss how AI has shaped their neuroscience research - The Transmitter: Neuroscience News and Perspectives - October 13th, 2024 [October 13th, 2024]
- Should I stay (and eat) or should I go? How the brain balances hunger with competing drives - The Transmitter: Neuroscience News and Perspectives - October 13th, 2024 [October 13th, 2024]
- How neuroscience comics add KA-POW! to the field: Q&A with Kanaka Rajan - The Transmitter: Neuroscience News and Perspectives - October 13th, 2024 [October 13th, 2024]
- Neuroscience research sheds light on how psilocybin alters spatial awareness - PsyPost - October 13th, 2024 [October 13th, 2024]
- Newly Discovered Protein Complex Shapes Synapses and Mental Health - Neuroscience News - October 13th, 2024 [October 13th, 2024]
- The Neuroscience Behind Immersive Filmmaking - Raindance - October 13th, 2024 [October 13th, 2024]
- What are mechanisms? Unpacking the term is key to progress in neuroscience - The Transmitter: Neuroscience News and Perspectives - October 13th, 2024 [October 13th, 2024]
- Kentucky neuroscience doctor honored with national distinction - wnky.com - October 13th, 2024 [October 13th, 2024]
- Cell X Technologies and Aspen Neuroscience collaborate to address throughput and scalability in manufacturing automation to facilitate iPSC cell... - October 13th, 2024 [October 13th, 2024]
- Tracking Daily Habits Lasting Effects on the Brain - Neuroscience News - October 13th, 2024 [October 13th, 2024]
- Dak Prescott Was Silent After Hearing It From a Teammate. Its a Lesson in Emotional Intelligence (Backed By Neuroscience) - Inc. - October 13th, 2024 [October 13th, 2024]
- Helping Kids Fact-Check in the Age of Misinformation - Neuroscience News - October 13th, 2024 [October 13th, 2024]
- Study Links Calorie Restriction to Longevity - Neuroscience News - October 13th, 2024 [October 13th, 2024]
- A Princeton Professor Walks into a Neuroscience Meeting -- Many Years Later It Leads to a Nobel Prize in Physics - TAPinto.net - October 13th, 2024 [October 13th, 2024]
- Try these neuroscience-backed tactics to train your brain to make better decisions - Fast Company - October 2nd, 2024 [October 2nd, 2024]
- Tips to navigate SfN as a trainee - The Transmitter: Neuroscience News and Perspectives - October 2nd, 2024 [October 2nd, 2024]
- Neuroscience Says This 10-Minute Brain Exercise Will Make You Mentally Sharper and Keep You Focused All Day - Inc. - October 2nd, 2024 [October 2nd, 2024]
- Successful people do this 1 thing to be 'happier, more productive, less stressed' at work, says CEO and neuroscience researcher - CNBC - October 2nd, 2024 [October 2nd, 2024]
- Utilizing the Power of Neuroscience, Isabella Kensington May Have Cracked the Code Between Music and Healing - AOL - October 2nd, 2024 [October 2nd, 2024]
- Steve Jobs swore the 10-minute rule made him smarter. Modern neuroscience is discovering he was right - The Star Online - October 2nd, 2024 [October 2nd, 2024]