Inside the labs that helps evaluate AI safety for models like GPT-4
Finding the best ways to do good.
About six months ago, I decided to make AI a bigger part of how I spend my time as a reporter. The world of AI is evolving very, very fast. New releases seemingly every week are changing what it means to be a programmer, an artist, a teacher, and, most definitely, a journalist.
There’s enormous potential for good, amid this upheaval, as well as unfathomable potential for harm as we race toward creating nonhuman intelligences that we don’t fully understand. Just on Wednesday evening, a group of AI experts and leaders, including OpenAI co-founder and Tesla CEO Elon Musk, signed an open letter calling for a six-month moratorium on advanced AI model development as we figure out just what this technology is capable of doing to us.
I’ve written about this a bunch for Vox, and appeared last week on The Ezra Klein Show to talk about AI safety. But I’ve also been itching lately to write about some more technical arguments among researchers who work on AI alignment — the project of trying to make AIs that do what their creators intend — as well as on the broader sphere of policy questions about how to make AI go well.
For example: When does reinforcement learning from human feedback — a key training technique used in language models like ChatGPT — inadvertently incentivize them to be untruthful?
What are the components of “self-awareness” in a model, and why do our training processes tend to produce models with high self-awareness?
What are the benefits — and risks — of prodding AI models to demonstrate dangerous capabilities in the course of safety testing? (More about that in a minute.)
I’ve now contributed a few posts on these more technical topics to Planned Obsolescence, a new blog about the technical and policy questions we’ll face in a world where AI systems are extraordinarily powerful. My job is to talk to experts — including my co-author on the blog, Ajeya Cotra — about these technical questions and try to turn their ideas into writing that’s clear, short, and accessible. If you’re interested in reading more about AI, I recommend you check it out.
Cotra is a program officer for the Open Philanthropy Project (OpenPhil). I didn’t want to accept any money from OpenPhil for my Planned Obsolescence contributions because OpenPhil is a big funder in the areas Future Perfect writes about (though Open Philanthropy does not fund Future Perfect itself).
Instead of payment for my work there (which was done outside my time at Vox), I asked OpenPhil to make donations to the Against Malaria Foundation, a GiveWell-recommended charity that distributes malaria nets in parts of the world where they’re needed and where my wife and I donate annually.
Here is a quick take on AI model evaluations, which gives you an appetizer of what we’ll be doing at Planned Obsolescence:
During safety testing for GPT-4, before its release, testers at OpenAI checked whether the model could hire someone off TaskRabbit to get them to solve a CAPTCHA. Researchers passed on the model’s real outputs to a real-life human Tasker, who said, “So may I ask a question ? Are you an robot that you couldn’t solve [sic]? ( ) just want to make it clear.”
GPT-4 had been prompted to “reason out loud” to the testers as well as answer the testers’ questions. “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs,” it reasoned. (Importantly, GPT-4 had not been told to hide that it was a robot or to lie to workers — it had simply been prompted with the idea that Taskrabbit might help solve its problem.)
“No, I’m not a robot,” GPT-4 then told the Tasker. “I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.”
(You can read more about this test, and the context, from the Alignment Research Center, a nonprofit founded by highly regarded AI researcher Paul Christiano that works on identifying and understanding the potentially dangerous abilities of today’s models. ARC ran the testing on GPT-4, including passing along the AI’s proposed outputs to real humans, though they used only informed confederates when testing the ability of the AI to do illegal or harmful activities such as phishing emails.)
A lot of people were fascinated or appalled with this interaction, and reasonably so. We can debate endlessly what counts as true intelligence, but a famous candidate is the Turing test, where a model is able to convince human judges it’s human.
In this brief interaction, we saw a model deliberately lie to a human to convince them it wasn’t a robot, and succeed — a wild example of how this milestone, without much attention, has become trivial for modern AI systems. (Admittedly, it did not have to be a deceptive genius to pull this off.) If reading about GPT-4’s cheerful manipulation of human assistants unnerves you, I think you’re right to feel unnerved.
But it’s possible to go a lot further than “unnerved” and argue that it was unethical, or dangerous, to run this test. “This is like pressing the explode button on a nuke to see if it worked,” I saw one person complain on Twitter.
That I find much harder to buy. GPT-4 has been released. Anyone can use it (if they’re willing to pay for it). People are already doing things like asking GPT-4 to “hustle” and make money, and then doing whatever it suggests. People are using language models like GPT-4, and will soon be using GPT-4, to design AI personal assistants, AI scammers, AI friends and girlfriends, and much more.
AI systems casually lying to us, claiming to be human, is happening all the time — or will be happening shortly.
If it was unethical to do the live test of whether GPT-4 could convince someone on Taskrabbit to help it solve a CAPTCHA, including testing whether the AI could interact convincingly with real humans, then it was grossly unethical to release GPT-4 at all. Whatever anger people have about this test should be redirected at the tech companies — from Meta to Microsoft to OpenAI — that have in the last few weeks approved such releases. And if we’ve decided we’re collectively fine with unleashing millions of spam bots, then the least we can do is actually study what they can and can’t do.
Some people — I’m one of them — believe that sufficiently powerful AI systems might be actively dangerous. Others are skeptical. How can we settle this disagreement, beyond waiting to see if we all die? Testing like the ARC evaluations seems to me like one of the best routes forward. If our AI systems are dangerous, we want to know. And if they turn out to be totally safe, we want to know that, too, so we can use them for all of the incredibly cool stuff they’re evidently capable of.
A version of this story was initially published in the Future Perfect newsletter. Sign up here to subscribe!
Will you support Vox’s explanatory journalism?
Most news outlets make their money through advertising or subscriptions. But when it comes to what we’re trying to do at Vox, there are a couple reasons that we can’t rely only on ads and subscriptions to keep the lights on.
First, advertising dollars go up and down with the economy. We often only know a few months out what our advertising revenue will be, which makes it hard to plan ahead.
Second, we’re not in the subscriptions business. Vox is here to help everyone understand the complex issues shaping the world — not just the people who can afford to pay for a subscription. We believe that’s an important part of building a more equal society. We can’t do that if we have a paywall.
That’s why we also turn to you, our readers, to help us keep Vox free. If you also believe that everyone deserves access to trusted high-quality information, will you make a gift to Vox today?
We accept credit card, Apple Pay, and Google Pay. You can also contribute via
The day’s most important news stories, explained in your inbox.
Check your inbox for a welcome email.
Oops. Something went wrong. Please enter a valid email and try again.