All this outrage is finally having an impact. Bing’s trippy content is generated by AI language technology called ChatGPT developed by startup OpenAI, and last Friday, OpenAI issued a blog post aimed at clarifying how its chatbots should behave. It also released its guidelines on how ChatGPT should respond when prompted with things about US “culture wars.” The rules include not affiliating with political parties or judging one group as good or bad, for example.
I spoke to Sandhini Agarwal and Lama Ahmad, two AI policy researchers at OpenAI, about how the company is making ChatGPT safer and less nuts. The company refused to comment on its relationship with Microsoft, but they still had some interesting insights. Here’s what they had to say:
How to get better answers: In AI language model research, one of the biggest open questions is how to stop the models “hallucinating,” a polite term for making stuff up. ChatGPT has been used by millions of people for months, but we haven’t seen the kind of falsehoods and hallucinations that Bing has been generating.
That’s because OpenAI has used a technique in ChatGPT called reinforcement learning from human feedback, which improves the model’s answers based on feedback from users. The technique works by asking people to pick between a range of different outputs before ranking them in terms of various different criteria, like factualness and truthfulness. Some experts believe Microsoft might have skipped or rushed this stage to launch Bing, although the company is yet to confirm or deny that claim.
But that method is not perfect, according to Agarwal. People might have been presented with options that were all false, then picked the option that was the least false, she says. In an effort to make ChatGPT more reliable, the company has been focusing on cleaning up its dataset and removing examples where the model has had a preference for things that are false.
Jailbreaking ChatGPT: Since ChatGPT’s release, people have been trying to “jailbreak” it, which means finding workarounds to prompt the model to break its own rules and generate racist or conspiratory stuff. This work has not gone unnoticed at OpenAI HQ. Agarwal says OpenAI has gone through its entire database and selected the prompts that have led to unwanted content in order to improve the model and stop it from repeating these generations.
OpenAI wants to listen: The company has said it will start gathering more feedback from the public to shape its models. OpenAI is exploring using surveys or setting up citizens assemblies to discuss what content should be completely banned, says Lama Ahmad. “In the context of art, for example, nudity may not be something that’s considered vulgar, but how do you think about that in the context of ChatGPT in the classroom,” she says.
Consensus project: OpenAI has traditionally used human feedback from data labellers, but recognizes that the people it hires to do that work are not representative of the wider world, says Agarwal. The company wants to expand the viewpoints and the perspectives that are represented in these models. To that end, it’s working on a more experimental project dubbed the “consensus project,” where OpenAI researchers are looking at the extent to which people agree or disagree across different things the AI model has generated. People might feel more strongly about answers to questions such as “are taxes good” versus “is the sky blue,” for example, Agarwal says.