Called RETRO (for “Retrieval-Enhanced Transformer”), the AI matches the performance of neural networks 25 times its size, cutting the time and cost needed to train very large models. The researchers also claim that the database makes it easier to analyze what the AI has learned, which could help with filtering out bias and toxic language.
“Being able to look things up on the fly instead of having to memorize everything can often be useful, in the same way as it is for humans,” says Jack Rae at DeepMind, who leads the firm’s research in large language models.
Language models generate text by predicting what words come next in a sentence or conversation. The larger a model, the more information about the world it can learn during training, which makes its predictions better. GPT-3 has 175 billion parameters—the values in a neural network that store data and get adjusted as the model learns. Microsoft’s language model Megatron has 530 billion parameters. But large models also take vast amounts of computing power to train, putting them out of reach of all but the richest organizations.
With RETRO, DeepMind has tried to cut the cost of training without reducing the amount the AI learns. The researchers trained the model on a vast data set of news articles, Wikipedia pages, books, and text from GitHub, an online code repository. The data set contains text in 10 languages, including English, Spanish, German, French, Russian, Chinese, Swahili, and Urdu.
RETRO’s neural network has only 7 billion parameters. But the system makes up for this with a database containing around 2 trillion passages of text. Both the database and the neural network are trained at the same time.
When RETRO generates text, it uses the database to look up and compare passages similar to the one it is writing, which makes its predictions more accurate. Outsourcing some of the neural network’s memory to the database lets RETRO do more with less.
The idea isn’t new, but this is the first time a look-up system has been developed for a large language model, and the first time the results from this approach have been shown to rival the performance of the best language AIs around.
Bigger isn’t always better
RETRO draws from two other studies released by DeepMind this week, one looking at how the size of a model affects its performance and one looking at the potential harms caused by these AIs.
To study size, DeepMind built a large language model called Gopher, with 280 billion parameters. It beat state-of-the-art models on 82% of the more than 150 common language challenges they used for testing. The researchers then pitted it against RETRO and found that the 7-billion-parameter model matched Gopher’s performance on most tasks.
The ethics study is a comprehensive survey of well-known problems inherent in large language models. These models pick up biases, misinformation, and toxic language such as hate speech from the articles and books they are trained on. As a result, they sometimes spit out harmful statements, mindlessly mirroring what they have encountered in the training text without knowing what it means. “Even a model that perfectly mimicked the data would be biased,” says Rae.