From ChatGPT to Google Gemini, large language models are now increasingly important in our everyday lives. These models are part of the field of natural language processing – or ‘NLP’ – which studies how machines understand and generate human language. Most NLP systems are built using machine learning, and vast amounts of language data are used as training material. Afterwards, a successfully trained model should be able to handle new scenarios. This ability is called ‘generalization’. For large language models that generalize well, a conversation about a topic it hasn’t been trained on, such as new scientific discoveries, should not be a problem. Read More
While NLP researchers widely agree that generalization is important, they don’t agree on what good generalization looks like, what types of generalization exist, and which types are relevant in different scenarios. To address this, Dieuwke Hupkes and collaborators from 20 different universities and companies conducted the largest study of generalization in NLP.
In their paper, ‘A taxonomy and review of generalization research in NLP’, published in Nature Machine Intelligence, they present a map of the generalization research landscape, using a new taxonomy of five axes along which generalization research can differ.
The first axis captures the motivation for studying a model’s generalization. Some papers have a purely practical motivation: they want to ensure models can continue to perform well when conditions change. Others have a cognitive motivation: do models generalize like humans? Others look at fairness and inclusivity: does the model behave equally well across languages or to requests from users across all social backgrounds?
The second axis examines the type of generalization, such as generalization over different tasks. If a model is trained to answer questions, can it also write poetry? Another type is cross-lingual generalization: if a model is trained mainly on English data, how much more training data does it need to do well in another language? For example, Google Gemini, having only seen a single grammar book in Kalamang, a language spoken by fewer than 200 people, performs almost as well as a human using the same material.
To measure generalization, researchers create intentional differences – called shifts – between the training material and the testing material. Axes 3 to 5 give a technical description of how those shifts were created.
Having developed their taxonomy, Hupkes’ team mapped out the current state of generalization research as a whole, documenting over 700 NLP experiments according to the five axes. Among other things, the team discovered that 70% of generalization studies are motivated by practical concerns, whereas only 3% have a fairness motivation. Similarly, generalization across tasks is much more frequently studied than cross-lingual generalization. These findings show the urgent need for more research in these areas, particularly because of the risks that large language models can bring to underrepresented communities.
This work is part of a larger initiative called GenBench – short for Generalization Benchmarking. Visitors to the GenBench website can find the paper, a visualization of the NLP generalization landscape, and a tool to register new generalization experiments.
The GenBench team is organizing a series of academic workshops and coordinating the creation of collaboratively-built generalization benchmarks. The team ultimately hopes that GenBench could lead researchers towards a better, more coordinated approach to NLP evaluation. In turn, this will improve model development, such that when you use a large language model, its responses are trustworthy and reliable, even in new and unexpected scenarios.