What is exam question answering?
Most question answering tasks can be phrased as EQA, and as such most modeling improvements to QA systems are also applicable to EQA. It can even be seen as a particular instance of open-domain question answering. However, what sets EQA apart is the use of standard, human-curated exams and resources. Because these exams are also administered to humans, their use provides not only a point of comparison between other machine reading systems, but also a benchmark of machine reading vs. human intelligence.
This website serves as a community resource for researchers interested in machine intelligence, particularly in systems developed for, or evaluated in, an EQA setting. Here you will find a curated list of relevant papers, data sets, and comparative result tables. In addition, we provide tools for unifying data and evaluation across different QA models to aid researchers in this area.
Motivation
As artificial intelligence has matured as a field, researchers have long sought to define what it means for a machine to be intelligent, and in doing so devise a test with which to measure our progress towards creating intelligent machines. While the Turing test has emerged as the most notable and widely-cited of these measures, it is a poor fit for the research environment where a quick and objective measure of a machine's reading and reasoning capabilities is needed.
Simultaneously, after decades of steady progress, the performance of state-of-the-art systems on many traditional NLP tasks has leveled off. Together with the development of more powerful models, attention has also shifted away from "pre-processing" tasks which have little inherent value (part-of-speech tagging, parsing), and towards end tasks which more strongly correlate with a potential user's goals. What will be the defining NLP tasks of tomorrow? It is likely that question answering will be among them, but QA still requires more formalization — more consistent choices of data sets and evaluation measures — to ensure that QA research will be just as scientifically rigorous as the NLP tasks of the past.
In recent years some attention has been placed on the use of exams, quizzes, and word games as a measure of these skills. For the same reasons that we subject ourselves to standardized testing — to probe what knowledge we've memorized and what new facts we can infer — we feel that EQA is an excellent benchmark to assess machine intelligence in NLP. While simple tactics can produce high-scoring baselines [1], it stands to reason that matching expert human performance on exams will require machines to comprehend the same facts, and to make similar inferences.