A recent article by NLP researcher Jesse Dunietz argues that the rapidly evolving field of Natural Language Processing is getting better and better at solving benchmark problems while not actually building much that is useful.
Much of today’s reading comprehension research entails carefully tweaking models to eke out a few more percentage points on the latest data sets. “State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2.4 points!”
But many people in the field are growing weary of such leaderboard-chasing. What has the world really gained if a massive neural network achieves SOTA on some benchmark by a point or two?
An example of Goodhart's law "When a measure becomes a target, it ceases to be a good measure."
I do have some sympathy with this position: researchers do tend to focus on benchmarks. However, the algorithms that they produce do go on to be used in more practical settings by practitioners like me. There does need to be some way to compare progress, imperfect as it is, but these benchmarks need constant tweaking as the technology and requirements progresses.