Machine learning is algorithms + data. A lot of focus goes on improving algorithms, not enough goes on improving data.
In a recent talk, machine learning pioneer Andrew Ng gave several examples where algorithmic improvements made little difference but improving data made a big difference. This is contrary to the usual focus on algorithmic tweaks. Often, by improving the consistency of labelling in training data, we can improve a model dramatically.
I agree wholeheartedly with this point of view. In ML research (and Kaggle type competitions) there is very little focus on data exploration and understanding. This is because so many ML papers focus on improving model performance on a static benchmark dataset. This has been a useful approach but bears little resemblance to how ML is used in the real world. Time and again I have seen cases where extensive hyperparater tuning gives virtually no improvement but simple changes to data give huge improvements.
Why this matters: taking a more systematic approach to data quality is essential to making ML work.