One model to rule them all!
Self-supervised machine learning is where models can learn from large volumes of un-labelled data. This has been a very successful avenue of research lately, partly because it allows models to consume huge volumes of freely available data.
Most models that use this approach are designed to work with a single type of data. For example, a model that can handle text, can't be repurposed to handle images or sound. This means there are many different types of models being developed in parallel for each type of input.
To counter this, a new model has just been designed by Facebook AI (I'm not calling them Meta yet), that achieves 'State of the Art' performance (don't they all!) using a single model type across speech, text and images. The model is made up of two networks: a student and a teacher. The teacher network proposes a representation from an input (text, speech or image) and the student has to predict the representation based on an incomplete version of the input. Using this approach the network builds a new representation for each type of data from scratch.