We perceive the world more like video - a stream of audio and visual signals from a single point of view - than any other medium. Most internet traffic is video and a large proportion of time online is spent looking at video.
However, ML models tend to focus on static images and written or spoken language. This is largely down to the complexity of handling video. Recent work by researchers at Facebook may start to change that. They have developed a method for building joint representations of audio and video data and have used it for clustering and classifying videos on Instagram. The key development, like many recent advances in ML, is the use of large unlabelled datasets for unsupervised pre-training.
This research is still in fairly early stages and I don't think we are quite at the stage of building rich temporal video representations of the world in the way our brains work, but it is a step forward.
Why this matters: most of the data on the web is video. Now it can be used to train AI.