Just talk!

30.09.21

A child learns to walk before they can run, and to talk before they can read. Maybe AI should do the same?

The standard approach for 'talking' AI systems (like Siri, Alexa, etc) is to turn speech into text, analyse the text and convert it back to speech again. This is because our language models have been built using large corpuses of (usually English) text. Facebook's AI research group have recently published a new approach to NLP that is based purely on audio.

The Generative Spoken Language Model (GSLM) uses a transformer architecture to model raw audio signals, without any labels or text. The model has three parts:

An encoder than converts language into discrete sound units
An autoregressive language model that predicts the next unit
A decoder that converts the units back to speech.

The base model is trained on a large amount of unlabelled audio and can then be fine-tuned to do lots of other tasks. The approach has some benefits over the standard:

It can be applied to languages that do not have a large corpus of labelled training data
It allows us to encode the full expressivity of oral communication
It opens up a huge amount of unlabelled audio data.

Really interesting development.

←Previous: Start without machine learning

Next: Why? Causal learning→

Just talk!

Keep up with the latest developments in data science. One email per month.