A child learns to walk before they can run, and to talk before they can read. Maybe AI should do the same?
The standard approach for 'talking' AI systems (like Siri, Alexa, etc) is to turn speech into text, analyse the text and convert it back to speech again. This is because our language models have been built using large corpuses of (usually English) text. Facebook's AI research group have recently published a new approach to NLP that is based purely on audio.
The Generative Spoken Language Model (GSLM) uses a transformer architecture to model raw audio signals, without any labels or text. The model has three parts:
The base model is trained on a large amount of unlabelled audio and can then be fine-tuned to do lots of other tasks. The approach has some benefits over the standard:
Really interesting development.