All you need to understand the Transformer neural architecture
Since their introduction in 2017 as a tool for sequence transduction, transformer models have been steadily progressing many cutting-edge areas of machine learning, primarily in natural language processing. Broadly, transformer models adopt the mechanism of self-attention, differentially weighting the significance of each part of the input data. But, to fully understand how they work from scratch is a big task.
This huge article from Brandon Rohrer attempts to do that. It builds up the basics of matrices and linear algebra, though sequences and attention to decoders and encoders. It explains each complex concept one at a time when introduced, such as matrix multiplications and backpropagation, making little prior knowledge required to understand the article.
I would recommend working though this slowly, alongside things like The Illustrated Transformer and Andrew Ng's videos. Don't worry if some bits don't make sense at first, just keep going and eventually it will all fall into place. Skip over any maths you don't understand at first, but make sure you come back to it later. Completing the article will provide you with a strong understanding of how transformer models work at their core, so the next time you read the latest natural language processing advancement – it should be make sense!