DeepSeek’s Breakthrough – Experts and Efficiency in AI
The addition of Attention enabled Translators to process complex concepts
During the evolution of automatic translators, it was realised that the chance of a word occurring in a sentence was highly dependent on the structure of the sentence. A sentence that contains the word ‘tomato’ is much more likely to include ‘greenhouse’ than ‘particle accelerator’.
The Power of Attention in AI
Humans pay attention to the context of what they plan to read by looking at a headline, skimming the document or reading the first line. This allows us to make sense of the content before we read it, the content is processed more accurately and efficiently. There are many ways to replicate attention during an automated translation, all emulating what we do instinctively.
Overcoming GPU Limitations
Processing text word-by-word to focus the algorithm’s attention on the context is painfully slow in all but the most trivial cases, even on the fastest computer. There was a need for a method of identifying the train-of-thought of large sequences of tokens by working on all the tokens at the same time.
The parallel processing of tokens to extract attention was first implemented in 2017, although it came with a substantial increase in computing requirements.
The use of Graphics Processing Units (GPUs) for fast parallel processing of numerical methods was well established by 2017, and the translation of long sequences of tokens using LLMs became feasible for the first time.
Even so, determining the colossal number of permutations that exist between all sequences of input tokens and generated output, represents a computational, financial and technical barrier to all but the wealthiest of AI companies.
In our next article, we’ll explore the use of embedded experts in DeepSeek’s approach.