How BERT differs from its predecessors
The innovation of BERT is mainly in the way of pretraining. Earlier architectures, to learn, performed the task of generating text – that is, predicting which word is most likely to be next, given all the words before it. The decision of the neural network was influenced only by the words on the left (this is how, for example, the Transformer from OpenAI works), such neural networks are called unidirectional. A person does not do this, we usually look at the whole proposal at once.
To solve the problem, they came up with bidirectional neural networks. In short, two identical neural networks work in parallel, one predicts words from left to right, and the other predicts words from right to left. Then the result of both networks is simply “glued together”. This idea underlies the ELMo model. A bidirectional neural network performs better on some tasks than a unidirectional one, but this is also not quite what you want to see: we kind of get two “one-eyed” models, each of which does not know what the other is doing. Check this post to see the practical cases behind the bidirectional NNs.
Therefore, BERT is pre-trained on a “masked language model”. Its essence is that you need to predict the word not at the end of the sentence, but somewhere in the middle: not “the person went to the store for ???”, but “the person went to ??? for milk.” The masked model is called because the desired token is replaced by the [MASK] token.
If you are looking for expert support for your business solution, you can send your request to the next ML professional and get quick advice.
This approach to learning made it possible to do something that could not be done in the Transformer: submit (and take into account) all parts of the phrase, and not just the words on the left or right. Because to open the whole sentence in a one-way Transformer means to give a ready answer to a system that is trying to guess what happened next: it won’t learn anything that way. With the new learning method, “deep bidirectionality” is achieved, where the model looks both ways, rather than gluing separate representations of the context from the left and right.
“Under the hood” of BERT is an attention mechanism, purposely made similar to its counterpart in the GPT Transformer to make it easier to compare results. The architectures are similar: to simplify greatly, then BERT is such a “transformer”, in which the number and size of layers have been increased, the decoding part has been removed, and it has been taught to look at the context in both directions. The attention of a simple transformer is always directed to the tokens to the left of the given one (the words to the right are replaced by the special word [MASK], which resets the weight of attention to zero). BERT, on the other hand, “masks” only what needs to be predicted, which means that attention is directed to all the tokens of the input sequence – both on the left and the right.
The attention mechanism can select multipliers that increase the weight of significant words in the context. This greatly improves the accuracy of neural network decisions.
Where is BERT used?
BERT is used in the Google search engine: at first, this model worked only for English, later it was added to search in other languages. BERT-based technologies can also be used to moderate texts, such as reviews and comments, find answers to legal questions, and facilitate workflow with documents. The use of BERT and similar technologies are changing approaches to optimization of sites for Google search: since the search engine is now able to understand more colloquial, “live” texts, which means that you can no longer oversaturate the text with cumbersome descriptions with keywords.
If you need assistance with any AI-related project, you can get a consultation from Serokell company, they are among the top-listed developers for blockchain, ML, and biotech niches.