Is Bert Really Robust?

In this blog I will introduce the paper “Is Bert really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment” What is Bert? What is a natural language attack model and how is it useful? We will break down this paper into details and answer these questions in this post.
Natural language processing (NLP) has been a popular topic these years in the machine learning field. Its name is self-explanatory enough that it enables computers to classify and generate human language through a learning process. However, the word “learning” is kind of delusive. Computers don’t have the level of intelligence such that they can “learn” a language (until today, at least). So their learning process is done by processing a large amount of human languages and extract features out of them so they can have a model, or we can say algorithm, of how human languages are formed and understood. To think it further, this is actually similar to how every human learns his/her language --- through a large amount of information receiving --- we heard people speaking to us a lot as babies, and then we try to formulate the language based on what we have heard.
Although the title is about a specific NLP model Bert, the paper doesn’t pay that much attention on the model. Instead, it presents a baseline for natural language attack named TEXTFOOLER that can generate adversarial examples for NLP models. One fancy thing about TEXTFOOLER is that it is based on a black-box set up. It means that it is able to attack the NLP models without knowing their architecture or parameters. This property creates a lot of freedom for TEXTFOOLER in cases of attacking models that are not open-source.
Basically, what TEXTFOOLER does is to generate the adversarial texts that “fools” the NLP models. These adversarial texts are very similar from the original texts, sometimes maybe just one or two different words in a sentence, so that the human judgement will remain the same. However, NLP models would give a different prediction on these adversarial examples. More formally, the adversarial examples created by TEXTFOOLER must hold three utility-preserving properties:
(1) human prediction consistency: the prediction of human evaluators should remain unchanged between original and adversarial texts.
(2) Semantic similarity: the adversarial text should have the same meaning as the source text.
(3) Language fluency: adversarial examples should be grammatically acceptable by human judges.
Suppose we now have a model in our hands. Think of it as a magical function F. We have a corpus of sentences X = {X1… Xn}, and a set of labels Y = {Y1… Yn} corresponding to it. This magical function F, aka our model, is able to map every sentence to its label, F: X -> Y. What TEXTFOOLER needs to do is to generate an adversarial example Xadv, such that it is mapped to different label from the source text X, yet it is similar enough to X such that the cosine similarity value Sim(Xadv, X) exceeds a certain parameter.
So, how are the adversarial texts generated by this TEXTFOOLER tool? We can break it down into several precedures.
(1) Word Importance Ranking
In human language, every sentence has its key words to represent the meaning. As long as you are familiar with those keywords in a sentence, you know what the sentence is about. For example, in the sentence “May I have your phone number please?” the words “phone” and “number” best represent the meaning of the sentence, while “may” “please” don’t help a lot with it. A lot of NLP models like Bert take advantage of this property, so that only some key words in a sentence serve as influential factors of the prediction. Thus, TEXTFOOLER applies a selection mechanism to choose words with higher significance. Because of the blackbox setup, the mechanism measures every word’s ranking by deleting them from the set of words and calculating the difference it creates on the label prediction.
(2) Candidate Selection
A word replacement mechanism is then applied to the selected words. Note that the replacement mechanism must obey the three utility-preserving rules mentioned above. We do this by a pipeline of work.
First, think about all words as vectors in a specific coordination system, in which the words that have similar meanings are closer to each other in terms of cosine similarity. By this means, we gather a pool of candidates for a specific word w.
(3) POS checking
Next, we need to make sure that the sentence remains human readable after replacement, and this requires the candidate words to have the same part-of-speech as the original words. Using this checking mechanism, we can filter out a lot of candidates that don't make sense to place in the sentence.
(4) Semantic Similarity Checking
Up to now, the words left in the pool should have similar meaning, as well as same part-of-speech as the source word, and what we do next is to place them into the sentence to create the adversarial example Xadv.
We then place the adversarial example Xadv into model F to get a label prediction Yadv. We then check if Y = Yadv, because according to the problem definition, model F should have a different prediction on adversarial examples. Moreover, we check if the adversarial example Xadv is different enough from the original text X, such that their cosine similarity SIM(Xadv, X) exceeds a certain parameter. Note that this checking is on the sentence level, so it is different from checking the similarities of candidate words and original word.
Finally, what we have in the pool should be a set of words that
(a) are similar enough to the original word w
(b) has the same part-of-speech value as the original word w
(c) when placed in the sentence, making the sentence Xadv to have different prediction than original text X
(d) when placed in the sentence, maintaining the similarity of the adversarial example Xadv and original text X
(5) Finalization of Adversarial Examples
For the last step, we choose the word c that has the highest cosine similarity as the original word w among the pool and replace it with the original texts. Thus we create Xadv.
The performance of TEXTFOOLER is stunning. Almost all of the targeting models have accuracy of 10% and below after its adversarial example attack. Moreover, its adversarial examples stay very similar to the original texts, with only less than 20% words being replaced.
NLP attacking tools like TEXTFOOLER are very beneficial for the future of NLP model developments. They serve not only as perfect tools for finding potential drawbacks in existing models, but also a self-checking mechanism of the performance of a specific model under different level of attacks.