Posts RoBERTa
Post
Cancel

RoBERTa

RoBERTa

Facebook AI Research (FAIR), 2019)

(Robustly optimized BERT approach)

The difference with BERT

It shows the impact of many key hyperparameters and training data size with the thought of Bert was undertrained

Points

1. Bigger on everything

  • including Batch size, Epochs, Data (requires more time, resources, and computing power)
  • BookCorpus + English Wikipedia (16GB), CC-News crawled from CommonCrawl News data (76GB), OpenWebText (38GB), STORIES (31GB)
  • Training a model with large mini-batches improves the perplexity of MLM objective, likewise, it is easier to parallelize via distributed data-parallel training.

2. No NSP

  • when it comes to model training phase of just a sentence, NSP can degrade a performance of the model

3. Dynamic Masking

  • BERT uses Static Masking. In MLM training objective, BERT performs masking only once during data preprocessing which means same input masks are fed to the model on every single epoch.
  • RoBERTa changes the masked token in every training epochs
  • To avoid using the same mask for every epoch, training data was duplicated 10 times. If the masking is performed every time a sequence is fed to the model, the model sees different versions of the same sentence with masks on different positions.

4. Traning on Longer Sequence

5. Larger Byte-level BPE

  • Byte-Pair Encoding (BPE) is a hybrid between character and word-level representations, which solely relies on subword units. These subword units can be extracted by performing a statistical analysis of training dataset. Generally, the BPE vocabulary size range from 10K -100K subword units.
  • BERT uses a character level BPE vocabulary size of 30K which is learned after preprocessing with heuristic tokenization rules.
  • RoBERTa uses the encoding method discussed in the paper by Radford et al. (2019). Here BPE subword vocabulary is reduced to 50K (still bigger than BERT’s vocab size) units with the capability to encode any text without any unknown tokens and no need of preprocessing or tokenization rules. Using this encoding degraded the performance of end-task performance in some cases. Still, this method was used for encoding as it is a universal encoding scheme which doesn’t need any preprocessing and tokenization rules.

Pretrained model in Korean

klue/roberta-large · Hugging Face


References

https://towardsdatascience.com/robustly-optimized-bert-pretraining-approaches-537dc66522dd

https://arxiv.org/abs/1907.11692

https://medium.com/analytics-vidhya/evolving-with-bert-introduction-to-roberta-5174ec0e7c82