Retrieval-Augmented Language Model Pre-Training

REALM: Retrieval-Augmented Language Model Pre-Training
REALM: Integrating Retrieval into Language Representation Models
google-research/language

Abstract

REALM is just the BERT based language model (LM) augmented with a retriever. At pretrain time, MLM task is used. Pretraining is unsupervised where salient spans (entities, dates) are masked. Finetuning is done for Open QA task: predict (start, end) spans given question and retrieved docs z. Retriever is supported by knowledge corpus and MIPS. The paper showed great results for OpenQA task. It formed a basis for many future studies: RAG, RETRO, ATLAS.

In this I will talk about REALM training details and also its evaluation on open question answering datasets.

Stop Thinking, Just Do!