Text summarization is the process of distilling the most important information from a source text to produce an abridged version for a particular user and task.
Two ways to approach automatic text summarization are Extractive Summarization and Abstractive Summarization.
Extractive Summarization: select the most informative units of text from the input and copy them directly into the summary. Usually, extracted units in extractive models are sentences.
Abstractive Summarization: take a step further to resemble human-written summaries by either rephrasing or paraphrasing, which requires more sophisticated linguistic understandings and even the incorporation of real-world knowledge.
We propose a system that has capable of summarizing a paper. It uses BART, which pre-trains a model combining Bidirectional and Auto-Regressive Transformers and PEGASUS, which is a State-of-the-Art model for abstractive text summarization.
In 2019, researchers of Facebook AI-Language have published a new model for Natural Language Processing (NLP) called BART. BART outperforms other models in the NLP domain, achieving new, state-of-the-art results on a range of abstract dialogue, question answering, and summarization tasks.
in 2020 researchers of Google AI Language have published a new model for Natural Language Processing (NLP) called PEGASUS.
BART
BART, which stands for Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. BART perform as generalizing BERT on the Encoder side and GPT as Decoder.
BART has an autoregressive decoder, it can be fine-tuned for sequence generation tasks such as summarization. In summarization, information is copied from input but controlled, which is closely related to the denoising pre-training object.
PEGASUS
PEGASUS, which stands for Pre-training with Extracted Gap-Sentences for Abstractive Summarization, pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. its base architecture includes an encoder and a decoder. In PEGASUS, the encoder (MLM) and decoder (GSG).
An input is a document with missing sentences, PEGASUS will recover them then the output consists of missing sentences concatenated together. This task is Gap Sentence Generation. PEGASUS is trained to predict these sentences.
Conclusion
BART model that is trained on CNN/DailyMail data has a good performance. It does provide a fluent summary. However, we think that it still has some weaknesses:
The BART model is trained on English vocabulary then it may not be used for other languages.
BART may miss out on some key word that researchers might want to see as a part of the summary.
PEGASUS model that is also trained on CNN/DailyMail data provides a shorter version of summarize than the BART model. However, The summarize isn’t always meaningful and correct.
댓글