# Big Bird: Transformers for Longer Sequences

NIPS 2020, 2020.

Weibo:

Abstract:

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism tha...More

Code:

Data:

Introduction

- Models based on Transformers [92], such as BERT [22, 63], are wildly successful for a wide variety of Natural Language Processing (NLP) tasks and are mainstay of modern NLP research.
- The key innovation in Transformers is the introduction of a self-attention mechanism, which can be evaluated in parallel for each token of the input sequence, eliminating the sequential dependency in recurrent neural networks, like LSTM
- This parallelism enables Transformers to leverage the full power of modern SIMD hardware accelerators like GPUs/TPUs, thereby facilitating training of NLP models on datasets of unprecedented size.
- The pretraining has led to significant improvement in low data regime downstream tasks [51] as well as tasks with sufficient data [102] and have been a major force behind the ubiquity of transformers in contemporary NLP

Highlights

- Models based on Transformers [92], such as BERT [22, 63], are wildly successful for a wide variety of Natural Language Processing (NLP) tasks and are mainstay of modern NLP research
- We systematically develop BIGBIRD, an attention mechanism whose complexity is linear in the number of tokens (Sec. 2)
- We show that when sparse attention mechanisms are used in a standalone encoder, they are Universal Approximators of sequence to sequence functions in the style of Yun et al [105]
- Complementing the above positive results, we show that moving to a sparse-attention mechanism incurs a cost, i.e. there is no free lunch
- We propose BIGBIRD: a sparse attention mechanism that is linear in the number of tokens
- We achieve state of the art results for question answering and document summarization on a number of different datasets
- We further introduce attention based contextual language model for DNA and fine-tune it for down stream tasks such as promoter region prediction and predicting effects of non-coding variants

Methods

- Natural Language Processing

the goal is to showcase benefits of modeling longer input sequence for NLP tasks, for which the authors select three representative tasks. - There has been a recent upsurge in using deep learning for genomics data [87, 107, 13], which has resulted in improved performance on several biologically-significant tasks such as promoter site prediction [71], methylation analysis [55], predicting functional effects of non-coding variant [110], etc
- These approaches consume DNA sequence fragments as inputs, and the authors believe longer input sequence handling capability of BIGBIRD would be beneficial as many functional effects in DNA are highly non-local [12].

Results

**Theoretical Results about Sparse Attention Mechanism**

the authors will show that that sparse attention mechanisms are as powerful and expressive as full-attention mechanisms in two respects.- The authors show that when sparse attention mechanisms are used in a standalone encoder, they are Universal Approximators of sequence to sequence functions in the style of Yun et al [105].
- The authors note that this property was explored theoretically in contemporary work Yun et al [106].

Conclusion

- The authors propose BIGBIRD: a sparse attention mechanism that is linear in the number of tokens.
- The authors use the power of extra global tokens preserve the expressive powers of the model.
- The authors complement these results by showing that moving to sparse attention mechanism do incur a cost.
- BIGBIRD gives state-of-the-art performance on a number of NLP tasks such as question answering and long document classification.
- The authors further introduce attention based contextual language model for DNA and fine-tune it for down stream tasks such as promoter region prediction and predicting effects of non-coding variants

Summary

## Introduction:

Models based on Transformers [92], such as BERT [22, 63], are wildly successful for a wide variety of Natural Language Processing (NLP) tasks and are mainstay of modern NLP research.- The key innovation in Transformers is the introduction of a self-attention mechanism, which can be evaluated in parallel for each token of the input sequence, eliminating the sequential dependency in recurrent neural networks, like LSTM
- This parallelism enables Transformers to leverage the full power of modern SIMD hardware accelerators like GPUs/TPUs, thereby facilitating training of NLP models on datasets of unprecedented size.
- The pretraining has led to significant improvement in low data regime downstream tasks [51] as well as tasks with sufficient data [102] and have been a major force behind the ubiquity of transformers in contemporary NLP
## Objectives:

The authors' goal is to showcase benefits of modeling longer input sequence for NLP tasks, for which the authors select three representative tasks.## Methods:

Natural Language Processing

the goal is to showcase benefits of modeling longer input sequence for NLP tasks, for which the authors select three representative tasks.- There has been a recent upsurge in using deep learning for genomics data [87, 107, 13], which has resulted in improved performance on several biologically-significant tasks such as promoter site prediction [71], methylation analysis [55], predicting functional effects of non-coding variant [110], etc
- These approaches consume DNA sequence fragments as inputs, and the authors believe longer input sequence handling capability of BIGBIRD would be beneficial as many functional effects in DNA are highly non-local [12].
## Results:

**Theoretical Results about Sparse Attention Mechanism**

the authors will show that that sparse attention mechanisms are as powerful and expressive as full-attention mechanisms in two respects.- The authors show that when sparse attention mechanisms are used in a standalone encoder, they are Universal Approximators of sequence to sequence functions in the style of Yun et al [105].
- The authors note that this property was explored theoretically in contemporary work Yun et al [106].
## Conclusion:

The authors propose BIGBIRD: a sparse attention mechanism that is linear in the number of tokens.- The authors use the power of extra global tokens preserve the expressive powers of the model.
- The authors complement these results by showing that moving to sparse attention mechanism do incur a cost.
- BIGBIRD gives state-of-the-art performance on a number of NLP tasks such as question answering and long document classification.
- The authors further introduce attention based contextual language model for DNA and fine-tune it for down stream tasks such as promoter region prediction and predicting effects of non-coding variants

- Table1: Building block comparison @512 on the nodes. Then a random subset (k%) of all connections is replaced with a random connection. Model
- Table2: QA Dev results using Base size models. We report accuracy for WikiHop and F1 for HotpotQA, Natural Questions, and TriviaQA
- Table3: Fine-tuning results on Test set for QA tasks. The Test results (F1 for HotpotQA, Natural Questions, TriviaQA, and Accuracy for WikiHop) have been picked from their respective leaderboard
- Table4: Summarization ROUGE score for long documents
- Table5: MLM BPC
- Table6: Comparison
- Table7: Chromatin-Profile Prediction

Related work

- There have been a number of interesting attempts, that were aimed at alleviating the quadratic dependency of Transformers, which can broadly categorized into two directions. First line of work embraces the length limitation and develops method around it. Simplest methods in this category just employ sliding window [94], but in general most work fits in the following general paradigm: using some other mechanism select a smaller subset of relevant contexts to feed in the transformer and optionally iterate, i.e. call transformer block multiple time with different contexts each time. Most prominently, SpanBERT [42], ORQA [54], REALM [34], RAG [57] have achieved strong performance for different tasks. However, it is worth noting that these methods often require significant engineering efforts (like back prop through large scale nearest neighbor search) and are hard to train.

Second line of work questions if full attention is essential and have tried to come up with approaches that do not require full attention, thereby reducing the memory and computation requirements. Prominently, Dai et al [21], Sukhbaatar et al [83], Rae et al [74] have proposed auto-regresive models that work well for left-to-right language modeling but suffer in tasks which require pbidirectional context. Child et al [16] proposed a sparse model that reduces the complexity to O(N N ), Kitaev et al [49] further reduced the complexity to O(N log(N )) by using LSH to compute nearest neighbors. Ye et al [104] proposed binary partitions of the data where as Qiu et al [73] reduced complexity by using block sparsity. Recently, Longformer [8] introduced a localized sliding window based mask with few global mask to reduce computation and extended BERT to longer sequence based tasks. Finally, our work is closely related to and built on the work of Extended Transformers Construction [4]. This work was designed to encode structure in text for transformers. The idea of global tokens was used extensively by them to achieve their goals. Our theoretical work can be seen as providing a justification for the success of these models as well. It is important to note that most of the (a) Random attention (b) Window attention (c) Global Attention (d) BIGBIRD aforementioned methods are heuristic based and empirically are not as versatile and robust as the original transformer, i.e. the same architecture do not attain SoTA on multiple standard benchmarks. (There is one exception of longformer which we include in all our comparisons, Sec. 4). Moreover, these approximations do not come with theoretical guarantees.

Funding

- We achieve state of the art results for question answering and document summarization on a number of different datasets
- We showcase that our long input BIGBIRD along with the proposed pretraining significantly improves performances in two downstream tasks
- We see that BIGBIRD achieve nearly perfect accuracy with a 5% jump from the previous best reported accuracy
- With the baselines in Tab. 7 and see that we significantly improve on performance on the harder task

Study subjects and analysis

standard data-sets: 4

This task involves predicting a random subset of tokens which have been masked out. We use four standard data-sets for pretraining (listed in App. E.1, Tab. 9), warm-starting from the public RoBERTa checkpoint2

challenging datasets: 4

NaturalQ LA SA. Question Answering (QA) We considered following four challenging datasets: 1. Natural Questions [52]: For the given question, find a short span of answer (SA) from the given evidences as well highlight the paragraph from the given evidences containing information about the correct answer (LA)

long document datasets: 3

Summarization Document summarization is a task of creating a short and accurate summary of a text document. We used three long document datasets for testing our model details of which are mention in Tab. 18. In this paper we focus on abstractive summarization of long documents where using a longer contextual encoder should improve performance

Reference

- A. Abboud, V. V. Williams, and O. Weimann. Consequences of faster alignment of sequences. In International Colloquium on Automata, Languages, and Programming, pages 39–51.
- A. Abboud, A. Backurs, and V. V. Williams. Tight hardness results for lcs and other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 59–78. IEEE, 2015.
- J. Abreu, L. Fred, D. Macêdo, and C. Zanchettin. Hierarchical attentional hybrid neural networks for document classification. In International Conference on Artificial Neural Networks, pages 396–402.
- J. Ainslie, S. Ontanon, C. Alberti, P. Pham, A. Ravula, and S. Sanghai. Etc: Encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483, 2020.
- C. Alberti, K. Lee, and M. Collins. A bert baseline for the natural questions. arXiv preprint arXiv:1901.08634, 2019.
- J. Alt, R. Ducatez, and A. Knowles. Extremal eigenvalues of critical erd\h {o} sr\’enyi graphs. arXiv preprint arXiv:1905.03243, 2019.
- A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 51–58, 2015.
- I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- F. Benaych-Georges, C. Bordenave, A. Knowles, et al. Largest eigenvalues of sparse inhomogeneous erdos–rényi graphs. Annals of Probability, 47(3):1653–1676, 2019.
- F. Benaych-Georges, C. Bordenave, A. Knowles, et al. Spectral radii of sparse random matrices. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 2141–2161. Institut Henri Poincaré, 2020.
- R. Bharanikumar, K. A. R. Premkumar, and A. Palaniappan. Promoterpredict: sequence-based modelling of escherichia coli 70 promoter strength yields logarithmic dependence between promoter strength and sequence. PeerJ, 6:e5862, 2018.
- S. Buldyrev, A. Goldberger, S. Havlin, R. Mantegna, M. Matsa, C.-K. Peng, M. Simons, and H. Stanley. Long-range correlation properties of coding and noncoding dna sequences: Genbank analysis. Physical Review E, 51(5):5084, 1995.
- A. Busia, G. E. Dahl, C. Fannjiang, D. H. Alexander, E. Dorfman, R. Poplin, C. Y. McLean, P.-C. Chang, and M. DePristo. A deep learning approach to pattern recognition for short dna sequences. BioRxiv, page 353474, 2019.
- J. Chen, S.-t. Lin, and G. Durrett. Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610, 2019.
- Y.-C. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu. Distilling the knowledge of bert for text generation. arXiv preprint arXiv:1911.03829, 2019.
- R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- F. Chung and L. Lu. The average distances in random graphs with given expected degrees. Proceedings of the National Academy of Sciences, 99(25):15879–15882, 2002.
- C. Clark and M. Gardner. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017.
- K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
- A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
- Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860, 2019.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054, 2019.
- R. Dreos, G. Ambrosini, R. Cavin Périer, and P. Bucher. Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research, 41(D1):D157–D164, 2013.
- G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research, 22:457–479, 2004.
- Y. Fang, S. Sun, Z. Gan, R. Pillai, S. Wang, and J. Liu. Hierarchical graph network for multi-hop question answering. arXiv preprint arXiv:1911.03631, 2019.
- L. A. Gates, C. E. Foulds, and B. W. O’Malley. Histone marks in the ‘driver’s seat’: functional roles in steering the transcription cycle. Trends in biochemical sciences, 42(12):977–989, 2017.
- J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org, 2017.
- S. Gehrmann, Y. Deng, and A. M. Rush. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792, 2018.
- M. Ghandi, D. Lee, M. Mohammad-Noori, and M. A. Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7), 2014.
- A. Gidiotis and G. Tsoumakas. A divide-and-conquer approach to the summarization of academic articles. arXiv preprint arXiv:2004.06190, 2020.
- M. Gong. ReflectionNet, 2020 (accessed June 3, 2020). URL https://www.microsoft.com/en-us/research/people/migon/.
- S. Gray, A. Radford, and D. P. Kingma. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 3, 2017.
- K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.
- J. He, L. Wang, L. Liu, J. Feng, and H. Wu. Long document classification from local word glimpses via recurrent attention learning. IEEE Access, 7:40707–40718, 2019.
- K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
- G. Izacard and E. Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
- Y. Jiang, J. Petrak, X. Song, K. Bontcheva, and D. Maynard. Team bertha von suttner at semeval-2019 task 4: Hyperpartisan news detection using elmo sentence representation convolutional network. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 840–844, 2019.
- M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association for Computational Linguistics.
- M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8: 64–77, 2020.
- E. Katzav, O. Biham, and A. K. Hartmann. Distribution of shortest path lengths in subcritical erdos-rényi networks. Physical Review E, 98(1):012301, 2018.
- W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at ucsc. Genome research, 12(6):996–1006, 2002.
- U. Khandelwal, K. Clark, D. Jurafsky, and L. Kaiser. Sample efficient text summarization using a single pre-trained transformer. arXiv preprint arXiv:1905.08836, 2019.
- E. Khurana, Y. Fu, D. Chakravarty, F. Demichelis, M. A. Rubin, and M. Gerstein. Role of non-coding sequence variants in cancer. Nature Reviews Genetics, 17(2):93, 2016.
- J. Kiesel, M. Mestre, R. Shukla, E. Vincent, P. Adineh, D. Corney, B. Stein, and M. Potthast. Semeval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, 2019.
- B. Kim, H. Kim, and G. Kim. Abstractive summarization of reddit posts with multi-level memory networks. arXiv preprint arXiv:1811.00783, 2018.
- N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
- T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- V. Kumar, A. Choudhary, and E. Cho. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245, 2020.
- T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- J.-S. Lee and J. Hsiang. Patent classification by fine-tuning bert language model. World Patent Information, 61:101965, 2020.
- K. Lee, M.-W. Chang, and K. Toutanova. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300, 2019.
- J. J. Levy, A. J. Titus, C. L. Petersen, Y. Chen, L. A. Salas, and B. C. Christensen. Methylnet: an automated and modular deep learning approach for dna methylation analysis. BMC bioinformatics, 21(1):1–15, 2020.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401, 2020.
- W. Liang. Segmenting dna sequence into words based on statistical language model. Nature Precedings, pages 1–1, 2012.
- H. Lin, Z.-Y. Liang, H. Tang, and W. Chen. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM transactions on computational biology and bioinformatics, 2017.
- J. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, and D. R. Karger. What makes a good answer? the role of context in question answering. In Proceedings of the Ninth IFIP TC13 International Conference on Human-Computer Interaction (INTERACT 2003), pages 25–32, 2003.
- D. Liu, Y. Gong, J. Fu, Y. Yan, J. Chen, D. Jiang, J. Lv, and N. Duan. Rikinet: Reading wikipedia pages for natural question answering. arXiv preprint arXiv:2004.14560, 2020.
- Y. Liu and M. Lapata. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
- L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, and B. Sagot. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894, 2019.
- D. Miller. Leveraging bert for extractive text summarization on lectures. arXiv preprint arXiv:1906.04165, 2019.
- S. Narayan, S. B. Cohen, and M. Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
- A. Nenkova and L. Vanderwende. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, 101, 2005.
- M. L. Olson, L. Zhang, and C.-N. Yu. Adapting pretrained language models for long document classification. OpenReview, 2019.
- A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- M. Oubounyt, Z. Louadi, H. Tayara, and K. T. Chong. Deepromoter: Robust promoter predictor using deep learning. Frontiers in genetics, 10, 2019.
- J. Pérez, J. Marinkovic, and P. Barceló. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
- J. Qiu, H. Ma, O. Levy, S. W.-t. Yih, S. Wang, and J. Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
- J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- S. Rothe, S. Narayan, and A. Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461, 2019.
- A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
- E. Sharma, C. Li, and L. Wang. Bigpatent: A large-scale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741, 2019.
- P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4): 981–1025, 2011.
- E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
- S. Subramanian, R. Li, J. Pilault, and C. Pal. On extractive and abstractive neural document summarization with transformer language models. arXiv preprint arXiv:1909.03186, 2019.
- S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799, 2019.
- C. Sun, L. Huang, and X. Qiu. Utilizing bert for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588, 2019.
- D. Sussman. Lecture Notes for Boston University MA 882 Spring 2017, 2017 (accessed June 3, 2020). URL http://math.bu.edu/people/sussman/MA882_2017/2017-01-26-Lecture-2.html.
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- A. Tampuu, Z. Bzhalava, J. Dillner, and R. Vicente. Viraminer: Deep learning on raw dna sequences for identifying viral genomes in human samples. PloS one, 14(9), 2019.
- Z. Tang, Y. Shen, X. Ma, W. Xu, J. Yu, and W. Lu. Multi-hop reading comprehension across documents with path-based graph convolutional network. arXiv:2006.06478, 2020.
- T. Thongtan and T. Phienthrakul. Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 407–414, 2019.
- T. H. Trinh and Q. V. Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
- R. K. Umarov and V. V. Solovyev. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PloS one, 12(2), 2017.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang. Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167, 2019.
- D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684): 440–442, 1998.
- J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302, 2018.
- R. Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357–365, 2005.
- S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052, 2017.
- X. Xiao, Z.-C. Xu, W.-R. Qiu, P. Wang, H.-T. Ge, and K.-C. Chou. ipsw (2l)-pseknc: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo k-tuple nucleotide composition. Genomics, 111(6):1785–1793, 2019.
- Y. Yang, R. Zhang, S. Singh, and J. Ma. Exploiting sequence-based features for predicting enhancer– promoter interactions. Bioinformatics, 33(14):i252–i260, 2017.
- Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019.
- Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie. Balanced sparsity for efficient dnn inference on gpu. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5676–5683, 2019.
- Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang. Bp-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
- C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
- C. Yun, Y.-W. Chang, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. o(n) connections are expressive enough: Universal approximability of sparse transformers. In Advances in Neural Information Processing Systems, 2020.
- H. Zhang, C.-L. Hung, M. Liu, X. Hu, and Y.-Y. Lin. Ncnet: Deep learning network models for predicting function of non-coding dna. Frontiers in genetics, 10, 2019.
- J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019.
- X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
- J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931–934, 2015.
- Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In IEEE international conference on computer vision, pages 19–27, 2015.

Full Text

Tags

Comments