# Document Summarization Based on Data Reconstruction

AAAI, 2012.

EI

Weibo:

Abstract:

Document summarization is of great value to many real world applications, such as snippets generation for search results and news headlines generation. Traditionally, document summarization is implemented by extracting sentences that cover the main topics of a document with a minimum redundancy. In this paper, we take a different perspect...More

Code:

Data:

Introduction

- With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents.
- Summarization can represent the document with a short piece of text covering the main topics, and help users sift through the Internet, catch the most relevant document, and filter out redundant information.
- News sites usually describe hot news topics in concise headlines to facilitate browsing.
- Both the snippets and headlines are specific forms of document summary in practical applications

Highlights

- With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents
- We propose a novel framework called Document Summarization based on Data Reconstruction (DSDR) which finds the summary sentences by minimizing the reconstruction error
- Among all the seven summarization algorithms, latent semantic indexing and symmetric non-negative matrix factorization show the poorest performance on both data sets
- Applying singular value decomposition on the terms by sentences matrix, summarization by latent semantic indexing chooses those sentences with the largest indexes along the orthogonal latent semantic directions
- We propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document
- Document Summarization based on Data Reconstruction with linear reconstruction is more efficient while Document Summarization based on Data Reconstruction with nonnegative reconstruction has better performance

Methods

- The authors do not compare with any supervised methods (Toutanova et al 2007; Haghighi and Vanderwende 2009; Celikyilmaz and Hakkani-Tur 2010; Lin and Bilmes 2011)

Results

- Overall Performance Comparison ROUGE can generate three types of scores: recall, precision and F-measure.
- The authors get similar experimental results using the three types with DSDR taking the lead.
- As shown by the highest ROUGE scores in bold type from the two tables, it is obvious that DSDR takes the lead followed by ClusterHITS.
- The authors' DSDR selects sentences which span the intrinsic subspace of the candidate sentence space
- Such sentences are contributive to reconstruct the original document, and so are contributive to improve the summary quality.
- Under the DSDR framework, the sequential method of linear reconstruction is suboptimal, so DSDR-non outperforms DSDR-lin

Conclusion

- The authors propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document.
- The authors introduce two types of reconstruction and develop efficient optimization methods for them.
- The linear reconstruction problem is solved using a greedy strategy and the nonnegative reconstruction problem is solved using a multiplicative updating.
- DSDR with linear reconstruction is more efficient while DSDR with nonnegative reconstruction has better performance.
- It would be of great interests to develop more efficient solution for DSDR with nonnegative reconstruction

Summary

## Introduction:

With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents.- Summarization can represent the document with a short piece of text covering the main topics, and help users sift through the Internet, catch the most relevant document, and filter out redundant information.
- News sites usually describe hot news topics in concise headlines to facilitate browsing.
- Both the snippets and headlines are specific forms of document summary in practical applications
## Methods:

The authors do not compare with any supervised methods (Toutanova et al 2007; Haghighi and Vanderwende 2009; Celikyilmaz and Hakkani-Tur 2010; Lin and Bilmes 2011)## Results:

Overall Performance Comparison ROUGE can generate three types of scores: recall, precision and F-measure.- The authors get similar experimental results using the three types with DSDR taking the lead.
- As shown by the highest ROUGE scores in bold type from the two tables, it is obvious that DSDR takes the lead followed by ClusterHITS.
- The authors' DSDR selects sentences which span the intrinsic subspace of the candidate sentence space
- Such sentences are contributive to reconstruct the original document, and so are contributive to improve the summary quality.
- Under the DSDR framework, the sequential method of linear reconstruction is suboptimal, so DSDR-non outperforms DSDR-lin
## Conclusion:

The authors propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document.- The authors introduce two types of reconstruction and develop efficient optimization methods for them.
- The linear reconstruction problem is solved using a greedy strategy and the nonnegative reconstruction problem is solved using a multiplicative updating.
- DSDR with linear reconstruction is more efficient while DSDR with nonnegative reconstruction has better performance.
- It would be of great interests to develop more efficient solution for DSDR with nonnegative reconstruction

- Table1: Average F-measure performance on DUC 2006. ”DSDR-lin” and ”DSDR-non” denote DSDR with the linear reconstruction and DSDR with the nonnegative reconstruction respectively
- Table2: Average F-measure performance on DUC 2007. ”DSDR-lin” and ”DSDR-non” denote DSDR with the linear reconstruction and DSDR with the nonnegative reconstruction respectively
- Table3: The associated p-values of the paired t-test on DUC 2006
- Table4: The associated p-values of the paired t-test on DUC 2007

Related work

- Recently, lots of extractive document summarization methods have been proposed. Most of them involve assigning salient scores to sentences of the original document and composing the result summary of the top sentences with the highest scores. The computation rules of salient scores can be categorized into three groups (Hu, Sun, and Lim 2008): feature based measurements, lexical chain based measurements and graph based measurements. In (Wang et al 2008), the semantic relations of terms in the same semantic role are discovered by using the WordNet (Miller 1995). A tree pattern expression for extracting information from syntactically parsed text is proposed in (Choi 2011). Algorithms like PageRank (Brin and Page 1998) and HITS (Kleinberg 1999) are used in the sentence score propagation based on the graph constructed based on the similarity between sentences. Wan and Yang (2007) show that graph based measurements can also improve the single-document summarization by integrating multiple documents of the same topic.

Funding

- This work was supported in part by National Natural Science Foundation of China (Grant No: 61125203, 61173185, 90920303), National Basic Research Program of China (973 Program) under Grant 2011CB302206, Zhejiang Provincial Natural Science Foundation of China (Grant No: Y1101043) and Foundation of Zhejiang Provincial Educational Department under Grant Y201018240

Reference

- Brin, S., and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems 30(1-7):107–117.
- Cai, D., and He, X. 201Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering 24(4):707–719.
- Cai, D.; He, X.; Ma, W.-Y.; Wen, J.-R.; and Zhang, H. 2004. Organizing WWW images based on the analysis of page layout and web link structure. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo.
- Cai, D.; He, X.; Han, J.; and Huang, T. S. 2011. Graph regularized non-negative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8):1548–1560.
- Celikyilmaz, A., and Hakkani-Tur, D. 2010. A hybrid hierarchical model for multi-document summarization. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics.
- Choi, Y. 2011. Tree pattern expression for extracting information from syntactically parsed text corpora. Data Mining and Knowledge Discovery 1–21.
- Conroy, J., and O’leary, D. 2001. Text summarization via hidden markov models. In Proc. of the 24th ACM SIGIR, 40ACM.
- Gong, Y., and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proc. of the 24th ACM SIGIR, 19–25. ACM.
- Haghighi, A., and Vanderwende, L. 200Exploring content models for multi-document summarization. In Proc. of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
- Harabagiu, S., and Lacatusu, F. 2005. Topic themes for multidocument summarization. In Proc. of the 28th ACM SIGIR, 209. ACM.
- He, X.; Cai, D.; Wen, J.-R.; Ma, W.-Y.; and Zhang, H.-J. 2007. Clustering and searching www images using link and page layout analysis. ACM Transactions on Multimedia Computing, Communications and Applications 3(1).
- Hoerl, A., and Kennard, R. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 55–67.
- Hu, M.; Sun, A.; and Lim, E. 2008. Comments-oriented document summarization: understanding documents with readers’ feedback. In Proc. of the 31st ACM SIGIR, 291–298. ACM.
- Huang, Y.; Liu, Z.; and Chen, Y. 2008. Query biased snippet generation in xml search. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data.
- Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5):604–632.
- Lin, H., and Bilmes, J. 2011. A class of submodular functions for document summarization. In The 49th ACL-HLT, Portland, OR, June.
- Lin, C., and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. of the North American Chapter of the Association for Computational Linguistics on
- Human Language Technology, 71–78. Association for Computational Linguistics.
- Lin, C. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of the WAS, 25–26.
- Miller, G. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- Natarajan, B. 1995. Sparse approximate solutions to linear systems. SIAM journal on computing 24(2):227–234.
- Nenkova, A.; Vanderwende, L.; and McKeown, K. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proc. of the 29th ACM SIGIR, 580. ACM.
- Palmer, S. 1977. Hierarchical structure in perceptual representation. Cognitive Psychology 9(4):441–474.
- Park, S.; Lee, J.; Kim, D.; and Ahn, C. 2007. Multi-document Summarization Based on Cluster Using Non-negative Matrix Factorization. SOFSEM: Theory and Practice of Computer Science 761–770.
- Riedel, K. 1992. A sherman-morrison-woodbury identity for rank augmenting matrices with application to centering. SIAM Journal on Matrix Analysis and Applications 13(2):659–662.
- Sha, F.; Lin, Y.; Saul, L.; and Lee, D. 2007. Multiplicative updates for nonnegative quadratic programming. Neural Computation 19(8):2004–2031.
- Shen, D.; Sun, J.; Li, H.; Yang, Q.; and Chen, Z. 2007. Document summarization using conditional random fields. In Proc. of IJCAI, volume 7, 2862–2867.
- Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
- Toutanova, K.; Brockett, C.; Gamon, M.; Jagarlamudi, J.; Suzuki, H.; and Vanderwende, L. 2007. The pythy summarization system: Microsoft research at duc 2007. In Proc. of DUC, volume 2007.
- Turpin, A.; Tsegay, Y.; Hawking, D.; and Williams, H. E. 2007. Fast generation of result snippets in web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
- Wachsmuth, E.; Oram, M.; and Perrett, D. 1994. Recognition of objects and their component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex 4(5):509.
- Wan, X., and Yang, J. 2007. CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In Proc. of the 30th annual international ACM SIGIR, 150. ACM.
- Wan, X., and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proc. of the 31st ACM SIGIR, 299– 306. ACM.
- Wang, D.; Li, T.; Zhu, S.; and Ding, C. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proc. of the 31st ACM SIGIR.
- Wasson, M. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proc. of the 17th international conference on Computational linguistics-Volume 2.
- Yu, K.; Zhu, S.; Xu, W.; and Gong, Y. 2008. Non-greedy active learning for text categorization using convex ansductive experimental design. In Proc. of the 31st ACM SIGIR, 635–642. ACM.
- Yu, K.; Bi, J.; and Tresp, V. 2006. Active learning via transductive experimental design. In Proc. of the 23rd ICML, 1081–1088. ACM.

Full Text

Best Paper

Best Paper of AAAI, 2012

Tags

Comments