Document Summarization Based on Data Reconstruction

AAAI, 2012.

Cited by: 98|Bibtex|Views246
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose a novel summarization framework called Document Summarization based on Data Reconstruction which selects the most representative sentences that can best reconstruct the entire document

Abstract:

Document summarization is of great value to many real world applications, such as snippets generation for search results and news headlines generation. Traditionally, document summarization is implemented by extracting sentences that cover the main topics of a document with a minimum redundancy. In this paper, we take a different perspect...More

Code:

Data:

Introduction
  • With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents.
  • Summarization can represent the document with a short piece of text covering the main topics, and help users sift through the Internet, catch the most relevant document, and filter out redundant information.
  • News sites usually describe hot news topics in concise headlines to facilitate browsing.
  • Both the snippets and headlines are specific forms of document summary in practical applications
Highlights
  • With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents
  • We propose a novel framework called Document Summarization based on Data Reconstruction (DSDR) which finds the summary sentences by minimizing the reconstruction error
  • Among all the seven summarization algorithms, latent semantic indexing and symmetric non-negative matrix factorization show the poorest performance on both data sets
  • Applying singular value decomposition on the terms by sentences matrix, summarization by latent semantic indexing chooses those sentences with the largest indexes along the orthogonal latent semantic directions
  • We propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document
  • Document Summarization based on Data Reconstruction with linear reconstruction is more efficient while Document Summarization based on Data Reconstruction with nonnegative reconstruction has better performance
Methods
Results
  • Overall Performance Comparison ROUGE can generate three types of scores: recall, precision and F-measure.
  • The authors get similar experimental results using the three types with DSDR taking the lead.
  • As shown by the highest ROUGE scores in bold type from the two tables, it is obvious that DSDR takes the lead followed by ClusterHITS.
  • The authors' DSDR selects sentences which span the intrinsic subspace of the candidate sentence space
  • Such sentences are contributive to reconstruct the original document, and so are contributive to improve the summary quality.
  • Under the DSDR framework, the sequential method of linear reconstruction is suboptimal, so DSDR-non outperforms DSDR-lin
Conclusion
  • The authors propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document.
  • The authors introduce two types of reconstruction and develop efficient optimization methods for them.
  • The linear reconstruction problem is solved using a greedy strategy and the nonnegative reconstruction problem is solved using a multiplicative updating.
  • DSDR with linear reconstruction is more efficient while DSDR with nonnegative reconstruction has better performance.
  • It would be of great interests to develop more efficient solution for DSDR with nonnegative reconstruction
Summary
  • Introduction:

    With the explosive growth of the Internet, people are overwhelmed by a large number of accessible documents.
  • Summarization can represent the document with a short piece of text covering the main topics, and help users sift through the Internet, catch the most relevant document, and filter out redundant information.
  • News sites usually describe hot news topics in concise headlines to facilitate browsing.
  • Both the snippets and headlines are specific forms of document summary in practical applications
  • Methods:

    The authors do not compare with any supervised methods (Toutanova et al 2007; Haghighi and Vanderwende 2009; Celikyilmaz and Hakkani-Tur 2010; Lin and Bilmes 2011)
  • Results:

    Overall Performance Comparison ROUGE can generate three types of scores: recall, precision and F-measure.
  • The authors get similar experimental results using the three types with DSDR taking the lead.
  • As shown by the highest ROUGE scores in bold type from the two tables, it is obvious that DSDR takes the lead followed by ClusterHITS.
  • The authors' DSDR selects sentences which span the intrinsic subspace of the candidate sentence space
  • Such sentences are contributive to reconstruct the original document, and so are contributive to improve the summary quality.
  • Under the DSDR framework, the sequential method of linear reconstruction is suboptimal, so DSDR-non outperforms DSDR-lin
  • Conclusion:

    The authors propose a novel summarization framework called Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document.
  • The authors introduce two types of reconstruction and develop efficient optimization methods for them.
  • The linear reconstruction problem is solved using a greedy strategy and the nonnegative reconstruction problem is solved using a multiplicative updating.
  • DSDR with linear reconstruction is more efficient while DSDR with nonnegative reconstruction has better performance.
  • It would be of great interests to develop more efficient solution for DSDR with nonnegative reconstruction
Tables
  • Table1: Average F-measure performance on DUC 2006. ”DSDR-lin” and ”DSDR-non” denote DSDR with the linear reconstruction and DSDR with the nonnegative reconstruction respectively
  • Table2: Average F-measure performance on DUC 2007. ”DSDR-lin” and ”DSDR-non” denote DSDR with the linear reconstruction and DSDR with the nonnegative reconstruction respectively
  • Table3: The associated p-values of the paired t-test on DUC 2006
  • Table4: The associated p-values of the paired t-test on DUC 2007
Download tables as Excel
Related work
  • Recently, lots of extractive document summarization methods have been proposed. Most of them involve assigning salient scores to sentences of the original document and composing the result summary of the top sentences with the highest scores. The computation rules of salient scores can be categorized into three groups (Hu, Sun, and Lim 2008): feature based measurements, lexical chain based measurements and graph based measurements. In (Wang et al 2008), the semantic relations of terms in the same semantic role are discovered by using the WordNet (Miller 1995). A tree pattern expression for extracting information from syntactically parsed text is proposed in (Choi 2011). Algorithms like PageRank (Brin and Page 1998) and HITS (Kleinberg 1999) are used in the sentence score propagation based on the graph constructed based on the similarity between sentences. Wan and Yang (2007) show that graph based measurements can also improve the single-document summarization by integrating multiple documents of the same topic.
Funding
  • This work was supported in part by National Natural Science Foundation of China (Grant No: 61125203, 61173185, 90920303), National Basic Research Program of China (973 Program) under Grant 2011CB302206, Zhejiang Provincial Natural Science Foundation of China (Grant No: Y1101043) and Foundation of Zhejiang Provincial Educational Department under Grant Y201018240
Reference
  • Brin, S., and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems 30(1-7):107–117.
    Google ScholarLocate open access versionFindings
  • Cai, D., and He, X. 201Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering 24(4):707–719.
    Google ScholarLocate open access versionFindings
  • Cai, D.; He, X.; Ma, W.-Y.; Wen, J.-R.; and Zhang, H. 2004. Organizing WWW images based on the analysis of page layout and web link structure. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo.
    Google ScholarLocate open access versionFindings
  • Cai, D.; He, X.; Han, J.; and Huang, T. S. 2011. Graph regularized non-negative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8):1548–1560.
    Google ScholarLocate open access versionFindings
  • Celikyilmaz, A., and Hakkani-Tur, D. 2010. A hybrid hierarchical model for multi-document summarization. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Choi, Y. 2011. Tree pattern expression for extracting information from syntactically parsed text corpora. Data Mining and Knowledge Discovery 1–21.
    Google ScholarFindings
  • Conroy, J., and O’leary, D. 2001. Text summarization via hidden markov models. In Proc. of the 24th ACM SIGIR, 40ACM.
    Google ScholarLocate open access versionFindings
  • Gong, Y., and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proc. of the 24th ACM SIGIR, 19–25. ACM.
    Google ScholarLocate open access versionFindings
  • Haghighi, A., and Vanderwende, L. 200Exploring content models for multi-document summarization. In Proc. of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Harabagiu, S., and Lacatusu, F. 2005. Topic themes for multidocument summarization. In Proc. of the 28th ACM SIGIR, 209. ACM.
    Google ScholarLocate open access versionFindings
  • He, X.; Cai, D.; Wen, J.-R.; Ma, W.-Y.; and Zhang, H.-J. 2007. Clustering and searching www images using link and page layout analysis. ACM Transactions on Multimedia Computing, Communications and Applications 3(1).
    Google ScholarLocate open access versionFindings
  • Hoerl, A., and Kennard, R. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 55–67.
    Google ScholarFindings
  • Hu, M.; Sun, A.; and Lim, E. 2008. Comments-oriented document summarization: understanding documents with readers’ feedback. In Proc. of the 31st ACM SIGIR, 291–298. ACM.
    Google ScholarLocate open access versionFindings
  • Huang, Y.; Liu, Z.; and Chen, Y. 2008. Query biased snippet generation in xml search. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data.
    Google ScholarLocate open access versionFindings
  • Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5):604–632.
    Google ScholarLocate open access versionFindings
  • Lin, H., and Bilmes, J. 2011. A class of submodular functions for document summarization. In The 49th ACL-HLT, Portland, OR, June.
    Google ScholarLocate open access versionFindings
  • Lin, C., and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. of the North American Chapter of the Association for Computational Linguistics on
    Google ScholarLocate open access versionFindings
  • Human Language Technology, 71–78. Association for Computational Linguistics.
    Google ScholarFindings
  • Lin, C. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of the WAS, 25–26.
    Google ScholarLocate open access versionFindings
  • Miller, G. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
    Google ScholarLocate open access versionFindings
  • Natarajan, B. 1995. Sparse approximate solutions to linear systems. SIAM journal on computing 24(2):227–234.
    Google ScholarLocate open access versionFindings
  • Nenkova, A.; Vanderwende, L.; and McKeown, K. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proc. of the 29th ACM SIGIR, 580. ACM.
    Google ScholarLocate open access versionFindings
  • Palmer, S. 1977. Hierarchical structure in perceptual representation. Cognitive Psychology 9(4):441–474.
    Google ScholarLocate open access versionFindings
  • Park, S.; Lee, J.; Kim, D.; and Ahn, C. 2007. Multi-document Summarization Based on Cluster Using Non-negative Matrix Factorization. SOFSEM: Theory and Practice of Computer Science 761–770.
    Google ScholarFindings
  • Riedel, K. 1992. A sherman-morrison-woodbury identity for rank augmenting matrices with application to centering. SIAM Journal on Matrix Analysis and Applications 13(2):659–662.
    Google ScholarLocate open access versionFindings
  • Sha, F.; Lin, Y.; Saul, L.; and Lee, D. 2007. Multiplicative updates for nonnegative quadratic programming. Neural Computation 19(8):2004–2031.
    Google ScholarLocate open access versionFindings
  • Shen, D.; Sun, J.; Li, H.; Yang, Q.; and Chen, Z. 2007. Document summarization using conditional random fields. In Proc. of IJCAI, volume 7, 2862–2867.
    Google ScholarLocate open access versionFindings
  • Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
    Google ScholarLocate open access versionFindings
  • Toutanova, K.; Brockett, C.; Gamon, M.; Jagarlamudi, J.; Suzuki, H.; and Vanderwende, L. 2007. The pythy summarization system: Microsoft research at duc 2007. In Proc. of DUC, volume 2007.
    Google ScholarLocate open access versionFindings
  • Turpin, A.; Tsegay, Y.; Hawking, D.; and Williams, H. E. 2007. Fast generation of result snippets in web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
    Google ScholarLocate open access versionFindings
  • Wachsmuth, E.; Oram, M.; and Perrett, D. 1994. Recognition of objects and their component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex 4(5):509.
    Google ScholarLocate open access versionFindings
  • Wan, X., and Yang, J. 2007. CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In Proc. of the 30th annual international ACM SIGIR, 150. ACM.
    Google ScholarLocate open access versionFindings
  • Wan, X., and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proc. of the 31st ACM SIGIR, 299– 306. ACM.
    Google ScholarFindings
  • Wang, D.; Li, T.; Zhu, S.; and Ding, C. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proc. of the 31st ACM SIGIR.
    Google ScholarLocate open access versionFindings
  • Wasson, M. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proc. of the 17th international conference on Computational linguistics-Volume 2.
    Google ScholarLocate open access versionFindings
  • Yu, K.; Zhu, S.; Xu, W.; and Gong, Y. 2008. Non-greedy active learning for text categorization using convex ansductive experimental design. In Proc. of the 31st ACM SIGIR, 635–642. ACM.
    Google ScholarLocate open access versionFindings
  • Yu, K.; Bi, J.; and Tresp, V. 2006. Active learning via transductive experimental design. In Proc. of the 23rd ICML, 1081–1088. ACM.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of AAAI, 2012
Tags
Comments