AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We describe an approach to extract attribute-value pairs from product descriptions that requires very little user supervision

Semi-supervised learning of attribute-value pairs from product descriptions

IJCAI, pp.2838-2843, (2007)

Cited: 73|Views75
EI
Full Text
Bibtex
Weibo

Abstract

We describe an approach to extract attribute-value pairs from product descriptions. This allows us to represent products as sets of such attribute-value pairs to augment product databases. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic enti...More

Code:

Data:

Introduction
  • Retailers have been collecting a large amount of sales data containing customer information and related transactions.
  • Improved forecasting is possible if the retailer is able to describe the shoe with a product number, but with a set of attribute-value pairs, such as material: lightweight mesh nylon, sole: low profile, lacing system: standard.
  • This would enable the retailer to use data from other products having similar attributes.
  • The work presented in this paper is motivated by the need to make the process of extracting attribute-value pairs from product descriptions more efficient and cheaper by developing an interactive tool that can help human experts with this task
Highlights
  • Retailers have been collecting a large amount of sales data containing customer information and related transactions
  • For the experiments reported in this paper, we developed a web crawler that traverses retailer web sites and extracts product descriptions
  • We report precision results for the two categories in two ways: first, we do a simple evaluation of each unique data item
  • We describe an approach to extract attribute-value pairs from product descriptions that requires very little user supervision
  • Future work will focus on adding an interactive step to the extraction algorithm that will allow users to correct extracted pairs as quickly and efficiently as possible
Results
  • Precision Results for Most Frequent Data Items

    As the training data contains many duplicates, it is more important to extract correct pairs for the most frequent pairs than for the less frequent ones.
  • The authors report precision results for the most frequently data items.
  • This is done by sorting the training data by frequency, and manually inspecting the pairs that the system extracted for the most frequent 300 data items.
  • This was done only for the system run that includes co-EM classification.
Conclusion
  • The authors describe an approach to extract attribute-value pairs from product descriptions that requires very little user supervision.
  • The authors start with a novel unsupervised seed generation algorithm that has high precision but limited recall.
  • The supervised and especially the semi-supervised algorithm yield significantly increased recall with little or no decrease in precision using the automatically generated seeds as labeled data.
  • Future work will focus on adding an interactive step to the extraction algorithm that will allow users to correct extracted pairs as quickly and efficiently as possible.
  • One of the main challenges with an interactive approach is to make co-EM efficient enough that the time of the user is optimized
Tables
  • Table1: Automatically extracted seed attribute-value pairs name pairs, e.g., Smith is extracted as an attribute as it occurs as part of many phrases and fulfills our criteria (Joe Smith, Mike Smith, etc.) after many first names. Our unsupervised seed generation algorithm gets about 65% accuracy in the tennis category and about 68% accuracy in the football category. We have experimented with manually correcting the seeds by eliminating all those that were incorrect. This did not result in any improvement of the final extraction performance, leading us to conclude that our algorithm is robust to noise and is able to deal with noisy seeds
  • Table2: Examples of extracted pairs for system run with coEM
  • Table3: Recall for Tennis and Football Categories
  • Table4: Precision for Tennis and Football Categories
  • Table5: F-1 measure for Tennis and Football categories
  • Table6: Non-weighted and Weighted Precision Results for Tennis and Football Categories. ‘T’ stands for tennis, ‘F’ is football, ‘nW’ non-weighted, and ‘W’ is weighted
Download tables as Excel
Related work
  • While we are not aware of any system that addresses the task described in this paper, much research has been done on extracting information from text documents. One task that has received attention recently is that of extracting product features and their polarity from online user reviews. Liu et al [Liu and Cheng, 2005] describe a system that focuses on extracting relevant product attributes, such as focus for digital cameras. These attributes are extracted by use of a rule miner, and are restricted to noun phrases. In a second phase, the system extracts polarized descriptors, e.g., good, too small, etc. [Popescu and Etzioni, 2005] describe a similar approach: they approach the task by first extracting noun phrases as candidate attributes, and then computing the mutual information between the noun phrases and salient context patterns. Our work is related in that in both cases, a product is expressed as a vector of attributes. However, our work focuses not only on attributes, but also on values, and on pairing attributes with values. Furthermore, the attributes that are extracted from user reviews are often different than the attributes of the products that retailers would mention. For example, a review might mention photo quality as an attribute but specifications of cameras would tend to use megapixels or lens manufacturer in the specifications.
Reference
  • [Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In COLT-98, 1998.
    Google ScholarLocate open access versionFindings
  • [Brill, 1995] Eric Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 1995.
    Google ScholarLocate open access versionFindings
  • [Collins and Singer, 1999] M. Collins and Y. Singer. Unsupervised Models for Named Entity Classification. In EMNLP/VLC, 1999.
    Google ScholarLocate open access versionFindings
  • [Ghani and Jones, 2002] Rayid Ghani and Rosie Jones. A comparison of efficacy of bootstrapping algorithms for information extraction. In LREC 2002 Workshop on Linguistic Knowledge Acquisition, 2002.
    Google ScholarLocate open access versionFindings
  • [Jones, 2005] Rosie Jones. Learning to Extract Entities from Labeled and Unlabeled Text. Ph.D. Dissertation, 2005.
    Google ScholarLocate open access versionFindings
  • [Lin, 1998] Dekan Lin. Dependency-based evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems, 1998.
    Google ScholarLocate open access versionFindings
  • [Liu and Cheng, 2005] Bing Liu and Minqing Hu and Junsheng Cheng. Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of WWW 2005, 2005.
    Google ScholarLocate open access versionFindings
  • [Nigam and Ghani, 2000] Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of cotraining. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM2000), 2000.
    Google ScholarLocate open access versionFindings
  • [Peng and McCallum, 2004] Fuchun Peng and Andrew McCallum. Accurate information extraction from research papers using conditional random fields. In HLT 2004, 2004.
    Google ScholarLocate open access versionFindings
  • [Popescu and Etzioni, 2005] Ana-Maria Popescu and Oren Etzioni. Extracting product features and opinions from reviews. In Proceedings of EMNLP 2005, 2005.
    Google ScholarLocate open access versionFindings
  • [Porter, 1980] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
    Google ScholarLocate open access versionFindings
  • [Seymore et al., 1999] Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning hidden markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn