# CONCIERGE: Improving Constrained Search Results by Data Melioration

very large data bases, pp. 2865-2868, 2020.

Weibo:

Abstract:

The problem of finding an item-set of maximal aggregated utility that satisfies a set of constraints is at the cornerstone of many e-commerce applications. Its classical definition assumes that all the information needed to verify the constraints is explicitly given. In practice, however, the data available in e-commerce databases on the ...More

Code:

Data:

Introduction

- The selection of a k-size item-set with the maximal aggregated utility that satisfies a set of constraints is a fundamental problem to many e-commerce applications.
- The problem of finding an item-set of maximal aggregated utility that satisfies a set of constraints is often referred to as Constrained Search (CS) [3].
- Given a set of constrained search queries of interest and a bound on the number of requests from the sellers, CONCIERGE, the system that the authors present in this work, assists the e-commerce platform in identifying a boundedsize set of items whose data should be manually completed.

Highlights

- The selection of a k-size item-set with the maximal aggregated utility that satisfies a set of constraints is a fundamental problem to many e-commerce applications
- The problem of finding an item-set of maximal aggregated utility that satisfies a set of constraints is often referred to as Constrained Search (CS) [3]
- We propose a hybrid approach that harnesses the information derived by common Machine Learning (ML) modules to reduce the manual effort, focusing on the potentially most “beneficial” items
- We demonstrate the operation of CONCIERGE over realworld e-commerce data
- We note that computing the probability a k-size item-set satisfies a constraint is exponential in k. Estimate this probability using the possible worlds semantics of [4], showing that it can be estimated up to a constant factor in O(k). As both AVG-Probabilistic Constrained Search (PCS) and LMPCS naturally generalize PCS, we provide our hardness results w.r.t
- Since PCS cannot be approximated to a constant factor in PTIME, we provide an efficient best-effort algorithm, which we experimentally show to be highly effective

Results

- CONCIERGE uses the probabilities derived by the ML modules to choose a bounded-size set of items that is expected to improve both the utility and the probability of satisfying the constraints, for both queries.
- CONCIERGE would turn to the sellers of items 2, 6, and 7, resulting with the optimal (w.r.t the ground truth) solutions for both queries: S1 and S2.
- Via CONCIERGE’s dedicated UI, the user can: (1) select the search queries of interest and their imposed constraints; (2) limit the overall number of requests from sellers, and (3) define the aggregation strategy.
- The authors first present the data model, formally define the Probabilistic Constrained Search (PCS) problem, and provide two extensions for handling multiple queries.
- The authors can improve the result of a constrained search query q in two manners: increase the overall utility or satisfy the constraints with higher probability.
- The optimal result for q is a k-size item-set that is most likely to satisfy the constraints while maximizing utility.
- Completing missing data on these items assists the platform to include them in the result set of q1, improving it in terms of both utility and probability of satisfying the constraints.
- In the first problem definition, called AVG-PCS, the goal is to find a k-size set of items that maximizes the average contribution across all queries.
- Completing missing data on these items may improve the results of both queries.

Conclusion

- The authors prove this bound to hold for PCS, even if the authors know which k-size item-set satisfies the constraints with the highest probability, has maximal utility.
- The audience will play the role of both data analysts, attempting to improve the results of constrained search queries, as well as of the items’ sellers, requested to complete missing attribute values.
- The authors begin by asking the audience to select: (1) the search queries of interest, and their corresponding constraints; (2) a bound on the number of requests from sellers, and (3) the aggregation policy.

Summary

- The selection of a k-size item-set with the maximal aggregated utility that satisfies a set of constraints is a fundamental problem to many e-commerce applications.
- The problem of finding an item-set of maximal aggregated utility that satisfies a set of constraints is often referred to as Constrained Search (CS) [3].
- Given a set of constrained search queries of interest and a bound on the number of requests from the sellers, CONCIERGE, the system that the authors present in this work, assists the e-commerce platform in identifying a boundedsize set of items whose data should be manually completed.
- CONCIERGE uses the probabilities derived by the ML modules to choose a bounded-size set of items that is expected to improve both the utility and the probability of satisfying the constraints, for both queries.
- CONCIERGE would turn to the sellers of items 2, 6, and 7, resulting with the optimal (w.r.t the ground truth) solutions for both queries: S1 and S2.
- Via CONCIERGE’s dedicated UI, the user can: (1) select the search queries of interest and their imposed constraints; (2) limit the overall number of requests from sellers, and (3) define the aggregation strategy.
- The authors first present the data model, formally define the Probabilistic Constrained Search (PCS) problem, and provide two extensions for handling multiple queries.
- The authors can improve the result of a constrained search query q in two manners: increase the overall utility or satisfy the constraints with higher probability.
- The optimal result for q is a k-size item-set that is most likely to satisfy the constraints while maximizing utility.
- Completing missing data on these items assists the platform to include them in the result set of q1, improving it in terms of both utility and probability of satisfying the constraints.
- In the first problem definition, called AVG-PCS, the goal is to find a k-size set of items that maximizes the average contribution across all queries.
- Completing missing data on these items may improve the results of both queries.
- The authors prove this bound to hold for PCS, even if the authors know which k-size item-set satisfies the constraints with the highest probability, has maximal utility.
- The audience will play the role of both data analysts, attempting to improve the results of constrained search queries, as well as of the items’ sellers, requested to complete missing attribute values.
- The authors begin by asking the audience to select: (1) the search queries of interest, and their corresponding constraints; (2) a bound on the number of requests from sellers, and (3) the aggregation policy.

Related work

- Our work is closely related to a line of work studying different variants of the CS problem, proposing efficient algorithms for solving them [7, 9, 3]. While we establish the connection between CS and the optimization problem that we study in this work (showing our problem to be harder), we emphasize that our goal is different. Instead of finding the optimal solution for the search queries, we aim to improve their results, via data melioration.

Multiple data cleansing tools combine both human and ML [8], typically using domain experts to generate adequate labeled data for supervised learning, while minimizing human effort [10, 5]. Our work complements these previous efforts by leveraging the probabilities obtained by the ML algorithms, to identify which items should be manually cleaned. CONCIERGE can be used to optimize the cleaning process of a database, as well as to assist in its ongoing maintenance - whenever a new constraint is imposed, CONCIERGE can take over to efficiently identify what missing information may improve the queries’ results.

Funding

- This work has been partially funded by the Israel Science Foundation, the Binational US-Israel Science Foundation, and the Tel Aviv University Data Science center

Reference

- [2] O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. Technical report, Stanford, 2005.
- [3] L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840, 2017.
- [4] N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. The VLDB Journal, 2007.
- [5] C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In ICML, 2013.
- [6] A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, 2011.
- [7] J. Stoyanovich, K. Yang, and H. Jagadish. Online set selection with fairness and diversity constraints. In EDBT, 2018.
- [8] C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. Proc. VLDB Endow., 7(13):1529–1540, 2014.
- [9] T. Wu, L. Chen, P. Hui, C. J. Zhang, and W. Li. Hear the whole story: Towards the diversity of opinion in crowdsourcing markets. Proc. VLDB Endow., 8(5):485–496, 2015.

Tags

Comments