Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao,Jiarong Jiang,Yiqun Hu,Wuwei Lan,Henry Zhu,Anuj Chauhan,Alexander Li,Lin Pan,Jun Wang,Chung-Wei Hang,Sheng Zhang, Marvin Dong, Joe Lilien,Patrick Ng,Zhiguo Wang,Vittorio Castelli,Bing Xiang

conf_acl（2022）

引用 4|浏览82

暂无评分

摘要

Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.

查看译文

关键词

data,high-quality,text-to-sql

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要