Anonymouth Revamped : Getting Closer to Stylometric Anonymity

Andrew W. E. McDonald, Jeffrey Ulman, Marc Barrowclift,Rachel Greenstadt

semanticscholar(2013)

引用 4|浏览0
暂无评分
摘要
Stylometry, the study of writing style—such as word choice, sentence length, and sentence structure—is a very real threat to privacy. Even if a would–be author is completely anonymous in every respect, as soon as he/she encodes any thoughts in text, his/her anonymity may vanish. Today, a document’s author can be selected from a pool of a hundred thousand authors [1]. Further, even when it is unknown whether a document’s author is in a select pool of potential authors, methods exist to determine the likelihood that a given document was written by any author in the aforementioned pool. The accuracy and ability [2] of stylometric authorship detection/attribution in open and closed world scenarios is rapidly advancing. It is therefore exceedingly important to continue the evolution of tools that combat this potential privacy breach, and offer privacy seeking individuals the ability to remain anonymous while expressing their ideas via a textual medium. While it is possible for one to anonymize a document with nothing more than a text editor, studies show that it is quite challenging to do well, that there is no guarantee that the author has done a sufficient job of removing his/her style from the document, and that it is hard to be consistent in hiding one’s own writing style [3,4]. We present a revised version of the open source, Java-based, authorship anonymization tool, Anonymouth, presented at the 2012 Privacy Enhancing Technologies Symposium [5]. The revised Anonymouth has a fully redesigned graphic user interface to enhance usability, along with an increased repertoire and updated algorithms to improve performance. Anonymouth uses JStylo [5] —an authorship attribution platform—as its backend, and uses machine learning and natural language processing techniques to attempt to aid a user in removing his/her style from a document he/she authored. To do this, the user must first input three sets of documents: the document to be anonymized, documentToAnonymize; previous documents authored by the user, userSampleDocuments; and a set comprised of documents by at least three other authors, otherSampleDocuments. Once all documents have been loaded into Anonymouth, JStylo extracts features from all documents, and classifies the documentToAnonymize with respect to the set formed by combining the userSampleDocuments and the otherSampleDocuments (the userAndOtherDocuments), using one of the available machine learning algorithms (though the SMO is almost exclusively used). This classification is shown to the user to provide an idea of the document’s baseline anonymity. Anonymouth then analyzes the features extracted from all three document sets, and based upon the ak-means clustering algorithm, and the user’s average feature values, decides upon a few sets of potential target values for the documentToAnonymize’s features. Each set of potential target values is tested against the classifier trained on the set of userAndOtherDocuments. The set of potential target values that returns a classification suggesting the greatest degree of anonymity is selected as the set of target values for the user’s documentToAnonymize. Next, alternative ways of expressing ideas, as well as elements to add and remove from the document are presented to the user. Once the user is satisfied with his/her edits, he/she may either reprocess the document. This edit and reprocess cycle continues until the user feels satisfied by the classification returned by JStylo. Initially, Anonymouth presented the actual extracted features (such as character grams) to the user, and “suggested” that the number of occurrences of the selected feature be either increased or decreased. It was quickly apparent that these “suggestions” made Anonymouth completely unusable. In addition, the user interface was unintuitive and poor at relaying necessary information; as gleaned from the surveys we conducted during the original Anonymouth’s user study. Further yet, while the core of Anonymouth appeared
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要