Comparative Analysis of Word vs. Character Embedding for Machine Learning Based Detection of Malicious URLs and DGA-Generated Domains

Oludayo C. Ayodele,Suleiman Y. Yerima

2023 IEEE 15th International Conference on Computational Intelligence and Communication Networks (CICN)(2023)

引用 0|浏览3
暂无评分
摘要
This study presents a comprehensive comparative analysis of the effectiveness of word-level and character-level embeddings in the context of machine learning-based detection of malicious URLs and DGA-generated domains. Utilizing distinct datasets comprising DGA-generated domains and Spam URLs, we systematically evaluate various machine learning models coupled with word-level and character-level tokenization techniques. Our findings indicate that character-level tokenization yields superior results in identifying DGA-generated domains, particularly due to the random character composition of these URLs. Conversely, both word-level and character-level embeddings exhibit comparable success rates in classifying Spam URLs, owing to the non-random nature of their URL structures. The study sheds light on the importance of tailoring tokenization strategies based on the unique characteristics of the data. We recommend character-level embeddings for detecting DGA-generated domains characterized by random characters. In contrast, the choice between word-level and character-level embeddings is less critical when dealing with Spam URLs, as both approaches yield effective results.
更多
查看译文
关键词
Malicious URL detection,DGA domain detection,word embedding,character embedding,term frequency-inverse document frequency,n-grams,Bag-of-words,machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要