We developed a machine learning model to detect malicious URLs by combining lexical, host-based, and content-based features, overcoming the limitations of traditional blacklisting methods. Using the Random Forest algorithm, our approach analyzes specific URL attributes such as length, HTTP tokens, domain age, Google index status, web traffic, and the presence of iframe or right-click events. Trained on a dataset of 11,000 URLs from the UCI Machine Learning Repository, the model achieved 94.7% accuracy.
View the Project on GitHub KSruthiVel/Malicious-URL-Detection-using-Machine-Learning
These refer to statistical features extracted from the literal URL string. For example, length of the URL string, number of digits, number of parameters in its query part, if the URL is encoded, etc. Example, ‘amazon.com.support.info’.
These provide information about the host of the webpage, for example, country of registration, domain name properties, named servers, connection speed, time to live from registration, etc. The motivation behind including these parameters is that there is a difference in website deployment tactics, the longevity of existence, and the reputation for malicious and benign sites.
These are obtained from the downloaded HTML code of the webpage. These features capture the structure of the webpage and the content embedded in it. These will include information on script tags, embedded objects, executables, hidden elements, etc. For example, in an SQL injection attack, anomalies such as malformed documents or repeated tags show up in raw HTML content.
Lexical Features | Host-based Features | Content-based Features |
---|---|---|
url_of_anchor | registration_length | web_traffic |
sub_domain | age_of_domain | favicon |
having_- | having_ip | redirect |
links_in_tags | google_index | submitting_to_email |
sfh | dns_record | statistical_report |
request_url | mouse_over | |
url_length | iframe | |
https_token | rightclick | |
shortening_service | ||
having_@ | ||
abnormal_url | ||
having_// |
UI: HTML5, CSS3.
Backend: Python3.
Libraries: beautifulsoup4, googlesearch-python, scikit-learn, pandas, requests, whois.