AI and RoboticsPUBLISHED

Privacy-Preserving Machine Learning for Fraudulent Website Detection Using Differentially Private Decision Trees

Muhammad Ali Fauzi (Faculty of Computer Science Universitas Brawijaya Malang, Indonesi), Yudhistira Bian Yang (Faculty of Computer Science Universitas Brawijaya Malang, Indonesi), Bian Yang (Department of Information Security and Communication Technology Norwegian University of Science and Technology (NTNU) Gjøvik, Norway)
April 25, 2025

Abstract

Detecting fraudulent websites is essential for cy- balancing model accuracy with privacy. This paper investigates bersecurity, as these sites often serve as vehicles for phishing, identity theft, and malware distribution. Machine learning mod- the application of DP in Decision Trees, comparing a tradi- els, particularly Decision Trees, have shown strong effectiveness tional Decision Tree with a differentially private version and in identifying patterns indicative of fraudulent sites. However, analyzing the impact of varying levels of the privacy budget, the need to protect sensitive user data introduces challenges, epsilon, on model performance. necessitating privacy-preserving approaches like Differential Pri- vacy (DP). This study investigates the application of DP in Decision Tree classifiers to assess the trade-offs between privacy II. M ATERIALS AND M ETHODS and utility in detecting fraudulent websites. Using a publicly available dataset, we compare a traditional Decision Tree with A. Dataset a differentially private version across various values of the The dataset used in this study is sourced from a public privacy budget, epsilon ( ϵ ). Our results show that while the original model achieves optimal performance with an accuracy dataset [6]. This dataset comprises several features that char- of 95.4%, the DP model at an optimal ϵ = 4 . 09 maintains high acterize websites, including elements such as URL length, utility, achieving 83.7% accuracy and an F1-score of 82.7%, domain age, and the presence of HTTPS. These features are demonstrating its suitability for privacy-sensitive applications. essential in distinguishing legitimate websites from fraudulent This study highlights that with moderate ϵ values, DP Decision ones. The dataset is labeled, with each instance marked as Trees can provide effective privacy protection with minimal performance loss, making them viable for real-world applications either fraudulent or non-fraudulent, which allows supervised where both privacy and predictive accuracy are critical. learning models to be trained and evaluated on this binary

Keywords

Fraudulent website detectionmachine learningclassification task. Decision Treedifferential privacyprivacy-preserving technol-