Mobile Malware Detection; Big Data Analytics; Machine Learning
[en] Mobile malware is on the rise. Due to their popularity, smartphones represent an attractive target for cybercriminals, especially regarding unauthorized access to private user data; smartphones incorporate a lot of sensitive information about users, even more than a personal computer. Indeed, besides personal information such as documents, accounts, passwords, contacts, etc., smartphone sensors centralize other sensitive data such as user location, physical activities, etc. In this paper, we study the problem of malware detection in smartphones, using supervised machine learning models and big data analytics frameworks. Using a publicly available dataset for smartphone data analysis (the SherLock data collection, see http://bigdata.ise.bgu.ac.il/sherlock/), we train and benchmark different supervised machine learning models to detect malware apps activity.The Sherlock data collection is a crowdsourcing-based smartphone dataset in which hundreds of features from many different "sensors" or vantage points within the device are monitored, using a tailored smartphone agent. The collection is done during a long-term - 2 years (2015/16), field trial on 50 smartphones used as primary device for 50 different participants. The monitoring agent collects a wide variety of network, software and sensor data at a high sample rate (as low as 5 seconds); in addition, participant devices include a sandbox-like smartphone agent which runs controlled malware apps, perpetrating attacks on the user's device (such as contacts theft, spyware, phishing, etc.), while creating labels for the SherLock dataset. The complete labeled dataset contains more than 10 billion data records, with a total of about 4 TB of data. We additionally complement the labels for malicious apps which might have been installed by participants by analyzing the installed apps' hashes in Virus Total (https://www.virustotal.com), a well-known multi antivirus online scanning system. From the complete dataset, we keep two specific feature categories: all those features related to the network traffic generated by the apps, and all those features corresponding to the footprint of the app on the CPU and internal running processes (e.g., statistics on CPUs, memory usage, linux-level processes information, etc.). The rationale is that some malware activity would be more visible at the network traffic level, whereas some others would be better identified at the local processes level. Using this dataset, we train different machine learning models (e.g., decision trees, neural networks, SVMs, etc.) and verify their accuracy to automatically spot out malicious apps running on the users’ devices. We also apply feature selection strategies to improve results and reduce computational times. Given the size of the dataset, we rely on big data platforms (such as Spark) to perform the analysis, complementing the machine learning based analysis with scikit-learn like pipelines. We evaluate three different concepts, including (i) overall model performance (using multi-fold cross validation on the complete dataset), (ii) generalization of the learned models across different users (train in N-1 users, and test in the remaining user), and (iii) detection accuracy drift along time (train during first month/week, test the resulting model in the subsequent months/weeks). Initial results are very promising, especially regarding overall model performance for decision tree based models.