BIGMOMAL - Big Data Analytics for Mobile Malware Detection

Wassermann, Sarah; Casas, Pedro

Télécharger

Communication poster (Colloques et congrès scientifiques)

BIGMOMAL - Big Data Analytics for Mobile Malware Detection

Wassermann, Sarah; Casas, Pedro

2017 • ACM Internet Measurement Conference 2017

Permalien
https://hdl.handle.net/2268/215139

Documents (1)Envoyer vers Détails Statistiques Bibliographie Publications similaires

Documents

Texte intégral

poster.png

Preprint Auteur (7.98 MB)

Télécharger

Tous les documents dans ORBi sont protégés par une licence d'utilisation.

Envoyer vers

RIS BibTex APA Chicago Permalink X Linkedin

Détails

Mots-clés :

Mobile Malware Detection; Big Data Analytics; Machine Learning

Résumé :

[en] Mobile malware is on the rise. Due to their popularity, smartphones represent an attractive target for cybercriminals, especially regarding unauthorized access to private user data; smartphones incorporate a lot of sensitive information about users, even more than a personal computer. Indeed, besides personal information such as documents, accounts, passwords, contacts, etc., smartphone sensors centralize other sensitive data such as user location, physical activities, etc. In this paper, we study the problem of malware detection in smartphones, using supervised machine learning models and big data analytics frameworks. Using a publicly available dataset for smartphone data analysis (the SherLock data collection, see http://bigdata.ise.bgu.ac.il/sherlock/), we train and benchmark different supervised machine learning models to detect malware apps activity.The Sherlock data collection is a crowdsourcing-based smartphone dataset in which hundreds of features from many different "sensors" or vantage points within the device are monitored, using a tailored smartphone agent. The collection is done during a long-term - 2 years (2015/16), field trial on 50 smartphones used as primary device for 50 different participants. The monitoring agent collects a wide variety of network, software and sensor data at a high sample rate (as low as 5 seconds); in addition, participant devices include a sandbox-like smartphone agent which runs controlled malware apps, perpetrating attacks on the user's device (such as contacts theft, spyware, phishing, etc.), while creating labels for the SherLock dataset. The complete labeled dataset contains more than 10 billion data records, with a total of about 4 TB of data. We additionally complement the labels for malicious apps which might have been installed by participants by analyzing the installed apps' hashes in Virus Total (https://www.virustotal.com), a well-known multi antivirus online scanning system. From the complete dataset, we keep two specific feature categories: all those features related to the network traffic generated by the apps, and all those features corresponding to the footprint of the app on the CPU and internal running processes (e.g., statistics on CPUs, memory usage, linux-level processes information, etc.). The rationale is that some malware activity would be more visible at the network traffic level, whereas some others would be better identified at the local processes level. Using this dataset, we train different machine learning models (e.g., decision trees, neural networks, SVMs, etc.) and verify their accuracy to automatically spot out malicious apps running on the users’ devices. We also apply feature selection strategies to improve results and reduce computational times. Given the size of the dataset, we rely on big data platforms (such as Spark) to perform the analysis, complementing the machine learning based analysis with scikit-learn like pipelines. We evaluate three different concepts, including (i) overall model performance (using multi-fold cross validation on the complete dataset), (ii) generalization of the learned models across different users (train in N-1 users, and test in the remaining user), and (iii) detection accuracy drift along time (train during first month/week, test the resulting model in the subsequent months/weeks). Initial results are very promising, especially regarding overall model performance for decision tree based models.

Disciplines :

Sciences informatiques

Auteur, co-auteur :

Wassermann, Sarah ; Université de Liège - ULiège > Master sc. informatiques, à fin.

Casas, Pedro

Langue du document :

Anglais

Titre :

BIGMOMAL - Big Data Analytics for Mobile Malware Detection

Date de publication/diffusion :

novembre 2017

Nom de la manifestation :

ACM Internet Measurement Conference 2017

Lieu de la manifestation :

London, Royaume-Uni

Date de la manifestation :

du 1 novembre 2017 au 3 novembre 2017

Manifestation à portée :

International

Intitulé du projet de recherche :

BigDAMA

Disponible sur ORBi :

depuis le 17 octobre 2017

Statistiques

Nombre de vues

437 (dont 11 ULiège)

Nombre de téléchargements

60 (dont 1 ULiège)

Voir plus de statistiques