The Unspoken Problems With Machine Learning in Security Noa Weiss
Hi! AI & Machine Learning Consultant ● Playing with data for over a decade ● Risk and Security ● PayPal, Armis ● 2
Hi! Deep Voice foundation ● Leader of Women in Data Science Israel ● Mentor junior data scientists ● 3
Agenda ● Is the grass really greener? ○ ML - other domains ○ ML - security ● The things that hold us back ● Possible solutions 4
Agenda ARE WHY WHAT we lagging behind is that the case can we do 5
Agenda ARE WHY WHAT we lagging behind is that the case can we do 6
ML IN OTHER DOMAINS: COMPUTER VISION
Computer Vision Today ● Autonomous vehicles ● Facial recognition ● Generative AI 8
COMPUTER VISION: EXAMPLES 9
Image Completion Algorithm: Image-GPT 10
11
Sketches → Photorealism Algorithm: GauGan 12
13
14
15
Sketches → Photorealism Algorithm: GauGan Developed by Katherine Nicholls, PhD 16
Fictional People www.thispersondoesnotexist.com 17
Fictional People / Cats www.thiscatdoesnotexist.com 18
Fictional Everything www.thispersondoesnotexist.com www.thiscatdoesnotexist.com www.thishorsedoesnotexist.com/ www.thisartworkdoesnotexist.com/ www.thischemicaldoesnotexist.com/ 19
ML IN OTHER DOMAINS: NATURAL LANGUAGE PROCESSING (NLP)
NLP Today ● Pretty good automatic translation ● Long-form question answering ● GPT-3 21
NLP: EXAMPLES 22
GPT-3 ● Language model (multi-purpose NLP model) ● Mostly generative ● Astonishing performance 23
GPT-3: Generative Code ● Free description of layout → JSX code ● (No task-specific training) 24
GPT-3: Generative Code ● Free description of ML model → model code! 25
GPT-3: Coding Interview 26
27
Google Duplex ● “Personal assistant” for phone reservations 28
Google Duplex 29
Security
ML in Security Today The good stufg: ● Some significant improvements in malware detection ○ Next Generation Anti Virus (NGAV) ● Some promise for network intrusion detection ○ Not yet prominent in practice 31
ML in Security Today ● All in all: ○ ML models with so-so performance ○ ML only makes for a small part of core product ○ Data and ML technology under-utilized ● Lagging behind other domains 32
Agenda ARE WHY WHAT we lagging behind is that the case can we do 33
WHY?
Anomaly Detection Algorithms Algorithms aimed at identifying data points, events, or observations that deviate from a dataset's normal ● Very common in Security ○ Algorithm task fits business needs ○ Unsupervised (no labels needed) 35
Anomaly Detection Algorithms Yet, not ideal for Security: ● High false positive rate (FPR) ○ Legitimate user activity is often anomalous ○ Higher cost of errors than other domains ■ (Block legit activity? Wait for manual review?) ● Human-designed features are our “Ground Truth” ○ Very prone to human bias ○ Model only spots MOs we already know 36
Changing Environment ● Most ML domains: mostly unchanging environment ○ E.g.: CV, NLP ● Environment in Security: ○ New devices ○ New apps ○ New protocols ○ Etc. ● This is a problem for a learning model 37
An Adapting Adversary ● As we become better at securing our devices and networks, attackers become better at outsmarting our defences ● This is a problem uncommon in most fields ○ E.g.: CV, NLP 38
Tagging ● How CV and NLP get tagged datasets ● Why we can’t do that in security ○ Expertise ○ Context ○ Confidentiality ○ Scale ● Bigger datasets = bigger tagging problems ○ Sampling? 39
Imbalanced Classes Difgerent classes are extremely over/under represented in the data ● Results in poor predictive performance (especially for minority class) 40
Imbalanced Classes A major problem when aiming to identify ● fraud/attacks While common solutions exist, they are ● limited, and do not fully solve this problem 41
Need for Explainability ● CV / NLP: mostly based on deep learning techniques ● Deep learning models are considered “black boxes” ● Security decision-making requires explainability (more so than other domains) ● DL could still be used with added-on explainability models - but those are imperfect, and complex 42
Confidentiality ● Other domains: ○ Public datasets ○ Public baselines ○ Publicly-released trained models ● All of those enable not only direct collaboration, but also a way to compare new methods and algorithms 43
Confidentiality Security: ● Companies bound by confidentiality ● No natively public data available Few publicly available datasets - small / outdated 44
Many researchers are struggling to find comprehensive and valid datasets to test and evaluate their proposed techniques and having a suitable dataset is a significant challenge in itself. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 45
In order to test the effjciency of such mechanisms, reliable datasets are needed that (i) contain both benign and several attacks, (ii) meet real world criteria, and (iii) are publicly available. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 46
Agenda ARE WHY WHAT we lagging behind is that the case can we do 47
WHAT CAN WE DO TO CHANGE THIS?
1 . Public Datasets
2 . Benchmarks
3 . Direct Collaboration
Public Datasets Benchmarks Direct Collaboration Encourage an active discussion & indirect collaboration, in the public domain, resulting in faster, better progress for the security domain as a whole.
Wrap Up ARE WHY WHAT we lagging behind is that the case can we do 53
Thank you hi@weissnoa.com @NWeiss linkedin.com/in/noa-weiss www.weissnoa.com Presentation template by SlidesCarnival
Recommend
More recommend