The Unspoken Problems With Machine Learning in Security Noa Weiss

Hi! AI & Machine Learning Consultant ● Playing with data for over a decade ● Risk and Security ● PayPal, Armis ● 2

Hi! Deep Voice foundation ● Leader of Women in Data Science Israel ● Mentor junior data scientists ● 3

Agenda ● Is the grass really greener? ○ ML - other domains ○ ML - security ● The things that hold us back ● Possible solutions 4

Agenda ARE WHY WHAT we lagging behind is that the case can we do 5

ML IN OTHER DOMAINS: COMPUTER VISION

Computer Vision Today ● Autonomous vehicles ● Facial recognition ● Generative AI 8

COMPUTER VISION: EXAMPLES 9

Image Completion Algorithm: Image-GPT 10

Sketches → Photorealism Algorithm: GauGan 12

Sketches → Photorealism Algorithm: GauGan Developed by Katherine Nicholls, PhD 16

Fictional People www.thispersondoesnotexist.com 17

Fictional People / Cats www.thiscatdoesnotexist.com 18

Fictional Everything www.thispersondoesnotexist.com www.thiscatdoesnotexist.com www.thishorsedoesnotexist.com/ www.thisartworkdoesnotexist.com/ www.thischemicaldoesnotexist.com/ 19

ML IN OTHER DOMAINS: NATURAL LANGUAGE PROCESSING (NLP)

NLP Today ● Pretty good automatic translation ● Long-form question answering ● GPT-3 21

NLP: EXAMPLES 22

GPT-3 ● Language model (multi-purpose NLP model) ● Mostly generative ● Astonishing performance 23

GPT-3: Generative Code ● Free description of layout → JSX code ● (No task-specific training) 24

GPT-3: Generative Code ● Free description of ML model → model code! 25

GPT-3: Coding Interview 26

Google Duplex ● “Personal assistant” for phone reservations 28

Google Duplex 29

Security

ML in Security Today The good stufg: ● Some significant improvements in malware detection ○ Next Generation Anti Virus (NGAV) ● Some promise for network intrusion detection ○ Not yet prominent in practice 31

ML in Security Today ● All in all: ○ ML models with so-so performance ○ ML only makes for a small part of core product ○ Data and ML technology under-utilized ● Lagging behind other domains 32

Anomaly Detection Algorithms Algorithms aimed at identifying data points, events, or observations that deviate from a dataset's normal ● Very common in Security ○ Algorithm task fits business needs ○ Unsupervised (no labels needed) 35

Anomaly Detection Algorithms Yet, not ideal for Security: ● High false positive rate (FPR) ○ Legitimate user activity is often anomalous ○ Higher cost of errors than other domains ■ (Block legit activity? Wait for manual review?) ● Human-designed features are our “Ground Truth” ○ Very prone to human bias ○ Model only spots MOs we already know 36

Changing Environment ● Most ML domains: mostly unchanging environment ○ E.g.: CV, NLP ● Environment in Security: ○ New devices ○ New apps ○ New protocols ○ Etc. ● This is a problem for a learning model 37

An Adapting Adversary ● As we become better at securing our devices and networks, attackers become better at outsmarting our defences ● This is a problem uncommon in most fields ○ E.g.: CV, NLP 38

Tagging ● How CV and NLP get tagged datasets ● Why we can’t do that in security ○ Expertise ○ Context ○ Confidentiality ○ Scale ● Bigger datasets = bigger tagging problems ○ Sampling? 39

Imbalanced Classes Difgerent classes are extremely over/under represented in the data ● Results in poor predictive performance (especially for minority class) 40

Imbalanced Classes A major problem when aiming to identify ● fraud/attacks While common solutions exist, they are ● limited, and do not fully solve this problem 41

Need for Explainability ● CV / NLP: mostly based on deep learning techniques ● Deep learning models are considered “black boxes” ● Security decision-making requires explainability (more so than other domains) ● DL could still be used with added-on explainability models - but those are imperfect, and complex 42

Confidentiality ● Other domains: ○ Public datasets ○ Public baselines ○ Publicly-released trained models ● All of those enable not only direct collaboration, but also a way to compare new methods and algorithms 43

Confidentiality Security: ● Companies bound by confidentiality ● No natively public data available Few publicly available datasets - small / outdated 44

Many researchers are struggling to find comprehensive and valid datasets to test and evaluate their proposed techniques and having a suitable dataset is a significant challenge in itself. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 45

In order to test the effjciency of such mechanisms, reliable datasets are needed that (i) contain both benign and several attacks, (ii) meet real world criteria, and (iii) are publicly available. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 46

WHAT CAN WE DO TO CHANGE THIS?

1 . Public Datasets

2 . Benchmarks

3 . Direct Collaboration

Public Datasets Benchmarks Direct Collaboration Encourage an active discussion & indirect collaboration, in the public domain, resulting in faster, better progress for the security domain as a whole.

Wrap Up ARE WHY WHAT we lagging behind is that the case can we do 53

Thank you hi@weissnoa.com @NWeiss linkedin.com/in/noa-weiss www.weissnoa.com Presentation template by SlidesCarnival

The Unspoken Problems With Machine Learning in Security Noa Weiss - PowerPoint PPT Presentation

The Unspoken Problems With Machine Learning in Security Noa Weiss Hi! AI & Machine Learning Consultant Playing with data for over a decade Risk and Security PayPal, Armis 2 Hi! Deep Voice foundation Leader of

JFK Unspoken Speech Community Project. HELLO AND WELCOME. OUR PURPOSE. The Unspoken Speech is

Neural Networks for Machine Learning Lecture 1a Why do we need machine learning? Geoffrey Hinton

Dynamic gradient estimation in machine learning Thomas Flynn Abstract The optimization problems

SECURITY AND PRIVACY OF MACHINE LEARNING Ian Goodfellow Staff Research Scientist Google Brain

Structure of Vision Problems Alan Yuille (UCLA). Machine Learning Theory of Machine Learning

Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology

SECURITY, ADVERSARIAL SECURITY, ADVERSARIAL LEARNING, AND PRIVACY LEARNING, AND PRIVACY

Outline Optimization Unconstrained Optimization Problems Machine Learning and Pattern

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Problem Definition Machine learning problems, such as image classification or voice recognition,

Machine Learning and Embedded Security Farinaz Koushanfar Professor and Henry Booker Faculty

for Machine Learning Nicole Nichols** Pacific Northwest National Lab Co-Authors: Rob Jasper,

USING MACHINE LEARNING FOR VLSI TESTABILITY AND RELIABILITY Mark Ren, Miloni Mehta TAKE-HOME

Introductory Applied Machine Learning The primary aim of the course is to provide the student with

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning Discussion Dave Draffin 04/24/ 2 018 After this discussion you should: Know

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Foundations of Machine Learning Multi-Class Classification Motivation Real-world problems often

Machine Learning: Study of algorithms that improve their performance P at some task T

Thinking about Genomics through a Machine Learning Lens: Basics of how several DNA- and RNA-based

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

XAI in Machine Learning Problems Taxonomy Explanation by Design Black Box eXplanation Example

WHAT DID YOU SAY? THE UNSPOKEN BARRIERS OF EFFECTIVELY COMMUNICATING WITH OURSELVES AND OTHERS

in Machine Learning Nicholas Carlini University of California, Berkeley (now Google Brain) This