Small Data Machine Learning Andrei Zmievski The goal is not a - PowerPoint PPT Presentation

Top 12 Languages id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 I guarantee you people aren’t tweeting at me in Swahili.

Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

Language Detection Text_LanguageDetect pear / textcat pecl / Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

EnglishNotEnglish ✓ Clean-up text (remove mentions, links, etc) ✓ Run language detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Di ff erence with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.

BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones

feature vectors I N P U T O U T P U T labels (good/bad) Had my input and output

BIAS CORRECTION One more thing to address

BIAS CORRECTION BAD GOOD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias

BIAS CORRECTION GOOD BAD

O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

O V E R SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.

chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

chance feature 1 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 77 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?

COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

COST FUNCTION m J ( θ ) = 1 X Cost ( h θ ( x ) , y ) m i =1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

LOGISTIC COST ( − log ( h θ ( x )) if y = 1 Cost ( h θ ( x ) , y ) = − log (1 − h θ ( x )) if y = 0

LOGISTIC COST y=1 y=0 0 1 0 1 Correct guess Cost = 0 Incorrect guess Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.

minimize cost OVER θ Finding the best values of Theta that minimize the cost

GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.

Small Data Machine Learning Andrei Zmievski The goal is not a - PowerPoint PPT Presentation

Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later WORK We are all superheroes, because we help our customers

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP March 3, 2014

Version Control With Subversion Jonathan Worthington Scarborough Linux User Group Version

Building a Multi-Purpose Platform For Bulk Data Using SqlAlchemy Introducing a way of building

Classical BI (A logic for reasoning about dualising resource) James Brotherston Cristiano

Towards a structure theory of Maharam algebras Boban Velickovic Equipe de Logique Universit

1 All Science fields require even more computing power, and GPU computing starts to be a

Cloud Computing with Amazon Web Services and the DevOps Methodology www.cloudreach.com Who am

Bayesian Nash equilibrium Felix Munoz-Garcia Strategy and Game Theory - Washington State

Sambuz

Useful Links

Newsletter

Mail Us