Top 12 Languages id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 I guarantee you people aren’t tweeting at me in Swahili.
Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
Language Detection Text_LanguageDetect pear / textcat pecl / Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
EnglishNotEnglish ✓ Clean-up text (remove mentions, links, etc) ✓ Run language detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Di ff erence with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones
feature vectors I N P U T O U T P U T labels (good/bad) Had my input and output
BIAS CORRECTION One more thing to address
BIAS CORRECTION BAD GOOD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias
BIAS CORRECTION GOOD BAD
O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
O V E R SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
chance feature 1 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 77 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
COST FUNCTION m J ( θ ) = 1 X Cost ( h θ ( x ) , y ) m i =1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
LOGISTIC COST ( − log ( h θ ( x )) if y = 1 Cost ( h θ ( x ) , y ) = − log (1 − h θ ( x )) if y = 0
LOGISTIC COST y=1 y=0 0 1 0 1 Correct guess Cost = 0 Incorrect guess Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
minimize cost OVER θ Finding the best values of Theta that minimize the cost
GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
Recommend
More recommend