Statistical Natural Language Processing Text Classifjcation Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017
Some examples generic search on internet as a result of looking for a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, * Fresh from this morning … deposited at a financial institute in Europe. reliable person that will help me to retrieve funds I a BIG problem that I had to get your contact via a is it spam? ANTI-CORRUPTION GRAFT agenda of the rulling government is I am sorry to invade your privacy; but the ongoing Republic Nigeria under regime of Jonathan Good-luck. My name is Dr. Pius Anyim, former senate president of the Dear Friend, Subject: Dear Friend / Lets work together From: Dr Pius Ayim <> 1 / 32
Some examples @DB_Bahn mußten sie für den Sauna-Besuch Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications of text classifjcation Sentiment analysis is currently one of the most popular zuzahlen ? the rest of the album is background music. is the customer happy? "Wouldnt Be nice" are indeed masterpieces...but Bob Dylan sometimes do. "God Only Know" and They definitly can not write great lyrics like but the songwriting is childish and rubbish. this album. Yes, the production is wonderfull I never understood what's the BIG deal behind 2 / 32
Some examples @DB_Bahn mußten sie für den Sauna-Besuch Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications of text classifjcation Sentiment analysis is currently one of the most popular zuzahlen ? the rest of the album is background music. is the customer happy? "Wouldnt Be nice" are indeed masterpieces...but Bob Dylan sometimes do. "God Only Know" and They definitly can not write great lyrics like but the songwriting is childish and rubbish. this album. Yes, the production is wonderfull I never understood what's the BIG deal behind 2 / 32
Some examples @DB_Bahn mußten sie für den Sauna-Besuch Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications of text classifjcation zuzahlen ? the rest of the album is background music. is the customer happy? "Wouldnt Be nice" are indeed masterpieces...but Bob Dylan sometimes do. "God Only Know" and They definitly can not write great lyrics like but the songwriting is childish and rubbish. this album. Yes, the production is wonderfull I never understood what's the BIG deal behind 2 / 32 • Sentiment analysis is currently one of the most popular
Some examples which language is this text in? Član 3. Svako ima pravo na život, slobodu i ličnu bezbjednost. Detecting language of the text is often the fjrst step for many NLP applications. Extremely easy for the most part, but tricky for – closely related languages – text with code-switching Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 32
Some examples which language is this text in? Član 3. Svako ima pravo na život, slobodu i ličnu bezbjednost. many NLP applications. – closely related languages – text with code-switching Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 32 • Detecting language of the text is often the fjrst step for • Extremely easy for the most part, but tricky for
More questions given a doctor’s report? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, web page? institutional or personal – economy – travel – sports – politics answer the support email? book? its description? product be listed based on 4 / 32 – gender student essay get? learner? level of a language – native language – age affjliation – political party • Who wrote the book? • What category should a • Find the author’s • What is the genre of the • Which department should • Is the author depressed? • Is this news about • What is the profjciency • What grade should a • Is the web site an • What is the diagnosis,
Text classifjcation documents complete books important (and interacts with the classifjcation method) Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 32 • In many NLP applications we need to classify text • Documents of interest vary from short messages to • The classifjcation task can be binary or multi-class • The core part of the solution is a classifjer • The way to extract features from the documents is
Text classifjcation the defjnition more) of the known classes input a document otput the predicted document class input a set of documents with associated labels otput a classifjer Essentially, the task is supervised learning (classifjcation). Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 6 / 32 • Given a document, our aim is to classify it into one (or • During prediction • During training
How about a rule-based method? We will stick to statistical / machine learning approaches Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 7 / 32 • They exist, and still used often in the industry • Rule-based approaches are language specifjc • It is diffjcult to adapt them to new environments
Supervised learning model Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, label predicted new data features ML prediction algorithm ML labels features data training training 8 / 32
Supervised learning model Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, label predicted new data features ML prediction algorithm ML labels features data training training 8 / 32
Two important parts – what features to use? words? characters? both? n-grams of words or characters? – what value to assign to each feature? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 9 / 32 • How do we represent a document? • What classifjcation algorithm should we use?
Bag of words (BoW) representation fjlm what , proof titan a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, have movies about . come clue to this that from sci-fj a.e. “ ” supposed be hollywood is animated it do because . ’s know does most how japan good n’t I thing The idea: use words that occur in text as features without BoW representation is supposed to be about. to do it. I don’t know what this fjlm Hollywood doesn’t have a clue how because “titan a.e.” is proof that sci-fj movies come from Japan, It’s a good thing most animated The document paying attention to their order. 10 / 32
Bag of words representation do Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with binary features a thing 11 / 32 be about. to The document value It’s a good thing most animated sci-fj movies feature come from Japan, because “titan a.e.” is proof all words in our document collection that Hollywood doesn’t have a clue how to do it. I don’t know what this fjlm is supposed to 1 1 1 1 1 1 1 • If the word is in the document, the 1 value of 1 , otherwise 0 0 0 • The feature vector contains values for 0
Bag of words representation do Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with (document) frequencies a thing 12 / 32 to The document It’s a good thing most animated sci-fj movies come from Japan, because “titan a.e.” is proof that Hollywood doesn’t have a clue how to do it. I don’t know what this fjlm is supposed to be about. vectors – efgect of document length feature value – frequent is not always good 2 2 2 1 1 1 1 • Use frequencies rather than binary 1 0 • May help in some cases, but 0 0
Bag of words representation a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with relative frequencies thing 13 / 32 be about. to value The document feature It’s a good thing most animated sci-fj movies come from Japan, because “titan a.e.” is proof that Hollywood doesn’t have a clue how to do document length do it. I don’t know what this fjlm is supposed to 0.06 0.06 0.06 0.03 0.03 0.03 0.03 • Relative frequencies are less sensitive to 0.03 0.00 0.00 • Still, high-frequency words dominate 0.00
tf-idf weighting term count in doc Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, number of docs doc length 14 / 32 documents that contain the term idf inverse document frequency - inverse of the ratio of tf term frequency - frequency of the word in the document – Words that appear in many documents are not specifjc important/representative for the document – Words that appear multiple times in a document is • Intuition: • tf-idf uses two components • Both components are typically normalized tf-idf t , d = C t , d | d | × log N n t number of docs with t
tf-idf example the Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, good a the book a 15 / 32 bad good the Document 1 ( d 1 ) 5 tf-idf ( t , d ) = tf ( t , d ) × idf ( t ) 2 1 tf-idf ( good , d 1 ) = ? Document 2 ( d 2 ) 2 tf-idf ( bad , d 1 ) = ? 2 1 tf-idf ( the , d 1 ) = ? Document 3 ( d 3 ) 1 tf-idf ( good , d 3 ) = ? 2 3
More recommend