Text Processing Information Retrieval Inf 141 / CS 121

Tokenization • Break the input into words • Character stream -> token stream • Called a tokenizer / lexer / scanner • Compiler front-end • Lexer hooks up to parser • Preprocessor for information retrieval • Lexer feeds tokens to retrieval system

Identifying Tokens • Divide on whitespace and throw away punctuation? • What is a token? Depends… • Apostrophes • O’Neill • aren’t • Hyphen-handling • clear-headed vs clearheaded • mother-in-law

Identifying Tokens • Multiple words as single token? • San Francisco • white space • New York University vs York University • Tokens that aren’t words • jossher@uci.edu • http://www.ics.uci.edu/~lopes • 192.168.0.1

Markup as Tokens • Many documents are structured using markup • HTML, XML, ePub , etc… • What to do about tags? • Include them as tokens • Filter them out entirely • Filter tokens based on tags

Advanced Tokenization • Tokenization can do more than break a character stream into tokens • Programming language tokenizers use specific grammars • Can identify comments, literals • Associate a type with each token

Writing a Tokenizer • while loop looking for delimiters • Fast to write and execute • Hard to maintain and easy to mess up • Java library methods • java.util.Scanner • java.util.String.split() • java.util.StringTokenizer

Writing a Tokenizer • Deterministic Finite Automaton (DFA) • Finite set of states space • Alphabet a-z • Transition function Start space • Start state a-z \n • End states \n End

Generating a Tokenizer • Numerous open source tools • ANTLR, JFlex, JavaCC • Usually focused on programming languages • Specify the grammar, tool generates the program • Easy to maintain • Very flexible

Dropping Common Terms • Very common words can be bad for IR systems • he has it on that and as by with a… • Stop words • Use up lots of space in the index • Match nearly every document • Rarely central to document’s meaning • How to detect them? • Assignment part b

Drop Common Terms? • Should you remove stop words? • Flights to London vs Flights London • Flights from London vs Flights London • How to search for “to be or not to be”? • Trend in Information Retrieval is to not use stop words • Replaced by statistical techniques

Normalization • Canonicalize tokens so that superficial differences don’t matter • USA = U.S.A. = usa • C.A.T = cat? • Techniques • Remove accents & diacritics • Case-folding • Collapse alternate spellings

Stemming and Lemmatisation • Reduce word variants to single version • am, are, is => be • Stemming • Reduce words to stem by chopping off suffix • Lemmatization • Remove inflection to arrive at base dictionary form of the word, called a lemma

Porter’s Algorithm • Most common algorithm for stemming English • 5 phases of sequential word reduction • Stage 1 example • SSES -> SS caresses -> caress • IES -> I ponies -> poni • SS -> SS caress -> caress • S -> cats -> cat

Stemming Example

Stemming vs Lemmatisation • Stemming is easy (ish) • Fairly simple set of rules • Lemmatisation is hard • Requires complete vocabulary and morphological analysis • Which is better for retrieval? • Depends… • Both improve recall and harm precision

Acronym Expansion • Expands acronyms and abbreviations into their full form • USA -> united stats of america • In4matx -> informatics • Usefulness depends on domain • Source code retrieval greatly aided

Language Differences • Some languages have more morphology than English • Spanish, German, Latin • German has compound words • Chinese and Japanese don’t segment words • French for the is a prefix that changes depending on context

Text Processing Information Retrieval Inf 141 / CS 121 - PowerPoint PPT Presentation

Text Processing Information Retrieval Inf 141 / CS 121 Tokenization Break the input into words Character stream -> token stream Called a tokenizer / lexer / scanner Compiler front-end Lexer hooks up to parser

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CS1100: Computer Science and Its Applications Text Processing Processing Text Excel can be

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Digital Object Memories for the Web of Things Dr.-Ing. Jens Haupert German Research Center for

Mainframes: The past will come back to haunt you By:

Short Paper: WifiLeaks: Underestimated Privacy Implications of the ACCESS WIFI STATE Android

Undo-Redo Principles, concepts Undo patterns Java implementation Undo 1 Undo* Benefits

AuCPace: Efficient Verifier-Based PAKE protocol tailored for the IIoT Bjrn Haase, Benot

Graphical user interfaces (G (GUI) Tkinter Python shell Accumulator: 0

Evonne M. Silva | evonne@codeforamerica.org Government can work for the people by the people in

World 2012 Web-based iOS Configuration Management Tim Bell Trinity College, University of