Center for Reflected Text Analytics Lecture 2 Annotation tools & Segmentation
Summary of Part 1 • Annotation theory • Guidelines • Inter-Annotator agreement • Inter-subjective annotations • Annotation exercise • Discuss disagreements with your neighbor • Improve annotation guidelines University of Stuttgart 2
Annotation Tool Support • Tools can support the annotation process at various stages • Managing multiple annotators • Assign documents to annotate • Supervise their progress • Analyse disagreements • Display disagreements (only) • Calculate quantitative IAA ( κ ) • Create a gold standard • Make decisions on disagreements • Record final decisions • Usable tools: See handout University of Stuttgart 3
Segmentation University of Stuttgart 4
Segmentation Tool Download http://tinyurl.com/cretanetworker = http://www2.ims.uni-stuttgart.de/gcl/reiterns/creta/CRETANetworker.jar University of Stuttgart 5
Segmentation • Abstract definition • No meaning of a segment implied • The task of separating a text into multiple parts (“segments”) • Segmentation according to various criteria based on • Structure (chapters, acts, letters, speeches) • Linguistics (sentences, paragraphs) • Narrative content (scenes, time, place) • Content (topics under discussion) • No generic criterion covering multiple research questions University of Stuttgart 6
Segmentation Viewpoints • Focus on segments • Spans of text • Focus on segment boundaries • Positions in a text • Views are equivalent – we will switch between them when appropriate Segment 1 Segment 2 Segment 4 Segment 3 Segment Segment Segment Boundary Boundary Boundary University of Stuttgart 7
Entities + Segments = Networks Mary Peter Paul Co-Occurrence Network University of Stuttgart 8
Entities + Segments = Networks Slightly more abstract description • Segmented text with the appearing entities ⟨ {A, B}, {A, B, B, B, A}, {A, C} ⟩ A B C A 2 1 • Convert into an (quadratic) adjacency matrix • Diagonal is typically uninteresting B 2 0 • Matrix is symmetric C 1 0 • Create network A • A node is created for each row (or cell) • An edge is created for each cell, B C weighted according to cell value University of Stuttgart 9
Segmentation Annotation • Theoretically • Segments can be annotated just like entity references • Both cover sequences of words • Appropriate annotation guidelines would define when to annotate segments • Practically • Segmentation criterion closely tied to research question • No reasonable generic abstraction layer • That works for multiple research questions and/or text corpora • Single texts only contain a few segments • Much more annotated texts needed for any kind of automatisation University of Stuttgart 10
Segment Annotation Tool • Web-based UI • Beta-Software • Automatic annotation through rules and tools • Entity annotation • Stanford Named Entity Recognizer (Finkel et al., 2005) • Only proper names, no descriptive noun phrases • Rules (regular expressions) – to specify the entity references • Segment annotation • Rules (regular expressions) – to specify the segment boundaries • Unsupervised segmentation algorithm (TextTiling; Hearst, 1994) • Network export → Gephi University of Stuttgart 11
Gephi Network Tool • Free and open source • https://gephi.org • Wide range of metric, filter and layout algorithms • Network editing (e.g., merge nodes) • Plugins • Export into static images University of Stuttgart 12
demo University of Stuttgart 13
Regular Expressions Useful text processing skills 101 • A powerful way to describe sets of character sequences • Many search tools support REs, and all programming languages do • Looks cryptic, but is quite systematic • REs on slides/handout are marked in forward slashes / / for readability • they don’t need to be typed in the tool • Basics • Many regular characters stand for themselves • The RE /a/ finds occurrences of the character “a” • Sequences of characters stand for sequences of themselves • The RE /the/ finds occurrences of the string “the” University of Stuttgart 14
Regular Expressions Basics • Many regular characters stand for themselves • The RE /a/ finds occurrences of the character “a” • Sequences of characters stand for sequences of themselves • The RE /the/ finds occurrences of the string “the” • Meta characters (“quantifiers”) are applied on the previous character • ?: previous character optional (0-1 times) • /them?/ finds both “the” and “them” • +: Previous character one or more times • /ab+/ finds ”ab”, “abb”, ”abbb”, … • The kleene star * finds the previous character zero or more times • /ab*/ finds “a”, “ab”, “abb”, ”abbb”, … University of Stuttgart 15
Regular Expressions Alternations and Character Classes • /(re1|re2)/ finds everything that finds either re1 or re2 • /(good|better|best)/ finds comparative and superlative forms of the adjective “good” • /great(er|est)?/ finds comp. and sup. forms of “great” • The question mark makes the suffixes optional • We can mark alternatives on character level in square brackets: […] • /[Tt]he/ finds upper and lower case forms of “the” • Square brackets support ranges of characters • /[A-Z]/ finds upper-case characters (beware: locale) • /[0-9]/ finds digits University of Stuttgart 16
Regular Expressions Special cases and exceptions • The dot . matches everything • /a.*b/ finds everything that begins with a and ends with b • Escape character: Backslash • In order to find a dot, we need to prevent its special meaning • /.*\.doc/ finds everything that ends on “.doc” (e.g., filenames) University of Stuttgart 17
Regular Expressions Real examples • Chapter 10. • /Chapter [0-9]+\./ • Chapter V. (Roman numbers) • /Chapter [IVXCM]+\./ • Beware: Possible over-matching • Dates: MAY 22., AUGUST 23. • /[A-Z]+ [0-9]+\./ • Beware: Possible over-matching University of Stuttgart 18
TextTiling Hearst (1994) • Unsupervised segmentation algorithm, developed for expository texts • Compares lexicon in a window left and right of a target sentence gap step size = 3 n n+1 sentence boundary window size = 2 2 3 1 0 d n v 2 = dist( v 1 , v 2 ) = v 1 = 7 2 0 9 University of Stuttgart 19
TextTiling Hearst (1994) • Unsupervised segmentation algorithm, developed for expository texts • Compares lexicon in a window left and right of a target sentence gap n n+1 d n d n+3 sentence boundary University of Stuttgart 20
TextTiling Hearst (1994) • More powerful algorithms are available • E.g., topic segmentation • Clear adaptation possibilities • How to create word vectors? • Which words are included (function/content words)? • Which value is represented in the vector (frequency, tf*idf, information, …) • How to calculate similarity/distance? • Cosine, manhattan, … • But: Evaluation is hard • No gold standard available • Different expectations University of Stuttgart 21
Hands-On Session 2 • Go to … • Load a text of your liking (it‘s better if you are familiar with it) • Add entity references by applying the Stanford NER system • Make a brief check, if the important entities are included (“Passepartout”, for instance, is not) • You can add specific names by specifying regular expressions • Add reasonable segment annotations • Export a GEXF file and load it into Gephi • Play with various options and see how the network changes University of Stuttgart 22
Recommend
More recommend