Discourse markers and other signals: annotation and analysis Ludivine CRIBLE Bucharest, 15-16 Oct 2019
Overview 2 1. Domains and functions : operational definitions 2. EXMARaLDA suite: general functionalities 3. Hands-on demo : creating an annotated TedTalk transcription 4. Extracting and analyzing data 5. The next step : signalling analysis
The taxonomy in practice Definitions 3
Key principles : reminders 4 – Two independent layers of functional information What is the relation/function expressed by the (semantics of the) DM ? 15 – Which type of content/elements/layer does the DM target ? 4 – Each function can combine with each domain (theoretically) – – Only 1 value per level (no double tags) – You can start annotating at any level
Domains 5 Ideational Rhetorical Sequential Interpersonal Objective relations Subjective relations Segments management Addressee management between external facts between thoughts or speech-acts Structuring topics, Phatic function, Low degree of speaker turns, digressions, manifests the involvement Speaker’s attitude, hesitations, stalling relationship with the beliefs, reasoning hearer Incompatible with Make the steps and expressions of opinion Distance from facts (“I flow of speech more Explicit call or answer to think that…”, “I can say explicit the addressee that…”)
Functions I : discourse relations 6 – Addition (ADD) : S2 provides discourse-new information related to S1 Conjunction – Specification (SPE) : S2 elaborates on S1 with more details or an example – Temporal (TMP) : the two segments are chronologically ordered – Cause (CAU) : S2 explains the situation in S1 Contingency – Consequence (CSQ) : S2 is the result of the situation in S1 – Condition (CND) : S2 is the condition for the truth/relevance of S1 – Concession (CCS) : S2 denies expectations related to S1 – Contrast (CTR) : the two segments differ w.r.t a shared property Comparison – Alternative (ALT) : the segments can replace each other
Functions II : speech-specific 7 – Hedging (HDG) : the DM signals some approximation – Monitoring (MNT) : the DM signals the speaker’s intent to control the flow – Agreeing (AGR) : the DM signals agreement – Disagreeing (DIS) : the DM signals disagreement – Topic (TOP) : the DM signals a start, change or return to topic Domain-specific – Quoting (QUO) : the DM introduces (pseudo-)reported speech
Examples 8 IDE RHE SEQ INT Addition le grand frère avait un non je marchais pas ah non Pacs avait fait une intendance <spk1> tu dis euh cheese pour le rôle de papa et en plus non j'ai pas couru (0.180) et aux baladins (0.780) et euh cliché et genre euh un peu pour d’être papa il avait un j'ai fait encore un détour Camille lui dit euh tu se cacher rôle de d’essayer les oublieras pas de payer <spk2> et un peu pour se cacher choses avant nous aussi ouai Alternative on est plusieurs ou tu c’est pas pour ça qu’on fait de euh ben j'ai fait euh deux ans <spk1> j’avais repris euh des me vouvoies ? la musique mais c’est enfin enfin ma première et ma études en gestion des ressources c’est pas pour être reconnu deuxième euh d'institutrice humaines […] dans la rue euh primaire <spk2> directement après? <spk1> ben euh enfin j’ai arrêté euh l’année passée euh avril et euh […] l’année scolaire suivante Concession elle devait partir le si la démocratie est un mot c’était assez comique de les cet auditeur euh vigilant il va vous lendemain mais elle ancien, ici et maintenant la entendre parler comme ça dire tiens euh encore Jean n’est jamais partie démocratie signifie la euh des filles (0.690) mais d’Ormesson mais on entend Jean prospérité pour tous euh ouais puis après euh d’Ormesson à chaque automne voilà quoi
Tips and notes 9 – Domains form a relative cline, allow for “more” or “less” interpretations – Domains might not mean exactly the same thing for all functions, be flexible – In case of doubt for the function, the bias is the “dictionary” meaning – Test phase and discussion with second annotator necessary – Practice makes perfect
EXMARaLDA suite General functionalities 10
Generalities 11 – Thomas Schmidt’s team in Hamburg (CLARIN-D) – Open-source annotation software – Designed specifically for spoken text – transcription – text-to-sound alignment – annotation – Download and documentation available at: http://exmaralda.org/en/
EXMARaLDA suite (Schmidt & Wörner 12 2012) – Corpus Manager for corpus metadata – Partitur Editor for transcription and annotation – Exakt for extraction/concordancer 1 2 3
Pros and cons 13 – Open-source – Cannot handle heavy files – All-in-one – Several steps for extraction – User-friendly, intuitive (vs. Praat) – Each annotation tier per speaker – Few constraints (vs. ELAN) – Interoperable format
Input formats 14 – ELAN (.eaf) – Praat (.TextGrid) – Transcriber (.trs) – Folker (.flk) – CHAT (.cha) – Anvil (.anvil) – Annotation Graph file (.xml) same formats available – Plain text (.txt) for export – Treetagger (.txt) – TEI (.xml)
Annotation panel 15 – View > Annotation panel – Open : choose your .xml file in its folder – You can edit the annotation panel with any text editor, e.g. Notepad++ – The file provided follows Crible & Degand (in press) You can change ir or create a new one cf. EXMARaLDA documentation – – The name of the « category » must be exactly the same as the name of the tier – Automatically displays the list of available values + any description you want – Double-click on the value to add it in the cell (avoids spelling mistakes)
Tips for DM annotation 16 – word-level segmentation – either merge transcription tiers or double annotation tiers – enter list of labels as « Annotation panel » for easy use – prefer chronological order than DM-by-DM to understand the context – don’t do 5 hours in a row – keep calm
Creating an annotated TedTalk Hands-on demo 17
Exercise 1 18 1. Use transcript provided or download any from https://www.ted.com/ 2. Import it to Partitur Editor 3. Select segmentation rule 4. Create annotation tiers 5. Open annotation panel 6. Identify 5 DRDs and annotate their functions 7. Save as .exb file
Extraction and analysis From EXMARaLDA to Excel 19
CorpusManager file (1) 20 – Group all your annotated files (.exb) in the same folder – Open « CorpusManager » (CoMa) – File > Create corpus from transcriptions – Name the corpus – Click on « Browse » : go into the folder where all the .exb files are stored – DO NOT CLICK ON ONE OF THE FILES – click anywhere else in the folder, otherwise the corpus will erase the .exb file make sure that you can read: File name > *YourCorpusName*.coma – – It will show how many transcription files you have in this folder – « Next »
CorpusManager file (2) 21 – « Select transcriptions » : just click on « Next » – « Segmentation » – Tick the box on « Segment transcriptions » – Select « …use default segmentation », click on « Next » – « Metadata assignment » Click on « Next » – « Speakers » – Click on « Finish » you created a .coma file
EXAKT 22 – Open EXAKT – File > Open corpus (or shortcut) and find the .coma file you just created – Select « RegEx(A) » – Annotation: Select your tier name, e.g. « DM » – In the « RegEx » box, type your search string, e.g. « well » – Typing the dollar sign $ will give you all the annotations, everything you typed in the « DM » tier – Then click on the binoculars on the right
Visualizing the annotations 23 – You will see a concordancer with all your DMs. – To add the annotations from other tiers: – Columns (top-left corner) > Add annotation – Select the Annotation Category you want (e.g. start with « DM », then « DOMAIN »…) The « Exact » option is fine – « OK » – – To add metadata, such as the name of the transcription file: – Columns > Metadata – Selection « Filename* », click on the « + » sign, then « OK »
Visualizing the annotations 24 The result should look like this:
Exploring the annotations 25 – You can add more characters to the Left and Right context by clicking on the magnifying glass on the right – By doublie-clicking on one item, you can visualize it in the transcription format at the bottom – You can also play it! – You can add a « comment » column if you want, once you revise your annotations – Columns > Add analysis > Analysis name: « Comment » > OK
Extracting the annotations 26 – Click anywhere on the concordancer – Ctrl + A (select everything) – Ctrl + C (copy) – Go on Excel and Ctrl + V (paste on a new Excel sheet)
Working under Excel 27 – You can now filter your data, create pivot tables and graphs, look at frequencies …
Inter- and intra-annotator 28 reliability – To assess the reliability and replicability of your analysis – Intra = repeat your annotations after a while and compare – Inter = compare with another annotator – % measured in Excel : IF((A1=A2);”same”;”diff”) – Kappa scores measured in R or online : https://nlp-ml.io/jg/software/ira/ – Aim for k = 0.7, see Spooren & Degand (2010)
Recommend
More recommend