Textual and Lexical Statistics Carme 2011 Mónica Bécue Bertaut Universitat Politècnica de Catalunya
Outline Outline 1. Introduction 2. Illustrative example 3. Statistical analysis. Methods and results 3.1 Text encoding and data structures 3.2 Glossaries 3.3 Lexical statistics 3.4 Textual statistics Correspondence analysis Constrained clustering Other applications, other methods, in brief: canonical correspondence analysis multiple factor analysis for contingency table 4. Conclusive remarks Mónica Bécue-Bertaut Lexical and textual statistics 2/39
Introduction 1. Introduction. Some dates about statistical analysis of texts In the late 50 th , several favoured archiving the classics on electronic devices ( Trésor général des langues et parlers français in France, Oxford Concordances , in UK.) and developing methods to deal with these now available huge collections of texts. Lexical statistics (P. Guiraud, Ch. Muller., G.K. Zipf, G. U. Yule, … ) Muller Ch. (1977). Principes et méthodes de statistique lexicale. Hachette. The 35 years of that event next year … Correspondence analysis (CA), proposed by Benzécri (1961-1964 )─ burst in on these first works to offer a very new and invigorating approach. CA and clustering methods constitute the core of textual statistics whose applications are extremely wide-ranging. Benzécri (1981). Pratique de l’analyse des données. Tome 3. Linguistique et lexicologie. The 30 years of that event this year. Mónica Bécue-Bertaut Lexical and textual statistics 3/39
Introduction 1. Introduction. Types of corpus (collection of texts/documents) The Bible The classics Political speech Newspapers Non-directive interviews Free-text answers to open-ended questions in surveys Information retrieval Bibliography on a given theme for technological watch and scientific policy Claiming letters Automatic search in textual bases such as legal bases Organisation of textual bases Films or TV series scripts. What is a good script? Closing prosecution speech in a trial. Is the speech a good rethorical text? Free comments in hall test sessions related to wine/perfume/cheese…tasting Mónica Bécue-Bertaut Lexical and textual statistics 4/39
Introduction Introduction 1. Introduction. Challenge and approaches Challenge: turning texts into (textual) data Starting point: counting up the word frequencies Lexical statistics start from the sequence of words They mainly look for the lexical structure of the corpus (Muller, 1977; Labbé, 1990) Textual statistics start from the documents words matrix. They favour the distribution of the words into the documents by applying correspondence analysis and clustering methods (Benzécri, 1981; Lebart & Salem, 1994; Murtagh, 2005) Mónica Bécue-Bertaut Lexical and textual statistics 5/39
Illustrative example 2. Illustrative example: uncover the discursive strategy in a prosecution closing speech Speech pronounced at the end of a trial for murder at Barcelona Audience Court (in Spanish). It has been segmented into 59 discursive spaces by the researchers, taking as breakpoints the tone ruptures and silences of the prosecutor. It takes one hour 15mn. Beginning of the speech: ----esp0001 con la venia por este ministerio fiscal se ha formulado un escrito de conclusiones elevado a la definitiva por varios hechos un delito relativo a la prostitución un delito de asesinato así como un delito de estafa en concurso ideal con el anterior con respecto al primero de ellos es decir al delito relativo a la prostitución no nos queda ningún género de dudas ( … ) (approximate translation: with the leave of the Court from this public prosecutor’s department a final written statement drawing conclusions has been made due to several incriminating facts: prostitution related offence, murder offence as well as premeditated fraud offence. Concerning the first offence, we have no doubts....) Mónica Bécue-Bertaut Lexical and textual statistics 6/39
Illustrative example Objectives For the prosecutor To demonstrate his thesis, convince the audience (judges/lawyers; in this case, no jury) while observing the law By adopting a convenient discursive strategy: selection of words, arguments order, rhythm of the speech For the analyst Uncover, through statistics tools, how the prosecutor uses the information at his disposal to structure his speech and argumentation. Is the speech good? Mónica Bécue-Bertaut Lexical and textual statistics 7/39
Illustrative example Some words about the context Murder of a prostitute (MJAM) by her pimp to cash a substantial life insurance No evidence Only one witness (MF) had heard the pimp talking to an accomplice about the murder. However, she withdrew her statement to the police when facing the judge. She was most probably afraid … The prosecutor has to rely on persuasive clues / circumstantial evidence to show the implausibility of the defence thesis It is the first important case for the prosecutor while the defendant has the best criminal counsel of Barcelona. The colleagues of the prosecutor consider this case as a “lost case” … Mónica Bécue-Bertaut Lexical and textual statistics 8/39
Illustrative example Some words about the context Oral speech that follows a classical scheme 1. Listing the offences and indicating the legal framework 2. Describing the facts, the data and their connections 3. Qualifying these facts in a legal way, as the conclusive part of the speech This speech is largely improvised: the prosecutor has to pronounce it right at the end of the trial, taking into account what has occurred during all the events and statements Mónica Bécue-Bertaut Lexical and textual statistics 9/39
Statistical analysis 3. Statistical analysis: methods and results Texts encoding and data structures. Defining textual units 1. and segmenting the corpus into documents Glossaries 2. Lexical statistics 3. Repartition of the vocabulary Growth of the vocabulary Textual statistics. 4. Correspondence analysis Constrained clustering Characteristic lexical features Other applications, other methods Mónica Bécue-Bertaut Lexical and textual statistics 10/39 10
Text encoding 3. 1 Texts encoding and data structures Textual units Words: │ con │ la │ venia │ por │ este │ ministerio │ fiscal │ se │ ha │ formulado │ Lemmas: │ ConPr │ leArt │ veniaSubf │ porPr │ esteDet │ ministerioSubm │ fiscalAdj │ sePro │ formularVb │ Repeated segments : │ ministerio fiscal │ este ministerio fiscal │ In this case, tool and full words are conserved. Segmentation of the corpus into documents In the example, 2 nested segmentations : 59 discursive spaces (roughly, paragraphs) Less fine and more regular segmentation into 20 “Blocks” of about 500 occurrences (whose limits correspond to discursive spaces limits) Mónica Bécue-Bertaut Lexical and textual statistics 11/39
Data structures Data structures Short example W 1 corpus composed of 3 documents; N =14 occurrences V =9 distinct words W 8 doc1 W 1 W 3 w1 w2 w3 w4 w5 w6 w7 w8 w9 doc W 8 W 6 1 2 0 1 0 0 0 0 1 0 W 7 doc2 2 0 0 0 0 1 1 1 2 0 W 8 W 5 3 0 0 0 0 1 1 1 1 1 W 9 W 7 Corpus encoded into a documents×words table: frequency table doc3 W 8 W 5 W 6 Corpus encoded into a sequence of labelled occurrences doc w1 w2 w3 w4 w5 w6 w7 w8 w9 date topic author 1 2 0 1 0 0 0 0 1 0 y1 a 1 2 0 0 0 0 1 1 1 2 0 y2 a 2 3 0 0 0 0 1 1 1 1 1 y3 b 1 Corpus and metadata encoded into a multiple table Mónica Bécue-Bertaut Lexical and textual statistics 12/39
Vocabulary 3.2 Glossary of words N =10400 occurrences V =1799 distinct words. 302 words repeated 5 times and over (8031 occurrences are kept) mean length of the discursive spaces: 176.3 occurrences length range of the discursive spaces : from 54 to 463 occurrences Mónica Bécue-Bertaut Lexical and textual statistics 13/39
Vocabulary Glossary.Frequent words Life insurance : seguro ( insurance, 51 ) , persona (person, 40), seguro de vida (life insurance, 25), relación (relationship, 26), millones (millions, 13), beneficiarios (beneficiaries, 11). Actors : MJAM (victim, 38), FPM (33) and JCM (22) (defendants), hijos (children, 21), SRT (wife of the defendant, 17), MF (witness for the prosecution, 15). : Statement of the witness for the prosecution declaración(es) (statement(s), 43), policía (police, 27), delito (offence, 21), caso (case, 17), defensa (defence, 17), manifestaciones (manifestations, 16). Facts, data, clues, because there is no evidence, the prosecutor mentions hecho(s) (fact/s), 50), otro(s) dato(s) (other data, 32), indicios (clues, 11), “ un cúmulo de índices ” (accumulation of clues/circumstantial evidences), prueba (evidence, 15). 44), (it is Words indicating conviction : realmente (really, consta established 31), perfectamente (perfectly, 18), es evidente (it is evident 13) tenemos (we have, 14), sabemos (we know, 11). Mónica Bécue-Bertaut Lexical and textual statistics 14/39
Recommend
More recommend