TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad - PowerPoint PPT Presentation

TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad Europea de Madrid

Contents • Goals • Scientific approach • Design and implementation • Current results

Goals • Effective filtering of Spanish text dealing with – Pornography – Gross language • Two level filtering (efficiency-driven) – Light filtering – Heavy filtering

Scientific approach • Light filter – pornography – Statistical text processing • Very shallow text analysis • Machine Learning – High accuracy on “easy” text – Efficient

Scientific approach • Light filter – pornography (details) – Very shallow text analysis • Basic tokenization – Isolating words using separators (space, EOL, etc.) • Stop list filtering – Filtering out very common words (e.g. Prepositions) • Stemming – Basic morphology (“analysis”, “analyser” → “analy”) • Binary text representation – Weight vector (e.g. “sex” occurs → sex has weight 1)

Scientific approach • Light filter – pornography (details) – Machine Learning • Filtering tokens with Information Gain – Retaining 1% top scoring word stems • Support Vector Machines (SVM) & regression – SVM linear model -1.99 * sex - 0.35 * porn + ... > 0 → safe – Logistic regression » Obtain class probabilities by fitting the model

Scientific approach • Light filter – gross language – Swear words in 3 groups (low, med, high) – Extracted from the Official Spanish Language dictionary (DRAE), stemmed – Operation • If any high swear word occurs → score high • else if any med swear word occurs → score high ...

Scientific approach • Heavy filter – pornography – More advanced text processing • Shallow text analysis with some NLP • Machine Learning (as in light filtering) – Better accuracy on “difficult” text – Less efficient

Scientific approach • Heavy filter – pornography (details) – Shallow text analysis with some NLP • Previous approach plus more indicative indexing units • Noun Phrases recognition • Named Entities recognition (“Pam Anderson” vs. “Bill Gates”)

Scientific approach • Heavy filter – pornography (details) – Noun Phrases recognition (3 phases) 1. Part-Of-Speech tagging training data “el perro come” → “el_det perro_n come_v” where – det = determiner, n = noun, v = verb (simplified) – Maximum Entropy with MXPOST package 95+% accuracy) – Trained on the CRATER corpus (news text)

Scientific approach • Heavy filter – pornography (details) – Noun Phrases recognition (3 phases) 2. Noun phrases (NPs) as regular expressions – E.g. np = det n adj (“el_det niño_n listo_adj”) 3. NP normalization (avoiding tagging incoming text – MXPOST not GPL’ed) – Stop list, stemming and ordering E.g. “el niño listo” → “list niñ” –

Scientific approach • Heavy filter – pornography (details) – Named Entities recognition • As defined in Computational Natural Language Learning (CONLL) 02/03 workshops – Named entities = phrases with names of persons, organizations, locations, times and quantities – E.g. [PER Wolff] , currently a journalist in [LOC Argentina] , played with [PER Del Bosque] in the final years of the seventies in [ORG Real Madrid] . • We partly follow the approach by 02 top performers (Carreras et al .)

Scientific approach • Heavy filter – pornography (details) – Named Entities recognition • A selection of Carreras text features – Focus word capitalization, punctuation marks, etc • A number of Machine Learning algorithms – Naive Bayes, SVM, kNN, etc. • Trained on CONLL Spanish corpora (news text)

Scientific approach • Heavy filter – gross language – Same swear words groups as in light filter – Weight vector (3 = high, 2 = med, etc.) – Cosine similarity with text input weight vector ∈ [0,1] → score

Design and implementation • Coded in Java • Third party (Java) libraries – WEKA (learning) – HTMLParser (text extraction) – Muffin (filtering test) – MXPOST (POS-Tagging training data) • Available at – PoesiaSoft/TextFilter/Spanish

Design and implementation • Package overview – indexer (core) – indexing, training – gross – gross language – ner – Named Entity recognition – filter – filtering utils (testing) – html2Text – HTML processing and bot – main – the filters

Design and implementation • Statistics – Code • 50 classes (300 Kb.) • 10 data files (10 Mb.) – Corpus • 35k html files (29k vs. 6k) • 1 Gb. of source HTML

Current results • Official results (beta version, porn light filter) • Sample of 4824 Web pages (891/3933) Predicted Harmful Harmless Total Actual Harmful 816 75 891 Harmless 4 3929 3933 Total 820 4004 4824 Precision 0.995 0.981 Recall 0.916 0.999 F-Measure 0.954 0.990

Current results • Official results (beta version, porn light filter) – Highlights • effectiveness value = 0.916 • over-blocking value = 0.001

Current results • Unofficial results – Light filter (porn) improved – Heavy filter (porn) • Slight (untested) improvement due to – Bigger feature space – NP and NE recognition

TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad - PowerPoint PPT Presentation

TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad Europea de Madrid Contents Goals Scientific approach Design and implementation Current results Goals Effective filtering of Spanish text dealing with

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

M. A. in Spanish M.A. in Spanish at UCA Designed for students with an undergraduate degree in

WELCOME TO A SPANISH SPEAKING WORLD THE WORLD SPEAKS SPANISH SPANISH IS A DYNAMIC , LIVING

Bondurant - Farrars Growing Spanish Program Allie Kerper, Lexie Klein & Haley Vance

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

State County Language State County Language AK Aleutians East Borough Spanish FL Osceola

Study Spanish in Spain Summer 2016 M a d r i d, S p a i n June 20-July 28 Course offerings :

Language Technology: R&D Word Embeddings Ali Basirat Department of Linguistics and Philology

Modelling the Polysemy of Spatial Prepositions in Referring Expressions omez Adam

SI486m : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney

Communicating Scala Objects Bernard Sufrin CPA, York, September 2008 [cpa2008-cso-talk]

Decision Making Under Uncertainty Making Decisions Under Uncertainty AI C LASS 10 (C H .

Transformative research frameworks This presentation covers transformative and Indigenous

How Man Many y Words s in n an an Imag age? e? NARRATI TIVE A E ASPECTS ECTS OF THE A

21) Functional and Modular Design Prof. Dr. U. Amann 1. Functional Design

TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad - PowerPoint PPT Presentation

TEXT FILTERING FOR SPANISH Enrique Puertas Sanz Universidad Europea de Madrid Contents Goals Scientific approach Design and implementation Current results Goals Effective filtering of Spanish text dealing with

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

M. A. in Spanish M.A. in Spanish at UCA Designed for students with an undergraduate degree in

WELCOME TO A SPANISH SPEAKING WORLD THE WORLD SPEAKS SPANISH SPANISH IS A DYNAMIC , LIVING

Bondurant - Farrars Growing Spanish Program Allie Kerper, Lexie Klein &amp; Haley Vance

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

State County Language State County Language AK Aleutians East Borough Spanish FL Osceola

Study Spanish in Spain Summer 2016 M a d r i d, S p a i n June 20-July 28 Course offerings :

Language Technology: R&amp;D Word Embeddings Ali Basirat Department of Linguistics and Philology

Modelling the Polysemy of Spatial Prepositions in Referring Expressions omez Adam

SI486m : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney

Communicating Scala Objects Bernard Sufrin CPA, York, September 2008 [cpa2008-cso-talk]

Decision Making Under Uncertainty Making Decisions Under Uncertainty AI C LASS 10 (C H .

Transformative research frameworks This presentation covers transformative and Indigenous

How Man Many y Words s in n an an Imag age? e? NARRATI TIVE A E ASPECTS ECTS OF THE A

21) Functional and Modular Design Prof. Dr. U. Amann 1. Functional Design

Bondurant - Farrars Growing Spanish Program Allie Kerper, Lexie Klein & Haley Vance

Language Technology: R&D Word Embeddings Ali Basirat Department of Linguistics and Philology