Recent Developments in the Czech National Corpus Michal Ken - PowerPoint PPT Presentation

Recent Developments in the Czech National Corpus Michal Křen Charles University in Prague Lancaster 20 July 2015 3 rd Workshop on the Challenges in the Management of Large Corpora

Introduction of the project User application development Data packages Corpus hosting Wiki, Support & Biblio Services SyD, Morfio & KWords KonText Tools for linguistic annotation Corpus compilation Project management tools Data processing and annotation Specialized corpora Parallel corpus Spoken corpora Written corpora Future plans

Czech National Corpus language corpora ▶ long-term project (since 1994) ▶ continuous mapping of Czech language ▶ compilation, maintenance and providing public access to various ▶ research infrastructure (since 2012) ⇒ service-oriented operation ▶ more than 4,500 registered active users ▶ almost 1,900 queries a day ▶ http://www.korpus.cz

corpus newspaper SYN2009PUB 700 mil. newspaper 1995–2007 SYN2013PUB 935 mil. 2005–2009 size SYN (version 3) 2 232 mil. union Currently available SYN-series corpora. outlook: new representative corpus SYN2015 1989–2004 newspaper 300 mil. SYN2006PUB contents time span SYN2000 100 mil. representative most of the texts from 1900–1999 SYN2005 100 mil. representative most of the texts from 2000–2004 SYN2010 100 mil. representative most of the texts from 2005–2009 fresh data in SYN (2010–2014 added) ▶ traditional corpora with detailed bibliographical information ▶ lemmatized & morphologically tagged

corpus 300 mil. Currently available SYN-series corpora. union 2 232 mil. SYN (version 3) 2005–2009 newspaper 935 mil. SYN2013PUB 1995–2007 newspaper 700 mil. SYN2009PUB 1989–2004 size newspaper SYN2006PUB most of the texts from 1900–1999 contents time span SYN2000 100 mil. most of the texts from 2005–2009 representative SYN2005 100 mil. representative most of the texts from 2000–2004 SYN2010 100 mil. representative ▶ traditional corpora with detailed bibliographical information ▶ lemmatized & morphologically tagged ▶ outlook: ▶ new representative corpus SYN2015 ▶ fresh data in SYN (2010–2014 added)

corpus recordings from 2002–2007 lemmatization & tagging outlook: spontaneous spoken Czech Currently available ORAL-series corpora. recordings from 2008–2011 Czech Republic 2.78 mil. size ORAL2013 Bohemia 1 mil. ORAL2008 recordings from 2002–2006 Bohemia 1 mil. ORAL2006 time span coverage two-layer ORTOFON series ▶ only unscripted, informal dialogical speech ▶ ORAL2013 designed as a representation of contemporary ▶ manual one-layer transcription

corpus Bohemia spontaneous spoken Czech Currently available ORAL-series corpora. recordings from 2008–2011 Czech Republic 2.78 mil. size recordings from 2002–2007 ORAL2013 1 mil. ORAL2008 recordings from 2002–2006 Bohemia 1 mil. ORAL2006 time span coverage ▶ only unscripted, informal dialogical speech ▶ ORAL2013 designed as a representation of contemporary ▶ manual one-layer transcription ▶ outlook: ▶ lemmatization & tagging ▶ two-layer ORTOFON series

InterCorp and a number of other languages Version 8 (June 2015) 38 foreign languages, out of which 20 lemmatized and/or tagged foreign-language texts: size of the core: 194 mil., total size of the InterCorp: 1,423 mil. words collections included: journalistic texts: Project Syndicate, Presseurop Acquis Communautaire, EuroParl, Open Subtitles ▶ large parallel corpus ▶ texts aligned on sentence level with their translations between Czech ▶ consists of two major parts: ▶ core : manually revised alignment, mostly fiction ▶ collections : automatic alignment, various domains

InterCorp and a number of other languages Version 8 (June 2015) size of the core: 194 mil., total size of the InterCorp: 1,423 mil. words ▶ large parallel corpus ▶ texts aligned on sentence level with their translations between Czech ▶ consists of two major parts: ▶ core : manually revised alignment, mostly fiction ▶ collections : automatic alignment, various domains ▶ 38 foreign languages, out of which 20 lemmatized and/or tagged ▶ foreign-language texts: ▶ collections included: ▶ journalistic texts: Project Syndicate, Presseurop ▶ Acquis Communautaire, EuroParl, Open Subtitles

DIAKORP DIALEKT dialectal corpus target size 200,000 words (end of 2016) DEAF corpus of Czech texts written by the deaf target size 200,000 words (end of 2016) ▶ diachronic corpus of historical Czech (from 14 th century onwards, with current focus on the 19 th century) ▶ current size 2 mil. words, major update soon

DIAKORP DIALEKT DEAF corpus of Czech texts written by the deaf target size 200,000 words (end of 2016) ▶ diachronic corpus of historical Czech (from 14 th century onwards, with current focus on the 19 th century) ▶ current size 2 mil. words, major update soon ▶ dialectal corpus ▶ target size 200,000 words (end of 2016)

DIAKORP DIALEKT DEAF ▶ diachronic corpus of historical Czech (from 14 th century onwards, with current focus on the 19 th century) ▶ current size 2 mil. words, major update soon ▶ dialectal corpus ▶ target size 200,000 words (end of 2016) ▶ corpus of Czech texts written by the deaf ▶ target size 200,000 words (end of 2016)

Project management tools ▶ software environments for internal work flow management ▶ web-based “wrappers” that combine both CNC and third-party tools

SynKorp ▶ database interconnected with data processing toolchain ▶ data collection and processing of the written language corpora ▶ customizable text conversion and clean-up ▶ bibliographic annotation and text classification

Mluvka ▶ database and integrated project management system ▶ coordination of spoken and dialectal data collection ▶ large networks of external collaborators ⇒ three-level project coordination hierarchy ▶ manual two-layer annotation (orthographic and phonetic) ▶ formal compliance checks and expert revisions ▶ balancing of the collected material ▶ payment calculation

InterCorp database a project-independent editor of aligned parallel texts) ▶ database and integrated project management system ▶ coordination of data collection for InterCorp ▶ large networks of external collaborators ⇒ three-level project coordination hierarchy ▶ work flow management of the individual texts ▶ manual verification and revision of the alignment (using InterText, ▶ payment calculation

Recent Developments in the Czech National Corpus Michal Ken - PowerPoint PPT Presentation

Recent Developments in the Czech National Corpus Michal Ken Charles University in Prague Lancaster 20 July 2015 3 rd Workshop on the Challenges in the Management of Large Corpora Introduction of the project User application development

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

NORDIC chamber of commerce in the czech republic czech economy facts in brief 2015 Czech economy

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

CzeSL an error tagged corpus of Czech as a second language Barbora tindlov 1 Svatava

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

Czech Republic your business partner 4 February 2013 Country Focus Briefing Czech Republic

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

REPORT ON THE CZECH CADASTRE 2005-2006 Jiri Rydval, Libor Tomandl Czech Office for Surveying,

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Hand arm Vibration & Hand-arm Vibration & Recent Developments Recent Developments

Fuel Poverty - recent developments and next steps 13 th November 2013 Recent developments

Recent Economic Developments Recent Economic Developments January, 2 0 0 9 Published by

Macroeconomic Overview of India: Recent Trends and Developments Recent Trends and Developments

NTRU Cryptosystem: Recent Developments Ron Steinfeld School of IT Monash University, Australia

Coping with variation in the Icelandic Diachronic Treebank Eirkur Rgnvaldsson Anton Karl

Unit 7: A multivariate approach to linguistic variation Statistics for Linguists with R A

Analysing Lexical Semantic Change with Contextualised Word Representations Mario Giulianelli,

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan Jurafsky Stanford,

Modeling Word Emotion in Historical Language: Quantity Beats Supposed Stability in Seed Word

Post Nasal Devoicing as Opacity: A Problem for Natural Constraints BRANDON PRICKETT UNIVERSITY

Language in Synchronic / Diachronic Sense and Some Puzzles of the Philosophy of Language Logika:

Epistemic normativity and becoming our future selves Ted Shear University of California, Davis

Sambuz

Useful Links

Newsletter

Mail Us

Recent Developments in the Czech National Corpus Michal Ken - PowerPoint PPT Presentation

Recent Developments in the Czech National Corpus Michal Ken Charles University in Prague Lancaster 20 July 2015 3 rd Workshop on the Challenges in the Management of Large Corpora Introduction of the project User application development

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

NORDIC chamber of commerce in the czech republic czech economy facts in brief 2015 Czech economy

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

CzeSL an error tagged corpus of Czech as a second language Barbora tindlov 1 Svatava

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

Czech Republic your business partner 4 February 2013 Country Focus Briefing Czech Republic

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

REPORT ON THE CZECH CADASTRE 2005-2006 Jiri Rydval, Libor Tomandl Czech Office for Surveying,

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Hand arm Vibration &amp; Hand-arm Vibration &amp; Recent Developments Recent Developments

Fuel Poverty - recent developments and next steps 13 th November 2013 Recent developments

Recent Economic Developments Recent Economic Developments January, 2 0 0 9 Published by

Macroeconomic Overview of India: Recent Trends and Developments Recent Trends and Developments

NTRU Cryptosystem: Recent Developments Ron Steinfeld School of IT Monash University, Australia

Coping with variation in the Icelandic Diachronic Treebank Eirkur Rgnvaldsson Anton Karl

Unit 7: A multivariate approach to linguistic variation Statistics for Linguists with R A

Analysing Lexical Semantic Change with Contextualised Word Representations Mario Giulianelli,

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan Jurafsky Stanford,

Modeling Word Emotion in Historical Language: Quantity Beats Supposed Stability in Seed Word

Post Nasal Devoicing as Opacity: A Problem for Natural Constraints BRANDON PRICKETT UNIVERSITY

Language in Synchronic / Diachronic Sense and Some Puzzles of the Philosophy of Language Logika:

Epistemic normativity and becoming our future selves Ted Shear University of California, Davis

Sambuz

Useful Links

Newsletter

Mail Us

Hand arm Vibration & Hand-arm Vibration & Recent Developments Recent Developments