Sharing Data and Language Resources: Technical Aspects and Best Practices Stelios Piperidis ELRC, ILSP/Athena RC ELRC Workshop in Slovenia, 08.12.2015 1
Illustration of data packaging workflow Data LRs (Language Resources) Value chain Cleaning & Processing of Description Identification Basic Conversion activity Validation LRs & Storage of & Selection docu- (content, of Data mentation ( e.g. Alignment ) LRs container) Legal Status determination Privacy handling PSI vs and acceptance Licensing (i.e. anonymization) Upload data to the Repository & Sharing Market knowledge Industry Partnership ELRC ELRC / EC network Public Partner ELRC Workshop in Slovenia, 08.12.2015 3
Issues to address (1) • Identification of sources Identification Basic & Selection documentati on of Data • Identification and selection of data sets (raw data) • Legal issues Legal status determination • Licensing • Privacy and ethics management PSI vs. Licensing • Technical issues • Choice of Medium and Data formats for the transfer of the “raw” data (preference for the ELRC ad hoc platform) Partnership • Documentation with basic identification elements Market knowledge (Languages, Domains, year, …) Industry network ELRC Workshop in Slovenia, 08.12.2015 4
Any digital textual data !! ELRC Workshop in Slovenia, 08.12.2015 6
Issues to address (2) Technical issues Cleaning & Conversion • Cleaning of data format (content, container) • encoding Character sets e.g. UTF8 • discarding formatting, e.g. bold, italic; graphics, ads, tables, html tags, etc. Privacy handling and acceptance • … (i.e. anonymization) • File cleaning (e.g. conversion to XML, XLIFF, etc.) • Data anonymization Market ELRC knowledge Industry network ELRC Workshop in Slovenia, 08.12.2015 7
Formatting example Η Ελλάδα αποτελεί έναν χώρο πολιτισμού, τέχνης και Greece is a place of culture, the arts and sciences. Its επιστημών. Η μακραίωνη συμβολή της στο παγκόσμιο tradition of contribution to global cultural and scientific communities, γίγνεσθαι, σε συνδυασμό με το μοναδικό φυσικό κάλλος και combined with its outstanding natural beauty and excellent τις άρτιες υποδομές , την καθιστούν ιδανικό τόπο infrastructure , has made it an ideal place in which to hold διεξαγωγής συνεδρίων. Τα τελευταία χρόνια, η ελληνική conferences. Over the last few years, Greece has more and more Η Ελλάδα αποτελεί έναν χώρο πολιτισμού, Greece is a place of culture, the arts and επικράτεια υποδέχεται όλο και συχνότερα ανθρώπους των frequently welcomed people of letters, sciences and the arts, who τέχνης και επιστημών. Η μακραίωνη sciences. Its tradition of contribution to γραμμάτων, των επιστημών και των τεχνών, οι οποίοι have participated in symposia, conferences and exhibitions. Athens συμβολή της στο παγκόσμιο γίγνεσθαι, σε global cultural and scientific communities, International Airport ‘ Eleftherios Venizelos’ , one of the most modern συμμετέχουν σε συμπόσια, συνέδρια και εκθέσεις. Ο Διεθνής συνδυασμό με το μοναδικό φυσικό κάλλος combined with its outstanding natural Αερολιμένας Αθηνών «Ελευθέριος Βενιζέλος», ένα από τα airports in the world in operation since 2001, greatly boosted the και τις άρτιες υποδομές, την καθιστούν beauty and excellent infrastructure, has πλέον σύγχρονα αεροδρόμια παγκοσμίως, ο οποίος organization of international conferences. ιδανικό τόπο διεξαγωγής συνεδρίων. Τα made it an ideal place in which to hold λειτουργεί από το 2001, έδωσε μεγάλη ώθηση στη τελευταία χρόνια, η ελληνική επικράτεια conferences. Over the last few years, διοργάνωση διεθνών συνεδρίων. υποδέχεται όλο και συχνότερα ανθρώπους Greece has more and more frequently των γραμμάτων, των επιστημών και των welcomed people of letters, sciences and τεχνών, οι οποίοι συμμετέχουν σε the arts, who have participated in συμπόσια, συνέδρια και εκθέσεις. Ο symposia, conferences and exhibitions. Athens International Airport ‘ Eleftherios Διεθνής Αερολιμένας Αθηνών «Ελευθέριος Venizelos’, one of the most modern airports Βενιζέλος», ένα από τα πλέον σύγχρονα αεροδρόμια παγκοσμίως, ο οποίος in the world in operation since 2001, greatly λειτουργεί από το 2001, έδωσε μεγάλη boosted the organization of international ώθηση στη διοργάνωση διεθνών conferences. συνεδρίων. ELRC Workshop in Slovenia, 08.12.2015 8
Data anonymization • Identify a large source of data on individuals, organizations etc. • Use a Named Entity Recognizer (NER) to find and remove private biodata (names, locations, dates, birth information, etc.) and replace with generic placeholders • Confirm results meet acceptable requirements – Reject data if anonymization is not accurate as required ELRC Workshop in Slovenia, 08.12.2015 9
Issues to address (3) • Validation and Quality control of the output Validation of the anonymization procedure • Validation and Quality Control of the output (Language Resource format, content) accept / reject LR Public partner ELRC Workshop in Slovenia, 08.12.2015 10
Issues to address (4) Processing of • Description & Data preparation and processing for LRs Storage of LRs ( e.g. Alignment ) Automated Translation tools (e.g. Alignment) • Description of the Language Resource (meta-data) • Packaging and delivery (Data Repository with e-sharing) to EC and Owner Upload data to the Repository & Sharing Market ELRC / ΕΕ knowledge Industry network ELRC Workshop in Slovenia, 08.12.2015 11
Cooperation actions • Identification of sources • Identification and selection of data sets (raw data) – Data can be obtained from the visible sources (e.g. harvested from web) – Data can be handed over by the public sector players – Public sector players can boost the identification of visible sources • Processing indicated above can be carried out in cooperation by the ELRC and the data provider ELRC Workshop in Slovenia, 08.12.2015 13
How ELRC can help? • Support for all procedures and technical issues – Support services • ELRC portal • technical & legal support helpdesk • repository for sharing LRs • forum ELRC Workshop in Slovenia, 08.12.2015 14
ELRC portal www.lr-coordination.eu Screen shot goes here ELRC Workshop in Slovenia, 08.12.2015 15
ELRC portal: Helpdesk Screen shot goes here ELRC Workshop in Slovenia, 08.12.2015 16
ELRC Portal: Repository Screen shot goes here ELRC Workshop in Slovenia, 08.12.2015 17
ELRC Portal: Repository Screen shot goes here ELRC Workshop in Slovenia, 08.12.2015 18
ELRC Portal: Web forum Screen shot goes here ELRC Workshop in Slovenia, 08.12.2015 19
Conclusions • Repurposing existing data (human translations) is the best way to improve Automated Translation quality • Data-driven paradigms provide an efficient way to leverage value from existing resources • ELRC can help reviewing data for suitability (at any phase) • Do not underestimate the value of your language resources, foresee a Data Management Plan ELRC Workshop in Slovenia, 08.12.2015 22
Best practice for the future: Capitalize on your valuable data Best Practice in Data Management ELRC Workshop in Slovenia, 08.12.2015 23
My data in the future • Now that I know the value of data, what should my plans be? • What are the best ways to collect, maintain, archive and re-use my data • In particular how can I use it for improving MT performances? ELRC Workshop in Slovenia, 08.12.2015 24
Main phases of data development Value chain Cleaning & Processing of Description Identification Basic Conversion activity Validation LRs & Storage of & Selection docu- (content, of Data mentation ( e.g. Alignment ) LRs container) Legal Status determination Privacy handling PSI vs and acceptance Licensing (i.e. anonymization) Upload data to the Repository & Sustainable storage Sharing Market knowledge This can be part of the data management plan (DMP) Industry network ELRC Workshop in Slovenia, 08.12.2015 26
Recommend
More recommend