G eneralized A nalysis of L ogs for A utomatic y g T ranslation and E pisodic A nalysis of S earches http: / / www.galateas.eu
GALATEAS - EU Project Part of the European Commission ‘s I nform ation and Com m unication Technologies Policy Support Program m e (Co-funded by the European Commission for an overall budget of 3.7M Euros) 01/ 04/ 2010 to 31/ 03/ 2013 Multitudinous vendors offer engines for retrieving contents and Multitudinous vendors offer engines for retrieving contents and metadata via search requests by end users. These queries are a precious resource for understanding user behaviour. GALATEAS w ill greatly help to custom ize these search requests and enable content providers to understand w hat inform ation users are really looking for understand w hat inform ation users are really looking for. GALATEAS w ill address tw o im portant challenges: • Making sense of short queries in any language and • Translating them . This will help content administrators to answer questions that are crucially important to them, such as: • Which are the topics which are most commonly searched in my collection, according to a certain language? , g g g • How do these topics relate with my catalogue? • Which named entities (people, places) are more popular among my users?”.
GALATEAS - EU Project Today content provider cannot customize Today content provider cannot customize content and indexing as they don’t know their users. The GALATEAS project offers digital content providers – an innovative approach to an innovative approach to understanding users' behaviour by analysing language-based information from transaction logs from transaction logs – technologies facilitating improved navigation and search for m ultilingual navigation and search for m ultilingual content access
GALATEAS - web services GALATEAS develops two web services LangLog: It will analyze transaction log containing queries to search engines for a given content provider. By applying statistical technologies coupled with language oriented services, B l i t ti ti l t h l i l d ith l i t d i it will produce reports concerning the informational needs of the users accessing that particular aggregation. LangLog will provide generalizations of the actions that information seekers perform in order to find contents inside p a searchable collection of digital objects. QueryTrans: It will translate queries coming from an external search engine into several target languages: the external search engine will return to the into several target languages: the external search engine will return to the user results into languages different from the one in which the query was formulated.
LangLog -Understand user needs LangLog Understand user needs Challenge - Recognise named entities and deal with multilingual terms in multilingual terms in very short texts Index term 1 La Gioconda Query 1 Tableau Mona Lisa (F) Li (F) Index term 2 Index term 2 Oil Query 2 Oil painting la Gioconda (EN) Index term 3 Query 3 La Gioconda Painting Painting GALATEAS pitturi da Vinci (IT) Identify appropriate index terms according to what the user is looking for
LangLog - Customise according to user needs LangLog Customise according to user needs Query ID Query ID Query Query Class Class Query Query Query Query Class Class ID Query 1 Leonardo da Vinci, La Art Query Leonardo da Science Gioconda X Vinci, Que y Query 2 Leonardo da Vinci, eo a do da c , Science Sc e ce hydraulics hydraulics, Vitruvian Man hydrometer Query 2 Oil painting, la Art Gioconda Query 3 Query 3 La Gioconda pitturi La Gioconda, pitturi, Art Art da Vinci Query 5 Leonardo da Vinci, Science GALATEAS meteorology Assign to previously unseen queries a Challenge – class from your Perform classification Perform classification i d indexing hierarchy i hi h and clustering with short query texts
QueryTrans – Query in multiple languages y y p g g direct No answer answer answer more answers the raft of the I I want a t Medusa? picture about “le radeau de la Méduse ” (F) (F) query query translation GALATEAS GALATEAS machine translation resources
Sources Sources • Our sources are the transaction logs of • Our sources are the transaction logs of specialised content providers which contain – Information that is already structured such as: Information that is already structured such as: • Session data • Clickthrough data • Digital content providers’ structured information hierarchies Di it l t t id ’ t t d i f ti hi hi – Unstructured information • The queries themselves q
Technologies Technologies GALATEAS will combine uniquely GALATEAS will combine uniquely • Language resources – Bilingual dictionaries, word lists, synonyms g y y • State-of-the-art natural language processing tools – E.g. Xerox Incremental Parser (XIP) • Data mining and log management components D t i i d l t t – Extract Transform Load tools • Query expansion classification and clustering systems Query expansion, classification and clustering systems – E.g. CLUTO • M achine translation software – MOSES
Technologies Technologies All technologies are incorporated in a web services framework that All technologies are incorporated in a web services framework that allows easy integration of third-party technologies and great extensibility Web services Web ser ices Translated Original query query q y Customer GALATEAS core Semantic services Query Semantic logs similarity similarity Customised reports Named Entity Recognition/ Part of Speech Tagging W b Web services i Natural Language Processing services
GALATEAS partners GALATEAS partners • Project coordinator: Xerox Research Centre Xerox Research Centre Europe (France) • Objet Direct (France) Objet Direct (France) • CELI (Italy) • University of Trento (Italy) • Gonetwork (Italy) G t k (It l ) • Bridgeman Art (England) • Humboldt University (Germany) • University of Amsterdam (Netherlands) ( ) http:/ / w w w .galateas.eu
Questions? http:/ / w w w galateas eu http:/ / w w w .galateas.eu Contact: Xerox Research Centre Europe Frédérique Segond 6 chemin de Maupertuis 38240 Meylan; France 38240 Meylan; France frederique.segond@xrce.xerox.com
Recommend
More recommend