Content-based Linked Data Summarization Andrejs Abele Supervisor: Paul Buitelaar Mentor: Georgeta Bordea
Introduction 1. Motivation 2. Datasets 3. Approach 4. Evaluation 5. Experiments 6. Conclusion & Future work
Terminology ● Linked data ● Automatic summarization: ○ Extraction-based summarization, ○ Abstraction-based summarization ● Single document summarization ● Multi document-summarization
Motivation Data scientist and developers ... ... Wikipedi DynaMed a outbreak databas ... Us e Census Datahub contains Data 8 731 datasets Summarizer Dataset Description Top entries ... Encyclopedia DBpedia history, structure, ... ... Contains information about science, technology, math, history … outbreakdatabas Provides summaries of significant food and water related outbreaks Outbreak, illness, ... ... e occurring since ... ... ... ... ...
Datasets DBpedia - english dbpedia dump( 866 461 004 ) <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en . <http://dbpedia.org/resource/Mithridate> <http://dbpedia.org/ontology/wikiPageInterLanguageLink> <http://br.dbpedia.org/resource/Mitridates> . <http://dbpedia.org/resource/Uguisu_no_fun> <http://dbpedia.org/ontology/abstract> "Uguisu no fun (\u9DAF\u306E\u7CDE), which literally means \u201Cnightingale feces\u201D in Japanese, refers to the excrement (fun) produced by a particular nightingale called the Japanese bush warbler (Cettia diphone) (uguisu). The droppings have been used in facials since ancient Japanese times. Recently, the product has been used in the Western world. This facial has been referred to as the \u201CGeisha Facial\u201D. The facial is supposed to lighten the skin and balance skin tones that have acne or sun damage."@en . … WikiAbstracts - Wikipedia abstracts ( 4 636 227 ) acquis - Acquis english corpus ( 23228 )
Experiment 1. Extract informations about one topic from linked Dataset 2. Determine most important terms 3. Create summary from extracted words 4. Compare summary to wikipedia article about the topic
Ranking methods ● Normalized Term Frequency (TF/N) ● Term Frequency -Inverse Document Frequency (TF*IDF) IDF(t)=ln(N d /N dt ) ● Taxonomy extraction
Evaluation ROUGE-N ROUGE output: 1. Recall N-gram based co-occurrence statistics 2. Precision pyrouge - https://github.com/andersjo/pyrouge.git 3. F-measure Parameters used : m - uses Porter stemmer s - removes stopwords (around,as,aside,ask,asking,...) n - max-ngram l - n-words
Experiments ● Term preprocessing ○ Stemming ○ Removing Stopwords ○ Part-of-speech tagging
Experiment 1 extracted IDF data IDF data sources: literals source 1. All Literals from DBpedia 2. Wikipedia abstracts 3. acquis 4. extracted literals Compute Compute Compute TF IDF TF-IDF Ranked term List (230) Wikipedia article abstract (230) ROUGE ROUGE output (using stemming)
Extract informations about one topic 1. grep for all triples containing <http://dbpedia.org/page/Category:Traditional_medicine> 2. get all subjects and objects and merge in a list 3. use list to grep for all related triples from dbpedia 4. upload triples to triplestore 5. query for unique subjects and objects, where object is a literal
Topic specific data ( 369 ) ?s ?O <http://dbpedia.org/resource/Kampo> ", alternatively shortened as just Kanpō, is the Japanese study and adaptation of Traditional Chinese medicine (TCM). The fundamental principles of Chinese medicine came to Japan between the 7th and 9th centuries. Since then, the Japanese have created their own unique herbal medical system and diagnosis. Kampo uses most of the Chinese medical system including acupuncture and moxibustion but is primarily concerned with the study of herbs." <http://dbpedia.org/resource/Kampo> "Kampo" <http://dbpedia.org/resource/Apocroustic> "Apocroustics, in pre-modern medicine, were medications intended to stop the flux of malignant humours to a diseased part. They were usually cold, astringent, and consisting of large particles." ... ● Text gets stemmed using Lucene library and merged in one document altern shorten as just Kanp is the Japanes studi and adapt of Tradit Chines medicin TCM The fundament principl of Chines medicin came to Japan between the 7th and 9th centuri Sinc then the Japanes have creat their own uniqu herbal medic system and diagnosi Kampo us most of the Chines medic system includ acupunctur and moxibust but is primarili concern with the studi of herb Kampo Apocroust in pre modern medicin were medic intend to stop the flux of malign humour to a diseas part Thei were usual cold astring and consist of larg particl
IDF datasets ● Input is standard triple (S P O) <http://dbpedia.org/resource/Irani_traditional_medicine> <http://www.w3.org/2000/01/rdf-schema#label> "Irani traditional medicine"@en . <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en ● Using Jena parser, filter out Literals Irani traditional medicine Lignum nephriticum ● Words get stemmed using Lucene library (Irani, tradit, medicin) (Lignum, nephriticum) ● Calculate IDF 0.09230952124 medicin 0.03787453865 tradit 0.02862030703 herbal 0.01959969126 medic
Experiment 1 result without stopword With stopwords removed R P F R P F DBPedia 0.21304 0.21304 0.21304 0.25373 0.17617 0.20795 Wikipedia 0.18261 0.18261 0.18261 0.22388 0.14493 0.17595 acquis 0.15217 0.15217 0.15217 0.19403 0.12093 0.149 extracted literals 0.21739 0.21645 0.21692 0.24627 0.17647 0.20561
Experiment 2 extracted IDF data IDF data sources: literals source 1. All Literals from DBpedia POS tage POS tage 2. Wikipedia abstracts Compute Compute Compute TF IDF TF-IDF Ranked term List (230) Wikipedia article abstract (230) ROUGE ROUGE output (using stemming)
Part of speech tagging ● extract all literals Trisuloides sericea is a moth of the Noctuidae family. It is found in South-east Asia. The wingspan is about 24 mm. Khvorakabad is a village in Mazraeh Now Rural District, in the Central District of Ashtian County, Markazi Province, Iran. At the 2006 census, its population was 72, in 23 families. ● T ag text using stanford speech tagger (3.5.0) Trisuloides_NNS sericea_NN is_VBZ a_DT moth_NN of_IN the_DT Noctuidae_NNP family_NN ._. It_PRP is_VBZ found_VBN in_IN South-east_JJ Asia_NNP ._. The_DT wingspan_NN is_VBZ about_IN 24_CD mm_NN ._. Khvorakabad_NNP is_VBZ a_DT village_NN in_IN Mazraeh_NNP Now_NNP Rural_NNP District_NNP ,_, in_IN the_DT Central_NNP District_NNP of_IN Ashtian_NNP County_NNP ,_, Markazi_NNP Province_NNP ,_, Iran_NNP ._. At_IN the_DT 2006_CD census_NN ,_, its_PRP$ population_NN was_VBD 72_CD ,_, in_IN 23_CD families_NNS ._. ● Filter out only Verbs and nouns (NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ) Trisuloides_NNS, sericea_NN, is_VBZ, moth_NN, Noctuidae_NNP, family_NN, is_VBZ, found_VBN, Asia_NNP, wingspan_NN, is_VBZ, mm_NN, Khvorakabad_NNP, is_VBZ, village_NN, Mazraeh_NNP, Now_NNP, Rural_NNP, District_NNP, Central_NNP, District_NNP, Ashtian_NNP, County_NNP, Markazi_NNP, Province_NNP, Iran_NNP, census_NN, population_NN, was_VBD, families_NNS ● Compute TF-IDF
Rezults without stopword With stopwords removed R P F R P F DBPedia 0.17826 0.17826 0.17826 0.26119 0.16509 0.20231 Wikipedia 0.16087 0.16087 0.16087 0.23881 0.14884 0.18338
Experiment 3 extracted Taxonomy parameters: literals MincommonDoc=2 MincommonDoc=3 Split in Saffron documents Ranked Generate term List taxonomy (230) Wikipedia article abstract (230) ROUGE ROUGE (using output stemming)
Rezults Taxonomy without stopword POS With stopwords removed without stopword With stopwords removed R P F R P F R P F R P F MinComDoc=2 Words 0.31739 0.28968 0.3029 0.49254 0.27049 0.34921 DBPedia 0.17826 0.17826 0.17826 0.26119 0.16509 0.20231 MinComDoc=2 Terms 0.17826 0.25309 0.20918 0.29104 0.24528 0.26621 Wikipedia 0.16087 0.16087 0.16087 0.23881 0.14884 0.18338 MinComDoc=3 Words 0.12174 0.73684 0.20896 0.20896 0.73684 0.32559 MinComDoc=3 Terms 0.05652 0.65 0.104 0.09701 0.65 0.16882 Stemmed without stopword With stopwords removed R P F R P F DBPedia 0.21304 0.21304 0.21304 0.25373 0.17617 0.20795 Wikipedia 0.18261 0.18261 0.18261 0.22388 0.14493 0.17595 acquis 0.15217 0.15217 0.15217 0.19403 0.12093 0.149 extracted literals 0.21739 0.21645 0.21692 0.24627 0.17647 0.20561
Conclusion TF-IDF considering triples as documents shows good results Taxonomy extraction provided best results Future work ● Automatically extract categories/topics from dataset ● Generate N-grams summaries for topics, based on model, that is trained on full dataset ● Gather relevant statistics about datasets ● Create more precise evaluation method
Recommend
More recommend