microsoft academic graph
play

Microsoft Academic Graph Academic Graph Viszards session Pajek - PowerPoint PPT Presentation

MAG V. Batagelj Microsoft Microsoft Academic Graph Academic Graph Viszards session Pajek files Years Authors and Vladimir Batagelj keywords Derived networks IMFM Ljubljana and IAM UP Koper Citation network XXXVI Sunbelt 2016


  1. MAG V. Batagelj Microsoft Microsoft Academic Graph Academic Graph Viszards session Pajek files Years Authors and Vladimir Batagelj keywords Derived networks IMFM Ljubljana and IAM UP Koper Citation network XXXVI Sunbelt 2016 Conclusions Newport Beach, California; April 5–10, 2016 References V. Batagelj MAG

  2. Outline MAG 1 Microsoft Academic Graph V. Batagelj 2 Pajek files Microsoft 3 Years Academic Graph 4 Authors and keywords 5 Derived networks Pajek files 6 Citation network Years 7 Conclusions Authors and keywords 8 References Derived networks Citation network Conclusions Vladimir Batagelj : References vladimir.batagelj@fmf.uni-lj.si Current version of slides (April 10, 2016, 16 : 57): http://vlado.fmf.uni-lj.si/pub/slides/vbMAG16.pdf V. Batagelj MAG

  3. Microsoft Academic Graph MAG V. Batagelj The Microsoft Academic Graph (MAG) is a heterogeneous Microsoft Academic graph containing scientific publication records, citation Graph relationships between those publications, as well as authors, Pajek files institutions, journals and conference ”venues” and fields of Years study. The first version was published on June 5, 2015; the last Authors and keywords updated version is from February 5, 2016. Derived Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, networks and Kuansan Wang, An Overview of Microsoft Academic Citation network Service (MAS) and Applications, WWW – World Wide Web Conclusions Consortium (W3C), 18 May 2015. References V. Batagelj MAG

  4. MAG – entities and sizes MAG V. Batagelj Entity name Entity Count Microsoft Papers > 83 million Academic Graph Authors > 20 million Pajek files Institutions > 770 , 000 Years Journals > 22 , 000 Authors and Conference series > 900 keywords Conference instances > 26 , 000 Derived networks Fields of study > 50 , 000 Citation network The ZIP containing all data files has size 28.2 GB. Conclusions Searching, machine learning, recomendation tasks. References V. Batagelj MAG

  5. MAG – data files structure MAG V. Batagelj Affiliations ConferenceSeries Microsoft 1 Affiliation ID 1 Conference series ID Academic 2 Affiliation name 2 Short name (abbreviation) Graph 3 Full name Authors Pajek files 1 Author ID ConferenceInstances 2 Author name 1 Conference series ID Years 2 Conference instance ID FieldsOfStudy 3 Short name (abbreviation) Authors and 1 Field of study ID 4 Full name keywords 2 Field of study name 5 Location Derived 6 Official conference URL networks FieldOfStudyHierarchy 7 Conference start date 1 Child field of study ID 8 Conference end date Citation 2 Child field of study level 9 Conference abstract registration date network 3 Parent field of study ID 10 Conference submission deadline date 4 Parent field of study level 11 Conference notification due date Conclusions 5 Confidence 12 Conference final version due date References V. Batagelj MAG

  6. MAG – data files structure MAG V. Batagelj Papers PaperAuthorAffiliations 1 Paper ID 1 Paper ID Microsoft 2 Original paper title 2 Author ID Academic 3 Normalized paper title 3 Affiliation ID Graph 4 Paper publish year 4 Original affiliation name Pajek files 5 Paper publish date 5 Normalized affiliation name 6 Paper Document Object Identifier 6 Author sequence number Years (DOI) 7 Original venue name PaperReferences Authors and keywords 8 Normalized venue name 1 Paper ID 9 Journal ID mapped to venue name 2 Paper reference ID Derived 10 Conference series ID networks mapped to venue name PaperUrls 11 Paper rank 1 Paper ID Citation 2 URL network PaperKeywords 1 Paper ID Journals Conclusions 2 Keyword name 1 Journal ID References 3 Field of study ID mapped to keyword 2 Journal name V. Batagelj MAG

  7. MAG into a collection of networks MAG MAG is similar to data from bibliographic data bases (Web of V. Batagelj Science, Scopus, DBLP, ZB Math, etc.). In our paper On bibliographic networks we proposed to transform Microsoft Academic such data into a collection of one-mode and two-mode networks – in Graph the case of MAG into: Pajek files Cite , WA , WK , WV , AC , Years where: W – works (papers, books, etc.), A – authors, K – keywords, Authors and keywords V – venues (conferences, journals, publishers), C - companies or Derived institutions, F - field. networks Citation and some properties of nodes: network year – publication year of a work. Conclusions References An important fact about these networks is that many pairs share a common set – using the network multiplication we can get derived networks. V. Batagelj MAG

  8. Problems MAG • the networks obtained from the complete MAG are very V. Batagelj large and require substantial time for construction and Microsoft analysis. We decided: Academic Graph • to limit in the first phase the analysis to some smaller Pajek files subset of data on which the analyses can be performed Years fast. • to explore the data an see what are the problems Authors and keywords • to identify problems and develop solutions. Derived • transforming and cleaning the data networks Citation • identifying problems network • missing “standard” bibliographic data such as Volume and Conclusions First page. References We selected as the subset the data related to SNA. Extraction was done by Juergen Pfeffer. V. Batagelj MAG

  9. MAG/SNA – sizes MAG V. Batagelj Microsoft Academic Graph W – works (papers, books, etc.) 634552 Pajek files A – authors 1048433 Years K – keywords 24535 Authors and V – venues (conferences, journals, publishers) keywords C – companies or institutions Derived networks F – field Citation network Conclusions References V. Batagelj MAG

  10. Cleaning MAG V. Batagelj The networks are too large to do in- dividual cleaning in general. We can Microsoft identify some problems that can be Academic Graph corrected using (short) programs. Pajek files For example, the same author ap- Years pears several times in the list of au- thors – the identity problem . Authors and keywords We produced a partition that puts Derived all authors with the same name into networks the same class. The application of Citation it to shrink the set of authors can network be risky – in MathSciNet there ex- Conclusions ist 697 chinese mathematicians with References the name Wang, Li. V. Batagelj MAG

  11. MAG – entities and sizes MAG V. Batagelj Microsoft Academic Graph Another such partition is the partition DOI the puts into the Pajek files same class all works with the same DOI. In this case it is Years reasonable to assume that they identify the same work. Authors and keywords In general we treat the remaining inconsistencies in data as a Derived noise. If they show up also in results we correct the data in an networks appropriate way and repeat the analysis. Citation network Conclusions References V. Batagelj MAG

  12. MAG/SNA – The distribution of papers by years MAG The distribution of papers by years V. Batagelj ● Microsoft ● ● Academic ● ● Graph 30000 ● ● ● Pajek files ● Years ● ● ● Authors and 20000 ● keywords freq ● ● ● Derived ● ● networks ● ● 10000 ● Citation ● ●● network ● ●● ●● Conclusions ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● References 0 ● 1950 1960 1970 1980 1990 2000 2010 year V. Batagelj MAG

  13. MAG/SNA – The distribution of papers by years MAG > setwd("c:/users/Batagelj/work/Python/MAG") V. Batagelj > years <- read.table(file="Year.clu",header=FALSE,skip=2)$V1 > t <- table(years) > min(years) Microsoft Academic [1] 1803 Graph > max(years) [1] 2016 Pajek files > year <- as.integer(names(t)) > freq <- as.vector(t[1950<=year & year<=2016]) Years > y <- 1950:2016 Authors and > model <- nls(freq~c*dlnorm(2017-y,a,b),start=list(c=500000,a=2.5,b=0.7)) keywords > model Nonlinear regression model Derived model: freq ~ c * dlnorm(2017 - y, a, b) networks data: parent.frame() Citation c a b network 6.317e+05 2.655e+00 6.164e-01 residual sum-of-squares: 51166952 Conclusions Number of iterations to convergence: 6 References Achieved convergence tolerance: 9.371e-06 > plot(y,freq,pch=16,cex=0.75,main="The distribution of papers by years", + xlab="year",ylab="freq") > lines(y,predict(model,list(x=2017-y)),col=’red’,lw=2) V. Batagelj MAG

Recommend


More recommend