Automatically assembling a census of an academic field Allison Morgan, Samuel Way, Aaron Clauset University of Colorado Boulder
About Me R E S E A R C H A R T I C L E Third year PhD Student N E T W O R K S C I EN C ES in CS at CU Boulder Systematic inequality and hierarchy in faculty hiring networks Aaron Clauset, 1,2,3 * Samuel Arbesman, 4 Daniel B. Larremore 5,6 Science Advances 1(1), e1400005, (2015). Collaborators and I study the “sociology of science” Proc. 25th Int'l World Wide Web Conf. (WWW), ( 2016 ) Interested in computational methods to study under- representation in academia Proceedings of the National Academy of Sciences Oct 2017, 201702121
Motivation Nobel Prize winners Much of the sociology of science studies Chemists small samples of the academic workforce at a single point in time. Can we build a tool to efficiently collect the employment information of all faculty and those who leave academia across institutions, across time ? Cartoons by Jorge Chan; phdcomics.com
Challenge Every department contains a public directory of its faculty Jane With the same information: Professor names, titles, email addresses, jane@example.edu and webpages Mark Associate Professor mark@example.edu But, information is distributed Susan and not well structured Assistant Professor susan@example.edu Cartoons by Jorge Chan; phdcomics.com
Our Approach Identify the directory’s Department Homepage HTML structure & extract Courses | Faculty … faculty information faculty_name: Jane } title: Professor Jane website: ... Professor email: ... jane@example.edu Mark Associate Professor Filter non-tenure-track mark@example.edu faculty for further analyses Start from department title: Assistant Professor Susan Assistant Professor homepage title: Research Professor susan@example.edu title: Full Professor title: Instructor Navigate to its faculty directory Cartoons by Jorge Chan; phdcomics.com
Our Approach Identify the directory’s Department Homepage HTML structure & extract Directory Start at Given the HTML Collect all Parse Identify Filtered Courses | Prospective Students | Faculty with every department directory faculty information structure has outward links tables names directory For each person on homepage URL been identified person the page faculty_name: Jane Identify } Is their title Sort links Parse lists title: Professor Jane titles tenure-track? website: ... Professor ` email: ... If not, remove jane@example.edu Identify Pick a link: Parse divs entry from webpages is this a directory If not, try directory? Mark the next Associate Professor Identify Filter non-tenure-track Parse likely link emails articles mark@example.edu faculty for further analyses Start from (ii) Identify the HTML (iii) Identify faculty members (iv) Sample the (i) Navigate to the directory Susan department title: Assistant Professor structure of the directory relevant faculty Assistant Professor members homepage title: Research Professor susan@example.edu title: Full Professor title: Instructor Navigate to its faculty directory Cartoons by Jorge Chan; phdcomics.com
Navigation From a department homepage, sort all outgoing links by keywords: [“professor”, "faculty", “people", "directory", “personnel", “staff” … ] 0.6 For more than half of departments, Fraction of traversals 0.5 0.4 this heuristic results in the shortest 0.3 path. 0.2 0.1 0.0 0 1 2 3 4 5 6 7 8 9 10 Number of extra steps relative to the shortest path Showing http://www.cs.ucdavis.edu to http://www.cs.ucdavis.edu/people/faculty/
Navigation To stop at directories, we use a random forest classifier trained on all directory pages, and a sample of non-directory pages. Important features: [“NAME”, “TITLE”, “EMAIL”, “PHONE”, “website”, “profile”, “office”, “interest”] Average accuracy is 82%* * To avoid skipping directory pages, we parse any page which has a likelihood of being a directory > 0 . Results in perfect recall, at the expense of precision.
Summary of Engineering Results Fast: average < 1 minute vs ~8 hours to produce a single department’s faculty directory Accurate: 99% recall (nearly all tenure-track faculty are retrieved) and precision (few 23.2% of 2017 Census non-tenure-track faculty are retrieved)* 4608 Comparable to findings of major 582 1393 88.8% of 2011 survey organization: 16% vs 11% 76.8% of 2017 Censuses net growth in the number of 11.2% of 2011 Census faculty from the CRA 2011 Census 2017 Census * Manually checked against a third of departments; Computing Research Association: https://cra.org
So what can we do with this tool? Journal of Animal Science , 74(11), 2843-2848, 1996 We investigate the “leaky pipeline” : # of students (millions) Men 1.5 women leave STEM at various career Women 1.25 1.0 0.75 stages, resulting in their under- 0.5 0.25 representation at the faculty level 0 l e l e o s o t c o a ' o r r u h o h o s d c l f c t e e a k s s t s t e h s r - e r s - h o g r e c e r e g W g r e r a l r d e e e i t B e H n d d d t t n i i n n M i U i PloS ONE , 11 (7), e0157447, 2016
Leaky Pipeline Three stages of tenure-track
Leaky Pipeline Arrows represent the flow from tenure-track stage in 2011 to 2017
Leaky Pipeline Retention
Leaky Pipeline Promotion
Leaky Pipeline Attrition
Leaky Pipeline Overall attrition for women is slightly higher than men (15.5% vs 14.3%)
Future Work Dept. of Demography Jane Professor Jane jane@example.edu Dept. of Sociology Professor jane@example.edu Jane Mark Aaron Professor Emeritus Professor Associate Professor jane@example.edu Mark aaron@example.edu mark@example.edu Associate Professor Beth Mark mark@example.edu Susan Assistant Professor Associate Professor Assistant Professor beth@example.edu mark@example.edu Susan susan@example.edu Assistant Professor Sam Susan Assistant Professor susan@example.edu Assistant Professor sam@example.edu susan@example.edu time Expand support to other Use the InternetArchive to academic fields collect the historical data Cartoons by Jorge Chan; phdcomics.com
Thanks! https://arxiv.org/abs/1804.02760 Prof. Aaron Clauset Dr. Sam Way PhD Computer Science PhD Computer Science aaron.clauset@colorado.edu samuel.way@colorado.edu
Recommend
More recommend