The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke, Web Archiving Team Lead, Library of Congress @agrotke | abgr@loc.gov
It all began a long, long, time ago, in a far away… place https://flic.kr/p/4JNkLE https://flic.kr/p/4N2jHU
Goals of the End of Term Project ◻ work collaboratively to preserve public U.S. Government websites ◻ document federal agencies’ presence on the web during the end of Presidential terms ◻ enhance the existing research collections of the partner institutions ◻ raise awareness about the need for preservation ◻ engage with researchers and subject experts
Extant .gov web archiving efforts Capture, Preservation, & Access Community Efforts • LOC: legislative branch, some executive • Federal Web Archiving Group • GPO: agency sites, often ephemeral • Research Initiatives • NARA: congressional web harvest every • academic 2 years • NGO or watchdog • IA: global & curated crawls • Citizen Driven • Agency-level: NIH/NLM, DOE, DOL, • grassroots efforts, e.g. Data HHS, CMS, others, using Archive-It or refuge/rescue other tools • End of Term • UNT & Others: Topical .gov collecting • focused but large-scale • Internal agency guidelines multi-institutional project
Original End of Term Web Archive Partners for 2008/2012 - all IIPC & NDIIPP/NDSA partners
EOT 2016: more partners! Federal Government Web Archiving Working Group
EOT Collaborative Roles • Project management and coordination (All, rotates) • Nomination tool development (UNT) • URL nomination (All + community/public) • Crawling (IA, UNT, LC, GW) • Preservation of full copy (IA, LC, UNT) • Access: portal, full-text search, metadata, research & support (IA, CDL) • Outreach, press, twitter account (primarily IA, LC, UNT, Stanford) The 2016 project brought more partners, capacity for better, distributed crawling, and more community and researcher engagement
Defining the “government web presence” Stanford WebBase Project 2004 crawl list of URLs
Community Engagement Events : NYAM, U-Toronto, U-Penn, UC-Riverside, more Research : CMU, Georgetown, U-Washington
Volunteer contributions Nominations Volunteers 2008 457 26 2012 1476 31 2016 15,000+ 400+ Plus over 150,000 from DataRescue/EDGI events/tools!
Size of Archive URLs Size Comments 2008 ~160 million 17.95 TB Multiple crawls, deduplicated 2012 ~120 million 18.60 TB More focused crawls, deduplicated. Notable for media richness, uniqueness, density. 2016 over 429 million 293.91 TB Includes ~150 TB of FTP crawls
EOT 2016 Content gov,dontserveteens) gov,house,bobbyscott) gov,ems) gov,fcc) gov,dot) gov,house,brown) gov,energy) gov,fcc,apps) gov,dot,adfs) gov,house,castor) • Content includes: gov,energy,afdc) gov,fcc,appsdemo) gov,dot,fastlane) gov,house,chrissmith) gov,energy,betterbuildingssolutioncenter) gov,fcc,consumercomplaints) gov,dot,fhwa) gov,house,chu) gov,energy,buildingdata) gov,fcc,esupport) gov,dot,fhwa,borderplanning) gov,house,clerk) gov,energy,catalyst) gov,fcc,fjallfoss) gov,dot,fhwa,collaboration) gov,house,cole) gov,energy,eere) gov,fcc,hraunfoss) • 9,000+ social media gov,dot,fhwa,efl) gov,house,cummings) gov,energy,eere,apps1) gov,fcc,licensing) gov,dot,fhwa,environment) gov,house,delbene) gov,energy,eere,apps2) gov,fcc,reboot) gov,dot,fhwa,fhwapap04) gov,house,denham) gov,energy,etec) gov,fcc,stations) gov,dot,fhwa,flh) gov,house,desjarlais) gov,energy,fossil) gov,fcc,transition) accounts (scrape of gov gov,dot,fhwa,international) gov,house,docs) gov,energy,genomicscience) gov,fcc,wireless) gov,dot,fhwa,mutcd) gov,house,donovan) gov,energy,hss) gov,fcc,wireless2) gov,dot,fhwa,nhi) gov,house,duckworth) gov,energy,hydrogen) gov,fda) gov,dot,fhwa,ops) gov,house,edworkforce) gov,energy,nnsa) gov,fda,accessdata) gov,dot,fhwa,safety) gov,house,energycommerce) SM registry API) 44% FB, gov,energy,pi) gov,fda,blogs) gov,dot,fhwa,wfl) gov,house,farr) gov,energy,science) gov,fdic) gov,dot,fhwa,wwwcf) gov,house,flores,rsc) gov,energy,ssl) gov,fdicig) gov,dot,fmcsa) gov,house,foreignaffairs) gov,energycodes) gov,fdlp) gov,dot,fmcsa,ai) gov,house,foreignaffairs,democrats) gov,energysavers) gov,fdlp,purl) 37% TW, 10% YT gov,dot,fmcsa,cms) gov,house,fosteryouthcaucus-karenbass) gov,energystar) gov,fdsys) gov,dot,fmcsa,csa) gov,house,gabbard) gov,enfield-ct) gov,fec) gov,dot,fmcsa,csa2010) gov,house,gosar) gov,ennistx) gov,fec,docquery) gov,dot,fmcsa,li-public) gov,house,grothmanforms) gov,enterpriseal) gov,fec,eqs) gov,dot,fmcsa,mrb) gov,house,gutierrez) • ~190K total domain, gov,eop) gov,federalregister) gov,dot,fmcsa,nrcme) gov,house,heck) gov,epa) gov,federalreserve) gov,dot,fmcsa,safer) gov,house,history) gov,epa,archive) gov,federalreserve,oig) gov,dot,fra) gov,house,homeland) gov,epa,blog) gov,federalreserveconsumerhelp) gov,dot,fra,safetydata) gov,house,issa) subdomains, gov and gov,epa,cfpub) gov,fedshirevets) gov,dot,fta) gov,house,jones) gov,epa,cumulis) gov,feedthefuture) gov,dot,fta,transit-safety) gov,house,jordan) gov,epa,developer) gov,fema) gov,dot,isddc) gov,house,lee) gov,epa,gispub4) gov,fema,asd) gov,dot,its) gov,house,lgbt-polis) gov,epa,iaspub) gov,fema,beta) non-gov sites gov,dot,its,benefitcost) gov,house,messer) gov,epa,nepis) gov,fema,careers) gov,dot,its,pcb) gov,house,mulvaney) gov,epa,ofmpub) gov,fema,citizencorps) gov,dot,its,standards) gov,house,naturalresources) gov,epa,semspub) gov,fema,community) gov,dot,marad) gov,house,norton) gov,epa,water) gov,fema,emilms) gov,dot,nhtsa) gov,house,oversight) • more crowdsourced, gov,epa,yosemite) gov,fema,gis) gov,dot,nhtsa,www-esv) gov,house,oversight,democrats) gov,epa,yosemite1) gov,fema,hazards) gov,dot,nhtsa,www-fars) gov,house,paulgosar) gov,erie) gov,fema,m) gov,dot,nhtsa,www-nrd) gov,house,perry) gov,erie,gis1) gov,fema,msc) gov,dot,nhtsa,www-odi) gov,house,peteking) gov,erie,gis2) gov,fema,ndms) curatorial nominations gov,dot,oig) gov,house,quigley) gov,erieco) gov,fema,training) gov,dot,ost,airconsumer) gov,house,resourcescommittee) gov,erieco,engage) gov,fema,usfa) gov,dot,ost,dotcr) gov,house,rules) gov,eriecountypa) gov,fema,usfa,apps) gov,dot,ost,dothr) gov,house,scalise) gov,essexct) gov,ferc) gov,dot,ost,testimony) gov,house,scalise,rsc) gov,eugene-or) gov,ferndalemi) gov,dot,phmsa) gov,house,schiff) gov,eugene-or,ceapps) gov,ffiec) gov,dot,phmsa,npms) gov,house,science) gov,eugene-or,pdd) gov,ffiec,ithandbook) gov,dot,phmsa,opsweb) gov,house,sensenbrenner) gov,eulesstx) gov,fgdc) gov,dot,phmsa,primis) gov,house,smallbusiness) gov,exeternh) gov,fhfa) gov,house,timryan) gov,fido,xml)
EOT 2016 Press ++ • Press • Dozens of articles and interviews • Collaborations • Data Refuge • EDGI • GSA / 18F • data.gov
EOT Challenges Typical web archiving challenges • • complexity of content • volume & proliferation • “you get what you get” w/ little cataloging or QA • Distribution of work • more partners = more project/partner mgmt • contributed seed lists • Resource constraints • the “it isn’t anyone’s actual job” problem • tech, time limitations & scale of data • funding = (there is none)
Using the EOT Archive
End of Term Web Archive http://eotarchive.cdlib.org/
http://eot.us.archive.org/eot/*/www.whitehouse.gov
Plans for release of 2016 – 2017 • All web crawl data from IA, LC, UNT has been ingested at IA. • Derivative datasets for all the data (WATs, WANEs, extracted page text) have been generated. • Components to integrate new content into portal are being worked on (metadata, search, thumbnails, Wayback indexes). Once finalized, CDL will begin process to update the portal. • We have a number of researchers interested in the data (IA working with them)
analysis by project team members
http://vphill.com/journal/?s=eot
Comparing PDFs in EOT from 2008 to 2012 http://vphill.com/journal/post/5872/ http://vphill.com/journal/post/5861/
.gov & .mil biggest change
Top 15 .gov & .mil domains Top 15 .gov & .mil domains present in 2008 but new in 2012 missing in 2012
EOT2008 and EOT2012 Crawling Schedule
Extracted Special Web Collections https://archive.org/details/MilitaryIndustrialPowerpointComplex http://archive.org/~vinay/20th-century-gov-headshots.html http://archive.org/~vinay/20th-century-gov-groupshots.html
eot 2016 “raw” content https://archive.org/details/EndOfTerm2016WebCrawls
researcher access to .gov WAT Datasets LGA Datasets WANE Datasets (Web Archive (Longitudinal (Web Archive Graph Analysis) Transformation) Named Entities) Key Metadata from Every What Links to What Names of People, Places, Resource over Time Organizations Web Archive Datasets (via platform, disk, APIs, etc.)
researcher access to .gov http://archivesunleashed.com/ http://www.websci16.org/hackathon https://github.com/vinaygoel/ars-workshop http://webarchives.ca/
web preservation for content creators
Recommend
More recommend