the end of term archive archiving the u s government web
play

The End of Term Archive: Archiving the U.S. Government Web MLTW | - PowerPoint PPT Presentation

The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke, Web Archiving Team Lead, Library of Congress @agrotke | abgr@loc.gov It all began a long, long, time ago, in a far away place


  1. The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke, Web Archiving Team Lead, Library of Congress @agrotke | abgr@loc.gov

  2. It all began a long, long, time ago, in a far away… place https://flic.kr/p/4JNkLE https://flic.kr/p/4N2jHU

  3. Goals of the End of Term Project ◻ work collaboratively to preserve public U.S. Government websites ◻ document federal agencies’ presence on the web during the end of Presidential terms ◻ enhance the existing research collections of the partner institutions ◻ raise awareness about the need for preservation ◻ engage with researchers and subject experts

  4. Extant .gov web archiving efforts Capture, Preservation, & Access Community Efforts • LOC: legislative branch, some executive • Federal Web Archiving Group • GPO: agency sites, often ephemeral • Research Initiatives • NARA: congressional web harvest every • academic 2 years • NGO or watchdog • IA: global & curated crawls • Citizen Driven • Agency-level: NIH/NLM, DOE, DOL, • grassroots efforts, e.g. Data HHS, CMS, others, using Archive-It or refuge/rescue other tools • End of Term • UNT & Others: Topical .gov collecting • focused but large-scale • Internal agency guidelines multi-institutional project

  5. Original End of Term Web Archive Partners for 2008/2012 - all IIPC & NDIIPP/NDSA partners

  6. EOT 2016: more partners! Federal Government Web Archiving Working Group

  7. EOT Collaborative Roles • Project management and coordination (All, rotates) • Nomination tool development (UNT) • URL nomination (All + community/public) • Crawling (IA, UNT, LC, GW) • Preservation of full copy (IA, LC, UNT) • Access: portal, full-text search, metadata, research & support (IA, CDL) • Outreach, press, twitter account (primarily IA, LC, UNT, Stanford) The 2016 project brought more partners, capacity for better, distributed crawling, and more community and researcher engagement

  8. Defining the “government web presence” Stanford WebBase Project 2004 crawl list of URLs

  9. Community Engagement Events : NYAM, U-Toronto, U-Penn, UC-Riverside, more Research : CMU, Georgetown, U-Washington

  10. Volunteer contributions Nominations Volunteers 2008 457 26 2012 1476 31 2016 15,000+ 400+ Plus over 150,000 from DataRescue/EDGI events/tools!

  11. Size of Archive URLs Size Comments 2008 ~160 million 17.95 TB Multiple crawls, deduplicated 2012 ~120 million 18.60 TB More focused crawls, deduplicated. Notable for media richness, uniqueness, density. 2016 over 429 million 293.91 TB Includes ~150 TB of FTP crawls

  12. EOT 2016 Content gov,dontserveteens) gov,house,bobbyscott) gov,ems) gov,fcc) gov,dot) gov,house,brown) gov,energy) gov,fcc,apps) gov,dot,adfs) gov,house,castor) • Content includes: gov,energy,afdc) gov,fcc,appsdemo) gov,dot,fastlane) gov,house,chrissmith) gov,energy,betterbuildingssolutioncenter) gov,fcc,consumercomplaints) gov,dot,fhwa) gov,house,chu) gov,energy,buildingdata) gov,fcc,esupport) gov,dot,fhwa,borderplanning) gov,house,clerk) gov,energy,catalyst) gov,fcc,fjallfoss) gov,dot,fhwa,collaboration) gov,house,cole) gov,energy,eere) gov,fcc,hraunfoss) • 9,000+ social media gov,dot,fhwa,efl) gov,house,cummings) gov,energy,eere,apps1) gov,fcc,licensing) gov,dot,fhwa,environment) gov,house,delbene) gov,energy,eere,apps2) gov,fcc,reboot) gov,dot,fhwa,fhwapap04) gov,house,denham) gov,energy,etec) gov,fcc,stations) gov,dot,fhwa,flh) gov,house,desjarlais) gov,energy,fossil) gov,fcc,transition) accounts (scrape of gov gov,dot,fhwa,international) gov,house,docs) gov,energy,genomicscience) gov,fcc,wireless) gov,dot,fhwa,mutcd) gov,house,donovan) gov,energy,hss) gov,fcc,wireless2) gov,dot,fhwa,nhi) gov,house,duckworth) gov,energy,hydrogen) gov,fda) gov,dot,fhwa,ops) gov,house,edworkforce) gov,energy,nnsa) gov,fda,accessdata) gov,dot,fhwa,safety) gov,house,energycommerce) SM registry API) 44% FB, gov,energy,pi) gov,fda,blogs) gov,dot,fhwa,wfl) gov,house,farr) gov,energy,science) gov,fdic) gov,dot,fhwa,wwwcf) gov,house,flores,rsc) gov,energy,ssl) gov,fdicig) gov,dot,fmcsa) gov,house,foreignaffairs) gov,energycodes) gov,fdlp) gov,dot,fmcsa,ai) gov,house,foreignaffairs,democrats) gov,energysavers) gov,fdlp,purl) 37% TW, 10% YT gov,dot,fmcsa,cms) gov,house,fosteryouthcaucus-karenbass) gov,energystar) gov,fdsys) gov,dot,fmcsa,csa) gov,house,gabbard) gov,enfield-ct) gov,fec) gov,dot,fmcsa,csa2010) gov,house,gosar) gov,ennistx) gov,fec,docquery) gov,dot,fmcsa,li-public) gov,house,grothmanforms) gov,enterpriseal) gov,fec,eqs) gov,dot,fmcsa,mrb) gov,house,gutierrez) • ~190K total domain, gov,eop) gov,federalregister) gov,dot,fmcsa,nrcme) gov,house,heck) gov,epa) gov,federalreserve) gov,dot,fmcsa,safer) gov,house,history) gov,epa,archive) gov,federalreserve,oig) gov,dot,fra) gov,house,homeland) gov,epa,blog) gov,federalreserveconsumerhelp) gov,dot,fra,safetydata) gov,house,issa) subdomains, gov and gov,epa,cfpub) gov,fedshirevets) gov,dot,fta) gov,house,jones) gov,epa,cumulis) gov,feedthefuture) gov,dot,fta,transit-safety) gov,house,jordan) gov,epa,developer) gov,fema) gov,dot,isddc) gov,house,lee) gov,epa,gispub4) gov,fema,asd) gov,dot,its) gov,house,lgbt-polis) gov,epa,iaspub) gov,fema,beta) non-gov sites gov,dot,its,benefitcost) gov,house,messer) gov,epa,nepis) gov,fema,careers) gov,dot,its,pcb) gov,house,mulvaney) gov,epa,ofmpub) gov,fema,citizencorps) gov,dot,its,standards) gov,house,naturalresources) gov,epa,semspub) gov,fema,community) gov,dot,marad) gov,house,norton) gov,epa,water) gov,fema,emilms) gov,dot,nhtsa) gov,house,oversight) • more crowdsourced, gov,epa,yosemite) gov,fema,gis) gov,dot,nhtsa,www-esv) gov,house,oversight,democrats) gov,epa,yosemite1) gov,fema,hazards) gov,dot,nhtsa,www-fars) gov,house,paulgosar) gov,erie) gov,fema,m) gov,dot,nhtsa,www-nrd) gov,house,perry) gov,erie,gis1) gov,fema,msc) gov,dot,nhtsa,www-odi) gov,house,peteking) gov,erie,gis2) gov,fema,ndms) curatorial nominations gov,dot,oig) gov,house,quigley) gov,erieco) gov,fema,training) gov,dot,ost,airconsumer) gov,house,resourcescommittee) gov,erieco,engage) gov,fema,usfa) gov,dot,ost,dotcr) gov,house,rules) gov,eriecountypa) gov,fema,usfa,apps) gov,dot,ost,dothr) gov,house,scalise) gov,essexct) gov,ferc) gov,dot,ost,testimony) gov,house,scalise,rsc) gov,eugene-or) gov,ferndalemi) gov,dot,phmsa) gov,house,schiff) gov,eugene-or,ceapps) gov,ffiec) gov,dot,phmsa,npms) gov,house,science) gov,eugene-or,pdd) gov,ffiec,ithandbook) gov,dot,phmsa,opsweb) gov,house,sensenbrenner) gov,eulesstx) gov,fgdc) gov,dot,phmsa,primis) gov,house,smallbusiness) gov,exeternh) gov,fhfa) gov,house,timryan) gov,fido,xml)

  13. EOT 2016 Press ++ • Press • Dozens of articles and interviews • Collaborations • Data Refuge • EDGI • GSA / 18F • data.gov

  14. EOT Challenges Typical web archiving challenges • • complexity of content • volume & proliferation • “you get what you get” w/ little cataloging or QA • Distribution of work • more partners = more project/partner mgmt • contributed seed lists • Resource constraints • the “it isn’t anyone’s actual job” problem • tech, time limitations & scale of data • funding = (there is none)

  15. Using the EOT Archive

  16. End of Term Web Archive http://eotarchive.cdlib.org/

  17. http://eot.us.archive.org/eot/*/www.whitehouse.gov

  18. Plans for release of 2016 – 2017 • All web crawl data from IA, LC, UNT has been ingested at IA. • Derivative datasets for all the data (WATs, WANEs, extracted page text) have been generated. • Components to integrate new content into portal are being worked on (metadata, search, thumbnails, Wayback indexes). Once finalized, CDL will begin process to update the portal. • We have a number of researchers interested in the data (IA working with them)

  19. analysis by project team members

  20. http://vphill.com/journal/?s=eot

  21. Comparing PDFs in EOT from 2008 to 2012 http://vphill.com/journal/post/5872/ http://vphill.com/journal/post/5861/

  22. .gov & .mil biggest change

  23. Top 15 .gov & .mil domains Top 15 .gov & .mil domains present in 2008 but new in 2012 missing in 2012

  24. EOT2008 and EOT2012 Crawling Schedule

  25. Extracted Special Web Collections https://archive.org/details/MilitaryIndustrialPowerpointComplex http://archive.org/~vinay/20th-century-gov-headshots.html http://archive.org/~vinay/20th-century-gov-groupshots.html

  26. eot 2016 “raw” content https://archive.org/details/EndOfTerm2016WebCrawls

  27. researcher access to .gov WAT Datasets LGA Datasets WANE Datasets (Web Archive (Longitudinal (Web Archive Graph Analysis) Transformation) Named Entities) Key Metadata from Every What Links to What Names of People, Places, Resource over Time Organizations Web Archive Datasets (via platform, disk, APIs, etc.)

  28. researcher access to .gov http://archivesunleashed.com/ http://www.websci16.org/hackathon https://github.com/vinaygoel/ars-workshop http://webarchives.ca/

  29. web preservation for content creators

Recommend


More recommend