An updated Portrait of the Portuguese Web Joo Miranda, Daniel - PowerPoint PPT Presentation

An updated Portrait of the Portuguese Web João Miranda, Daniel Gomes {joao.miranda,daniel.gomes} @ fccn.pt http://arquivo.pt

Summary • Introduction • Methodology • Metrics • Conclusions 2/28

3/28 Introduction

The Web • The Web is a huge source of information – Information published exclusively on the Web – Information disappears • Preservation started by the Web Archives – Access for future generations • First initiative: Internet Archive 4/28

Web Archives • Altavista across time 1997 1996 2004 1998 1999 2003 2009 2001 5/28

The Portuguese Web Archive • The Portuguese Web Archive Web archive User Content Crawler Indexer Web Interface Storage 6/28

Crawler • What is a crawler? – Collects contents from the Web – Starts from an initial set of addresses • How does it work? – Iteratively downloads contents – Extracts links to find new ones 7/28

8/28 Methodology

Methodology • Crawl of the Portuguese Web (March-May, 2008) – .PT domain – Heritrix crawler – 180 000 initial addresses – 48 million contents – 2.5 TB • No content analysis, only log analysis 9/28

10/28 Metrics

Sites hosted per IP address – Why? • Politeness policies for crawling server1 server1 siteA siteA crawler crawler server2 siteB siteB server3 siteC siteC 11/28

Sites hosted per IP address - Results • 75% of the IP addresses host 1 site 2% >10 sites 23% ]1,10] sites 75% 1 site 12/28

Successful responses – Why? • Quality indicator • Large % of broken links mines trust of users 13/28

Successful responses - Results • 18% of the sites returned 100% OK responses 14/28

Media types – Why? • Browsers or document viewers for cellphones • Parsing and indexing for search engines 15/28

Media types - Results • 90% of the number •69% of the amount of of contents are html, data are html, pdf, jpeg jpeg, gif Media type % contents Media type % amount data 1 Text/html 57.8% 1 Text/html 35.4% 2 Image/jpeg 22.8% 2 App’n/pdf 17.9% 3 Image/gif 9.4% 3 Image/jpeg 16.1% 4 Text/xml 1.9% 4 Text/plain 4.2% ‐ Other 8.1% ‐ Other 26.4% 16/28

Content size – Why? • Estimate the storage resources required to create Web data repositories 17/28

Content size - Results • 96% lower than 128 KB 18/28

Dynamically generated contents – Why? • Identify technological trends in Web publishing 19/28

Dynamically generated contents - Results • At least 46.3 % of the contents were dynamically generated 22.4% PHP 53.7% Not dynamically generated 10% ASP 0.9% JSP 13% Other with parameters 20/28

URL length – Why? • Influences interaction design • Determine adequate length for input boxes that receive URLs • How many characters should be presented on a search engine results page 21/28

URL length - Results • 84% lower than 100 characters 22/28

23/28 Conclusions

Conclusions I • Long URL addresses • Half of the contents are dynamically generated (mainly PHP) • 90% of the contents are HTML, JPEG and GIF • 69% of the amount of data are HTML, PDF and JPEG 24/28

Conclusions II • 96% of the contents are smaller than 128 KB • Half of the sites present a successful response rate below 80% • Most IP addresses host a single site 25/28

“Future” work • Study trends in the evolution of web characteristics –João Miranda, Daniel Gomes, Trends in Web characteristics, 7th Latin American Web Congress • Analyze metrics extracted from content and link analysis 26/28

Contribute to preserve the Web • Anyone can contribute to preserve the Web • Lend disk space to keep backup copies – Just need to install rARC – http://arquivo.pt/rarc • Help required to test beta version 27/28

Thank you. Logs used in this study are available for research purposes. Please contact us. http://arquivo.pt 28/28

An updated Portrait of the Portuguese Web Joo Miranda, Daniel - PowerPoint PPT Presentation

An updated Portrait of the Portuguese Web Joo Miranda, Daniel Gomes {joao.miranda,daniel.gomes} @ fccn.pt http://arquivo.pt Summary Introduction Methodology Metrics Conclusions 2/28 3/28 Introduction The Web

DESTINATION PRESENTATION PORTUGUESE RIVIERA PORTUGUESE RIVIERA The Portuguese Riviera is the

Presentation of Portrait of Hon. Marcus P. Knowlton. Presentation of Portrait of Hon. Marcus P.

validarcae Utility tool to deal with the Portuguese classification of economic activities (CAE)

Addresses, on the Occasion of the Presentation of the Portrait of George B. Wood, M. D.,

PORTRAIT PAINTING USING ACTIVE TEMPLATES Mingtian Zhao, Song-Chun Zhu University of California,

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Address by Chauncey B. Ripley: On the Presentation of the Memorial Portrait of John Norton Address

As the Dean of Coventry leads his final service, his portrait is presented to the Cathedral by

<1> This is a demonstration of Portrait Lighting. I originally thought I would be able to

Portrait of a Miner in a Landscape Alex Biryukov, Daniel Feher University of Luxembourg April

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Brazilian Culture Prof. Emanuelle Oliveira Department of Spanish and Portuguese

The Portuguese Economy October 2017 Carlos Andrade Chief Economist, Novo Banco , The

FASHION TALK Attract participation and increase learning motivation of young adults Portuguese

FACULTY/STAFF HANDBOOK REVIEW - 1 Committee Members Robert Granfield, Vice Provost for

3/17/15 FY2010 FY2011 FY2012 FY2013 FY2014 FY2015 .5

The The Basi Basics cs of of Ut Utah ah St State Nur Nursing Facility cility PA PASRR Tr

2 Produced by Method Savvy Lenovo Partner Network / The SEO Journey Series Workbook 30-Minute SEO

IMT Surveillance and Rail Technology African Ports & Rail Evolution INSTITUTE FOR MARITIME

KILBARCHAN AMATEUR ATHLETIC CLUB Robert Hawkins CTO John Rodger Finance Convener VISION

Audit of the BLM LHC system Beam loss interface to machine protection and energy distribution

Redwood City Automated Meter Infrastructure Redwood City Service Area Population 86,000

Sambuz

Useful Links

Newsletter

Mail Us

An updated Portrait of the Portuguese Web Joo Miranda, Daniel - PowerPoint PPT Presentation

An updated Portrait of the Portuguese Web Joo Miranda, Daniel Gomes {joao.miranda,daniel.gomes} @ fccn.pt http://arquivo.pt Summary Introduction Methodology Metrics Conclusions 2/28 3/28 Introduction The Web

DESTINATION PRESENTATION PORTUGUESE RIVIERA PORTUGUESE RIVIERA The Portuguese Riviera is the

Presentation of Portrait of Hon. Marcus P. Knowlton. Presentation of Portrait of Hon. Marcus P.

validarcae Utility tool to deal with the Portuguese classification of economic activities (CAE)

Addresses, on the Occasion of the Presentation of the Portrait of George B. Wood, M. D.,

PORTRAIT PAINTING USING ACTIVE TEMPLATES Mingtian Zhao, Song-Chun Zhu University of California,

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Address by Chauncey B. Ripley: On the Presentation of the Memorial Portrait of John Norton Address

As the Dean of Coventry leads his final service, his portrait is presented to the Cathedral by

&lt;1&gt; This is a demonstration of Portrait Lighting. I originally thought I would be able to

Portrait of a Miner in a Landscape Alex Biryukov, Daniel Feher University of Luxembourg April

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Brazilian Culture Prof. Emanuelle Oliveira Department of Spanish and Portuguese

The Portuguese Economy October 2017 Carlos Andrade Chief Economist, Novo Banco , The

FASHION TALK Attract participation and increase learning motivation of young adults Portuguese

FACULTY/STAFF HANDBOOK REVIEW - 1 Committee Members Robert Granfield, Vice Provost for

3/17/15 FY2010 FY2011 FY2012 FY2013 FY2014 FY2015 .5

The The Basi Basics cs of of Ut Utah ah St State Nur Nursing Facility cility PA PASRR Tr

2 Produced by Method Savvy Lenovo Partner Network / The SEO Journey Series Workbook 30-Minute SEO

IMT Surveillance and Rail Technology African Ports &amp; Rail Evolution INSTITUTE FOR MARITIME

KILBARCHAN AMATEUR ATHLETIC CLUB Robert Hawkins CTO John Rodger Finance Convener VISION

Audit of the BLM LHC system Beam loss interface to machine protection and energy distribution

Redwood City Automated Meter Infrastructure Redwood City Service Area Population 86,000

Sambuz

Useful Links

Newsletter

Mail Us

<1> This is a demonstration of Portrait Lighting. I originally thought I would be able to

IMT Surveillance and Rail Technology African Ports & Rail Evolution INSTITUTE FOR MARITIME