Image Classification for Mobile Web Browsing Takuya Maekawa*, Takahiro Hara**, Shojiro Nishio** NTT* Osaka Univ.** International World Wide Web Conference 2006 1
Background Many commercial products and research studies focus on how to browse large Web pages on mobile devices with a small screen. Web page reconfiguration Web page analysis International World Wide Web Conference 2006 2
Background Web page reconfiguration Some Web images (contents) are discarded or downsized to fit in the page layout of the small screen. Deleting images for Reducing Discarding contents page layout photographic images (Personalization) International World Wide Web Conference 2006 3
Background Web page analysis Overcoming the limitations of mobile devices and supporting Web browsing activities by analyzing Web pages. Automatically provide Component recognition important images in the page Live 8 - Globe unites Thousands march on EdinburghBrad Pitt shows to fight poverty support at Live 8 time International World Wide Web Conference 2006 4
Problems Most studies and commercial products are prone to serious errors in detecting Web images because of their simple image role detection mechanisms. Role of a Web image • Menu • Content • Ad ... International World Wide Web Conference 2006 5
Problems Web page reconfiguration Deleting images for Reducing Discarding contents page layout photographic images (Personalization) Delete hyperlink Reduce Delete Delete image and ignore table tag 2 Fast 2 Furious 28 Days Later The 40- Year-Old Virgin Aeon Flux Alien vs. Predator A Man Apart International World Wide Web Conference 2006 6
Advantage Web page analysis Automatically provide Component recognition important images in the page Simple rule W and H<10pix Image for layout White small images 32% error International World Wide Web Conference 2006 7
Goal Automatic classification of Web images into categories according to image role 1. Collect 3901 images from 40 Web sites 2. Define 11 categories of Web images 3. Categorize 3901 images into 11 categories manually. 4. Select 37 image features to automatically categorize Web images well International World Wide Web Conference 2006 8
Collecting Web images From 120 pages in 40 sites Selected 3 pages including an index page Totally collected 3901 images International World Wide Web Conference 2006 9
11 image categories String images MENU SECTION DECORATION BUTTON Small images ITEM ICON TITLE MAP AD CONTENT LAYOUTER International World Wide Web Conference 2006 10
Image categories MENU Images for site menu. They are set in line horizontally in the upper and/or lower portion of the page 67.6% of them had more than two horizontally in- line images at the same height. They usually have small aspect ratios (average was 0.320). International World Wide Web Conference 2006 11
Image categories SECTION Headers of a section or a column of the page. They have text following them (92.8%). They usually have small aspect ratios (average: 0.142). International World Wide Web Conference 2006 12
Image categories DECORATION Images for decorative text. They represent text which would be difficult to create by using only HTML tags. These images don’t have hyperlinks. International World Wide Web Conference 2006 13
Image categories BUTTON Images with hyperlinks. These images have neighboring text and have the hyperlinks to the associated pages. They have text around them. Above: 16.1%, Below: 8.0%, Left: 36.8%, Right: 13.8%S International World Wide Web Conference 2006 14
Image categories ITEM Line head images of an itemization. ITEM images with the same width are set in line vertically (74.6%) Images have neighboring text on the right (99.4 ). ITEM images usually have aspect ratios of about 1 (average: 1.052). International World Wide Web Conference 2006 15
Image categories ICON Images that represent some kind of object. ICON images have neighboring text on the right or left. (right: 58.3%, left: 22.0%) ICON images usually have aspect ratios of about 1 (average: 0.942). International World Wide Web Conference 2006 16
Image categories TITLE Title images of the page. TITLE images have hyperlinks to the index page of the site or to themselves. MAP Image maps. <MAP NAME=“world"> <AREA href=“map.gif” … > </MAP> International World Wide Web Conference 2006 17
Image categories AD Advertisement images. Some AD images have hyperlinks to other domains. (average: 25.5%). AD images usually have small aspect ratios (average: 0.459). International World Wide Web Conference 2006 18
Image categories CONTENT Content images that are associated with the main contents of the page. CONTENT images have neighboring text on the right or below them (right: 35.1%, below: 51.7%). 55.4% of the CONTENT images were in JPEG format (remaining images: 6.6%). International World Wide Web Conference 2006 19
Image categories LAYOUTER Images to control the design and layout of other images and/or text on the page. Most LAYOUTER images are whole-colored. LAYOUTER images usually appear many times on a page (average: 10.7). International World Wide Web Conference 2006 20
Distribution of collected images We manually categorized collected images. Category number MENU 686 SECTION 469 DECORATION 69 BUTTON 87 ITEM 311 ICON 264 TITLE 141 MAP 53 AD 329 CONTENT 951 LAYOUTER 541 International World Wide Web Conference 2006 21
Image features We defined 37 of image features (F1-37) to classify Web images. All mobile devices cannot extract all features. We grouped features according to sources. F1-F20: HTML source analysis F21, F22: Web server F23-F30: Rendering information F31-F37: Image processing International World Wide Web Conference 2006 22
Image features (HTML) F1: Dimension F2: Width F3: Height F4: Aspect ratio F5: Uses Map or not {TRUE, FALSE} F6: Has a hyperlink or not {TRUE, FALSE} International World Wide Web Conference 2006 23
Image features (HTML) F7: Has an outlink or not {TRUE, FALSE} Outlink: a hyperlink to another domain F8: Has a loop-back-link or not {TRUE, FALSE} A loop-back-link: a hyperlink to the index page of the site or a link to the page that it is on. TITLE images and MENU images are usually set as ‘TRUE’. International World Wide Web Conference 2006 24
Image features (HTML) F9: Has an ALT string or not {TRUE, FALSE} String images and other text images are usually set as ‘TRUE’. MENU:85.4%, SECTION:74.0%, DECORATION:66.7%, BUTTON:63.2% F10: Number of characters in an ALT string International World Wide Web Conference 2006 25
Image features (HTML) F11: Number of characters in neighboring text F12: JPEG image or not {TRUE, FALSE} F13: Index in the HTML source The index is the order of the corresponding tag in a HTML source. TITLE images have small values (average: 48.4, average of all images: 424.7). International World Wide Web Conference 2006 26
Image features (HTML) F14: Number of appearances on a page F15: Number of images with the same dimension on a page CONTENT:7.5, ICON:4.3, ITEM:4.0 International World Wide Web Conference 2006 27
Image features (HTML) F16: Number of images with the same width on a page CONTENT: 8.1, AD: 3.5, ICON: 4.3, ITEM:4.5, SECTION: 4.4 F17: Number of images with the same height on a page CONTENT: 8.1, MENU: 8.5, SECTION: 4.8, ICON: 4.4, ITEM: 4.8 International World Wide Web Conference 2006 28
Image features (HTML) F18-F20: Number of neighboring images with the same attribute Height Width Dimension International World Wide Web Conference 2006 29
Image features (Web server) F21: Byte size F22: Byte size per dimension CONTENT: 0.83, AD: 0.71 ICON: 1.2, ITEM:1.0, LAYOUTER: 8.9 International World Wide Web Conference 2006 30
Image features (Rendering info.) F23-F30: Features extracted when rendering the page X coordinate Y coordinate Number of images with the same X coordinate Number of images with the same Y coordinate ... International World Wide Web Conference 2006 31
Image features (Image processing) F31: Number of colors F32: Number of concolorous regions F33: Minimum similarity to neighboring images International World Wide Web Conference 2006 32
Image features (Image processing) F34: Animation GIF or not 14.29% of AD images had animation GIFs. (Other images: 0.36%) F35: Has rounded corner rectangle or not (BUTTON: 37.9%) International World Wide Web Conference 2006 33
Image features (Image processing) F36: Text region occupancy ratio LAYOUTER: 0.40%, SECTION: 37.89%, DECORATION: 55.19%, TITLE: 44.85% F37: Number of text regions AD: 2.75, MENU: 1.04, SECTION: 1.19 International World Wide Web Conference 2006 34
Experiment We performed forty classification tests (Decision tree) Training set: images at thirty nine sites Test set: images at a rest of Web site [Conditions] C1: HTML source analysis (F1-20) C2: HTML+Web server (F1-22) C3: HTML+Web server+Rendering Info.(F1-30) C4: HTML+Web server+Image processing (F1-22, F31-37) C5: All features International World Wide Web Conference 2006 35
Recommend
More recommend