Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) - PowerPoint PPT Presentation

Using Semantic Similarity in Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan Univ.) Paul Chu (QNAP, Inc)

Crawling-based Web App Testing • the web app under test as a black-box • interacting with the app interface – DOMs in browsers • Usage – Model-based testing – Invariant detection – Cross-browser compatibility testing J.-W. Lin, F. Wang, P. Chu (ICST 2017) 2

Crawling-based Web App Testing Challenges: • Input value selection – topic identification • GUI state comparison Present approaches: • Manual labor intensive • application-specific • string-matching based – Written by human J.-W. Lin, F. Wang, P. Chu (ICST 2017) 3

Present approaches (1/4) Input Value Selection (Topic Identification) input.id("last_name").setValue("James"); J.-W. Lin, F. Wang, P. Chu (ICST 2017) 4

Present approaches (2/4) String-matching Based Rules 1. Map the feature string to a topic 2. Select a value from the dataset for the topic input.id("last_name").setValue("James"); J.-W. Lin, F. Wang, P. Chu (ICST 2017) 5

Present approaches (3/4) String-matching Based Rules input.id("last_name").setValue("James"); Drawbacks: • "last name", "family name", "surname", or even randomly generated id? • id mapped to multiple topics? e.g., "tel" → telephone "ln" → last_name "aycreateln" → ? J.-W. Lin, F. Wang, P. Chu (ICST 2017) 6

Present approaches (4/4) GUI State Abstraction • Distinguish newly discovered GUI states from explored ones • Abstract the states by DOM content filtering • Application-specific J.-W. Lin, F. Wang, P. Chu (ICST 2017) 7

Observations • Human interacts with web applications through the text in natural language – but not the DOM structures or attributes • In markup language (e.g. HTML and XML), the reserved words for DOM attributes are limited – id, name, type… • While the words used in text and attributes for input fields of the same topic may be different among web applications, they are usually semantically similar – “last name”, “surname”, “family name” J.-W. Lin, F. Wang, P. Chu (ICST 2017) 8

Our Proposal Inference with Semantic Similarity J.-W. Lin, F. Wang, P. Chu (ICST 2017) 9

Inference with Semantic Similarity Running Example Training data The input field to be inferred J.-W. Lin, F. Wang, P. Chu (ICST 2017) 10

Inference with Semantic Similarity Feature Extraction J.-W. Lin, F. Wang, P. Chu (ICST 2017) 11

Inference with Semantic Similarity Vector Transformation Bag-of-Words: J.-W. Lin, F. Wang, P. Chu (ICST 2017) 12

Inference with Semantic Similarity Vector Transformation Tf-idf: f ”password”,d3 log 2 (N/n ”password” )=4 (Term frequency with inverse document frequency) J.-W. Lin, F. Wang, P. Chu (ICST 2017) 13

Inference with Semantic Similarity Vector Transformation Latent Semantic Indexing • Singular Value Decomposition: 𝑌 = 𝑉Σ𝑊 𝑈 – 𝑉 : latent concepts in the documents – Σ : importance of each latent concept – 𝑊 𝑈 : Coordinates of the documents in the latent vector space • In our experiment, we use genism library. • Also see http://www.bluebit.gr/matrix- calculator/ J.-W. Lin, F. Wang, P. Chu (ICST 2017) 14

Inference with Semantic Similarity Similarity Calculation • With the 𝑉 , Σ and 𝑊 𝑈 , we can transform a document q into the latent vector space in which its coordinates 𝑟 ′ = Σ −1 𝑉 𝑈 𝑟 • Similarity of q to the training documents = Cosine similarity of 𝑟 ′ to vectors in 𝑊 𝑈 J.-W. Lin, F. Wang, P. Chu (ICST 2017) 15

Inference with Similarity 0.9976 0.0697 0.0000 0.0000 J.-W. Lin, F. Wang, P. Chu (ICST 2017) 16

Experiment 1 Input Topic Identification • 100 real-world forms of graduate program registration • Totally 985 input fields J.-W. Lin, F. Wang, P. Chu (ICST 2017) 17

Experiment 1 Input Topic Identification Steps • Randomly choose x% of the forms as training data (corpus) – x = 10, 20, 30, 40, 50, 60 , 70 • Generate rules (i.e. mappings from feature strings to topics) using the training forms • Infer the rest forms with: – The proposed approach (NL) – Rule-based approach (RB) – RB+NL-n (no-match) – RB+NL-m (multiple-topic) – RB+NL-b (both) • Repeat 1000 times J.-W. Lin, F. Wang, P. Chu (ICST 2017) 18

Experiment 1 Input Topic Identification Result J.-W. Lin, F. Wang, P. Chu (ICST 2017) 19

Experiment 2 GUI State Abstraction • A real-world web app and its test cases • The states are manually examined and clustered by an engineer in the company J.-W. Lin, F. Wang, P. Chu (ICST 2017) 20

Experiment 2 GUI State Abstraction Abstraction Methods • WS (White Space) – Replace all line breaks and tabs with white space – Collapse white space • TagAttrWD – Keep only tag names and important attributes – Remove timestamps – WS abstraction • NL – Use enclosed text in visible DOM elements – A similarity threshold to determine equivalence J.-W. Lin, F. Wang, P. Chu (ICST 2017) 21

Experiment 2 GUI State Abstraction Result J.-W. Lin, F. Wang, P. Chu (ICST 2017) 22

Contribution • Natural language techniques for automating crawling-based web application testing – Input topic identification and value selection – State equivalence checking • Experiments J.-W. Lin, F. Wang, P. Chu (ICST 2017) 23

Future Work • The impact overall crawling efficacy with more data and other topic model alternatives such as LDA • Information retrieval from, e.g., comments, of DOMs • Mobile apps ? J.-W. Lin, F. Wang, P. Chu (ICST 2017) 24

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) - PowerPoint PPT Presentation

Using Semantic Similarity in Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan Univ.) Paul Chu (QNAP, Inc) Crawling-based Web App Testing the web app under test as a black-box interacting with

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

Welcome to the course! Building Web Applications in R with Shiny Building Web Applications in R

EVE: verifying correct execution of cloud-hosted web applications Suman Jana Vitaly Shmatikov

IT452 Advanced Web and Internet Systems Set 8: XML, XPath, and XSLT (Chapter 15.1-4,15.8) Some

Phosphorus Overview Instrument: Integrated Project under FP6 Activity: IST-2005-2.5.6

Web Application Fault Classification An Exploratory Study Yuepu Guo, Sreedevi Sampath

Introduction to Web Application Security Professor Larry Heimann Web Application Security

CSE 127: Introduction to Security Lecture 13: Network Attacks Deian Stefan UCSD Fall 2020

Favorite Free (or cheap) Fundraising Tools @fundraiserchad Thanks for attending! Well get

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us