A Data Warehouse/OLAP Framework for Web Usage Mining and Business Intelligence Reporting Xiaohua Hu Nick Cercone College of Information Science Faculty of Computer Science Drexel University, Philadelphia Dalhousie University PA, USA 19104 Halifax, Nova Scotia, Canada email: thu@cis.drexel.edu email: nick@cs.dal.ca Abstract Web usage mining is the application of data mining techniques to discover usage patterns and behaviors from web data (clickstream, purchase information, customer information etc) in order to understand and serve e-commerce customers better and improve the online business. In this paper we present a general Data Warehouse/OLAP framework for web usage mining and business intelligence reporting. We integrate the web data warehouse construction, data mining, On-Line Analytical Processing (OLAP) into the e-commerce system, this tight integration dramatically reduces the time and effort for web usage mining, business intelligence reporting and mining deployment. Our Data Warehouse/OLAP framework consists of four phases: data capture, webhouse construction (clickstream marts), pattern discovery and cube construction, pattern evaluation and deployment. We discuss data transformation operations for web usage mining and business reporting in clickstream, session and customer level, describe the problems and challenging issues in each phase in details and provide plausible solution to the issues and demonstrate with some examples from some real websites. Our Data Warehouse/OLAP framework has been integrated into some commercial e-commerce systems. We believe this Data Warehouse/OLAP framework would be very useful for developing any real-world web usage mining and business intelligence reporting systems. 1. Introduction Knowledge about customers and understanding customer needs is essential for customer retention in a web store for online e-commerce applications, since competitors are just one click away. To maintain a successful e-commerce solution, it is necessary to collect and analyze customer click behaviors at the web store. A web site generates a large amount of reliable data and is a killer domain for data mining application. Web usage mining can help an e-commerce solution to improve up-selling, cross-selling, personalized ads, click- through rate and so on by analyzing the clickstream and customer purchase data through data mining techniques. Web usage mining has attracted much attention recently from research and e-business professionals and it offers many benefits to an e-commerce web site such as: • Targeting customers based on usage behavior or profile (personalization) • Adjusting web content and structure dynamically based on page access pattern of users (adaptive web site) 1
• Enhancing the service quality and delivery to the end user (cross-selling, up-selling) • Improving web server system performance based on the web traffic analysis • Identifying hot area/killer area of the web site. We present a general Data Warehouse/OLAP framework for web usage mining and business intelligence reporting. In our framework, data mining is tightly integrated into the E-commerce systems. Our Data Warehouse/OLAP framework consists of four phases: data capture, webhouse construction (clickstream marts), pattern discovery and pattern evaluation as shown in Figure 1. In this framework, it provides the appropriate data transformations (also called ETL: Extraction, Transformation and Loading) from the OLTP system to data warehouse, build data cubes from the data warehouse and mine the data for business analysis and finally deploy the mining results to improve the on-line business. We describe the problems and challenging issues in each phase in detail and provide a general approach and guideline to web usage mining and business intelligence reporting for e- commerce. The rest of the paper is organized as follows: in Section 2, we discuss the various data capture methods and some of the pitfalls and challenging issues. In Section 3, we will describe the data transformation operations for web data at different level of granularity (clickstream level, session level and customer level) and show how to organize the dimensions and facts tables for the webhouse, which is the data source for the web usage mining and business intelligence reporting. We discuss the cube construction and various data mining methods for web usage mining in Section 4 and pattern evaluation (mining rules evaluation) in Section 5. We conclude in Section 6 with some insightful discussion. Data Capture Data Webhouse Mining, OLAP Pattern (clickstream, Construction ( rules, Evaluations sale, ( dimensions, prediction & customer, fact tables, models, Deployment aggregation cubes, product, etc) table, etc) reports, etc) Figure 1: The Data Warehouse/OLAP Data Flow Diagram 2. Data Capture Capturing the necessary data in the data collection stage is a key step for a successful data mining task. A large part of web data is represented in the web log collected in the web server. A web log records the interactions between web server and web user (web browsers). A typical web log (Common Log format) contains information such as Internet provider IP address, ID or password for access to a restricted area, a time stamp of the URL request, method of transaction, status of error code, and size in bytes of the transaction. For the Extended Log format, it includes the extra information such as a referrer and agent. Web logs were originally designed to help debugging web server. One of the fundamental flaws of analyzing web log data is that log files contain information about the files transferred from the server to the client not information about people visiting the web site 2
[9,19]. Some of these fields are useless for data mining and are filtered in the data pre- processing step. Some of them such as IP address, referrer and agent can reveal much about the site visitors and the web site. Mining the web store often starts with the web log data. Web log data need to go through a set of transformation before data mining algorithms can be applied. In order to have a complete picture of the customers, web usage data should include the web server access log, browser logs, user profiles, registration data, user sessions, cookies, user search keywords, and user business events [1,9,14]. Based on our practice and experience in web usage mining, we believe that web usage mining requires conflation of multiple data sources. The data needed to perform the analysis should consist of five main sources: (1) The web server logs recording the visitors’ click stream behaviors (pages template, cookie, transfer log, time stamp, IP address, agent, referrer etc.) (2) Product information (product hierarchy, manufacturer, price, color, size etc.) (3) Content information of the web site (image, gif, video clip etc.) (4) The customer purchase data (quantity of the products, payment amount and method, shipping address etc.) (5) Customer demographics information (age, gender, income, education level, Lifestyle etc.) Data collected in a typical web site categorize to different levels of granularity: page view, session, order item, order header, customer. A page view has the information such as type of the page, duration on the page. A session consists of a sequence of page views; an order contains a few order items. It is the best practice in the data collection phase to collect the finest granular and detailed data possible describing the clicks on the web server, and items sold at the web store. Each web server will potentially report different details, but at the lowest level, we should be able to obtain a record for every page hit and every item sold if we want to have a complete portfolio of the click behavior and sale situation of the web store. There are various methods to capture and collect valuable information for visitors for e- commerce at the server level, proxy level and client level through the CGI interface, Java API, JavaScript [1,9,14]. Most of them use web log data or packet sniffers as a data source for clickstream. Web log data are not sufficient for data mining purpose for the following main reasons: (1) Unable to identify the sessions (2) Lack of web store transaction data; the web store transaction records all sale related information of a web store and it is necessary for business analysis and data mining in order to answer some basic and important business questions such as “which referrer site leads more product sale at my site?”, “what is the conversion rate of the web site”, “which part of my web sites are more attractive to purchaser?”. (3) Lack of business events of web store; business events of a web store such as “add a item to shopping car”, “research key event”, “abandoning shopping cart” are very useful to analyze the user shopping and browsing behavior of a web store. In our framework, we believe that collecting data at the web application server layer is the most effective approach, as suggested by some commercial vendors [9,14]. The web application server controls all the user activities such as registration, logging in/out, and can create a unified database to store web log data, sale transaction data and business events of 3
Recommend
More recommend