Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1
Topics Topics • Introduction Introduction • Syllabus • Data Mining Concepts • Team Organization
Introduction Session Introduction Session • Your name and major • The definition of data mining Th d fi iti f d t i i • Your expectation from this course Your expectation from this course
Course Syllabus Course Syllabus • Syllabus S ll b
Da ta Mining Applic a tio ns Da ta Mining Applic a tio ns
Classes of Data-Mining Applications in 2003 Data ‐ Mining Applications Percentage Banking Banking 13 13 Bioinformatics/biotech 10 Direct marketing/fundraising 10 Fraud detection F d d t ti 9 9 So urc e : Scientific data 9 Insurance 8 Telecommunication l 8 www.kdnu Medical/pharmaceuticals 6 Retail 6 ug g e ts.c o m e ‐ Commerce/Web 5 Other 4 Investment/stocks 3 m Manufacturing 2 Security 2 Supply chain analysis 2 Travel 2 Entertainment 1
Newsweek, May 22, 2006
Ma rke t Ba ske t Ana lysis Ma rke t Ba ske t Ana lysis
9.14 A Ch he mic a l d da ta b a se e . F ig ure 9 C Che m Che m mistr mistr ry I ry I nf nf fo rm fo rm ma tic ma tic s c s
Wha t is Da ta Mining ? Wha t is Da ta Mining ? Source: Cover page of Advanced in Knowledge Discovery and Data Mining , edited by U. Fayyad, G. Piatesky Shapiro, P. Smyth and R. Uthurusamy, MIT Press edited by U Fayyad G Piatesky ‐ Shapiro P Smyth and R Uthurusamy MIT Press
How Much Information in 2003 How Much Information in 2003 • http://www.sims.berkeley.edu/research/proje cts/how ‐ much ‐ info ‐ 2003/ cts/how much info 2003/
What is Data Mining? What is Data Mining? • Misnomer?? • Gold Mining vs. Sand (Rock) Mining • Knowledge Discovery from Data (KDD) • Knowledge extraction K l d t ti • Data/pattern analysis • Data archaeology • Data dredging Data dredging
Da ta Mining is a n I Da ta Mining is a n I nte rdisc iplina ry nte rdisc iplina ry a nd Multidisc iplina ry F a nd Multidisc iplina ry F a nd Multidisc iplina ry F a nd Multidisc iplina ry F ie ld ie ld ie ld ie ld DATABASE DATABASE MACHINE MACHINE TECHNOLOGY TECHNOLOGY LEARNING LEARNING DATA DATA STATISTICS STATISTICS MINING MINING INFORMATION INFORMATION & MATH & & MATH MATH MATH THEORY THEORY INFORMATION INFORMATION OTHER OTHER RETRIEVAL RETRIEVAL DISCIPLINES DISCIPLINES
F ig ure 1.1 T he e v vo lutio n o o f da ta b a a se syste m m te c hno o lo g y
ig ure 1.4 Da ta mining a s a ste p in the pro c e ss o f kno wle dg e disc o ve ry Da Da a ta M a ta M Minin Minin ng is ng is s a s a P Pro c e Pro c e e ss o e ss o o f kn o f kn o wle o wle e dg e e dg e e F dis dis c o ve c o ve e ry e ry
Arc hite c ture o f a Da ta Mining Syste m Syste m Graphical User Interface Pattern/Model Evaluation Knowledge- Data Mining Engine Base Database or Data Warehouse Server data cleaning, integration, and selection Other Info Data Data World-Wide o d de Database Database Repositories Warehouse Web F ig ure 1.5 Arc hite c ture o f a typic a l da ta mining syste m
Da ta Mining a nd Sta ke ho lde rs a a g a d S a e o de s Increasing potential to support End User End User Making M ki business decisions Decisions Business Business Data Presentation Data Presentation Analyst Visualization Techniques Data Mining Knowledge Discovery K l d Di Data Analyst Data Exploration Statistical Analysis, Querying and Reporting y y g p g Data Warehouses / Data Marts OLAP DBA Data Sources Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Data Types - Perspective on Structure Data Types Perspective on Structure • Structured • Semi ‐ structured S i t t d • Unstructured Unstructured 20
Structured Data (1) Structured Data (1) • Data is organized in semantic entities g • Similar entities are grouped together ( (relations or classes) l ti l ) • Entities in the same group have the same Entities in the same group have the same descriptions (attributes, features) 21
Structured Data (2) Structured Data (2) • Descriptions for all entities in a group (schema) • Attributes – Have same defined formats d f d f – Have predefined lengths – Follow same orders 22
Semi-structured Data (1) Semi structured Data (1) • Semi ‐ structured data are organized in g semantic entities • Similar entities are grouped together Si il titi d t th • Entities in same group may not have same Entities in same group may not have same attributes 23
Semi-structured Data (2) Semi structured Data (2) • Attributes – Order of attributes not necessarily important – Not all attributes may be required – Size of same attributes in a group may differ – Type of same attributes in a group may differ 24
XML XML <bank ‐ 1> <customer> <customer_name> Hayes </customer_name> H / <customer_street> Main </customer_street> <customer_city> Harrison </customer_city> <account> <account_number> A ‐ 102 </account_number> <branch_name> Perryridge </branch_name> <balance> 400 </balance> </account> </account> <account> … </account> </customer> . . </bank 1> </bank ‐ 1> 25
Unstructured Data (1) Unstructured Data (1) • Masses of computerized data – which do not have a data structure – which is easily readable by a machine 26
Unstructured Data (2) Unstructured Data (2) “Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data – commonly appearing in e ‐ mails, memos, notes from call centers and support operations, news, user ll d i groups, chats, reports, letters, surveys, white papers, marketing material research presentations and Web marketing material, research, presentations and Web pages.” ‐‐ DM Review Magazine, February 2003 Issue
Data Types – Perspective on Representation • Numeric and categorical Numeric and categorical • Quantitative and qualitative • Nominal and ordinal • Static and dynamic (temporal) 28
Numeric and Categorical Data (1) Numeric and Categorical Data (1) • Numeric data Numeric data – Real number data, integer number data – Properties – Properties • Order relations (2 < 5) • Distance relation (d(2.3, 4.2) = 1.9) Distance relation (d(2.3, 4.2) 1.9) • Equality relation (2 = 2) 29
Numeric and Categorical Data (2) Numeric and Categorical Data (2) • Categorical (symbolic) values Categorical (symbolic) values – Equality relation • Blue = Blue or Rea <> Blue Blue = Blue or Rea <> Blue – Categorical values can be converted to a numeric values • Gender (male, female) � (0, 1) 30
Quantitative and Qualitative Data Quantitative and Qualitative Data • Quantitative data – Numeric values are quantitative values – Height, weight, salary • Qualitative data – Nominal N i l – Ordinal 31
Nominal Data Nominal Data • Utility customer type (residential, commercial, industrial, governmental) • Use different symbols, characters, and numbers numbers • These values can be coded alphabetically as A, B, and C, or numerically as 1, 2, and 3 d i ll d • Order ‐ less Order less 32
Ordinal Data Ordinal Data • The rank of the student in a class • Ordinal variables is a categorical variable for O di l i bl i i l i bl f which an order relation is defined but not a di t distance relation l ti • The ordered scale need not be necessarily The ordered scale need not be necessarily linear; difference between 4 th and 5 th students are different to that of 14 th and 15 th students are different to that of 14 and 15 students 33
Static and Dynamic Data Static and Dynamic Data • Static data – Attribute values do not change with time • Dynamic data – Attribute values change with time Att ib t l h ith ti 34
Data Repositories Data Repositories • Transactional database • Relational database • Relational database • Data warehouse • Advanced database • Data stream • The World Wide Web The World Wide Web 35
T T ra nsa c tio na l Da ta b a se ra nsa c tio na l Da ta b a se TI D List of item _ I Ds T100 I1, I2, I5 T200 T200 I2 I4 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 T700 I1 I3 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 ra nsa c tio na l da ta fo r a n AllE le c tro nic s b ra nc h a ble 5.1 T T 36
37 F ra g me e nts o f Re e la tio ns F ig g ure 1.6. se fo r AllE le c tro nic s F ro o m a Re l la tiona l D Da ta ba s
Da ta Wa re ho use (Ma rt) Da ta Wa re ho use (Ma rt) ypic a l fra me wo rk o f a da ta wa re ho use fo r AllE le c tro nic s ig ure 1.7 T F 38
a ble 3.1 Co mpa riso n b e twe e n OL T P a nd OL AP syste ms T 39
Sta r Sc he ma o f a Da ta Wa re ho use fo r Sa le s Wa re ho use fo r Sa le s ig ure 3.4 Sta r sc he ma o f a da ta wa re ho use fo r sa le s F 40
Recommend
More recommend