Learning Learning Retrieval Knowledge Retrieval Knowledge from Data from Data Helge Langseth Norwegian University of Science and Technology, Dept. of Mathematical Sciences Agnar Aamodt Norwegian University of Science and Technology, Dept.Computer and Information Science Ole Martin Winnem SINTEF Telecom and Informatics, Depth of Computer Science Work partly performed within NOEMIE, ESPRIT project no. 22312 Participants: NTNU, SINTEF, Saga, JRC, Schlumberger, Matra, Acknosoft, Dauphine NTNU
Outline Outline • Background / NOEMIE-project • CREEK • A data mining method • Integrating semantic networks with automatically generated networks structures: – Problems with the semantics – Benefits • Initial empirical results NTNU Slide no.: 2
Data and User views Data and User views The Task Reality NTNU Slide no.: 3
Study of the task reality Study of the task reality Experience Past cases gathering General domain The knowledge CBR Task Reality Data warehouse Data DM capturing NTNU Slide no.: 4
An example case An example case case-16 instance-of value case has-activity value tripping-in circulating has-depth-of-occurrence value 5318 has-task value solve-lc-problem has-observable-parameter value high-pump-pressure high-mud-density-1.41-1.7kg/l high-viscosity-30-40cp normal-yield-point-10-30-lb/100ft2 large-final-pit-volume-loss->100m3 long-lc-repair-time->15h low-pump-rate low-running-in-speed-<2m/s complete-initial-loss decreasing-loss-when-pump-off very-depleted-reservoir->0.3kg/l tight-spot high-mud-solids-content->20% small-annular-hydraulic-diameter-2-4in small-leak-off/mw-margin-0.021-0.050kg/l very-long-stands-still-time->2h has-well-section-position value in-reservoir-section has-failure value induced-fracture-lc has-repair-activity value pooh-to-casing-shoe waited-<1h increased-pump-rate-stepwise lost-circulation-again pumped-numerous-lcm-pills no-return-obtained set-and-squeezed-balanced-cement-plug NTNU Slide no.: 5
Initial design Initial design • User experiences • Problem descriptions • Solutions Controller Data Mining Case-based reasoning DW NTNU Slide no.: 6
Tangled CreekL CreekL Network Network Tangled thing goal hsc hsc case hsc domain-object hsc hsc hsc diagnosis find-treatment find-fault hsc has-output case#54 described-in vehicle has-function hi van transportation hsc hsc has-status diagnostic-case solved tested-by car hp hd hp wheel test-procedure possible-status-of hp test-step has-electrical-status hp hp engine hi has-state hsc starter-motor-turns has-engine-status has-fault tested-by hsc fuel-system case-of electrical diagnostic-hypothesis -system engine-test N-DD-234567 hsc has-fault engine-turns hp car-fault hsc test-for subclass-of battery-low has-fault fuel-system-fault engine-fault hsc instance-of hsc battery hsc broken-carburettor-membrane subclass-of subclass-of electrical-fault has-fault hsc status-of part-of battery-fault observed-finding tested-by hsc - has subclass finding subclass-of turning-of hi - has-instance test-for hp -ignition-key - has-part starter-motor hd - has-descriptor NTNU Slide no.: 7
Suitable DM methods must be: Suitable DM methods must be: • Able to generate structures from data, including a method for use (and update) of the domain expert’s model • Able to learn new entities when exposed to new data • The expressiveness is important. Limited models (like decision trees) are not suitable. • Our system performs explanation-driven CBR. Hence the models must be open for inspection • As we work in open, weak theory domains, we cannot expect that a deterministic structure will be able to capture the main effects • Should have semantic similarities with a semantic network structure • Bayesian networks is our initial method of choice although there are significant differences which impose some limitations on the integration • Other methods (e.g. ILP) are candidates for future activities NTNU Slide no.: 8
Bayesian networks (BN) Bayesian networks (BN) • A computer efficient representation of probability distributions by conditional independence among the attributes/states of a domain. • Has a qualitative part (below left), representing statistical dependence/independence statements. Can often be interpreted as a causal model among states. • Has a quantitative part (below right), representing conditional probability values for a specific state given one or more other states. Can be interpreted as a degree of belief in on state given other states. Left : Alarm (A) is caused by earthquake (E) and burglary (B). Alarm is independent of radio (R) given E and B. Right : The degree of belief in A (and not A) given the state of E and B. Eks.: Belief in A is 0.2 given E and not B (2nd row). NTNU Slide no.: 9
• User experiences • Problem descriptions • Solutions Controller Information flow General DM KI CBR (Creek) • Clustering “Data driven” CBR + • Time series Causal DM (BNs) • etc. Data Mining Case-based reasoning 1 2 3 1) Data preprocessing/cleaning 2) Structure learning and parameter tuning in the Bayesian Network DW 3) Generation of similarity matrices etc.
CBR and BN integration: General picture CBR and BN integration: General picture Human Generated General Domain Knowledge Knowledge Machine Generated Intensive CBR Causal Data Mining General Data Mining User DBs General purpose DBs Case Base NTNU Slide no.: 11
The experiment of Heckerman et. al. The experiment of Heckerman et. al. 2 2 10 21 13 1 6 1 5 1 9 20 31 23 6 5 4 27 11 3 2 34 36 3 7 35 1 7 12 2 4 2 8 2 9 26 25 18 33 1 4 7 8 9 1 2 3 30 x 1 x 2 x 3 x 37 cas e# 22 10 2 1 1 3 1 3 3 2 4 16 15 19 20 3 1 2 3 2 2 2 2 3 6 5 4 27 11 32 34 35 3 6 37 3 1 3 3 3 1 7 12 2 4 2 8 29 4 3 2 3 1 25 1 8 26 33 14 7 8 9 1 2 3 10,000 2 2 2 3 3 0 2 2 10 2 1 1 3 16 1 5 19 20 3 1 2 3 34 36 6 5 4 27 11 32 35 37 Deleted 1 7 12 29 2 4 2 8 2 5 18 26 33 14 9 7 8 1 2 3 30 NTNU Slide no.: 13
Generating Networks: Generating Networks: • Initialize Network repeat • Propose some Change to the structure • Fit Parameters to the new structure • Evaluate the new network according to some measure (like BIC, AIC, MDL) • If the New network is Better than the previous, then Keep the Change until Finished NTNU Slide no.: 14
BNs are powered by Conditional Independencies BNs are powered by Conditional Independencies Age Gender Cancer is Exposure independent of Smoking To Toxic Age and Gender given Exposure To Toxic and Cancer Smoking Serum Lung Calcium Tumour NTNU Slide no.: 15
Bayesian Networks: semantics Bayesian Networks: semantics S C conditional full joint local L E distribution independencies probability + = over domain models in BN structure X D P ( s , c , l , e , x , d ) P ( s ) P ( c ) P ( l | s ) = P ( e | s , c ) P ( x | l ) P ( d | l , e ) • Compact & natural representation: – nodes have ≤ k parents ⇒ O(2 k n) vs. O(2 n ) parameters – parameters natural and easy to elicit. Slide taken from Nir Friedman: “Learning the Structure of Probabilistic Models”
Can we learn causation from data? Can we learn causation from data? NTNU Slide no.: 17
Can we learn causation … (continued) Can we learn causation … (continued) The newspaper’s theory: “The Bimbo Theory”: IQ Clothes IQ Clothes Sex Sex Test Test result result The “meaning” is different, but the two networks are equally plausible from the newspaper story NTNU Slide no.: 18
Inferred Causation Inferred Causation NTNU Slide no.: 19
Integration of BN and EDoMo EDoMo Integration of BN and fuel-system carburettor hp has-fault condensation-in-gas-tank has-fault hsc carburettor-fault observable-state fuel-system-fault causes hsc hsc hi hsc carburettor-valve-fault water-in-gas-tank observed-finding hsc causes carburettor-valve-stuck water-in-gas-mixture causes hi causes causes causes too-rich-gas-mixture-in-cylinder hi enigne-turns no-chamber-ignition causes engine-does-not-fire NTNU Slide no.: 20
Integration Level Integration Level Low Medium High Purpose Domain level integration Inference level integration Data-source Separate Common data format, Everything data files different use represented as frames RetrieveCases ExplainSimilarity Typical BN- No dedicated (AttrA, AttrB) Inference BN inference task unit EDoMo No Verify substructures by Verification Verification verification examining “hidden on arc level nodes” and KL IMPOSSIBLE? divergence NTNU Slide no.: 21
Effect of Evidence During BN- -retrieve retrieve Effect of Evidence During BN Observed Domain model attributes Cases NTNU Slide no.: 22
Recommend
More recommend