Matching Twigs Matching Twigs in
Probabilistic XML Probabilistic XML
VLDB 2007 Benny Kimelfeld & Yehoshua Sagiv
Vienna, Austria
Probabilistic XML Probabilistic XML Benny Kimelfeld & Yehoshua - - PowerPoint PPT Presentation
VLDB 2007 Vienna, Austria Matching Twigs in Matching Twigs Probabilistic XML Probabilistic XML Benny Kimelfeld & Yehoshua Sagiv
Vienna, Austria
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
road (90%) road (60%) factory bldg. & wall (40%) / house & road (30%) house (50%) / factory bldg. (50%) factory bldg. (40%) /
(45%)
(36%)
(24%)
(36%)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
A prob. process for generating random data
* region road factory building
Each answer has an amount of certainty:
The probability of being obtained when querying a random database Querying probabilistic data:
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
* region road factory building
specific match?
pair of road & factory building?
* region road factory building
project
answer after the projection?
road & factory building?
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
A factory building, a road, an antenna, a heliport, a track
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
factory bldg. w/ antennas (50%) /
road (90%) heliport (80%)
(36%)
A factory building, a road, an antenna, a heliport, a track
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
factory bldg w/ antennas (50%) /
road (90%)
(7.2%)
track (20%)
Should we just filter out the whole match?
heliport (80%)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
– – Projection: Projection: Very simple queries can be highly intractable (data complexity) [Dalvi & Suciu, VLDB 04] – – Maximally joining relations: Maximally joining relations: Tractable under data complexity, generally intractable under query-and- data complexity [Kimelfeld & Sagiv, PODS 07]
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
In the paper, we explain in detail why our results do not follow from previous results on XML/relational models In the paper, we also have some preliminary results on the combination of maximal matches and projection
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Each node has a tag, a value or both
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
* region @area heliport ≥10km2 factory park.lot
Node predicate over the tag and value Child edge Descendant edge Output node (projection)
Possibly, more than one
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
root(T) → root(d)
node predicates are satisfied
* region @area heliport ≥10km2 factory park.lot
child edge → edge
That is, applying projection to the match
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
* region @area heliport ≥10km2 factory park.lot
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
A probabilistic process
XML documents
An ordinary XML document d, generated with probability Pr(d) d
d
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
E.g., uncertainty is many small pieces of data
Such as the following
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Mutually exclusive
0.8 0.4
track private
0.5 . 5
type vehicle neighborhood house m size s size house aerial-photo
0.75 . 8
building factory
. 8
park.lot heliport
0.4 . 3
region
Ordinary Ordinary nodes Distributional Distributional nodes Independent
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
0.8
track private
0.5 . 5
type vehicle neighborhood house m size s size house aerial-photo
0.75 . 8
building factory
. 8
park.lot heliport
0.4 . 3
region
0.4
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
0.8 0.4
track private
0.5 . 5
type vehicle neighborhood house m size s size house aerial-photo
0.75 . 8
building factory
. 8
park.lot heliport
0.4 . 3
region
Distributional nodes choose a set of children Traverse the tree top-down
Choose children independently
Drop unchosen children
Choose children independently Choose at most one child Choose at most one child
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
0.4
track
0.5
type vehicle neighborhood s size house aerial-photo
0.75
factory
. 8
heliport
. 3
region
Drop the distributional nodes
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
track type vehicle s size house aerial-photo factory heliport region neighborhood
Connect each
closest ancestor Drop the distributional nodes
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
track type vehicle s size house aerial-photo factory heliport region neighborhood
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
That is, of the type that is applied to non-probabilistic documents
* region road factory building
Twig w/ projection
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
When querying probabilistic data,
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
When querying probabilistic data,
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Non-Boolean Queries: Boolean Queries:
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
We apply a standard reduction from regular queries (that generate mappings) to Boolean ones:
That is, computing the probability of a match Next, we consider the evaluation of Boolean queries
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
a b
e
r e d
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
a b
e
r e d
a b
e
r e d
a b
e
r e d
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Document nodes are traversed bottom-up
a b
e
r e d 0.0 0.6 0.0 0.4 0.0 1.0 When visiting a node, evaluate a collection of queries (inc. the original
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Document nodes are traversed bottom-up
a b
e
r e d When visiting a node, evaluate a collection of queries (inc. the original)
Special treatment if the visited node is distributional
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
a b
e
r e d
involve several different children How can we compute the probability that there is a match, based on previous results for the descendants?
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Pr Pr
Pr
b c * *
Pr Pr
Pr Pr
The principle of
Pr Pr
b c * *
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
r A document satisfies a conjunction of negated twig branches iff each of the
Good news: Document branches are independent!
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Pr Pr
b c * *
Pr Pr
b c * *
Pr Pr
b c * *
x x x x
b c *
Pr Pr
b c *
Pr Pr
b c *
Pr Pr
Cut the roots from both twig and doc. branches: x x x x
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
root has only child edges; it would not work otherwise!
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
– Ordinary node (sketched in the previous slides) – Distributional node
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Is there an efficient algorithm under query-and-data complexity (polynomial in the query also)?
under query & data complexity! Even if:
No desc. edges Only independent distributions
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
the root
e b
e
r f d
That is, m1=m2
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Ordinary Data:
Probabilistic Data:
In other words, m is maximal among the partial answers with a sufficient probability
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
efficiently under data complexity
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
− XML and Twig Queries − Probabilistic XML − Querying Probabilistic XML (Complete Semantics)
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
– Known data model – Twig patterns (node predicates, child & desc. edges) – Complete & maximal semantics, projection
– Also used for evaluating queries with projection – Efficient under data complexity
– Efficient under query-and-data complexity
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Poly. w/o projection Poly. Poly. Poly. Poly.
Data Complexity
Open #P-complete #P-complete NP-complete
Query & Data Complexity
w/ projection w/ projection Boolean w/o projection Complete Complete semantics semantics Maximal Maximal semantics semantics
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
Fuzzy trees [Abiteboul & Senellart, 2006] Query Evaluation: #P-Complete ProTDB [Nierman and Jagadish, 2002] Query Evaluation: Tractable
PXML [Hung, Getoor & Subrahmanianm, 2003] Query Evaluation: Tree docs.: Tractable, DAG docs.: #P-hard Simple prob. trees [Abiteboul & Senellart, 2006] Query Evaluation: Tractable
Query evaluation: Complete semantics w/ projection
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007
– We already obtained significant improvements, both experimentally and analytically
– New types of distributional nodes
– Ongoing work: A combination of ProTDB [Nierman and Jagadish, 2002] and PXML [Hung, Getoor & Subrahmanianm, 2003]
Matching Twigs in Probabilistic XML Matching Twigs in Probabilistic XML
VLDB 2007 VLDB 2007