Transcript Slide 1

Keyword Search over XML
1
Inexact Querying
• Until now, our queries have been complex
patterns, represented by trees or graphs
• Such query languages are not appropriate
for the naive user:
– if XML “replaces” HTML as the web standard,
users can’t be expected to write graph queries
Allow Keyword Search over XML!
2
Keyword Search
• A keyword search is a list of search terms
• There can be different ways to define legal
search terms. Examples:
– keyword:label, e.g., author:Smith
– keyword, e.g., :Smith
– label, e.g., author:
– value (without distinguishing between keywords
and labels)
3
Challenges (1)
• Determining which part of the XML document
corresponds to an answer
– When searching HTML, the result units are usually
documents
– When searching XML, a finer granularity should be
returned, e.g., a subtree
4
What should be returned for the query :ACID, :Kempster ?
5
Challenges (2)
• Avoiding the return of non-meaningfully
related elements
– XML documents often contain many unrelated
fragments of information. Can these information
units be recognized?
6
What should be returned for the query :XML, author: ?
7
What should be returned for the query :XML, :Kempster ?
8
Challenges (3)
• Ranking mechanisms
– How should document fragments/XML elements
be ranked
• Ideas?
9
In what order should the answers be returned for :ACID, author: ?
10
Defining a Search Semantics
• When defining a search over XML, all previous
challenges must be considered.
• We must decide:
– what portions of a document are a search result?
– should any results be filtered out since they are not
meaningful?
– how should ranking be performed
• Typically, research focuses on one of these
problems and provides simple solutions for the
other problems.
11
Topics Discussed
• XRank: Paper presents a variation of
PageRank for ranking XML elements
– focus on ranking
• Interconnection Semantics: Methods to
determine whether a set of nodes is
meaningfully related
– focus on filtering out meaningless results
12
XRank: Ranked Keyword Search over
XML Documents
Guo, Shao, Botev, Shanmugasundram
SIGMOD 2003
13
Queries and their Semantics
• Queries are keywords k1,…,kn, as in a search
engine
• Query results are portions of XML documents that
contain all words. Formally:
– Let v be a node in the document. To determine whether
v should be returned: First, “remove” any descendents
of v that contain all the keywords k1,…,kn. If v still
contains all of k1,…,kn, then v should be a result of the
search.
– Intuition: Only return v if no more specific element can
be returned.
Note: Containment is via child edges, not IDREF edges 14
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XQL Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
What should be returned for
…
</workshop>
the query XQL language?
15
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XQL Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
What should be returned for
…
</workshop>
the query XQL language?
16
Ranking Results: Intuition
• Granularity of ranking
– In HTML, there is a rank for each document
– In XML, we want a rank for each element. Different elements in the
same document may have different ranks
• Propose to extend ideas used for ranking HTML:
– PageRank: Documents with more incoming links are more
important (recursive definition)
– Proximity: If the document contains the search terms close
together, then the document is more important
• Overall Rank: combination of PageRank and proximity
17
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XML Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
Should both papers be ranked
…
</workshop>
the same?
18
Topics
• We discuss:
– Ranking
– The Index Structure
– Query Processing
19
Ranking Results
• Take into consideration
– hyperlinks
– proximity
• We only discuss here ranking by the linking
structure. Ranking by proximity can easily be
defined (ideas?)
• What kind of “links” are the in a graph of
XML documents?
20
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XML Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
Child/Parent “links”
…
</workshop>
21
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XML Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
IDREF “links”
…
</workshop>
22
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop
</title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the language …</abstract>
<section name=”Implementing XML Operations”>
<subsection name=”Path Expressions”>
At first site, the XQL language looks…
</subsection>
</section> …
<cite ref=”2”> Querying XML in Xyleme </cite>
<cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite>
</paper>
<paper id =“2”>
XLink “links” (out of the
…
</workshop>
document)
23
Remember: Page Rank
: Hyperlink edge
d /3
d: Probability of following hyperlink
d/3
v
d /3
1-d: Probability of random jump
Number of outgoing
Numberlinks
of documents
24
A Graph of XML documents
• Nodes: N
– each element in a document is a node
• Edges: E = CE  CE-1  HE
– CE are “containment links”, i.e., there is an edge (u,v) in
CE if u is a parent of v in the XML document
– HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if
there is an IDREF link or XLink link from u to v
• Want to define ElemRank, the parallel to
PageRank, but for XML elements
25
Attempt 1 at ElemRank
Hyperlink edge
Containment edge
v
There are now 4
ways to get to an
element. Consider
all in the formula.
26
Attempt 1 at ElemRank:
Problem
Hyperlink edge
Containment edge
v
Consider a paper
with few sections
and many
references.
The more
references there
are, the less
important each
section is. Why?
27
Attempt 2 at ElemRank
Hyperlink edge
Containment edge
v
Consider
Hyperlinks and
Structural links
separately
28
Attempt 2 at ElemRank:
Problem
Hyperlink edge
Containment edge
v
In fact, better to
consider parentchild links
differently from
child-parent links
29
Actual ElemRank
Hyperlink edge
Containment edge
v
Consider
Hyperlinks, Parent
links and Child
links separately
30
Interpretation in terms of Random
Walks
• The element rank of e is the probability that e will
be reached if we start at a random element and at
each point we chose one of the following options:
– with probability 1-d1-d2-d3 jump to a random element in a
random page
– with probability d1 follow a random hyperlink from the
current element
– with probability d2 follow a random edge to a child
element
– with probability d3 follow the parent edge
31
ElemRank Example
Hyperlink edge
Containment edge
1
2
e(v)  d1 
4
3
1  d1  d 2  d3
e(u )
e(u )

d


d

e
(
u
)




2
3
N
(
u
)
N
(
u
)
Ne
( u ,v )HE
( u ,v )CE
( u ,v )CE 1
h
c
• Suppose that d1 = d2 = d3 = 0.3
• In what order will the nodes be ranked?
• What will be the formula for each node?
32
Think About it
• Very nice definition of ElemRank
• Does it make sense? Would ElemRank give good
results in the following scenarios:
– IDREFs connect articles with articles that they cite
– IDREFs connect managers with their departments
– IDREFs connect cleaning staff with their departments in
which they work
– IDREFs connect countries with bordering contries (as in
the CIA factbook)
33
Topics
• We discuss:
– Ranking
– The Index Structure
– Query Processing
34
Indexing
• We now discuss the index structure
• Recall that we will be ranking according to
ElemRank
• Recall that we want to return “most specific
elements”
• How should the data be stored in an index?
35
Naive Method
<workshop>
date
1
28 July …
<title>
2
0
<editors>
XML and …
7
XQL and …
<proceedings>
4
David Carmel …
<paper>
<title>
3
<author>
8
Ricardo …
5
<paper>
<Section>
6
Treat elements as
documents:
Normal inverted lists
Ricardo 0 ; 4 ; 5 ; 8
XQL
0;4;5;7
…
9
…
…
Problem: Space Overhead
How much space is needed in storage?
36
Naive Method
<workshop>
date
1
28 July …
<title>
2
0
<editors>
XML and …
7
XQL and …
<proceedings>
4
David Carmel …
<paper>
<title>
3
<author>
8
Ricardo …
5
<paper>
<Section>
6
Treat elements as
documents:
Normal inverted lists
Ricardo 0 ; 4 ; 5 ; 8
XQL
0;4;5;7
…
9
…
…
Problem: Spurious Results
Cant simply return intersection of the lists, since if a
node satisfies a query, so do all its ancestors
37
Dewey Encoding of ID
• Use path information to identify elements – DeweyID
• An ancestor’s ID is a prefix of its descendant’s ID
• Actually (not shown) all the node ids are prefixed by the
document number
<workshop>
0
<date>
0.0
<title>
28 July …
0.1
XML and …
<editors>
0.3.0.0
…
<proceedings>
0.3
David Carmel …
<paper>
<title>
0.2
0.3.0
0.3.0.1
<author>
…
0.3.1
<paper>
…
…
…
38
Dewey Inverted List (DIL)
• Store, for each keyword a list containing :
– the id of the node containing the keyword
– the rank of the node containing the keyword
– the positions of the keyword in the node
• Rank and positions are needed to compute
ranking
• To simplify, in the following slides, we only
store lists of node ids
39
Topics
• We discuss:
– Ranking
– The Index Structure
– Query Processing
40
Query Processing
• Challenges:
– How do we find nodes that contain all
keywords?
– How do find only the most specific node that
contains all keywords?
– Can this be done in a single scan of the inverted
keyword lists?
41
Example: Document
47th Document in Corpus
proceedings
paper
paper
… XQL …
title
… XQL …
abstract
section
… XQL …
subsection
… language …
… XQL
language …
42
Example: Document with IDs
47th Document in Corpus
proceedings 47.0
paper 47.0.0
47.0.0.0
title
47.0.0.2
section
47.0.0.1
abstract
… XQL …
47.0.1
paper
… XQL …
… XQL …
47.0.0.2.0
subsection
… language …
… XQL
language …
43
Example: Inverted Lists
proceedings 47.0
paper 47.0.0
47.0.0.0
title
paper
47.0.0.2
section
47.0.0.1
abstract
… XQL …
… XQL …
Lists contain ids
for nodes that
47.0.1
directly contain
… XQL … keyword. Lists
are sorted
47.0.0.2.0
subsection
… language …
… XQL
language …
XQL
language
47.0.0.0
47.0.0.1
47.0.0.2
47.0.0.2.0
47.0.1
47.0.0.2.0
44
Example: Inverted Lists
proceedings 47.0
paper 47.0.0
47.0.0.0
title
paper 47.0.1
47.0.0.2
section
47.0.0.1
abstract
… XQL …
… XQL …
We want to find
nodes that should
be returned. Which?
… XQL …How
will they be
ranked?
47.0.0.2.0
subsection
… language …
… XQL
language …
XQL
language
47.0.0.0
47.0.0.1
47.0.0.2
47.0.0.2.0
47.0.1
47.0.0.2.0
45
Algorithm: Data Structures
XQL
47.0.1
1
2
]
ContainsAll
Contains[
47.0.0.2.0
DeweyID
Result heap:
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0
46
Algorithm: Pseudo Code
• Find smallest next entry in inverted lists
• Find longest common prefix of entry and dewey
stack
• Pop all non-matching values from dewey stack.
When popping:
– propogate down containment information, if containsAll
is false
– if containsAll turns from false to true, add result to output
• Add non-matching values from entry into dewey
stack. Mark containment for entry’s keyword
47
Example: Algorithm
XQL
47.0.1
1
2
]
ContainsAll
Contains[
47.0.0.2.0
DeweyID
Result heap:
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0
48
Example: Algorithm
XQL
47.0.0.0
2
ContainsAll
1
]
DeweyID
Result heap:
Contains[
47.0.0.2.0
47.0.1
]
47.0.0.1
47.0.0.2.0
Contains[
language
47.0.0.2
Smallest entry is for
keyword 1, XQL.
lcp with Dewey stack =
none. Pop (nothing).
Add (all).
49
Example: Algorithm
XQL
2
ContainsAll
1
]
0
47.0.1
Contains[
47.0.0.2.0
DeweyID
Result heap:
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0

0
0
47
50
Example: Algorithm
XQL
47.0.0.2.0
Next smallest entry is for
keyword 2, language.
lcp with Dewey stack =
47.0.0
2
ContainsAll
0
1
]
DeweyID
Result heap:
Contains[
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0
Pop non-matching
47.0.1 entries

0
0
47
51
Example: Algorithm
XQL
47.0.0.2.0
0
lcp with Dewey stack =
47.0.0
2
ContainsAll
DeweyID
Next smallest entry is for
keyword 2, language.
1
]
Result heap:
Contains[
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0
Add additional
47.0.1 entries

0
47
52
Example: Algorithm
XQL
47.0.0.2.0
2

1
Next smallest entry is for
keyword 2, language.
lcp with Dewey stack =
47.0.0
0
ContainsAll
1
]
DeweyID
Result heap:
Contains[
47.0.0.2.0
47.0.1
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0

0
47
53
Example: Algorithm
XQL
47.0.0.2.0
2

1
Next smallest entry is for
keyword 1, XQL.
lcp with Dewey stack =
47.0.0
0
ContainsAll
1
]
DeweyID
Result heap:
Contains[
47.0.0.2.0
]
47.0.0.1
47.0.0.2
Contains[
language
47.0.0.0
Pop non-matching
47.0.1 entries

0
47
54
Example: Algorithm
XQL
47.0.0.2
2
ContainsAll
lcp with Dewey stack =
47.0.0
0
1
]
Next smallest entry is for
keyword 1, XQL.
Contains[
47.0.0
DeweyID
Result heap:
47.0.0.2.0
47.0.1
]
47.0.0.1
47.0.0.2.0
Contains[
language
47.0.0.0
Continue on
Blackboard!



0
47
55