The Internet Encyclopedia (Volume 3)

(coco) #1

P1: 57


Yu WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 17:52 Char Count= 0


METASEARCHENGINETECHNOLOGY 747

the INQUERY document retrieval system (Callan, Croft, &
Harding, 1992). This technique is known as theinference
networkapproach. When ranking collections, CORI Net
conceptually treats each collection representative as a (su-
per)document. Thus, the set of all representatives forms
a collection of superdocuments. LetDdenote this collec-
tion of superdocuments. Consider collectionCof regular
documents and a termtinC. Thedfioftcan be treated as
theterm frequencyoftin the superdocument ofC. Simi-
larly, thecfioftcan be treated as thedocument frequency
oftinD. This means that we have the term frequency
and document frequency of each term in each superdoc-
ument. With the representative of each collection being
treated like a superdocument, the problem of ranking col-
lections has been reduced to the problem of ranking (su-
per)documents. At this point, any term frequency- and
document frequency-based term-weighting scheme may
be utilized to compute the weight of each term in each
superdocument. As a result, all superdocuments can be
represented as vectors of terms with weights. Further-
more, any vector-based similarity function such as those
discussed under Text Retrieval may be used to compute
the similarities between a given query and any superdoc-
ument, and these similarities can be used as the ranking
scores of collections for the query.
CORI Net employs an inference network-based prob-
abilistic approach to rank superdocuments for a given
query. In this approach, the ranking score of a collection
with respect to queryqis an estimated belief that the col-
lection contains useful documents to the query. The belief
is the combined probability that the collection contains
useful documents due to each query term. Suppose query
qcontainsktermst 1 ,...,tk. LetNbe the number of collec-
tions under consideration. Letdfijbe the document fre-
quency of thejth term in theith collectionCiandcfjbe
the collection frequency of thejth term. The belief that
Cicontains useful documents due to thejth query term
is estimated by

p(tj|Ci)=a 1 +(1−a 1 )·Tij·Ij,

where

Tij=a 2 +(1−a 2 )·

dfij
dfij+K

,

Ij=log

(
N+ 0. 5
cfj

)/
log(N+ 1 .0).

In the above,Tijis a formula employed by CORI Net for
computing the term frequency weight of thejth term in
the superdocument ofCi,Ijis for computing the inverse
document frequency weight of the jth term,a 1 anda 2
are constants andKis a parameter related to the size
of collectionCi. Appropriate values fora 1 ,a 2 , andKcan
be determined empirically by experiments (Callan et al.,
1995). Letp(q|tj) denote the belief that termtjrepresents
queryq. One way to estimatep(q|tj) is to use the weight
oftjinq. Finally, the belief that collectionCicontains
documents useful to queryqcan be estimated to be

ri=p(q|Ci)=

∑k

j= 1

p(q|tj)·p(tj|Ci)

A nice feature of the CORI Net approach is that the
same method can be used to rank documents for a query
as well as to rank collections for a query. More informa-
tion about the CORI Net approach can be found in Callan
(2000).

Most Similar Document Approach.One useful measure
for ranking databases is the global similarity of the most
similar document in a database for a given query. The
global similarity may incorporate other usefulness mea-
sures of a page such as the PageRank in addition to the
usual similarity between a query and the page. Theoreti-
cally, documents should be retrieved in descending order
of some global measure to achieve optimal retrieval, if
such a measure truly indicates the degree of relevance.
The global similarities of the most similar documents in
all databases can be used to rank databases optimally
for a given query for retrieving themmost similar doc-
uments across all databases. Suppose there areMlocal
databasesD 1 ,D 2 ,...,DMand themmost similar docu-
ments from these databases are desired. Intuitively, it is
desirable to order these databases in such a way that the
firstkdatabases collectively contain themmost similar
documents and each of thesekdatabases contains at least
one of these documents for some integerk. The following
definition introduces the concept of an optimal database
order for a given query (Yu, Meng, Liu, Wu, & Rishe,
1999).
Definition.For a given queryq, databasesD 1 ,D 2 ,...,
DM are said to be optimally ranked in the order
[D 1 ,D 2 ,...,DM] if there exists an integerksuch thatD 1 ,
D 2 ,...,Dkcontain all of themmost similar documents
and eachDi,1≤i≤k, contains at least one of them
most similar documents.
Clearly, if databases are optimally ranked for a query,
then it is sufficient to search the firstkdatabases only.
Letmsim(q,D) be the global similarity of the most sim-
ilar document in databaseDwith the queryq. It can be
shown (Yu et al., 1999) that if databases are ranked in de-
scending order of theirmsim(q,D)’s, then these databases
are optimally ranked with respect toq. This theory can-
not be applied as is because it is not practical to search
all databases, find the global similarities of the most sim-
ilar documents in these databases, and then rank them
for each query. The solution to this problem is to estimate
msim(q,D) for any databaseDusing the representative
ofDstored in the metasearch engine.
One method for estimatingmsim(q,D) is as follows
(Yu et al., 1999). Two types of representatives are used.
In the global representative for all local databases, the
global inverse document frequency weight (gidfi) is kept
for each distinct termti. There is a local representative
for each local databaseD. For each distinct termtiinD,
two quantities,mnwiandanwi, are kept, wheremnwiand
anwiare themaximum normalized weightand theaverage
normalized weightof termti, respectively. Ifdiis the weight
oftiin a documentd, thend/|d|is the normalized weight
oftiind, where|d|is the length ofd. For termtiin database
D,mnwiandanwiare defined to be the maximum and the
average of the normalized weights oftiin all documents
inD, respectively. Let (q 1 ,...,qk) be the vector of queryq.
Free download pdf