The Internet Encyclopedia (Volume 3)

P1: 57

Yu WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 17:52 Char Count= 0

METASEARCHENGINETECHNOLOGY 747

the INQUERY document retrieval system (Callan, Croft, & Harding, 1992). This technique is known as theinference networkapproach. When ranking collections, CORI Net conceptually treats each collection representative as a (su- per)document. Thus, the set of all representatives forms a collection of superdocuments. LetDdenote this collection of superdocuments. Consider collectionCof regular documents and a termtinC. Thedfioftcan be treated as theterm frequencyoftin the superdocument ofC. Simi- larly, thecfioftcan be treated as thedocument frequency oftinD. This means that we have the term frequency and document frequency of each term in each superdocument. With the representative of each collection being treated like a superdocument, the problem of ranking collections has been reduced to the problem of ranking (su- per)documents. At this point, any term frequency- and document frequency-based term-weighting scheme may be utilized to compute the weight of each term in each superdocument. As a result, all superdocuments can be represented as vectors of terms with weights. Further- more, any vector-based similarity function such as those discussed under Text Retrieval may be used to compute the similarities between a given query and any superdocument, and these similarities can be used as the ranking scores of collections for the query. CORI Net employs an inference network-based prob- abilistic approach to rank superdocuments for a given query. In this approach, the ranking score of a collection with respect to queryqis an estimated belief that the collection contains useful documents to the query. The belief is the combined probability that the collection contains useful documents due to each query term. Suppose query qcontainsktermst 1 ,...,tk. LetNbe the number of collections under consideration. Letdfijbe the document frequency of thejth term in theith collectionCiandcfjbe the collection frequency of thejth term. The belief that Cicontains useful documents due to thejth query term is estimated by

p(tj|Ci)=a 1 +(1−a 1 )·Tij·Ij,

where

Tij=a 2 +(1−a 2 )·

dfij dfij+K

,

Ij=log

( N+ 0. 5 cfj

)/ log(N+ 1 .0).

In the above,Tijis a formula employed by CORI Net for computing the term frequency weight of thejth term in the superdocument ofCi,Ijis for computing the inverse document frequency weight of the jth term,a 1 anda 2 are constants andKis a parameter related to the size of collectionCi. Appropriate values fora 1 ,a 2 , andKcan be determined empirically by experiments (Callan et al., 1995). Letp(q|tj) denote the belief that termtjrepresents queryq. One way to estimatep(q|tj) is to use the weight oftjinq. Finally, the belief that collectionCicontains documents useful to queryqcan be estimated to be

ri=p(q|Ci)=

∑k

j= 1

p(q|tj)·p(tj|Ci)

A nice feature of the CORI Net approach is that the same method can be used to rank documents for a query as well as to rank collections for a query. More informa- tion about the CORI Net approach can be found in Callan (2000).

Most Similar Document Approach.One useful measure for ranking databases is the global similarity of the most similar document in a database for a given query. The global similarity may incorporate other usefulness mea- sures of a page such as the PageRank in addition to the usual similarity between a query and the page. Theoreti- cally, documents should be retrieved in descending order of some global measure to achieve optimal retrieval, if such a measure truly indicates the degree of relevance. The global similarities of the most similar documents in all databases can be used to rank databases optimally for a given query for retrieving themmost similar documents across all databases. Suppose there areMlocal databasesD 1 ,D 2 ,...,DMand themmost similar documents from these databases are desired. Intuitively, it is desirable to order these databases in such a way that the firstkdatabases collectively contain themmost similar documents and each of thesekdatabases contains at least one of these documents for some integerk. The following definition introduces the concept of an optimal database order for a given query (Yu, Meng, Liu, Wu, & Rishe, 1999). Definition.For a given queryq, databasesD 1 ,D 2 ,..., DM are said to be optimally ranked in the order [D 1 ,D 2 ,...,DM] if there exists an integerksuch thatD 1 , D 2 ,...,Dkcontain all of themmost similar documents and eachDi,1≤i≤k, contains at least one of them most similar documents. Clearly, if databases are optimally ranked for a query, then it is sufficient to search the firstkdatabases only. Letmsim(q,D) be the global similarity of the most similar document in databaseDwith the queryq. It can be shown (Yu et al., 1999) that if databases are ranked in descending order of theirmsim(q,D)’s, then these databases are optimally ranked with respect toq. This theory can- not be applied as is because it is not practical to search all databases, find the global similarities of the most similar documents in these databases, and then rank them for each query. The solution to this problem is to estimate msim(q,D) for any databaseDusing the representative ofDstored in the metasearch engine. One method for estimatingmsim(q,D) is as follows (Yu et al., 1999). Two types of representatives are used. In the global representative for all local databases, the global inverse document frequency weight (gidfi) is kept for each distinct termti. There is a local representative for each local databaseD. For each distinct termtiinD, two quantities,mnwiandanwi, are kept, wheremnwiand anwiare themaximum normalized weightand theaverage normalized weightof termti, respectively. Ifdiis the weight oftiin a documentd, thend/|d|is the normalized weight oftiind, where|d|is the length ofd. For termtiin database D,mnwiandanwiare defined to be the maximum and the average of the normalized weights oftiin all documents inD, respectively. Let (q 1 ,...,qk) be the vector of queryq.

The Internet Encyclopedia (Volume 3)

Get our desktop app

Company

Features

Documentation

Resources