Oracle8
ConText Cartridge Application Developer's Guide
Release 2.4 A63821-01 |
|
This appendix describes the scoring algorithm for text queries.
Note: This appendix discusses how ConText calculates score for text queries, which is different from the way it calculates score for theme queries. For more information about scoring for theme queries, see "Theme Querying" in Chapter 4. |
To calculate a relevance score for a returned document in
a text query, ConText uses an inverse frequency algorithm. Inverse frequency
scoring assumes that frequently occurring terms in a document set are "noise"
terms, and so these terms are scored lower. For a document to score high,
the query term must occur frequently in the document but infrequently in
the document set as a whole.
The following table illustrates ConText's inverse frequency
scoring. The first column shows the number of documents in the document
set, and the second column shows the number of terms in the document necessary
to score 100.
This table assumes that only one document in the set contains
the query term.
Number of Documents in Document Set | Frequency of Term in Document |
---|---|
1 |
34 |
5 |
20 |
10 |
17 |
50 |
13 |
100 |
12 |
500 |
10 |
1,000 |
9 |
10,000 |
7 |
100,000 |
5 |
1,000,000 |
4 |
The table illustrates that if only one document contained
the query term and there were five documents in the set, the term would
have to occur 20 times in the document to score 100. Whereas, if there
were 1,000,000 documents in the set, the term would have to occur only
4 times in the document to score 100.
You have 5000 documents dealing with chemistry in which the
term chemical occurs at least once in every document. The term chemical
thus occurs frequently in the document set.
You have a document that contains 5 occurrences of chemical
and 5 occurrences of the term hydrogen. No other document contains
the term hydrogen.
Because chemical occurs so frequently in the document
set, its score for the document is lower with respect to hydrogen,
which is infrequent is the document set as a whole. This is so even though
both terms occur 5 times in the document.
Inverse frequency scoring also means that adding documents
that contain hydrogen lowers the score for that term in the document,
and adding more documents that do not contain hydrogen raises the
score.
Because the scoring algorithm is based on the number of documents
in the document set, inserting, updating or deleting documents in the document
set is likely change the score for any given term before and after the
DML.
If DML is heavy, you or your ConText administrator must optimize
the index. Perfect relevance ranking is obtained by executing a query right
after optimizing the index.
If DML is light, ConText still gives fairly accurate relevance
ranking.
In either case, you or your ConText administrator must synchronize the index with CTX_DML.SYNC whenever DML is performed on the index.
See
Also:
For more information about optimizing and synchronizing an index, see Oracle8 ConText Cartridge Administrator's Guide. |