Information Retrieval Using Vector Space Models.
by
Jia Mao & Howard Zhou
Problems in Indexing Large Text Collections
Traditional indexing mechanisms pose many problems in retrieving information from large dynamic databases. Among them, scale and consistency have always been two of the most concerned issues. As the scale of today's databases increases continuously, the traditional way of indexing simply can not keep up with its rate. For instance, there are currently about 300 million Web pages on the Internet, and a typical search engine updates or acquires pointers to as many as 10 million Web pages in a single day, but because the pages are indexed at a much slower rate, the indexed collection of the largest search engine presently totals about 100 million documents. Moreover, since the traditional way of indexing relies heavily on human-generated indexes, it is difficult to maintain consistency. Experiments have shown that there is a 20% disparity on average in the terms chosen as appropriate to describe a given document by two different professional indexers.
These problems of scale and consistency have fueled the development of automated IR (Information Retrieval) techniques. Such methods can be applied to extremely large databases, and they can, without prejudice, model the concept-document association patterns that constitute the semantic structure of a document collection. Nonetheless, automated systems have their own problems such as complexities of language and disparities between the vocabulary of the systems' authors and that of their users.
Linear Algebra Coming into Rescure in Automated IR
Referring to Berry's paper, we will show how linear algebra can be used in automated IR. we will explore the automated Information Retrieval system using vector space model to manage and index large text collections. In this model, each document is encoded as a vector and each vector component reflects the importance of a particular term in representing the semantics or meaning of that document. Then the vectors for all documents in a database are stored as the columns of a single matrix. A user's query of the database will be represented as a vector of numbers. Relevant documents in the database are then identified via simple vector operations.
A Typical Model of a Vector Space Representation of Information
In the vector space IR model, a vector is used to represent each document in a collection. Each component of the vector reflects a particular concept, key word, or term associated with the given document. The value assigned to that component reflects the improtance of the term in representing the semantics of the document. Typically, the value is a function of the frequency with which the term occurs in the document.
A database containing a total of d documents described by t terms is represented as a txd term-by-document matrix A. The d vectors representing the d documents form the columns of the matrix. Thus, the matrix element aij is the weighted frequency at which term i occurs in document j. The columns of A are the document vectors, and the rows of A are the term vectors. The semantic content of the database is wholly contained in the column space of A.
The following example demonstrates how a simple collction of five titles described by six terms leads to a 6 x 5 term-by-document matrix.
The t = 6 terms:
The d = 5 document titles:
The 6 x 5 term-by-document matrix before normalization, where the element
aij is the number of times term i appears in document title j:

The 6 x 5 term-by-document matrix with unit columns:

Suppose that a user in search of cooking information initiates a search for books about baking bread. The corresponding query would be written as the vector
with nonzero entries for the terms baking and bread. The search for relevant documents is carried out by computing the cosines of the angles between the query vector q and the document vectors aj by the following formula. A document is returned as relevent only if the cosine value computed is greater than some cutoff value (predefined).

For this particula query q, the five cosines, in order, 0.8615, 0, 0, 0.5774, 0. Thus if we set the cutoff value to 0.5, all of the documents concerning baking bread (the first and fourth) are returned as relevant. The second, third, and fifth documents, which concern neither of these topics, are correctly ignored.
Improving the model with QR factorization and Singular Value Decomposition(SVD)
A newer method of latent semantic indexing(LSI) is a variant of the vector space model in which a low-rank approximation to the vector space representation of the database is employed. That is, we replace the original matrix by another matrix that is as close as possible to the original matrix but whose column space is only a subspace of the column space of the original matrix. Reducing the rank of the matrix serves to remove extraneous information or noise from the database it represents. Referring to Berry's paper, we will show how the process of rank reduction can be archieved by using QR factorization and then the less familiar but more powerful Singular Value Decomposition(SVD)
Page created on 06-09-2000