The vector space information retrieval model uses vectors to represent documents in a database. A single document is one column vector, each component of the vector contains info on a particular keyword, or term associated with the document. One way the value of the keyword is stored is by the frequency of the key word, e.g. the number of times the word appears in the document can be the value of the term in the document vector. This is the way that we will associate term value in our project. We could also perform more sophisticated searches by assigning global weights to each keyword. For example, the term "animal" could be given a lower weight than "elephant" to represent the relative importance of a specific example in a more general group.
When many documents are put together, the columns then form a term by document matrix . (We will henceforth refer to dimensions of the matrix as t x d.) The columns of this matrix are called document vectors and the rows are called term vectors . For text collections spanning many contexts, like an encyclopedia, the number of terms is often greater than the number of documents, t >> d. In the case of the Internet, the situation is reversed. There are about 100 times more web pages than there are words in the largest English dictionary, so t << d.
In the Vector Space Model, queries are made to the database by using a query vector , a vector very much like a document vector, as they are the same dimensions. The term(s) which one wants to retrieve documents on will have higher values in this query vector. Documents are then returned from the database according to how geometrically close they are to the query vector. This can be accomplished in many ways, but the measure we will use in this project is the cosine. The cosine between the query vector and every document is taken during the querying process by the following formula:

for j = 1...d, where aj is a document column, and q is the query vector. Those documents whose cosines exceed a certain threshold are considered relevant, all others are irrelevant.
Why use cosines for the similarity measure, and not use, let's say, the norm of the difference of the query vector and the document column? Both computations are monotonic after all, so they will achieve the same result. In both cases, we have to normalize the vectors. However, the sparcity of the vectors, especially the query vector, is a key feature in the model. Consider what happens when you are taking the similarity of a very sparse query vector with a dense document vector. Then using the latter method, you would have to subtract each entry of the query from each entry in the document, and then square and add each of those to find the norm of the resulting dense vector. Even if you precompute the norms of the document vectors, not only is this precomputation expensive, but so is storing the values when we have databases with millions of documents. However, using cosines, we can take advantage of the sparsity of the query vector, and only compute those multiplications (to get the numerator in the equation) in which the query entry is non-zero. The number of additions is also then limited. The time saved by taking advantage of sparcity would be significant when searching the Internet.
We now show an example that demonstrates how this retrieval process works in The "Baking" Example