Abstract: In the proposed article a new, ontology-based approach to information retrieval (IR) is presented. The system is based on a domain knowledge representation schema in form of ontology. New resources registered within the system are linked to conce
Model elements can be also used for search and retrieval of relevant documents. In case all documents are linked to the same domain model, it is possible to calculate a similarity between documents using the abovementioned conceptual structure of this domain model. Such approach supports also ‘soft’ techniques, where a search engine can utilize the domain model to find concepts related to those specified by user. The search engine can thus return every document linked to the concepts, which are close enough to the concepts mentioned in the user’s query.
In order to evaluate efficiency retrieval of such an ontology-based approach, we did a series of experiments with two other, frequently used techniques for information retrieval (vector model with tf-idf weight schema and latent semantic indexing model). In the following section 2, all three retrieval methods are briefly described. Section 3 describes for the experiments used data set as well as the results achieved. Finally, section 4 provides a summary of the experimental results and suggestions for future work.
2. SCHEME OF DOCUMENT RETRIEVAL
We developed package with three different approaches to document retrieval: vector representation, latent semantic indexing method (LSI), and ontology-based method used in the Webocrat system. In next sub-chapters, each of these approaches is briefly described.
2.1. VECTOR REPRESENTATION APPROACH
This well know approach is based on vector representation of document collection. First of all every document is passed through set of pre-processing tools (lower case, stop words filter, document frequency). Then a vector of index term weights is calculated as the document internal representation. These weights are calculated by most often used tf-idf scheme [4]:
wij=tfij×idfi Nwhere tfij= and idfi=log, maxefreqejnifreqij
freqij is the number of occurrences of term ti in document dj, N is number of documents in collection, and niis the document frequency for term ti in the whole document collection.
Such a vector is then normalized to unit length and stored into the term-document matrix A, which is internal representation of the whole document collection.
In order to find some relevant document to a specific query Q it is necessary to represent the query Q in the same way as a document Di (i.e. a vector of index term weights). Similarity between a query Q and a document Di is computed as cosine of those two normalized vectors (document and query vectors).
simTF IDF(Q,Di)=Di×Q
DiQ
2.2. LATENT SEMANTIC INDEXING APPROACH
LSI approach is based on singular value decomposition of tf-idf matrix A. By this decomposition three matrixes are computed [8].
A=USVT