Abstract: In the proposed article a new, ontology-based approach to information retrieval (IR) is presented. The system is based on a domain knowledge representation schema in form of ontology. New resources registered within the system are linked to conce
Figure 1. An ontology based document retrieval
3. EXPERIMENTS
3.1. DOCUMENT COLLECTION
Collection named Cystic Fibrosis was used for our experiments. This collection consists of 1239 files [1]. It is a subset extracted from a large MEDLINE collection where a keyword Cystic Fibrosis was used. The minimal size of a file is 0.12 kb, maximum size is 3.8 kb and average size is 1.045 kb.
A file with 100 queries is also supplied with the document collection. A set of relevant documents is given to each query is known. Each document in the answer set is ranked with respect to its relevance to the query by more experts - and can take values from 0 to 8 – see Table 1. In our experiments a document has been taken into account as relevant to a query, if its average experts ranking was more than 4.
It is possible to see at this collection as a group of documents and concepts of ontology, where every document is assigned to an appropriate set of concepts and similarly every concept can “hold” some documents. There are 821 concepts and average number of concepts assigned to a document is 2.8. Similarly we can refer to concepts collection by the same way. Average number of documents assigned to one concept is 4.2.
Abstract: In the proposed article a new, ontology-based approach to information retrieval (IR) is presented. The system is based on a domain knowledge representation schema in form of ontology. New resources registered within the system are linked to conce
Name of collection Relevance
Cystic fibrosis Min num. of Max num. of Average num. documents documents of documents 3 1 131 17,95 4 1 121 15,59
5 1 114 13,5
6 1 96 11,03
Table 1 Cystic fibrosis document collection
3.2. COMPARISON OF VARIOUS APPROACHES FOR DOCUMENT RETRIEVAL
In this section we will describe comparison of document retrieval experiments, where 3 different approaches were used: full text search (vector representation approach), latent semantic indexing approach, and finally ontology-based approach. First approach was used as described above with lower document frequency threshold equal to 0.2% and upper threshold set to 80%, i.e. only terms with documents frequency from the given interval have been taken into account for index. Threshold for LSI dimension reduction was set to 100.
Figure 2. Precision-recall curve for three analyzed retrieval approaches.
Precision-recall curve for all of the approaches described above are presented in Figure 2. Our experiments showed that the Webocrat-like approach based on a ontology is very promising, providing better retrieval efficiency than LSI or standard full text approach. However, as mentioned above, manual assignment of concepts to query has been used.
4. CONCLUSIONS
In this paper we have presented results of some experiments performed in order to evaluate retrieval efficiency of an ontology-based approach, which is implemented within the Webocrat system. We did a series of experiments with two other, frequently used techniques for information retrieval (vector model with tf-idf weight schema and latent semantic indexing model). The experiments on well-known Cystic Fibrosis document collection have shown that ontology-based approach employed in the Webocrat system is very promising and may yield better precision-recall characteristics.
However, there are still open questions related to this approach. Probably the major one is the question how to transform a user-defined query into a set of concepts from actual ontology. In our