International Journal of Information Studies, Vol 1, No 1 (2009)
Performance Analysis of Distributed IR Systems: Document-based, Term-based and
Ahmad Abusukhon, Michael P. Oakes, Ayman M. Abdalla
Abstract
Information retrieval (IR) systems for large-scale data collections must build an index in order to provide efficient
retrieval that meets the user’s needs. In distributed IR systems, query response time is affected by the way in which the data collection
is partitioned across nodes. There are three types of collection partitioning; document-based partitioning (called the local index),
term-based partitioning (called the global index) and hybrid partitioning. Average query response time and load balance are highly
affected by the way in which the data collection is partitioned among nodes. In this paper, we analyze the average query response
time with respect to number of nodes. We detach the query response time into four components namely search time, rank time, sort
time and communication time. We compare the above three types of partitioning in terms of average query response time, load
balance and query throughput for a system with 10 nodes (one broker and nine other nodes). Our results showed that within our
distributed IR system, the document-based and hybrid partitioning outperformed the term-based partitioning in terms of average
query response time, query throughput and load balance. However, unlike Xi, Somil, Luo, & Fox (2002), we did not find that hybrid
partitioning (using 10 nodes) was any better than document-based partitioning in terms of average query response time and query
throughput. In addition, our results showed that document-based partitioning performed better than hybrid partitioning in terms of
load balance.