Information Retrieval: Data Structures and Algorithms
Format: PDF / Kindle (mobi) / ePub
Information retrieval is a sub-field of computer science that deals with the automated storage and retrieval of documents. Providing the latest information retrieval techniques, this guide discusses Information Retrieval data structures and algorithms, including implementations in C. Aimed at software engineers building systems with book processing components, it provides a descriptive and evaluative explanation of storage and retrieval systems, file structures, term and query operations, document operations and hardware. Contains techniques for handling inverted files, signature files, and file organizations for optical disks. Discusses such operations as lexical analysis and stoplists, stemming algorithms, thesaurus construction, and relevance feedback and other query modification techniques. Provides information on Boolean operations, hashing algorithms, ranking algorithms and clustering algorithms. In addition to being of interest to software engineering professionals, this book will be useful to information science and library science professionals who are interested in text retrieval technology.
based on finite automata theory. Usually, before indexing, the text is filtered. Figure 2.7 shows the complete process for the text. Figure 2.7: Text preprocessing The preprocessing time needed to build the index is amortized by using it in searches. For example, if file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap02.htm (12 of 15)7/3/2004 4:19:26 PM Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND building the index
Guild, Sterling, VA 22170 Abstract This chapter introduces and defines basic IR concepts, and presents a domain model of IR systems that describes their similarities and differences. The domain model is used to introduce and relate the chapters that follow. The relationship of IR systems to other information systems is dicussed, as is the evaluation of IR systems. 1.1 INTRODUCTION Automated information retrieval (IR) systems were originally developed to help manage the huge scientific literature
implementing different aspects of the computerization of the OED from 1985. Hence, there was considerable interest in indexing the OED to provide fast searching of its 600Mb of text. A log2(n) t hours, standard building of a Patricia tree in any of its forms would have required about n where n is the number of index points and t is the time for a random access to disk. As it turned out, the dictionary had about n = 119,000,000 and our computer systems would give us about 30 random 27/30 60 60 =
were flagged as unresolved in the final index and fixed in a later pass. With 48 characters per key it was possible to read 600,000 keys into a 32Mb memory. For a 600Mb text, this means that on average we are reading 1 key per kilobyte, so we can use sequential I/O. Each reading of the file needs between 30 and 45 minutes and for 120,000,000 index points it takes 200 passes, or approximately 150 hours. Thus, for building an index for the OED, this algorithm is not as effective as the large
Optical Disks. Paper presented at the annual meeting of the Special Interest Group for Information Retrieval of the Association of Computing Machinery (ACM SIGIR'89), Cambridge, Massachusetts. EASTON, M. C. 1986. "Key-Sequence Data Sets on Indelible Storage." IBM Journal of Research and Development, 30(3), 230-41. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap06.htm (17 of 18)7/3/2004 4:19:45 PM Information Retrieval: CHAPTER 6: FILE