# Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

## Dan Gusfield

Language: English

Pages: 556

ISBN: 0521585198

Format: PDF / Kindle (mobi) / ePub

Traditionally an area of study in computer science, string algorithms have, in recent years, become an increasingly important part of biology, particularly genetics. This volume is a comprehensive look at computer algorithms for string processing. In addition to pure computer science, Gusfield adds extensive discussions on biological problems that are cast as string problems and on methods developed to solve them. This text emphasizes the fundamental ideas and techniques central to today's applications. New approaches to this complex material simplify methods that up to now have been for the specialist alone. With over 400 exercises to reinforce the material and develop additional topics, the book is suitable as a text for graduate or advanced undergraduate students in computer science, computational biology, or bio-informatics.

determine i2 and i3 when needed during the search. We will first discuss i2. Recall that a denotes the substring of T that was matched in the search just ended. To determine i2, we need to find which pattern P in V contains a copy of a ending closest to its right end, but not occurring as a suffix of P . If that copy of a ends r places 7.16. APL15: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 161 Figure 7.10: Substring a in F\ matched from position / down to position /of 7"; no further match

occurs in S[l..i]- So, s, equals cv and /, equals the string-depth of p. Exploiting the fact that the alphabet is fixed, the time to find (s,, /,) is <9(/,). Thus the entire compression algorithm runs in 0{m) time. 7.17.2. A one-pass version In the implementation above we assumed S to be known ahead of time and that a suffix tree for S could be built before compression begins. That works fine in many contexts, but the method can also be modified to operate on-line as S is being input, one

compaction process and prove that the resulting DAG recognizes substrings of S. The finite-state machine for S is created by expanding each edge of this DAG labeled by more than a single character. Each node in the DAG is now a state in the finite-state machine. 17. Show that the finite-state machine created above has the fewest number of states of any finite-state machine that recognizes substrings of S. The key to this proof is that a deterministic finite-state machine has the fewest number of

papers of the 1970s have a reputation for being extremely difficult to understand. That reputation is well deserved but unfortunate, because the algorithms, although nontrivial, are no more complicated than are many widely taught methods. And, when implemented well, the algorithms are practical and allow efficient solutions to many complex string problems. We know of no other single data structure (other than those essentially equivalent to suffix trees) that allows efficient solutions to such a

solution is easy to establish and is left as an exercise. This method runs in linear time and is therefore time optimal. A different linear-time method with a smaller constant was given by Shiloach [404]. 7.14. APL13: Suffix arrays - more space reduction In Section 6.5.1, we saw that when alphabet size is included in the time and space bounds, the suffix tree for a string of length m either requires 0 ( w | E | ) space or the minimum of O(m logm) and O(m log |E|) time. Similarly, searching for a