# Principles of Data Mining (Undergraduate Topics in Computer Science)

## Max Bramer

Language: English

Pages: 454

ISBN: 1447148835

Format: PDF / Kindle (mobi) / ePub

Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas.

*Principles of Data Mining* explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail.

This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data.

*Principles of Data Mining* aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field.

Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.

201310.1007/978-1-4471-4884-5_20© Springer-Verlag London 2013 20. Text Mining Max Bramer1 (1)School of Computing, University of Portsmouth, Portsmouth, UK Abstract This chapter looks at a particular type of classification task, where the objects are text documents. A method of processing the documents for use by the classification algorithms given earlier in this book using a bag-of-words representation is described. An important special case of text classification arises when the

classification 1, 5 instances with classification 2 and 15 instances with classification 3. So p 1=4/24, p 2=5/24 and p 3=15/24. We will call the entropy E start . It is given by E start =−(4/24)log2(4/24)−(5/24)log2(5/24)−(15/24)log2(15/24) =0.4308+0.4715+0.4238 =1.3261 bits (these and subsequent figures in this chapter are given to four decimal places). 5.3.3 Using Entropy for Attribute Selection The process of decision tree generation by repeatedly splitting on attributes is equivalent to

is say 2.39999 and exclude those where length is 2.40001. It is highly unlikely that there is any real difference between those values, especially if they were all measured imprecisely by different people at different times. On the other hand, if there were no values between say 2.3 and 2.5, a test such as length<2.4 would probably be far more reasonable. Another possibility would be to divide length into three ranges, this time so that there are the same number of instances in each of the three

stored in the database is a separate issue, which is not considered here. For convenience we write the items in an itemset in the order in which they appear in set I, the set of all possible items, i.e. {a,b,c} not {b,c,a}. All itemsets are subsets of I. We do not count the empty set as an itemset and so an itemset can have anything from 1 up to m members. 17.3 Support for an Itemset We will use the term support count of an itemset S, or just the count of an itemset S, to mean the number of

respectively. In all cases the itemset ends with item p rather than starting with item f. Looking at Figure 18.18 this way, the support counts for the a, c and f nodes cannot be 3, 3 and 4 respectively as they were in the FP-tree. If there are two transactions that include item p there cannot be more than 2 transactions that include items a and p together, or any other such combination. For this reason the best approach to constructing the conditional FP-tree for {p} is to construct the tree