(go back)

# Terminology

## Data mining

Automatic searching for useful patterns in large amounts of data

• Classify
• Associate
• Cluster

## Model

Simple Representation of the data

## Machine learning

Improved performance

Instance-based learning ..Just memorize all the possible sets of values

ML has some overlap with data mining, but DM involves more preprocessing, whileML uses processed data

# Weka

## Explorer

### Preprocess

• Open file -> data

### Classify

• Choose
• Prism
• Test options
• Don't test on the training set
• Cross validation
• Folds == number of models
• N folds means that the training data is divided into N portions and the algorithm is trained with n -1 parts as the training data, well one part is used to test theresulting model
• Stratified
• In weka, the class attribute is the last column -- it's the 'output'
• Tree pruning
• Reduces the tree size:might be OVERFIT, so it may not perform well on new data

## Why not just use the training set?

With real world data,we likely won't be able to produce a perfect model anyway, so using less data to make the model gives us amore realistic version of what to expect

## Confusion matrix

Tells how many were correctly classified and how many were incorrectly classified. Ideally, the diagonal of the matrix should have all the values(sort of like an identity matrix)

# Classification

## ZeroR (0R)

No rules... Just the majority value.

## OneR (1R)

Choose a single attribute for classification;choose the one with the least errors

## Discretization

Divide up the numeric values

## Naive Bayes

• table of values
• laplace estimator adds n (usually 1) to everything

## Prism Method

• If ? then class = value1
• Repeat until all the instances of the given class are used by rules

## Naive Bayes theorem (variables are independent)

### Temperature table

• If an attribute is unknown simply omit its value
• Removing redundant columns improves naive bayes predictions
• Likelihood and probability are not the same
• PR(yes) = L of yes divided by (L of yes plus L of no)
• To get rid of possible zeros, just add 1 to each number of instances for each class...? This avoidsmultiplying by zero
• Removing an attribute-- decrease likelihood fraction by 1??
• Note how to calculate likelihoods and probabilities

### Document classification

• Ground truth - the manual classification of documents

## (ID3) Choosing the first attribute to branch on

### Best, highest information gain

• See info() calculations
• When the formulas contains a zero, the output is zero

## PRISM a simple covering algorithm

### Start with the classes, try to generate a rule set for each

• For each class
• While the are instances with the class
• Go through each attribute/value, find with highest accuracy
• Repeat till the rule is perfect

### If ? Then class value

• Fill out the ? field
• If astigmatism is true and ?Then class value

## Association rules

• Item = attribute-valuepair
• Coverage - frequency of occurrence
• Use hashing for efficiency when checking if the previous item sets contain this item...?
• If ABC -> D
• If X then Y; LHS is antecedent, RHS is consequent. See p/t notes

# Numerical Regression

## Linear model

### Numeric

• Straight line
• Linear regression
• Sum of squared errors

## Nominal (class)

### Multi-response

• Attributes are all numeric, output is nonnumeric class
• Can cover convert to numeric (1, 0)
• Membership function

## Linearly separable

Can put a plane between data-points in different classes

## Linear Classification using the Perceptron

if $w \cdot a > 0$ predict class A else predict class B

## Instance based learning

From wikipedia: instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory. Instance-based learning is a kind of lazy learning.

Instance based learning involves storing all of the instances of the training set. Given a new instance, a, predict the class according to its nearest neighbour

Ways to find nearest neighbour:

• KD Tree
1. Creating the tree from the instances
1. Find the median point on the most widely distributed attribute, draw a line through it (this becomes a parent node)
2. Switch directions and draw lines through the other direction on either sides of the first line
3. Repeat until there is only at most 1 node inside each box. These nodes are leaf nodes
2. Finding the nearest neighbour for a new instance
1. A "star" represents a new instance - move it down the tree from the root toward where it is known to be on the attribute plane
2. When you've reached the last leaf in the direction of the star location, draw a circle around the star that is of the radius of the distance to that last leaf
3. Walk back up the tree and down adjacent branches as appropriate to decide if there is a closer neighbour based on the radius
• Euclidian Distance