Automatic searching for useful patterns in large amounts of data

- Classify
- Associate
- Cluster

Simple Representation of the data

Improved performance

Instance-based learning ..Just memorize all the possible sets of values

ML has some overlap with data mining, but DM involves more preprocessing, whileML uses processed data

- Open file -> data

- Choose
- Prism
- Test options
- Don't test on the training set
- Cross validation
- Folds == number of models
- N folds means that the training data is divided into N portions and the algorithm is trained with n -1 parts as the training data, well one part is used to test theresulting model

- Stratified

- In weka, the class attribute is the last column -- it's the 'output'
- Tree pruning
- Reduces the tree size:might be OVERFIT, so it may not perform well on new data

With real world data,we likely won't be able to produce a perfect model anyway, so using less data to make the model gives us amore realistic version of what to expect

Tells how many were correctly classified and how many were incorrectly classified. Ideally, the diagonal of the matrix should have all the values(sort of like an identity matrix)

No rules... Just the majority value.

Choose a single attribute for classification;choose the one with the least errors

Divide up the numeric values

- table of values
- laplace estimator adds n (usually 1) to everything

*If ? then class = value1*- Repeat until all the instances of the given class are used by rules

- If an attribute is unknown simply omit its value
- Removing redundant columns improves naive bayes predictions
- Likelihood and probability are not the same
- PR(yes) = L of yes divided by (L of yes plus L of no)

- To get rid of possible zeros, just add 1 to each number of instances for each class...? This avoidsmultiplying by zero
- Removing an attribute-- decrease likelihood fraction by 1??
- Note how to calculate likelihoods and probabilities

- Ground truth - the manual classification of documents

- See info() calculations
- When the formulas contains a zero, the output is zero

- For each class
- While the are instances with the class
- Go through each attribute/value, find with highest accuracy
- Repeat till the rule is perfect

- Go through each attribute/value, find with highest accuracy

- While the are instances with the class

- Fill out the ? field
- If astigmatism is true and ?Then class value

- Item = attribute-valuepair
- Coverage - frequency of occurrence
- Use hashing for efficiency when checking if the previous item sets contain this item...?
- If ABC -> D
- If X then Y; LHS is antecedent, RHS is consequent. See p/t notes

- Straight line
- Linear regression
- Sum of squared errors

- Attributes are all numeric, output is nonnumeric class
- Can cover convert to numeric (1, 0)

- Membership function

Can put a plane between data-points in different classes

if $w \cdot a > 0$ predict class A else predict class B

From wikipedia: instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory. Instance-based learning is a kind of lazy learning.

Instance based learning involves storing all of the instances of the training set. Given a new instance, **a**, predict the class according to its nearest neighbour

Ways to find nearest neighbour:

- KD Tree
- Creating the tree from the instances
- Find the median point on the most widely distributed attribute, draw a line through it (this becomes a parent node)
- Switch directions and draw lines through the other direction on either sides of the first line
- Repeat until there is only at most 1 node inside each box. These nodes are leaf nodes

- Finding the nearest neighbour for a new instance
- A "star" represents a new instance - move it down the tree from the root toward where it is known to be on the attribute plane
- When you've reached the last leaf in the direction of the star location, draw a circle around the star that is of the radius of the distance to that last leaf
- Walk back up the tree and down adjacent branches as appropriate to decide if there is a closer neighbour based on the radius

- Creating the tree from the instances
- Euclidian Distance