Classification and Prediction

oneonone · Jan 23, 2016

Description
Databases are rich with hidden information that can be used for making intelligent business decisions.

Classification and Prediction Classification and Prediction
Lecture 5/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA ([email protected])
Faculty of Computer Science, University of Indonesia
Objectives Objectives
Introduction
What is Classification?
Classification vs Prediction Classification vs Prediction
Supervised and Unsupervised Learning
D t P ti Data Preparation
Classification Accuracy
ID3 Algorithm
Information Gain
Bayesian Classification
Predictive Modelling
University of Indonesia
Predictive Modelling
2
Introduction Introduction
Databases are rich with hidden information that can be used
for making intelligent business decisions.
Classification and prediction can be used to extract models
d ibi i d l di f d describing important data classes or to predict future data
trends.
Classification predicts categorical labels Ex: categorize bank Classification predicts categorical labels. Ex: categorize bank
loan applications safe or risky.
Prediction models continuous-valued functions. Ex: predict the
expenditures of potential customers on computer equipment
given their income and occupation.
Typical Applications: Typical Applications:
Credit approval, target marketing,
Medical diagnosis, treatment effectiveness analysis
University of Indonesia
g , y
What is Classification? A two step process What is Classification? – A two-step process
Model construction:
Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
Data tuples are also referred to as samples, examples, or objects.
All tuples used for construction is called training set.
Since the class label of each training sample is provided
supervised learning. In clustering (unsupervised learning),
th l l b l f h t i i l i t k d th the class labels of each training sample is not known, and the
number or set of classes to be learned may not be known in
advance. advance.
The model is represented in the following forms:
Classification rules, (IF-THEN statements), decision tree, mathematical
University of Indonesia
, ( ), ,
formulae
4
What is Classification? A two step process (2) What is Classification? – A two-step process (2)
The model is used for classifying future or
unknown objects.
First, the predictive accuracy of the model is estimated
The known label of test sample is compared with the classified result
from the model.
Accuracy rate is the percentage of test set samples that are correctly Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
Test set is independent of training set otherwise over-fitting (it may
have incorporated some particular anomalies of the training data that
are not present in the overall sample population) will occur.
If the accuracy of the model is considered acceptable the If the accuracy of the model is considered acceptable, the
model can be used to classify future objects for which the
class label is not known (unknown, previously unseen data).
University of Indonesia
( p y )
5
Classification Process (1) Classification Process (1)
Classification
Training
Data
Classification
Algorithms
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Classifier
(Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
J im Associate Prof 7 yes J im Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
IF rank =‘professor’
OR years >6
THEN tenured =‘yes’
University of Indonesia
6
THEN tenured yes
Classification Process (2) Classification Process (2)
Classifier
Testing
Data
Unseen Data
Data
(J eff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
J oseph Assistant Prof 7 es
University of Indonesia
7
J oseph Assistant Prof 7 yes
What is Prediction? What is Prediction?
Prediction is similar to classification
First, construct model.
Second, use model to predict future or unknown objects p j
Major method for prediction is regression:
Linear and multiple regression
Non-liner regression
Prediction is different from classification
Classification refers to predict categorical class label.
Prediction refers to predict continuous value.
University of Indonesia
8
Classification vs Prediction Classification vs Prediction
Sending out promotional literature to every new
customer in the database can be quite costly. A more cos-
efficient method would be to target only those new
h lik l h customers who are likely to purchase a new computer
classification.
P d h b f h h Predict the number of major purchases that a customer
will make during a fiscal year prediction.
University of Indonesia
9
Supervised vs Unsupervised Learning Supervised vs Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the ) p y g
observations
Based on the training set to classify new data g y
Unsupervised learning (clustering)
We are given a set of measurements observations etc with We are given a set of measurements, observations, etc with
the aim of establishing the existence of classes or clusters in
the data the data
No training data, or the “training data” are not accompanied
by class labels
University of Indonesia
by class labels
10
Issues Data Preparation Issues – Data Preparation
Data preprocessing can be used to help improve the
accuracy, efficiency, and scalability of the classification or
prediction process.
Data Cleaning
Remove/reduce noise and the treatment of missing values
Relevance Analysis
Many of the attributes in the data may be irrelevant to the
classification or prediction task. Ex: data recording the day of
the week on which a bank loan application was filed is unlikely
to be relevant to the success of the application to be relevant to the success of the application.
Other attributes may be redundant.
This step is known as feature selection
University of Indonesia
This step is known as feature selection.
11
Issues Data Preparation Issues – Data Preparation
Data Transformation
Data can be generalized to higher-level concepts.
Useful fot continuous-valued attributes.
Income can be generalized low, medium, high.
Street city.
Generalization compresses the original training data, fewer
input/output operations may be involved during learning.
Wh i l t k ( th th d i l i When using neural networks (or other methods involving
distance measurements), data may also be normalized.
University of Indonesia
12
Comparing Classification Method Comparing Classification Method
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in large databases (not memory resident data)
Interpretability:
the level of understanding and insight provided by the model
Goodness of rules Goodness of rules
decision tree size
the compactness of classification rules
University of Indonesia
13
Classification Accuracy: Estimating Error Rates Classification Accuracy: Estimating Error Rates
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test
set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample as test p g p
data --- k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data
University of Indonesia
14
What is a decision tree? What is a decision tree?
A decision tree is a flow-chart-like tree structure.
Internal node denotes a test on an attribute
Branch represents an outcome of the test
All tuples in branch have the same value for the tested
attribute.
Leaf node represents class label or class label distribution.
To classify an unknown sample, the attribute values of the
sample are tested against the decision tree. A path is traced
from the root to a leaf node that holds the class prediction
f h l for that sample.
Decision trees can easily be converted to classification rules.
University of Indonesia
15
Training Dataset Training Dataset
An Example
Outlook TempreatureHumidityWindy Class
sunny hot high false N
An Example
from Quinlan’s
ID3
sunny hot high false N
sunny hot high true N
overcast hot high false P g
rain mild high false P
rain cool normal false P
rain cool normal tr e N rain cool normal true N
overcast cool normal true P
sunny mild high false N y g
sunny cool normal false P
rain mild normal false P
ild l t P sunny mild normal true P
overcast mild high true P
overcast hot normal false P
University of Indonesia
16
rain mild high true N
A Sample Decision Tree A Sample Decision Tree
Outlook Outlook
overcast
sunny
rain overcast
humidity windy P
high normal false true
N N P P
University of Indonesia
17
Decision Tree Classification Methods Decision-Tree Classification Methods
The basic top-down decision tree generation approach
usually consists of two phases:
Tree construction
At start, all the training examples are at the root.
Partition examples recursively based on selected Partition examples recursively based on selected
attributes.
Tree pruning Tree pruning
Aiming at removing tree branches that may lead to errors
h l if i t t d t (t i i d t t i i when classifying test data (training data may contain noise,
outliers, …)
University of Indonesia
18
ID3 Algorithm ID3 Algorithm
All attributes are categorical
Create a node N;
if samples are all of the same class C, then
return N as a leaf node labeled with C return N as a leaf node labeled with C
if attribute-list is empty then
return N as a leaf node labeled with the most common class
select split-attribute with highest information gain
label N with the split-attribute
f h l A f lit tt ib t b h f N d N for each value A
i
of split-attribute, grow a branch from Node N
let S
i
be the branch in which all tuples have the value A
i
for split- attribute
if S
i
is empty then
attach a leaf labeled with the most common class
Else recursively run the algorithm at Node S
i
until all branches reach leaf nodes
University of Indonesia
until all branches reach leaf nodes
19
Choosing Split Attribute –
Information Gain (ID3/C4 5) (1) Information Gain (ID3/C4.5) (1)
Assume all attributes to be categorical (discrete-values).
Continuous-valued attributes must be discretized.
Used to select the test attribute at each node in the tree.
Also called measure of the goodness of split.
The attribute with the highest information gain is chosen g g
as the test attribute for the current node.
University of Indonesia
20
Information Gain (ID3/C4 5) (2) Information Gain (ID3/C4.5) (2)
Assume that there are two classes, P and N.
L h f l S l f l P d Let the set of examples S contain p elements of class P and n
elements of class N.
The amount of information needed to decide if an arbitrary The amount of information, needed to decide if an arbitrary
example in S belong to P or N is defined as
I
p p n n
( ) log log
Assume that using attribute A as the root in the tree will partition
S {S S S }
I p n
p
p n
p
p n p n p n
( , ) log log = ?
+ +
?
+ +
2 2
S in sets {S
1
, S
2
, …, S
v
}.
If S
i
contains p
i
examples of P and n
i
examples of N, the information
needed to classify objects in all subtrees S
i
: needed to classify objects in all subtrees S
i
:
E A I
i
i
v
i
i
p
n
p
n
( ) ( , ) =
+
?
University of Indonesia
21
p n
i
i
i
p
n
( ) ( , )
+
=
?
1
Information Gain (ID3/C4 5) (3) Information Gain (ID3/C4.5) (3)
The attribute A is selected such that the information gain
gain(A) = I(p, n) - E(A)
is maximal, that is, E(A) is minimal since I(p, n) is the same to all , , ( ) (p, )
attributes at a node.
In the given sample data attribute outlook is chosen to split at In the given sample data, attribute outlook is chosen to split at
the root :
i ( tl k) = 0 246 gain(outlook) = 0.246
gain(temperature) = 0.029
gain(humidity) = 0.151
gain(windy) = 0.048
University of Indonesia
22
Information Gain (ID3/C4 5) (3) Information Gain (ID3/C4.5) (3)
Examples:
See Table 7.1.
Class label: buys_computer. Two values: YES, NO.
m = 2. C
1
correspond to yes, C
2
correspond to no.
9 samples of class yes and 5 samples of class no.
Compute the expected information needed to classify a given
sample
5 5 9 9
940 . 0
14
5
14
5
14
9
14
9
) 5 , 9 ( ) , (
log log
2 2
2 1
= ? ? = = I s s I
University of Indonesia
23
Information Gain (ID3/C4 5) (4) Information Gain (ID3/C4.5) (4)
Next, compute the entropy of each attribute. Let’s start with the
ib attribute age.
For age = “40”: s
13
= 3 s
23
= 2 I (s
13
, s
23
) = 0.971
Using equation (7 2) the expected information needed to classify Using equation (7.2), the expected information needed to classify
a given sample if the samples are partitioned by age is
694 . 0 ) , (
14
5
) , (
14
4
) , (
14
5
) (
23 13 22 12 21 11
= + + = s s I s s I s s I age E
Hence, the gain in information from such a partitioning:
Gain(age) = I (s
1
, s
2
) – E (age) = 0.246
14 14 14
( g ) (
1 2
) ( g )
Similarly, we can compute Gain(income) = 0.029, Gain(student) =
0.151, Gain(Credit_rating) = 0.048.
University of Indonesia
24
How to use a tree? How to use a tree?
Directly
test the attribute value of unknown sample against the tree.
A path is traced from root to a leaf which holds the label
Indirectly
decision tree is converted to classification rules
one rule is created for each path from the root to a leaf
IF-THEN is easier for humans to understand
Example:
IF age = “

Classification and Prediction

Attachments