Description
Warehouse may store terabytes of data
Data Preprocessing
Lecture 3/DMBI/IKI83403T/MTI/UI Lecture 3/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) y , , (y )
Faculty of Computer Science, University of Indonesia
Obj ti Objectives
Motivation: Why preprocess the Data? Motivation: Why preprocess the Data?
Data Preprocessing Techniques
Data Cleaning
Data Integration and Transformation Data Integration and Transformation
Data Reduction
University of Indonesia
2
Wh P th D t ? Why Preprocess the Data?
Quality decisions must be based on quality data Quality decisions must be based on quality data
Data could be incomplete, noisy, and inconsistent
Data warehouse needs consistent integration of
quality data q y
Incomplete
L ki ib l i ib f i Lacking attribute values or certain attributes of interest
Containing only aggregate data
Causes:
Not considered important at the time of entry
Equipment malfunctions
Data not entered due to misunderstanding
University of Indonesia
Inconsistent with other recorded data and thus deleted
3
Wh P th D t ? (2) Why Preprocess the Data? (2)
Noisy (having incorrect attribute values) Noisy (having incorrect attribute values)
Containing errors, or outlier values that deviate from the
expected expected
Causes:
Data collection instruments used may be faulty
Human or computer errors occuring at data entry
Errors in data transmission
Inconsistent
Containing discrepancies in
the department codes the department codes
used to categorize items
University of Indonesia
4
Wh P th D t ? (3) Why Preprocess the Data? (3)
“Clean” the data by filling in missing values smoothing Clean the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving
inconsistencies inconsistencies.
Some examples of inconsistencies:
customer_id vs cust_id
Bill vs William vs B.
Some attributes may be inferred from others. Data
cleaning including detection and removal of redundancies g g
that may have resulted.
University of Indonesia
5
D t P i T h i Data Preprocessing Techniques
Data Cleaning Data Cleaning
To remove noise and correct inconsistencies in the data
Data Integration
Merges data from multiple sources into a coherent data g p
store, such as a data warehouse or a data cube
Data Transformation Data Transformation
Normalization (to improve the accuracy and efficiency of
mining algorithms involving distance measurements E g mining algorithms involving distance measurements E.g.
Neural networks, nearest-neighbor)
D t Di ti ti Data Discretization
Data Reduction
University of Indonesia
6
D t P i T h i (2) Data Preprocessing Techniques (2)
Data Reduction
Warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the p y g y y g
complete data set
Obtains a reduced representation of the data set that is much smaller in p
volume, yet produces the same (or almost the same) analytical results.
Strategies for Data Reduction Strategies for Data Reduction
Data aggregation (e.g., building a data cube)
Dimension reduction (e.g. removing irrelevant attributes through Dimension reduction (e.g. removing irrelevant attributes through
correlation analysis)
Data compression (e.g. using encoding schemes such as minimum length
encoding or wavelets)
Numerosity reduction
University of Indonesia
Generalization
7
D t P i T h i (3) Data Preprocessing Techniques (3)
University of Indonesia
8
D t Cl i Mi i V l Data Cleaning – Missing Values
1. Ignore the tuple 1. Ignore the tuple
Usually done when class label is missing classification
Not effective when the missing values in attributes spread in Not effective when the missing values in attributes spread in
different tuples
F ll h l ll d f bl ? 2. Fill in the missing value manually: tedious + infeasible?
3. Use a global constant to fill in the missing value g g
‘unknown’, a new class?
Mining program may mistakenly think that they form an Mining program may mistakenly think that they form an
interesting concept, since they all have a value in common
not recommended not recommended
4. Use the attribute mean to fill in the missing value
i
University of Indonesia
avg income
9
D t Cl i Mi i V l (2) Data Cleaning – Missing Values (2)
5 Use the attribute mean for all samples belonging to the 5. Use the attribute mean for all samples belonging to the
same class as the given tuple same credit risk
t category
6. Use the most probable value to fill in the missing value
Determined with regression, inference-based tools such as
Bayesian formalism, or decision tree induction y
Methods 3 to 6 bias the data. The filled-in value may not be y
correct. However, method 6 is a popular strategy, since:
It uses the most information from the present data to predict missing values It uses the most information from the present data to predict missing values
There is a greater chance that the relationships between income and the other
attributes are preserved
University of Indonesia
attributes are preserved.
10
Data Cleaning –
N i d I t (I i t t) D t Noise and Incorrect (Inconsistent) Data
Noise is a random error or variance in a measured variable Noise is a random error or variance in a measured variable.
How can we smooth out the data to remove the noise?
Binning Method
Smooth a sorted data value by consulting its “neighborhood”, that
is, the values around it.
The sorted values are distributed into a number of buckets, or bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
Binning is also uses as a discretizatin technique (will be discussed
later)
University of Indonesia
11
Data Cleaning – Noisy Data
Bi i M th d Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 26, 25, 28, 29, 34 p ( ) , , , , , , , , , , ,
* Partition into (equidepth) bins of depth 3, each bin contains three values:
- Bin 1: 4, 8, 9, 15 , , ,
- Bin 2: 21, 21, 24, 26
- Bin 3: 25, 28, 29, 34 , , ,
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 , , ,
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 , , ,
* Smoothing by bin boundaries: the larger the width, the greater the effect
- Bin 1: 4, 4, 4, 15 , , ,
- Bin 2: 21, 21, 26, 26
- Bin 3: 25, 25, 25, 34
University of Indonesia
, , ,
12
Data Cleaning – Noisy Data
Cl t i Clustering
Similar values are organized into groups or clusters Similar values are organized into groups, or clusters.
Values that fall outside of the set of clusters may be
id d tli considered outliers.
University of Indonesia
13
Data Cleaning – Noisy Data
R i Regression
Data can be smoothed by
y
Data can be smoothed by
fitting the data to a
function such as with
y
Y1 function, such as with
regression.
Li i i l
Y1
Linear regression involves
finding the best line to fit
y =x +1
Y1’
two variables, so that one
variable can be used to x
X1
predict the other.
Multiple linear regression p g
> 2 variables,
multidimensional surface
University of Indonesia
multidimensional surface
14
D t S thi D t R d ti Data Smoothing vs Data Reduction
Many methods for data smoothing are also methods Many methods for data smoothing are also methods
for data reduction involving discretization.
Examples
Binning techniques reduce the number of distinct values g q
per attribute. Useful for decision tree induction which
repeatedly make value comparisons on sorted data.
Concept hierarchies are also a form of data discretization
that can also be used for data smoothng. g
Mapping real price into inexpensive, moderately_priced,
expensive p
Reducing the number of data values to be handled by the
mining process.
University of Indonesia
mining process.
15
D t Cl i I i t t D t Data Cleaning - Inconsistent Data
May be corrected manually May be corrected manually.
Errors made at data entry may be corrected by
f i t l d ith ti d i d performing a paper trace, coupled with routines designed
to help correct the inconsistent use of codes.
Can also using tools to detect the violation of known
data constraints.
University of Indonesia
16
D t I t ti d T f ti Data Integration and Transformation
Data Integration: combines data from multiple data stores Data Integration: combines data from multiple data stores
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from y p y
multiple data sources, e.g., A.cust-id ? B.cust-#
D t ti d l i d t l fli t Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different
possible reasons: different representations, different scales (feet possible reasons: different representations, different scales (feet
vs metre)
University of Indonesia
17
D t T f ti Data Transformation
Data are transformed into forms appropriate for mining Data are transformed into forms appropriate for mining
Methods:
Smoothing: binning, clustering, and regression
Aggregation: summarization, data cube construction gg g
Generalization: low-level or raw data are replaced by higher-
level concepts through the use of concept hierarchies p g p
Street city or country
Numeric attributes of age young middle-aged senior Numeric attributes of age young, middle-aged, senior
Normalization: attribute data are scaled so as to fall within a
small specified range such as 0 0 to 1 0 small specified range, such as 0.0 to 1.0
Useful for classification involving neural networks, or distance
measurements such as nearest neighbor classification and clustering
University of Indonesia
measurements such as nearest neighbor classification and clustering
18
D t T f ti (2) Data Transformation (2)
N li i l d f ll i hi ll ifi d Normalization: scaled to fall within a small, specified range
min-max normalization
A A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' + ?
?
=
z-score normalization
A A min max ?
A
d t d
mean v
v '
?
=
normalization by decimal scaling
A dev stand _
y g
v
v'= Where j is the smallest integer such that Max(| |)
Warehouse may store terabytes of data
Data Preprocessing
Lecture 3/DMBI/IKI83403T/MTI/UI Lecture 3/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) y , , (y )
Faculty of Computer Science, University of Indonesia
Obj ti Objectives
Motivation: Why preprocess the Data? Motivation: Why preprocess the Data?
Data Preprocessing Techniques
Data Cleaning
Data Integration and Transformation Data Integration and Transformation
Data Reduction
University of Indonesia
2
Wh P th D t ? Why Preprocess the Data?
Quality decisions must be based on quality data Quality decisions must be based on quality data
Data could be incomplete, noisy, and inconsistent
Data warehouse needs consistent integration of
quality data q y
Incomplete
L ki ib l i ib f i Lacking attribute values or certain attributes of interest
Containing only aggregate data
Causes:
Not considered important at the time of entry
Equipment malfunctions
Data not entered due to misunderstanding
University of Indonesia
Inconsistent with other recorded data and thus deleted
3
Wh P th D t ? (2) Why Preprocess the Data? (2)
Noisy (having incorrect attribute values) Noisy (having incorrect attribute values)
Containing errors, or outlier values that deviate from the
expected expected
Causes:
Data collection instruments used may be faulty
Human or computer errors occuring at data entry
Errors in data transmission
Inconsistent
Containing discrepancies in
the department codes the department codes
used to categorize items
University of Indonesia
4
Wh P th D t ? (3) Why Preprocess the Data? (3)
“Clean” the data by filling in missing values smoothing Clean the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving
inconsistencies inconsistencies.
Some examples of inconsistencies:
customer_id vs cust_id
Bill vs William vs B.
Some attributes may be inferred from others. Data
cleaning including detection and removal of redundancies g g
that may have resulted.
University of Indonesia
5
D t P i T h i Data Preprocessing Techniques
Data Cleaning Data Cleaning
To remove noise and correct inconsistencies in the data
Data Integration
Merges data from multiple sources into a coherent data g p
store, such as a data warehouse or a data cube
Data Transformation Data Transformation
Normalization (to improve the accuracy and efficiency of
mining algorithms involving distance measurements E g mining algorithms involving distance measurements E.g.
Neural networks, nearest-neighbor)
D t Di ti ti Data Discretization
Data Reduction
University of Indonesia
6
D t P i T h i (2) Data Preprocessing Techniques (2)
Data Reduction
Warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the p y g y y g
complete data set
Obtains a reduced representation of the data set that is much smaller in p
volume, yet produces the same (or almost the same) analytical results.
Strategies for Data Reduction Strategies for Data Reduction
Data aggregation (e.g., building a data cube)
Dimension reduction (e.g. removing irrelevant attributes through Dimension reduction (e.g. removing irrelevant attributes through
correlation analysis)
Data compression (e.g. using encoding schemes such as minimum length
encoding or wavelets)
Numerosity reduction
University of Indonesia
Generalization
7
D t P i T h i (3) Data Preprocessing Techniques (3)
University of Indonesia
8
D t Cl i Mi i V l Data Cleaning – Missing Values
1. Ignore the tuple 1. Ignore the tuple
Usually done when class label is missing classification
Not effective when the missing values in attributes spread in Not effective when the missing values in attributes spread in
different tuples
F ll h l ll d f bl ? 2. Fill in the missing value manually: tedious + infeasible?
3. Use a global constant to fill in the missing value g g
‘unknown’, a new class?
Mining program may mistakenly think that they form an Mining program may mistakenly think that they form an
interesting concept, since they all have a value in common
not recommended not recommended
4. Use the attribute mean to fill in the missing value
i
University of Indonesia
avg income
9
D t Cl i Mi i V l (2) Data Cleaning – Missing Values (2)
5 Use the attribute mean for all samples belonging to the 5. Use the attribute mean for all samples belonging to the
same class as the given tuple same credit risk
t category
6. Use the most probable value to fill in the missing value
Determined with regression, inference-based tools such as
Bayesian formalism, or decision tree induction y
Methods 3 to 6 bias the data. The filled-in value may not be y
correct. However, method 6 is a popular strategy, since:
It uses the most information from the present data to predict missing values It uses the most information from the present data to predict missing values
There is a greater chance that the relationships between income and the other
attributes are preserved
University of Indonesia
attributes are preserved.
10
Data Cleaning –
N i d I t (I i t t) D t Noise and Incorrect (Inconsistent) Data
Noise is a random error or variance in a measured variable Noise is a random error or variance in a measured variable.
How can we smooth out the data to remove the noise?
Binning Method
Smooth a sorted data value by consulting its “neighborhood”, that
is, the values around it.
The sorted values are distributed into a number of buckets, or bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
Binning is also uses as a discretizatin technique (will be discussed
later)
University of Indonesia
11
Data Cleaning – Noisy Data
Bi i M th d Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 26, 25, 28, 29, 34 p ( ) , , , , , , , , , , ,
* Partition into (equidepth) bins of depth 3, each bin contains three values:
- Bin 1: 4, 8, 9, 15 , , ,
- Bin 2: 21, 21, 24, 26
- Bin 3: 25, 28, 29, 34 , , ,
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 , , ,
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 , , ,
* Smoothing by bin boundaries: the larger the width, the greater the effect
- Bin 1: 4, 4, 4, 15 , , ,
- Bin 2: 21, 21, 26, 26
- Bin 3: 25, 25, 25, 34
University of Indonesia
, , ,
12
Data Cleaning – Noisy Data
Cl t i Clustering
Similar values are organized into groups or clusters Similar values are organized into groups, or clusters.
Values that fall outside of the set of clusters may be
id d tli considered outliers.
University of Indonesia
13
Data Cleaning – Noisy Data
R i Regression
Data can be smoothed by
y
Data can be smoothed by
fitting the data to a
function such as with
y
Y1 function, such as with
regression.
Li i i l
Y1
Linear regression involves
finding the best line to fit
y =x +1
Y1’
two variables, so that one
variable can be used to x
X1
predict the other.
Multiple linear regression p g
> 2 variables,
multidimensional surface
University of Indonesia
multidimensional surface
14
D t S thi D t R d ti Data Smoothing vs Data Reduction
Many methods for data smoothing are also methods Many methods for data smoothing are also methods
for data reduction involving discretization.
Examples
Binning techniques reduce the number of distinct values g q
per attribute. Useful for decision tree induction which
repeatedly make value comparisons on sorted data.
Concept hierarchies are also a form of data discretization
that can also be used for data smoothng. g
Mapping real price into inexpensive, moderately_priced,
expensive p
Reducing the number of data values to be handled by the
mining process.
University of Indonesia
mining process.
15
D t Cl i I i t t D t Data Cleaning - Inconsistent Data
May be corrected manually May be corrected manually.
Errors made at data entry may be corrected by
f i t l d ith ti d i d performing a paper trace, coupled with routines designed
to help correct the inconsistent use of codes.
Can also using tools to detect the violation of known
data constraints.
University of Indonesia
16
D t I t ti d T f ti Data Integration and Transformation
Data Integration: combines data from multiple data stores Data Integration: combines data from multiple data stores
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from y p y
multiple data sources, e.g., A.cust-id ? B.cust-#
D t ti d l i d t l fli t Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different
possible reasons: different representations, different scales (feet possible reasons: different representations, different scales (feet
vs metre)
University of Indonesia
17
D t T f ti Data Transformation
Data are transformed into forms appropriate for mining Data are transformed into forms appropriate for mining
Methods:
Smoothing: binning, clustering, and regression
Aggregation: summarization, data cube construction gg g
Generalization: low-level or raw data are replaced by higher-
level concepts through the use of concept hierarchies p g p
Street city or country
Numeric attributes of age young middle-aged senior Numeric attributes of age young, middle-aged, senior
Normalization: attribute data are scaled so as to fall within a
small specified range such as 0 0 to 1 0 small specified range, such as 0.0 to 1.0
Useful for classification involving neural networks, or distance
measurements such as nearest neighbor classification and clustering
University of Indonesia
measurements such as nearest neighbor classification and clustering
18
D t T f ti (2) Data Transformation (2)
N li i l d f ll i hi ll ifi d Normalization: scaled to fall within a small, specified range
min-max normalization
A A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' + ?
?
=
z-score normalization
A A min max ?
A
d t d
mean v
v '
?
=
normalization by decimal scaling
A dev stand _
y g
v
v'= Where j is the smallest integer such that Max(| |)