data mining

beeramlakshmi · May 3, 2008

Data mining is emerging as one of the key features of much homeland security Initiatives . Often used as a means for detecting fraud, assessing risk, and product Retailing , data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. In the context of homeland security, data mining is often viewed as a potential means to identify terrorist activities, such as money transfers and communications, and to identify and track individual terrorists themselves, such as through travel and immigration
records.

While data mining represents a significant advance in the type of analytical toolscurrently available, there are limitations to its capability. One limitation is that although data mining can help reveal patterns and relationships, it does not tell the user the value or significance of these patterns. These types of determinations must be made by the user. A second limitation is that while data mining can identify connections between behaviors and/or variables, it does not necessarily identify a causal relationship. To be successful, data mining still requires skilled technical and analytical specialists who can structure the analysis and interpret the output that is created.

Data mining is becoming increasingly common in both the private and public
sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. In the public sector, data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance

As with other aspects of data mining, while technological capabilities are
important, there are other implementation and oversight issues that can influence the Success of a project’s outcome. One issue is data quality, which refers to the accuracy and completeness of the data being analyzed.

A second issue is the interoperability of the data mining software and databases being used by different agencies. A third issue is mission creep, or the use of data for purposes other than for which the data were originally collected.

A fourth issue is privacy. Questions that may be considered include the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and possible application of the Privacy Act to these initiatives. It is anticipated that congressional oversight of data mining projects will grow as data mining efforts continue to evolve.

A View into the World of Data Mining
 Introduction
 What is Data Mining?
 Why Data Mining?
 Various Techniques of Data

INTRODUCTION

 Data Abundance

 Services for Customer

 Importance of Knowledge Discovery

What is Data Mining?

“ It is defined as the process of discovering patterns in data. The process must be automatic or semiautomatic. The pattern discovered must be meaningful, in that they lead to some advantage, usually economic advantage. The data is invariably present in substantial quantities.”

In easier terms….

Data mining is about extracting useful knowledge from large databases.

Further simply, Data miner analyzes historical data, discovers some patterns in them and these help human analyst (semi-automatic) or automated decision making tool to predict an apt outcome for a future situation.

Why Data Mining?

Automation has beaten Manual Human Analyst efforts

 Increasing Competition in the market

 Accurate, Fast, Unexpected predictions having enormous effect on economy.

Data Mining Application Areas

 Business Transactions
 E-commerce
 Scientific Study
 Health Care Study
 Web Study
 Crime Detection
 Loan Delinquency
 Banking

Data Mining Techniques

 Association Rule Mining
 Cluster Analysis
 Classification Rule Mining
 Frequent Episodes
 Deviation Detection/ Outlier Analysis
 Genetic Algorithms
 Rough Set Techniques
 Support Vector Machines

Association Rule Mining

 Basic terminology in Association DM
 Market Basket Analysis
 Algorithms
 Apriori Algorithm
 Partition Algorithm
 Pincer Search Algorithm
 Dynamic Item set Counting Algorithm
 FP Tree growth Algorithm

Basic Association DM Terms

 Support
It is the percentage of records containing an item combination compared to total number of records.
 Confidence
It is the support of an item combination divided by support for a condition. We actually measure how confident can we be, given that a customer has purchased one product, that he will also purchase another product.

 Association Rule
It is a rule of the form X=›Y showing an association between X and Y that if X occurs then Y will occur. It is accompanied by a confidence % in the rule.

 Itemset
It is a set of items in a transaction. K-itemset is a set of ‘k’ number of items

 Frequent Itemset
It is an itemset whose support in a transaction database is more than the minimum support specified.

 Maximal Frequent Itemset
It is an itemset which is frequent and no superset of it is frequent.

 Border set
It is an itemset if it is not frequent but all its proper subsets are frequent.

 Downward Closure Property
Any subset of a frequent itemset is frequent.

 Upward Closure Property
Any superset of an infrequent itemset is infrequent.

Market Basket Analysis

It is an analysis conducted to determine which products customers purchase together. Knowing this pattern of purchasing traits of customer can be very useful to a retail store or company.

This can be very useful as once it is known that customers’ who buy product A are likely to buy product B, then company can market both A and B together. Thus making purchase of one product target prospects of another

 For Example consider the following database:

Transaction Products
1 Burger, Coke, Juice
2 Juice, Potato Chips
3 Coke, Burger
4 Juice, Ground Nuts
5 Coke, Ground Nuts

Seeing this, there is no visible obvious rule or relationship between items in the buying patterns of the customers.

Move ahead to see mining of relationships…

Example explanation

Burger Juice Coke P chips G Nuts
Burger 2 1 2 0 0

Juice 1 3 1 1 1

Coke 2 1 3 0 1

P Chips 0 1 0 1 0

G Nuts 0 1 1 0 2

Above table shows how many times was one item purchased with other item. Central diagonal shows how items were purchased with themselves so we ignore it.

Giving view of terms in this example
Association Rule: “If a customer purchases Coke, then he will probably purchase a burger” is an association rule associating Coke and Burger.

Support: This rule has a support of 40%.
records in which both burger and coke occur together = 2
total number of records = 5
support = 2/5 * 100 = 40%

Confidence: Above rule has a confidence of 66%
support for combination (Coke+Burger) is 40%
support for condition (Coke) is 60%
confidence = 40/60 * 100 = 66%

Itemset: {Coke, Burger} is a 2-itemset containing 2

items. {Coke} is a 1-itemset.

Frequent Itemset: If Minimum support be 2 then {Coke}, {Juice}, {Burger}, {G Nuts}, {Coke, Burger} are frequent itemsets.
Maximal Frequent Itemset: {Coke, Burger} is a maximal frequent itemset. This is confirmed by the fact that both its subsets viz. {Coke}, {Burger} are frequent too.
Apriori Algorithm
This is a level wise algorithm developed by
Dr R Aggarwal & Dr R Srikant.
A set of frequent 1-itemsets is found. Then it is used to generate frequent 2-itemsets and these 2-itemsets are used to generate 3-itemsets and so on.
It has two parts:
-Joint Step Candidate Generation Process
-Pruning Process
Input: Database D of transactions, Minimum Support Threshold σ
Output: L, set of all Frequent itemsets in D.
Initialize: k=1, C1=all 1-itemsets
Read Database D to count support of C1 to determine L1
L1= {frequent 1-itemsets}
k=2 // k represents pass number
While (Lk-1≠ Ø) do
Begin
Ck=Ø

For all itemsets l1 є Lk-1do
For all itemsets l2 є Lk-1do
if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ ………….
^ (l1[k-2] = l2[k-2]) ^ (l1[k-1] < l2[k-1])
then c= l1[1], l1[2], l1[3],….., l1[k-1], l2[k-1]
Ck= Ck U {c}

for all c є Ck
for all (k-1)-subsets of d of c do
if d έ Lk-1
then Ck= Ck / {c}
For all transactions t є D do
Increment count of all candidates in Ck that are contained in t.
Lk = All candidates in Ck with minimum support
K=k+1
End

Answer: Uk Lk
Pincer Search Algorithm

This algorithm is one of the fastest Apriori based algorithm which implement horizontal mining. It was developed by Dao I Lin and Z M Kedem.
It uses the Apriori Method but makes it more efficient with the use of concept of Maximal Frequent Itemset thus combining both bottom-up (for frequent itemset generation) and top-down approach (for searching MFS).

L0= Ø, k=1, C1= {{i} | i є I}; S0= Ø
MFCS = {all items}; MFS = Ø
Do until Ck= Ø and Sk-1=Ø
read the database and count support for Ck & MFCS
MFS = MFS U {frequent itemsets in MFCS}
Sk= {infrequent itemsets in Ck}
if Sk≠ Ø
for all itemsets s є Sk
for all itemsets m є MFCS
if s is a subset of m
MFCS = MFCS \{m}
for all items e є s
if m \ {e} is not a subset of any itemset in MFCS
MFCS = MFCS U {m \ {e}}
For all itemsets c in Lk
If c is a subset of any itemset in current MFS

Delete c from Lk

Generate candidates from Ck+1 from Ck
If any frequent itemset in Ck is removed from Lk
Then for all itemsets l є Lk
for all itemsets m є MFS
if first k-1 items in l are also in m
for i from k+1 to |m|
Ck+1 = Ck+1 U {{l.item1 ,…., l.itemk, m.itemk}

For all itemsets in Ck+1
If c I not a subset of any itemset in current MFCS
Delete c from Ck+1
K=k+1
Answer : MFS

REFERENCES

 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

 W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge
Discovery in Databases: An Overview. In G. Piatetsky-Shapiro et al.(eds.) Knowledge Discovery in Databases. AAAI/MIT Press, 1991

jitendra05 · Feb 16, 2016

beeramlakshmi said:
Data mining is emerging as one of the key features of much homeland security Initiatives . Often used as a means for detecting fraud, assessing risk, and product Retailing , data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. In the context of homeland security, data mining is often viewed as a potential means to identify terrorist activities, such as money transfers and communications, and to identify and track individual terrorists themselves, such as through travel and immigration
records.

While data mining represents a significant advance in the type of analytical toolscurrently available, there are limitations to its capability. One limitation is that although data mining can help reveal patterns and relationships, it does not tell the user the value or significance of these patterns. These types of determinations must be made by the user. A second limitation is that while data mining can identify connections between behaviors and/or variables, it does not necessarily identify a causal relationship. To be successful, data mining still requires skilled technical and analytical specialists who can structure the analysis and interpret the output that is created.

Data mining is becoming increasingly common in both the private and public
sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. In the public sector, data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance

As with other aspects of data mining, while technological capabilities are
important, there are other implementation and oversight issues that can influence the Success of a project’s outcome. One issue is data quality, which refers to the accuracy and completeness of the data being analyzed.

A second issue is the interoperability of the data mining software and databases being used by different agencies. A third issue is mission creep, or the use of data for purposes other than for which the data were originally collected.

A fourth issue is privacy. Questions that may be considered include the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and possible application of the Privacy Act to these initiatives. It is anticipated that congressional oversight of data mining projects will grow as data mining efforts continue to evolve.

A View into the World of Data Mining
 Introduction
 What is Data Mining?
 Why Data Mining?
 Various Techniques of Data

INTRODUCTION

 Data Abundance

 Services for Customer

 Importance of Knowledge Discovery

What is Data Mining?

“ It is defined as the process of discovering patterns in data. The process must be automatic or semiautomatic. The pattern discovered must be meaningful, in that they lead to some advantage, usually economic advantage. The data is invariably present in substantial quantities.”

In easier terms….

Data mining is about extracting useful knowledge from large databases.

Further simply, Data miner analyzes historical data, discovers some patterns in them and these help human analyst (semi-automatic) or automated decision making tool to predict an apt outcome for a future situation.

Why Data Mining?

Automation has beaten Manual Human Analyst efforts

 Increasing Competition in the market

 Accurate, Fast, Unexpected predictions having enormous effect on economy.

Data Mining Application Areas

 Business Transactions
 E-commerce
 Scientific Study
 Health Care Study
 Web Study
 Crime Detection
 Loan Delinquency
 Banking

Data Mining Techniques

 Association Rule Mining
 Cluster Analysis
 Classification Rule Mining
 Frequent Episodes
 Deviation Detection/ Outlier Analysis
 Genetic Algorithms
 Rough Set Techniques
 Support Vector Machines

Association Rule Mining

 Basic terminology in Association DM
 Market Basket Analysis
 Algorithms
 Apriori Algorithm
 Partition Algorithm
 Pincer Search Algorithm
 Dynamic Item set Counting Algorithm
 FP Tree growth Algorithm

Basic Association DM Terms

 Support
It is the percentage of records containing an item combination compared to total number of records.
 Confidence
It is the support of an item combination divided by support for a condition. We actually measure how confident can we be, given that a customer has purchased one product, that he will also purchase another product.

 Association Rule
It is a rule of the form X=›Y showing an association between X and Y that if X occurs then Y will occur. It is accompanied by a confidence % in the rule.

 Itemset
It is a set of items in a transaction. K-itemset is a set of ‘k’ number of items

 Frequent Itemset
It is an itemset whose support in a transaction database is more than the minimum support specified.

 Maximal Frequent Itemset
It is an itemset which is frequent and no superset of it is frequent.

 Border set
It is an itemset if it is not frequent but all its proper subsets are frequent.

 Downward Closure Property
Any subset of a frequent itemset is frequent.

 Upward Closure Property
Any superset of an infrequent itemset is infrequent.

Market Basket Analysis

It is an analysis conducted to determine which products customers purchase together. Knowing this pattern of purchasing traits of customer can be very useful to a retail store or company.

This can be very useful as once it is known that customers’ who buy product A are likely to buy product B, then company can market both A and B together. Thus making purchase of one product target prospects of another

 For Example consider the following database:

Transaction Products
1 Burger, Coke, Juice
2 Juice, Potato Chips
3 Coke, Burger
4 Juice, Ground Nuts
5 Coke, Ground Nuts

Seeing this, there is no visible obvious rule or relationship between items in the buying patterns of the customers.

Move ahead to see mining of relationships…

Example explanation

Burger Juice Coke P chips G Nuts
Burger 2 1 2 0 0

Juice 1 3 1 1 1

Coke 2 1 3 0 1

P Chips 0 1 0 1 0

G Nuts 0 1 1 0 2

Above table shows how many times was one item purchased with other item. Central diagonal shows how items were purchased with themselves so we ignore it.

Giving view of terms in this example
Association Rule: “If a customer purchases Coke, then he will probably purchase a burger” is an association rule associating Coke and Burger.

Support: This rule has a support of 40%.
records in which both burger and coke occur together = 2
total number of records = 5
support = 2/5 * 100 = 40%

Confidence: Above rule has a confidence of 66%
support for combination (Coke+Burger) is 40%
support for condition (Coke) is 60%
confidence = 40/60 * 100 = 66%

Itemset: {Coke, Burger} is a 2-itemset containing 2

items. {Coke} is a 1-itemset.

Frequent Itemset: If Minimum support be 2 then {Coke}, {Juice}, {Burger}, {G Nuts}, {Coke, Burger} are frequent itemsets.
Maximal Frequent Itemset: {Coke, Burger} is a maximal frequent itemset. This is confirmed by the fact that both its subsets viz. {Coke}, {Burger} are frequent too.
Apriori Algorithm
This is a level wise algorithm developed by
Dr R Aggarwal & Dr R Srikant.
A set of frequent 1-itemsets is found. Then it is used to generate frequent 2-itemsets and these 2-itemsets are used to generate 3-itemsets and so on.
It has two parts:
-Joint Step Candidate Generation Process
-Pruning Process
Input: Database D of transactions, Minimum Support Threshold σ
Output: L, set of all Frequent itemsets in D.
Initialize: k=1, C1=all 1-itemsets
Read Database D to count support of C1 to determine L1
L1= {frequent 1-itemsets}
k=2 // k represents pass number
While (Lk-1≠ Ø) do
Begin
Ck=Ø

For all itemsets l1 є Lk-1do
For all itemsets l2 є Lk-1do
if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ ………….
^ (l1[k-2] = l2[k-2]) ^ (l1[k-1] < l2[k-1])
then c= l1[1], l1[2], l1[3],….., l1[k-1], l2[k-1]
Ck= Ck U {c}

for all c є Ck
for all (k-1)-subsets of d of c do
if d έ Lk-1
then Ck= Ck / {c}
For all transactions t є D do
Increment count of all candidates in Ck that are contained in t.
Lk = All candidates in Ck with minimum support
K=k+1
End

Answer: Uk Lk
Pincer Search Algorithm

This algorithm is one of the fastest Apriori based algorithm which implement horizontal mining. It was developed by Dao I Lin and Z M Kedem.
It uses the Apriori Method but makes it more efficient with the use of concept of Maximal Frequent Itemset thus combining both bottom-up (for frequent itemset generation) and top-down approach (for searching MFS).

L0= Ø, k=1, C1= {{i} | i є I}; S0= Ø
MFCS = {all items}; MFS = Ø
Do until Ck= Ø and Sk-1=Ø
read the database and count support for Ck & MFCS
MFS = MFS U {frequent itemsets in MFCS}
Sk= {infrequent itemsets in Ck}
if Sk≠ Ø
for all itemsets s є Sk
for all itemsets m є MFCS
if s is a subset of m
MFCS = MFCS \{m}
for all items e є s
if m \ {e} is not a subset of any itemset in MFCS
MFCS = MFCS U {m \ {e}}
For all itemsets c in Lk
If c is a subset of any itemset in current MFS

Delete c from Lk

Generate candidates from Ck+1 from Ck
If any frequent itemset in Ck is removed from Lk
Then for all itemsets l є Lk
for all itemsets m є MFS
if first k-1 items in l are also in m
for i from k+1 to |m|
Ck+1 = Ck+1 U {{l.item1 ,…., l.itemk, m.itemk}

For all itemsets in Ck+1
If c I not a subset of any itemset in current MFCS
Delete c from Ck+1
K=k+1
Answer : MFS

REFERENCES

 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

 W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge
Discovery in Databases: An Overview. In G. Piatetsky-Shapiro et al.(eds.) Knowledge Discovery in Databases. AAAI/MIT Press, 1991

Hey lakshmi, i think you have already explained the concept of data mining in detail and i am really impressed by your effort. Well, i am also sharing a document where you would get some more content which may be useful for other.

data mining

beeramlakshmi

New member

jitendra05

Banned

Attachments