Description
Protecting Business Intelligence And Customer Privacy While Outsourcing Data Mining Tasks
Knowl Inf Syst
DOI 10.1007/s10115-007-0113-3
REGULAR PAPER
Protecting business intelligence and customer privacy
while outsourcing data mining tasks
Ling Qiu · Yingjiu Li · Xintao Wu
Received: 7 February 2007 / Revised: 4 July 2007 / Accepted: 20 September 2007
© Springer-Verlag London Limited 2007
Abstract Nowadays data mining plays an important role in decision making. Since many
organizations do not possess the in-house expertise of data mining, it is bene?cial to outsource
data mining tasks to external service providers. However, most organizations hesitate to do
so due to the concern of loss of business intelligence and customer privacy. In this paper,
we present a Bloom ?lter based solution to enable organizations to outsource their tasks of
mining association rules, at the same time, protect their business intelligence and customer
privacy. Our approach can achieve high precision in data mining by trading-off the storage
requirement.
1 Introduction
1.1 Background and motivation
The pervasive impact of business computing has made information technology (IT) an indis-
pensable part of daily operations and the key to success for organizations. Many organizations
This research was supported by the USA National Science Foundation Grants CCR-0310974 and
IIS-0546027.
L. Qiu (B)
School of Mathematics, Physics and Information Technology, James Cook University,
Townsville, QLD 4811, Australia
e-mail: [email protected]
Y. Li
School of Information Systems, Singapore Management University, Singapore 178902, Singapore
e-mail: [email protected]
X. Wu
Department of Software and Information Systems, University of North Carolina at Charlotte,
Charlotte, NC 28223, USA
e-mail: [email protected]
1 3
L. Qiu et al.
have accumulated large amount of data from various channels in today’s digitalized age. It
is important to make these data available for decision making. Data mining, as one of the IT
services needed by organizations, provides such a technique for the exploration and analysis
of these raw data so as to reveal hidden information and knowledge; it has also been realized
as an important way for discovering knowledge from the data and converting “data rich"
to “knowledge rich" so as to assist strategic decision making. The bene?ts of using data
mining for decision making have been demonstrated in various industries and governmental
sectors [7], e.g., banking, insurance, direct-mail marketing, telecommunications, retails, and
health care [32]. Among all of the available data mining methods, the discovery of associa-
tions between business events or transactions is one of the most commonly used techniques.
Association rule mining has been an important application in decision support and marketing
strategy [25] for an organization.
Let us consider a typical application scenario as follows. In an organization (e.g., an enter-
prise or a governmental sector), there are several divisions including an IT division which
provides ITservices for the whole organization. Afunctional division may have to delegate or
outsource its data mining tasks to the IT division due to the lack of IT expertise and powerful
computing infrastructure which are usually centrally managed by the IT division.
This scenario can be extended to a more general circumstance in which all divisions are
individually independent organizations (or companies). This is because in today’s fast-paced
business environment, it is impossible for any single organization to understand, develop,
and implement every IT needed. By outsourcing, an organization can obtain speci?c human
resources (e.g., skilled programming personnel) and technological resources (e.g., more pow-
erful computing infrastructure) for its needs of IT services (e.g., data analysis) with lower
costs [12]. It can also be extended to online scenarios, e.g., a distributed computing environ-
ment comprising of a center server and some edge servers.
The practice of outsourcing data mining tasks involves extensive collaboration (e.g.,
exchange or share of data) across different organizations. Either the raw data or the revealed
information after analysis contains the business intelligence (BI) and customer privacy of an
organization. There is a security concern of potential risk of exposing private information
in outsourcing activities [28]. Without proper security policy and technology, these priva-
cies could be very vulnerable to security breaches. Therefore, to protect BI and customer
privacy, it is urgent and critical to provide solutions from the perspectives of both legality or
regulation and technology [27]. In this paper, we focus on the technology-based solutions.
When outsourcing mining tasks,
1
we should protect the following three elements which may
expose BI and customer privacy: (1) the source data which contain all transactions and items;
(2) the mining requests which are itemsets of interests; and (3) the mining results which are
frequent itemsets and association rules.
There are various methods proposed to preserve privacy in data mining; but they cannot
protect all three elements simultaneously. This is because with these methods, when a ?rst
party
2
outsources its mining tasks to a third party,
3
it has to provide the source database
(which might be someway encrypted) together with some additional information (e.g., plain
text of mining requests) without which the mining tasks may not be carried out. Given this
situation, the existing methods are unable to ef?ciently prevent the exposure of private infor-
1
Without further speci?cation, we always refer to association rule mining tasks.
2
This is the party that outsouces its data mining tasks. It may be a functional division of an organization or
the center server in a distributed environment with client–server architecture.
3
This is the party that is authorized by the ?rst party to undertake the outsourced data mining tasks. It may be
the ITdivision of an organization or an edge server in a distributed environment with client–server architecture.
1 3
Outsourcing data mining tasks
mation to the third party, or unable to prevent the third party from (either intentionally or
unintentionally) deciphering (and possibly spreading) further information from the mining
results (which would be sent back to the ?rst party) with the additional information.
1.2 Overview of our paper
1.2.1 Our solutions
Given the situation discussed in the previous paragraphs, to protect BI and customer privacy
when outsourcing data mining tasks, we should control the direct access to the original data.
In other words, the third party should not be allowed to access the original transactional data
while performing mining tasks, and should not be able to interpret the mining results.
In this paper, we present a Bloom ?lter based approach which provides an algorithm
for privacy preserving association rule mining with computation ef?ciency and predictable
(controllable) analysis precision. The Bloom ?lter [10] is a stream (or a vector) of binary
bits. It is a computationally ef?cient and irreversible coding scheme that can represent a
set of objects while preserving privacy of the objects. With our approach, ?rstly the source
data are converted to Bloom ?lter representation and handed over to a third party together
with mining algorithms. Then the ?rst party sends its mining requests to the third party.
Mining requests are actually candidates of frequent itemsets which are also represented by
Bloom ?lters. Lastly, the third party runs the mining algorithms with source data and mining
requests, and comes out the mining results which are frequent itemsets or association rules
represented by Bloom ?lters. In the above mining process, what the ?rst party exposes to the
third party does not violate privacy [22]; that is, the third party would not be able to distill
down private information from Bloom ?lters. Therefore, all three elements aforementioned
are fully protected by Bloom ?lters.
The goal of protecting BI and customer privacy during outsoucing can be achieved by
Bloom ?lter because it satis?es simultaneously the following three conditions. First, transac-
tions containing different numbers of items are mapped to Bloom?lters with the same length.
This prevents an adversary from deciphering the compositions of transactions by analyzing
the lengths of transactions. Second, Bloom ?lters support membership queries. This allows
an authorized third party to carry out data mining tasks with only Bloom ?lters (i.e., Bloom
?lters of either transactions or candidates of frequent itemsets). Third, without knowing all
possible individual items in the transactions, it is dif?cult to identify what items are included
in the Bloom ?lter of a transaction by counting the numbers of 1’s and 0’s. This is because
the probability of a bit in a Bloom ?lter being 1 or 0 is 0.5 given that the parameters of the
Bloom ?lter are optimally chosen (see Sect. 3 for details).
1.2.2 Main contributions
A condensed (5-page) version of our study has been published in [34]. In this version, we
present in details the mathematical analysis model for problem formulation and error rate
estimation, the analysis of the mining algorithm, and the results and analysis of experiments.
In brief, our main contributions include the following:
• We propose an approach allowing to outsource data mining tasks while protecting BI and
customer privacy. It prevents third parties from interpreting customer privacy from the
source data and from deciphering BI from the mining results.
• We present solid theoretical analysis for our approach. Our analysis shows that a tradeoff
can be made between storage requirement and mining precision.
1 3
L. Qiu et al.
• We test the proposed approach rigorously on both real and synthetic datasets. Our exper-
imental results show that our approach is both effective and ?exible for practical usage.
1.2.3 Organization of the paper
The remaining sections are organized as follows. We review the related work in Sect. 2.
We revisit the basics of Bloom ?lters and formulate our data mining problem in Sect. 3. In
Sect. 4, we present theoretical analysis for the formulated problems and estimate the errors
in data mining. We also present a method of using multiple groups of Bloom ?lters to further
reduce data mining errors. Following that in Sect. 5, we develop a multi-phased algorithm
and conduct a set of empirical simulations to verify our approach. Lastly we conclude our
paper in Sect. 6 with a discussion of contribution, limitation, and future research directions.
2 Literature review
Association rule mining has been an active research area since its introduction [2]. Many
algorithms have been proposed to improve the performance of mining association rules or
frequent itemsets. An interesting direction is the development of techniques that incorporate
privacy concerns.
One type of these techniques is perturbation based, which perturbs the data to a certain
degree before data mining so that the real values of sensitive data are obscured while non-
sensitive statistics on the collection of data are preserved. In an early work [5] a perturbation-
based approach was proposed for decision tree learning. This approach was further studied in
[1, 19, 23, 24]. Some recent work [6, 8, 9, 11, 14, 15, 29–31, 36, 37, 39] investigates the tradeoff
between leakage of private information and accuracy of mining results. In [8, 11], the authors
considered the problem of limiting disclosure of sensitive rules, aiming at selectively hiding
some frequent itemsets from large databases with as little impact on other, non-sensitive
frequent itemsets as possible. The idea is to modify a given database so that the support
of a given set of sensitive rules decreases below a predetermined threshold. Similarly, in
[37] a method is presented for selectively replacing individual values with unknowns from a
database to prevent the discovery of a set of rules, while minimizing the side effects on non-
sensitive rules. In [9, 31], the authors studied the impact of hiding strategies on an original
data set by quantifying how much information is preserved after sanitizing the data set. In
[6, 15, 36], researchers also studied the problemof mining association rules fromtransactions
in which the data has been randomized to preserve the privacy of individual transactions.
One problem of perturbation-based approach is that it may introduce some false association
rules. Another drawback of this approach is that it cannot always fully preserve data privacy
while achieving high mining precision [23, 24].
The second type is distributed privacy preserving data mining [13, 21, 26, 33, 38] based on
secure multi-party computation [40]. Though this approach can preserve privacy, it works
only in distributed environment (with several parties to collaborate in mining process) and
needs sophisticated protocols (secure multi-party computation based), which makes it infea-
sible for our scenario.
Both types of techniques are designed to protect privacy by masquerading the source data,
and cannot protect privacy distillable from the mining requests and results accessible to data
miners.
Related, but not directly relevant to our work, is the research in outsourced databases
[3, 16–18, 20] and most recently in data con?dentiality [35]. Hacigumus et al. [16–18, 20]
1 3
Outsourcing data mining tasks
explored a new paradigm for data management in which a third party service provides host
database as a service. They proposed several encryption techniques to process as many que-
ries as possible at the service providers’ site without having to decrypt the data. Agrawal
et al. [3] presented an order-preserving encryption scheme for numeric data that allows com-
parison operations to be directly applied on encrypted data. However, encryption is time
consuming and it may require auxiliary indices. It is only designed for certain type of queries
but may not be suitable for complex tasks such as association rule mining. Most recently Ra´ s
et al. [35] presented a generalized strategy to reduce a disclosure risk of con?dential data by
hiding some attributes of the source data. Hiding (or replacing) some attributes may break
the integrity of source data and thus the mining results may not be meaningful if it is applied
to our application scenario.
3 Problem formulation
Our research question is how to outsource the association rule data ming tasks, at the same
time, protect BI and customer privacy. We propose a Bloom?lter based approach with which
an authorized third party (e.g., an edge server or an IT division as mentioned earlier) will
perform the association rule mining based on the dataset transformed by Bloom Filters and
report the frequent itemsets with controllable error boundary.
In this section, we ?rst brie?y review the basics of Bloom ?lter, and then present the
mathematic formalization for our problem.
3.1 Bloom ?lter revisited
A Bloom ?lter is a simple, space-ef?cient, randomized data structure for representing a set
of objects so as to support membership queries.
De?nition 3.1 Given an n-element set S = {s
1
, . . . , s
n
} and k hash functions h
1
, . . . , h
k
of range m, the Bloom ?lter of S, denoted as B(S), is a binary vector of length m that is
constructed by the following steps: (i) every bit is initially set to 0; (ii) every element s ? S is
hashed into the bit vector through the k hash functions, and the corresponding bits h
i
(s) are
set to 1.
4
A Bloom ?lter function, denoted as B(·), is a mapping from a set (not necessarily
n-element set) to its Bloom ?lter.
For membership queries, i.e., whether an item x ? S, we hash x to the Bloom ?lter of S
(through those hash functions) and check whether all h
i
(x) are 1’s. If not, then clearly x is not
a member of S. If yes, we say x is in S although this could be wrong with some probability.
De?nition 3.2 For an element s and a set S, de?ne s ?
B
S if s hashes to all 1’s in the Bloom
?lter of S, and s / ?
B
S otherwise. The false positive rate of the Bloom ?lter of S is de?ned
as the probability of s ?
B
S while s / ? S, or Pr(s ?
B
S | s / ? S).
Assuming that all hash functions are perfectly random, we have the following
Lemma 3.3 Given an n-element set S = {s
1
, . . . , s
n
} and its Bloom ?lter B(S) of length m
constructed from k hash functions, the probability for a speci?c bit in B(S) being 0 is
p
0
=
_
1 ?
1
m
_
kn
? e
?kn/m
4
A location can be set to 1 multiple times, but only the ?rst change has an effect.
1 3
L. Qiu et al.
and the probability for a speci?c bit being 1 is
p
1
= 1 ? p
0
? 1 ?e
?kn/m
Then the false positive rate of B(S) is
f = p
k
1
?
_
1 ?e
?kn/m
_
k
(1)
Given n and m, f is minimized when p
0
= p
1
= 0.5 and k =
m
n
ln 2, in which case
f = 1/2
k
= (0.6185)
m/n
.
What’s missing from traditional Bloom ?lter? Traditional Bloom ?lters cannot solve the
privacy issue because anybody knowing the hash function can derive the original itemsets
based on Bloom Filters.
To preserve the privacy of Bloom ?lters, we propose keyed Bloom ?lter by augmenting
the hash functions h
i
with a secret key K. To represent set S, an element s ? S is inserted
into Bloom ?lter B by setting the corresponding bits h
i
(s ? K) in B to 1, where ? represents
concatenation. Also, to query whether an item x ? S, we check whether all h
i
(x ? K) bits are
set to 1. Without knowing the secret key, one is unable to derive the original set by examining
a Bloom ?lter. Without further mention we always assume that Bloom ?lters in our paper are
constructed with secret keys.
3.2 Our problem
Fundamentally we want to provide a solution for privacy preserving frequent itemsets min-
ing. The frequent itemset mining has been a common task in many data mining projects
for the past decade. From frequent itemsets, one can easily derive all association rules. The
mining of frequent itemsets of association rules has a wide range of applications in many
areas, from the analysis of customer preferences to DNA patterns.
For market basket data, we de?ne each transaction, such as a list of items purchased, as a
subset of all possible items.
De?nition 3.4 Let I = {I
1
, . . . , I
d
} be a set of d boolean variables called items. Let data-
base D be a set of transactions T
1
, T
2
, . . . , T
N
where each transaction T
i
is a set of items
such that T
i
? I. The support of an itemset S over I, denoted support (S), is de?ned as the
number of the transactions that contain S. The frequency of an itemset S, denoted freq(S),
is de?ned as support (S)/N.
Problem 1 Traditional problem: frequent itemsets mining. Mathematically, given a trans-
action database D over I and a threshold ? ? [0, 1], traditional research focuses on ?nding
all frequent itemsets FS ? 2
I
such that freq(FS) ? ?.
Our idea is to transform transaction database to a collection of Bloom ?lters to preserve
the privacy in frequent itemset mining. Each transaction T
i
? T is transformed to Bloom
?lter B(T
i
) of size m using k hashed functions. To preserve the privacy of items I
i
, we assume
that the mining process is done on the (keyed) Bloom ?lters B(I
i
) of the items rather than
on the items themselves.
Problem 2 Our research problem: privacy preserving frequent itemsets mining. Given (i)
a collection of Bloom ?lters {B(T
1
), . . . , B(T
N
)} for transaction database D over I, (ii) a
set of Bloom ?lters {B(I
1
), . . . , B(I
d
)} for items in I, and (iii) a threshold ? ? [0, 1], ?nd
all Bloom ?lters B(FS) of itemsets FS ? 2
I
such that freq(FS) ? ?.
1 3
Outsourcing data mining tasks
Normally without knowing the secret key, the third party, which may not be fully trusted,
will be unable to interpret sensitive and private information either from the contents of dat-
abases or from mining results. However, we still need to handle some extreme cases for the
protection of BI and customer privacy. Case 1, some transactions may contain only one item.
The number of 1’s in these Bloom ?lters are no more than but very close to k. Therefore, the
outsourced database may divulge partial individual items. To prevent such divulgence, one
of the solutions is to insert several virtual items as white noise to those transactions in which
item numbers are smaller than a threshold. Case 2, candidates of frequent 1-itemsets are
exactly individual items which may divulge some sensitive information. To prevent this, we
extend the usage of secret key by inserting several virtual items, denoted by k
1
, k
2
, . . . , k
i
,
into all transactions before outsoucing. At the same time, each candidate of frequent 1-item-
sets is inserted with a virtual item randomly chosen from k
1
to k
i
and is sent to the third
party together with mining requests (see Sect. 5.1 for detailed mining process). Thus the
edge servers cannot easily identify candidates of frequent 1- and 2-itemsets because both
types of candidates look alike. This method can be applied to conceal candidates of frequent
2-, 3-, …, and k-itemsets. Moreover, this operation can be done before or after the mixing of
white noise discussed in case 1.
4 Analysis
In this section, we analyze the possible error rates introduced by mining the frequent itemsets
based on Bloom ?lters instead of the original dataset. For any given itemset, the frequency
learnt from Bloom ?lters may be larger than its real frequency learnt from original transac-
tions due to the false positive of Bloom?lters. We make this clear in the following analysis. By
default, we assume that for any itemset, there is a Bloom ?lter function B(·) which produces
binary vector of length m through k hash functions.
4.1 Preliminaries
The false positive of a Bloom ?lter was de?ned for checking an element from a Bloom ?lter.
Now we extend the concept of false positive to checking an itemset from a Bloom ?lter.
De?nition 4.1 Given an itemset S and a transaction T
i
, de?ne S ?
B
T
i
if for all items s ? S,
s ?
B
T
i
, and de?ne S
B
T
i
otherwise. The false positive rate for checking S from the
Bloom ?lter of T
i
, denoted as f
i
, is de?ned as the probability of S ?
B
T
i
while S T
i
, or
Pr(S ?
B
T
i
| S T
i
).
Due to the false positive of checking an itemset from a Bloom ?lter, the support or fre-
quency learnt from a collection of Bloom ?lters is different from that learnt from original
transactions. Regarding such support and frequency, we have the following
De?nition 4.2 Given a collection of N Bloom ?lters {B(T
1
), . . . , B(T
N
)} for transaction
database D over I, the support of an itemset S ? 2
I
that is learnt from the collection of ?l-
ters, denoted as Bsupport (S), is de?ned as the number of ?lters B(T
i
) that satisfy S ?
B
T
i
.
The frequency of S that is learnt fromthe collection of ?lters, denoted as Bfreq(S), is de?ned
as Bsupport (S)/N.
Lemma 4.3 In the setting of Definition 4.2, the following statements hold: (i) S ?
B
T
i
iff B(S) ? B(T
i
) = B(S), where ? is bitwise AND. (ii) If S ? T
i
, then S ?
B
T
i
. (iii)
1 3
L. Qiu et al.
If S T
i
, then S ?
B
T
i
with probability f
i
, and S
B
T
i
with probability 1 ? f
i
. (iv)
Bfreq(S) ? freq(S).
Theorem 4.4 Given an itemset S and a transaction T
i
, the false positive rate of checking S
from the Bloom ?lter of T
i
is
f
i
=
_
1 ?e
?kn
i
/m
_
||B(S?T
i
)||
where n
i
= |T
i
| is the length of transaction T
i
in terms of the number of items, and || · ||
indicates the number of 1’s in a binary vector.
Proof From Eq. (1), one can derive that the false positive rate for checking any single item
s ? S ?T
i
is p
k
1
, where p
1
=
_
1 ?e
?kn
i
/m
_
is the probability that a speci?c bit is 1 in B(T
i
)
and k is the number of bits to which the item is hashed. From Lemma 4.3, we know that
S ?T
i
?
B
T
i
. Since the items in S ?T
i
are hashed to ||B(S ?T
i
)|| bits all together, the false
positive rate f
i
for checking S from B(T
i
) is p
||B(S?T
i
)||
1
=
_
1 ?e
?kn
i
/m
_
||B(S?T
i
)||
.
Corollary 4.5 Given an itemset S and a transaction T
i
, the false positive rate f
i
of checking
S from the Bloom ?lter of T
i
is bounded:
_
1 ?e
?kn
i
/m
_
||B(S)||
? f
i
?
_
1 ?e
?kn
i
/m
_
k
Proof According to the definition of false positive, we have S T
i
and thus 1 ? ||B(S ?T
i
)
|| ? ||B(S)||. Combining this with Theorem 4.4, we have
_
1 ?e
?kn
i
/m
_
||B(S)||
? f
i
?
_
1 ?e
?kn
i
/m
_
k
.
4.2 False positive and false negative
In our data mining problem, an outsourced server has access to the Bloom ?lters {B(T
i
)} of
all transactions. However, it has no access to the original data. Therefore, given a Bloom?lter
B(S), the server cannot compute the frequency freq(S) directly. The approaches to solving
the traditional frequent itemset mining problemcannot be applied directly to solving our data
mining problem.
According to Lemma 4.3 (i), the frequency Bfreq(S) can be derived from those Bloom
?lters. Our solution is to ?nd all Bloom ?lters B(S) such that Bfreq(S) ? ?
where ?
is a
revised threshold. Note that freq(S) ? ? is required in our data mining problem; thus, we
must ensure that
_
B(S) : Bfreq(S) ? ?
_
{B(S) : freq(S) ? ?}. Because the two sets are
not necessarily the same, we need to de?ne the false positive rate and false negative rate for
checking an itemset from all Bloom ?lters.
De?nition 4.6 Given an itemset S and N Bloom ?lters B(T
i
), i = 1, . . . , N, the false
positive rate for checking S from all Bloom ?lters using revised threshold ?
? ?, denoted as
f
+
?
, is de?ned as the probability of Bfreq(S) ? ?
while freq(S) < ?, or Pr
_
Bfreq(S) ? ?
|
freq(S) < ?). The false negative rate for checking S fromall Bloom?lters using?
, denotedas
f
?
?
, is de?ned as the probability of Bfreq(S) < ?
while freq(S) ? ?, or Pr
_
Bfreq(S) < ?
|
freq(S) ? ?).
If the revised threshold ?
is the same as the threshold ?, then the false negative rate will
be zero due to the fact Bfreq(S) ? freq(S) (Lemma 4.3 (iv)); however, the false positive
rate may be greater than zero in this case. In general, one may use ?
? ? so as to balance
1 3
Outsourcing data mining tasks
false negative rate and false positive rate. The higher the ?
, the higher the false negative rate,
and the lower the false positive rate. If the false negative rate is of major concerns, one may
choose ?
= ? to zero out false negative rate. Note that choosing ?
< ? is meaningless as it
only increases the false positive rate without decreasing the false negative rate which is zero.
To formalize our analysis, we de?ne a random variable for checking an itemset against a
Bloom ?lter. Then we re-write in terms of the de?ned variables the frequency of an itemset
learnt from Bloom ?lters and the false positive/negative rates for checking an itemset from
all Bloom ?lters.
De?nition 4.7 For an itemset S and a transaction T
i
such that S T
i
, de?ne a random 0–1
variable e
i
such that e
i
= 1 if S ?
B
T
i
and e
i
= 0 if S
B
T
i
.
The de?ned variable indicates whether S ?
B
T
i
; that is, e
i
= 1 with probability f
i
and
e
i
= 0 with probability 1 ? f
i
(see Lemma 4.3). In other words, e
i
represents a Bernoulli
trial with probabilities f
i
of success and 1 ? f
i
of failure. Without loss of generality, we can
assume that S T
i
for the ?rst N · (1 ?freq(S)) transactions T
i
. Then we have
Bfreq(S) = freq(S) +
1
N
·
N(1?freq(S))
i =1
e
i
(2)
Lemma 4.8 Let s
e
be the sum of N · (1 ?freq(S)) random 0-1 variables e
i
de?ned for an
itemset S. The false positive and false negative rates for checking S from all Bloom ?lters
using a revised threshold ?
? ? are
f
+
?
= Pr
_
s
e
? N
_
?
?freq(S)
_
| freq(S) < ?
_
f
?
?
= Pr
_
s
e
< N
_
?
?freq(S)
_
| ? ? freq(S) < ?
_
4.3 Estimate of false positive and false negative
We estimate the false positive and false negative rates in the case of n
i
?
= n, where n
i
is the
size of each transaction, and n =
1
N
N
i =1
n
i
is the average size of N transactions. Note that
if n
i
n, the transaction data can be clustered into multiple groups such that in each group,
the size of each transaction is equal or close to the average size of the transactions in the
group. The data mining task can be easily extended to each group. For simplicity, we assume
that n
i
= n in our analysis; we leave the multi-group case in Sect. 4.4.
Let k =
m
n
ln 2 be the optimal number of hash functions that are used for generating
the Bloom ?lter of length m for each transaction, which consists of n items. According to
Lemma 3.3 and Corollary 4.5, the false positive rate f
i
for checking an itemset S from a
Bloom ?lter has a lower bound and an upper bound:
2
?||B(S)||
? f
i
? 2
?k
(3)
First consider a special case where f
i
=
¯
f for all i = 1, . . . , N. In this case s
e
is the
sum of N · (1 ? freq(S)) independent Bernoulli trials with probabilities
¯
f for success and
1?
¯
f for failure. Let b(?, ?,
¯
f ) =
?!
?!(???)!
¯
f
?
(1?
¯
f )
???
be the probability that ? Bernoulli
trials with probabilities
¯
f for success and (1 ?
¯
f ) for failure result in ? successes and ? ??
failures (? ? ?). Let C(?, ?,
¯
f ) =
?
i =?
b(i, ?,
¯
f ) be the cumulative binomial probability
of having at least ? successes in ? trials.
5
Then the false positive and false negative rates for
5
Function C is a standard function provided in many commercial software packages such as Matlab.
1 3
L. Qiu et al.
checking an itemset S from all Bloom ?lters are
f
+
?
(
¯
f ) = C
_
N ·
_
?
?freq(S)
_
, N · (1 ?freq(S)) ,
¯
f
_
, where 0 < freq(S) < ? (4)
f
?
?
(
¯
f ) = 1 ?C
_
N ·
_
?
?freq(S)
_
, N · (1 ?freq(S)) ,
¯
f
_
, where ? ? freq(S) < ?
(5)
Since the cumulative binomial probability C(?, ?,
¯
f ) is monotonic increasing with
¯
f ,
from formulae (3) to (5) it is easy to know the lower bounds and upper bounds for the false
positive rate and false negative rate in general case:
f
+
?
_
2
?||B(S)||
_
? f
+
?
? f
+
?
_
2
?k
_
, where 0 < freq(S) < ?
f
?
?
_
2
?k
_
? f
?
?
? f
?
?
_
2
?||B(S)||
_
, where ? ? freq(S) < ?
In a special case where ?
= ?, we have f
?
?
= 0 and
C
_
A, B, 2
?||B(S)||
_
? f
+
?
? C
_
A, B, 2
?k
_
(6)
where A = N · (? ?freq(S)) and B = N · (1 ?freq(S)).
Equation (6) indicates that the greater the number k of hash functions, the smaller the false
positive and false negative rates. In other words, the longer the Bloom ?lters, the smaller the
false positive and false negative rates. To further understand this, we compare Bfreq(S) with
freq(S) in the average case. Let E[·] denote the mean of a random variable. From Eq. (2),
we have
E
_
Bfreq(S)
_
= freq(S) +
1
N
· E[s
e
] = freq(S) +
1
N
·
N(1?freq(S))
i =1
f
i
Recall the bounds for f
i
(see Eq. 3). In the false positive case (where freq(S) ? ?), we
have
N · (1 ??) · 2
?||B(S)||
? E[s
e
] ? N · 2
?k
Similarly, in the false negative case (where ? ? freq(S) < ?
), we have
N · (1 ??
) · 2
?||B(S)||
? E[s
e
] ? N · (1 ??) · 2
?k
The above three equations imply that the average value of Bfreq(S) is greater than freq(S),
but the difference is bounded. Note that ||B(S)|| ? k. We have
E
_
Bfreq(S)
_
?freq(S) ? (1 ??) · 2
?k
< 2
?k
(7)
The longer the Bloom ?lters (i.e., the greater the k), the smaller the difference between
Bfreq(S) and freq(S). If the length of Bloom ?lters increases linearly, then the difference
of frequencies decreases exponentially. For example, if k ? 20, then the difference of fre-
quencies will be less than 10
?6
. This means that in the average case, the frequency of an
itemset detected from Bloom ?lters will be greater than that detected from original data by
at most 10
?6
. Note that the above analysis is conducted for each itemset. The overall false
positive and false negative rates depend on the distribution of itemsets and their frequencies.
Empirical study will be conducted in Sect. 5.
1 3
Outsourcing data mining tasks
4.4 Multiple groups of Bloom ?lters
The estimate of false positive and false negative rates is based on the assumption that the size
of each transaction is equal or close to the average size of all transactions. This might not
be true in many real data sets. A simple solution is to cluster the transactions into multiple
groups such that in each group, the size of each transaction is equal or close to the average
size of the transactions in the group. Across all groups, the same number k of hash functions
are used for generating Bloom ?lters. In each group, k =
m
n
ln 2 is optimized for the length
m of Bloom ?lters and the average size n of the transactions in the group. This means that
for different groups, the length of Bloom ?lters are different depending on the average size
of transactions in the group. Roughly speaking, long Bloom ?lters will be used for long
transactions, while short Bloom ?lters for short transactions.
Using different Bloom?lters for different groups of transactions will not only save storage
requirement but also increase the precision of data mining. Note that the bounds presented
in the previous section are based on k. Since the same k is optimized across multiple groups,
the bounds can be used to estimate the false positive and false negative rates in all groups.
In the case of multiple groups, the transactions are represented by the Bloom ?lters of
different lengths. This requires that each candidate itemset be represented by Bloom ?lters
of different lengths too. This is because in our data mining problem, the Bloom ?lter of each
candidate itemset needs to be checked against the Bloom ?lters of the same length. To avoid
transmitting multiple Bloom ?lters for each candidate itemset between client and server,
only the longest Bloom ?lter of each candidate itemset is sent to the server at one time. In
data mining process, the outsourced server needs to transform it into different lengths so as
to check it against different groups of Bloom ?lters. We provide a simple solution called
?-folding to transform the Bloom ?lters.
De?nition 4.9 Given an m-bit Bloom ?lter B, ?-folding of B, denoted as B
?
, is de?ned as a
?m-bit vector generated from B: for 0<i ??m, bi t (B
?
, i )=bit(B, i )
_
0<j ?m
j =i mod ?m
bi t (B, j ),
where bi t (B, i ) denotes the i
th
bit of B, ? the bitwise OR, and 0 < ? ? 1.
Example 1 Let B be a 10-bit Bloom?lter. B
0.7
is de?ned as a 7-bit vector B
0.7
: bi t (B
0.7
, i ) =
bi t (B, i ) ? bi t (B, i +7) for i = 1, 2, 3, and bi t (B
0.7
, i ) = bi t (B, i ) for i = 4, 5, 6, 7.
From the definition of Bloom ?lter, it is clear that vector B
?
is a Bloom ?lter generated
with the same k hash functions which are used for generating Bloom ?lter B.
Given a Bloom ?lter B(S) of longest length m of candidate itemset S, the frequency
Bfreq(S) is computed by checking B(S) against the Bloom ?lters of transactions in all
groups. For a particular group in which the Bloom ?lters have length m
? m (note that
m is the longest length for all groups), the server applies
m
m
-folding to B(S) such that the
transformed Bloom ?lter has the length m
.
5 Experiments
To evaluate the performance of our Bloom ?lter based method for mining frequent itemsets,
we conduct experiments based on both synthetic data and real data. A framework of our
method is shown in Algorithm 1.
1 3
L. Qiu et al.
5.1 Algorithm
Algorithm 1 can be divided into three phases: counting phase (lines 3–5), pruning phase
(lines 6–8), and candidates generating phase (lines 9–10) in each round , where indicates
the size of each candidate itemset dealt with. In the counting phase, each candidate ?lter is
checked against all transaction ?lters and the candidate’s count is updated. In the pruning
phase, any Bloom ?lter is eliminated from the candidate set if its count (i.e., Bsupport) is less
than a revised threshold N · ?
. Finally, in the candidates generating phase, new candidate
Bloom ?lters are generated from the Bloom ?lters discovered in the current round. The new
candidates will be used for data mining in the next round.
Algorithm 1 Mining frequent itemsets from Bloom ?lters
1: C
1
=
_
B(I
1
), . . . , B(I
d
)
_
// B(I
i
) is the Bloom ?lter of item I
i
2: for ( = 1; C
= ?; ++) do
3: for each B(S) ? C
and each transaction ?lter B(T
i
) do
4: if S ?
B
T
i
then Bsupport(S) ++ // S ?
B
T
i
iff B(S) ? B(T
i
) = B(S)
5: end for
6: for each B(S) ? C
do
7: if Bsupport (S) < N · ?
then delete B(S) from C
// ?
is the revised threshold in data mining
8: end for
9: F
= C
// F
is the collection of Bloom ?lters of all “frequent" itemsets with length
10: C
+1
= can_gen(F
) // generate ?lters of candidate itemsets for the next round
11: end for
12: Answer =
_
F
// all ?lters of frequent itemsets
5.1.1 Ef?cient counting
To improve the ef?ciency in the counting phase, we organize the Bloom ?lters of the trans-
actions of each group in a tree hierarchy and use every q bits to partition them at different
levels, where q is a parameter. For example, at the root level, the partition leads to 2
q
child
nodes; the Bloom ?lters in each node share the same ?rst q bits. A node splits if it contains
more than c Bloom ?lters, where c is another parameter. At the end of partition, each leaf
node contains limited number of Bloom ?lters, while each non-leaf node (except the root) is
associated with a q-bit segment with which the node shares.
Because of the randomness of keyed hash functions, the distribution of Bloom ?lters is
uniform,
6
which implies that the tree is well balanced. Therefore, an L-level tree can be used
to index up to c · 2
qL
Bloom ?lters. Given q = 5 and c = 20, for example, a 4-level tree can
be used to index 20M Bloom ?lters.
Heuristic 5.1 Let s be the q-bit segment associated with a non-leaf node and B(S) be a
Bloom ?lter of candidate itemset S. If any bit in s is 1 while the corresponding bit in B(S) is
0, then no Bloom ?lter in the subtree rooted at the non-leaf node needs to be checked in the
counting phase.
In the counting phase, we traverse the tree to compare each candidate ?lter with the trans-
action ?lters stored in the leaf nodes and update the count of the candidate ?lter appropriately.
According to the above heuristic, we may skip some subtrees in the counting process.
6
Using the optimal number of hash functions, each bit in a Bloom ?lter has the equal probability of being 1
and 0.
1 3
Outsourcing data mining tasks
An alternative way to do this is to organize candidate ?lters in a tree structure and update
their counts appropriately while traversing the tree for each transaction ?lter. In multiple
group case, this solution requires that the tree of candidate ?lters be built differently for
each group, while in the above method a static tree of transaction ?lters can be used for any
candidate ?lters.
5.1.2 Candidates generating
To support interactive data mining and discover all frequent itemsets as required in Problem2,
a multiple step interaction can be conducted between client and server in candidates gener-
ating phase. The client provides the server with a set of candidate ?lters C
; after the server
sends back the mining result F
, the client generates another set C
+1
of candidate ?lters and
sends it to the server for data mining. This process can be done repetitively in each round.
As mentioned in Sect. 3.2, with C
1
(i.e., Bloom ?lters of individual items) an edge server
may decipher partial sensitive data (e.g., the compositions of transactions). Therefore, for
the concern of protecting BI and preserving customer privacy, it is advisory not to outsource
frequent 1-itemset mining tasks, or to use alternative choice of concealing C
1
discussed in
Sect. 3.2.
The candidate generation C
+1
= can_gen(F
) in this case is conducted at client side
for privacy reasons. The Bloom ?lters in F
are transformed back to itemsets with the help
of secret key. From this collection of itemsets, the client generates a new set of candidate
itemsets using the well-known method apri ori _gen as proposed in [4]. The basic idea of
apri ori _gen is that a candidate itemset of length + 1 is generated only if all its subsets
of length appear in the collection of itemsets. The client may also edit the set of candi-
dates according to application requirements and constraints. Finally, the client transforms
the candidate itemsets to Bloom ?lters and sends them back (in C
+1
) to the server. All of
our experiments presented in the next section are based on this scenario.
Another choice to perform candidate generation is at server side. However, the server has
no secret key to perform the hash functions, so it cannot transform back and forth between
Bloom ?lters and itemsets. A possible solution is to use C
+1
={B(S
1
) ? B(S
2
) : B(S
1
),
B(S
2
) ? F
} as candidate set for the next round. It is easy to verify that B(S
1
) ? B(S
2
) is
the Bloom ?lter of itemset S
1
? S
2
; therefore, this solution generates all Bloom ?lters of the
itemsets that are unions of any two frequent itemsets (clearly, no frequent itemset is missed
in this process). The disadvantage of this solution is that the server cannot exploit Apriori
property using apri ori _gen at itemset level.
5.2 Experiment settings
Rigorous experiments are conducted on both real data and synthetic data so as to evaluate the
performance of our Bloom ?lter method in terms of mining precision, storage requirement,
and computation time. All the experiments are run on a Compaq desktop computer with
Pentium-4 CPU clock rate of 3.00GHz, 3.25GB of RAM, and 150GB harddisk, running
Microsoft Windows XP Professional Version 2002 with SP2.
5.2.1 Real data
The real data sets we adopt for this experiment are BMS-POS, BMS-WebView-1, and BMS-
WebView-2 available at http://www.ecn.purdue.edu/KDDCUP. The dataset BMS-POS con-
tains several years worth of point-of-sale data from a large electronics retailer, whereas the
1 3
L. Qiu et al.
Fig. 1 Distribution of
transaction sizes
0
20
40
60
80
100
5 10 15 20 25 30 35
Transaction size
F
r
e
q
u
e
n
c
y
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Synthetic
datasets BMS-WebView-1 and BMS-WebView-2 contain several months worth of click-
stream data from two e-commerce websites.
5.2.2 Synthetic data
We generate synthetic data using the transaction generator designed in IBM Quest project
[4]. A synthetic dataset contains 100–750K transactions; each transaction is generated from
a set of 1,000 frequent itemsets. The size of each frequent itemset is picked from Poisson
distribution with mean 4. There are totally 1,000 distinct items. The size of each transaction
is picked from Poisson distribution with mean 10. We use four sets of synthetic data, namely
Syn1, Syn2, …, and Syn4.
5.2.3 Distributions of real data and synthetic data
The distributions of transaction sizes for both real data and synthetic data are shown in
Fig. 1. According to [41], the exponential-like distributions of transaction sizes are typical in
many real data sets used for mining association rules, while the synthetic data generated by
IBM generator with Poisson distribution have been most widely used for testing the scalabil-
ity of association rule mining algorithms. We use the real data to test the mining precision,
storage requirement, computational time, and the tradeoffs among them. We use the synthetic
data for scalability test. We show that our method works well for both types of data.
5.2.4 Mining precision
For mining precision, we compare the result of our method with the solution to the traditional
problem (see Problem 1). We use the standard association rule mining algorithm Apriori [4]
to discover exactly all frequent itemsets for the traditional problem. Given a revised threshold
?
in our method, the mining precision is measured in terms of overall false positive rate F
+
?
and overall false negative rate F
?
?
. Let F be the set of Bloom?lters of those frequent itemsets
discovered by Apriori, and F
be the set of Bloom ?lters discovered by our algorithm. Then
• Overall false positive rate: F
+
?
=
|F
?F|
|F
|
.
• Overall false negative rate: F
?
?
=
|F?F
|
|F
|
.
Note that the overall false positive and false negative rates are not exactly the false positive
and false negative rates discussed in Sect. 4.3. Given a data set and a data mining threshold,
the false positive and false negative rates studied in Sect. 4.3 are the theoretical probabilities
particular to each itemset, while the overall false positive and false negative rates are the
experimental results for the whole data set. Statistically, one can expect a positive correlation
between the two measurements.
1 3
Outsourcing data mining tasks
Table 1 Experiment parameters and their default values
Parameter Meaning default value
k Number of hash functions used for generating Bloom ?lters 30
? Frequency threshold for traditional data mining problem 1%
? Coef?cient for revised threshold ?
= ? +? · 2
?k
0
g Number of groups 4
5.2.5 Storage requirement
The storage requirement is measured in terms of the length of Bloom ?lters and the num-
ber of transactions that an outsourced server needs to store. If there are multiple groups of
transactions represented by Bloom ?lters of different lengths, then the storage requirement
is the sum of the product of Bloom ?lter length and number of transactions in each group.
Let there be g groups. For each group i , let the length of Bloom ?lters be m
i
and the number
of transactions N
i
. Then
Storage requirement =
g
i =1
m
i
N
i
.
5.2.6 Experiment parameters
Table 1 gives the parameters used in our experiments as well as their default values. In our
experiments, unless otherwise indicated, only one parameter is changed at a time while others
are kept at their default values.
The number k of hash functions is the optimal parameter to determine the length m
i
of
Bloom ?lters in each group i based on the average length n
i
of transactions in the group; that
is, k =
m
i
n
i
ln 2. The same k is used across all groups.
Recall that for any itemset, the average difference of the revised frequency and the original
frequency is upper-bounded by 2
?k
(see Eq. 7). We use the coef?cient ? (where 0 ? ? ? 1)
to calculate the revised frequency threshold ?
= ? + ? · 2
?k
. This coef?cient is used to
balance between the overall false positive and false negative rates.
When the transactions are divided into multiple groups, the grouping is based on the dis-
tribution of transaction size. In our experiments, the grouping is carried out by the following
method. Let n
1
be the average length of transactions in a dataset. The transactions are ?rstly
divided into two groups such that Group 1 contains those transactions whose lengths are
not greater than n
1
and otherwise for Group 2. Likewise, Group 2 is further divided into
two groups. The grouping operation is continued until the original dataset is divided into g
groups.
5.3 Experimental results
5.3.1 Data mining time versus data conversion time
In the ?rst set of experiments, we study the data mining time versus the data conversion time
by changing the number k of hash functions from 25 to 40.
Figure 2 shows that the time of mining frequent itemsets is much more than the time of
converting raw data to Bloom ?lter representation, the more transactions, the more mining
1 3
L. Qiu et al.
Fig. 2 Data mining time versus
data conversion time
0.0
6.0
12.0
18.0
24.0
30.0
25 30 35 40
Number of hash functions
R
e
l
a
t
i
v
e
r
u
n
n
i
n
g
t
i
m
e
(
m
i
n
i
n
g
/
c
o
n
v
e
r
s
i
o
n
)
BMS-WebView-1 BMS-WebView-2 BMS-POS
0
4
8
12
16
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
0
1.5
3
4.5
6
0.5 0.75 1.5
Frequency Threshold (%)
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
BMS-WebView-1
BMS-WebView-2
BMS-POS
2 1 0.5 0.75 1.5
Frequency Threshold (%)
2 1
Fig. 3 Changing mining thresholds where g = 4, k = 30, and ? = 0
time. This result reveals that the mining process takes the major part of running time. It veri-
?es the worthiness of pre-processing (i.e., data format conversion before outsourcing mining
tasks).
5.3.2 Change mining thresholds
In the second set of experiments, we study the mining precision and running time by changing
the frequency threshold ? from 0.5 to 2%. Other parameters are kept their default values, i.e.,
g = 4, k = 30, and ? = 0.
Figure 3 shows that the false positive rate is less than 6% for datasets BMS-WebView-1
and BMS-WebView-2, and less than 1% for datatset BMS-POS. For all datasets, the running
time is decreasing with mining threshold. The reason is that a larger mining threshold will
cause fewer frequent itemsets to be discovered in a shorter time. In addition, the experiments
on dataset BMS-POS require more time than those on datasets BMS-WebView-1 and BMS-
WebView-2 because dataset BMS-POS contains approximately 6–8 times more transactions
than other two datasets.
The false negative rates are 0 for all datasets when ? = 0. We also run experiments for
the revised frequency threshold ?
= ? + ? · 2
?k
when ? varies from 0.2 to 1.0 with an
increment of 0.2. Since our experimental results are the same as compared with the cases for
which ? = 0, the mining precision is not sensitive to the change of frequency threshold in
our experiments. Therefore, in the following, we use ? = 1% and ? = 0 and only consider
the false positive in our analysis.
1 3
Outsourcing data mining tasks
0
40
80
120
160
200
20 25 30 35 40
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
2
4
6
8
20 25 30 35 40
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
10
20
30
40
20 25 30 35 40
Number of Hash Functions Number of Hash Functions
Number of Hash Functions
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Fig. 4 Changing number of hash functions where g = 4, ? = 1%, and ? = 0
5.3.3 Change number of hash functions
In the third set of experiments, we study the mining precision, computation cost, and storage
requirement with the change of the number k of hash functions. We change k = from 20 to
40 and keep the other parameters at their default values, i.e., g = 4, ? = 1%, and ? = 0.
Figure 4 shows the mining precision with the change of k. There is a globally decreasing
trend of false positive rate for each real dataset. For dataset BMS-POS, the false positive rate
is nearly 5% for k = 20, and is less than 1% for k ? 25. For datasets BMS-WebView-1 and
BMS-WebView-2, the false positive rates are below 5% for k ? 30. It also shows that the
running time changes slightly with k. The running time is around 8min for dataset BMS-POS
and within 1min for datasets BMS-WebView-1 and BMS-WebView-2. The storage require-
ment is linearly increasing with k for all datasets based on the ?gure. The reason is that
the larger the k, the longer the Bloom ?lters, and the more storage is required to store the
Bloom ?lters. The experimental results show that high mining precision can be achieved by
increasing the number of hash functions. Consequently, the storage requirement increases
linearly due to the use of longer Bloom ?lters.
5.3.4 Change number of groups
In the fourth set of experiments, we study the mining precision, computation cost, and storage
requirement by changing the number g of groups from g = 2 to 5. The other parameters are
kept at their default values, i.e., k = 30, ? = 1%, and ? = 0.
Figure 5 shows that the false positive rate decreases with the number of groups. For data-
sets BMS-WebView-1 and BMS-WebView-2, the false positive rate can be as low as 5% for
g ? 3; whereas for dataset BMS-POS, the false positive rate is less than 2.5% for g ? 2.
It also shows that running time and storage requirement change little with parameter g. The
1 3
L. Qiu et al.
0
30
60
90
120
150
2 3
2 3 2 3
4 5
4 5 4 5
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
2
4
6
8
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
5
10
15
20
Number of Groups Number of Groups
Number of Groups
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Fig. 5 Changing number of groups where k = 30, ? = 1%, and ? = 0
results of this set of experiments reveal that high mining precision can be achieved with larger
number of groups without increasing much the storage requirement and the running time.
5.3.5 Scalability
In the last set of experiments, we use synthetic datasets Syn1, Syn2, …, and Syn4 to test the
scalability of our algorithm with respect to the running time, mining precision, and storage
requirement. We let k = 20 and keep other parameters at their default values, i.e., g = 4,
? = 1%, and ? = 0.
Figure 6 shows that the false positive rate for all synthetic datasets is no more than 1%,
and that the running time and the storage requirement increase linearly with the number of
transactions which is scaled up from 100K in dataset Syn1 to 750K in dataset Syn4. The
scalability is also veri?ed by experiments for k > 20 (e.g., k = 25 and 30) with zero false
positive rate.
From the experimental results, we can conclude that the running time and storage require-
ment are scalable with the number of transactions while the mining precision is reasonably
high.
5.4 Summary of experiments
One can draw the following conclusions from out experiments: (1) in the various cases of
our experiments, zero false negative rate can be achieved. This result is not sensitive to the
change of mining threshold; (2) by increasing the number of hash functions in formulating
Bloom ?lters, the false positive rate decreases at the price of higher storage requirement; (3)
by increasing the number of groups in data mining, the false positive rate decreases without
increasing the running time or the storage requirement; and (4) the running time and the
1 3
Outsourcing data mining tasks
0
70
140
210
280
350
Synthetic Dataset
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
0
30
60
90
120
Synthetic Dataset
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
0
0.25
0.5
0.75
1
Syn1 Syn2 Syn3 Syn4 Syn1 Syn2 Syn3 Syn4
Syn1 Syn2 Syn3 Syn4
Synthetic Dataset
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
Fig. 6 Scalability results where k = 20, g = 4, ? = 1%, and ? = 0
storage requirement are scalable with the number of transactions that are processed in data
mining.
6 Conclusions
The contribution of this paper is threefold. First, we proposed a new approach to outsourc-
ing association rule mining tasks while protecting BI and customer privacy. The proposed
approach is different from previous solutions in that it can protect BI and customer privacy
while outsourcing mining tasks, at the same time, maintain the precision of mining results.
Second, we performed theoretical analysis on the false positive and false negative rates in
data mining. We also estimated the upper and lower bounds for the mining errors. Third,
we investigated the tradeoffs between mining precision and storage requirement. Rigorous
experiments were conducted on typical real and synthetic datasets showing that our approach
can save storage significantly without sacri?cing much mining precision, security level, and
running time.
We are planning to investigate how to protect BI and customer privacy while outsourcing
other mining tasks such as decision tree mining. It is also interesting to study privacy pre-
serving techniques for mining speci?c types of frequent itemsets, such as maximum itemsets
and closed itemsets.
References
1. Agrawal D, Aggarwal CC (2001) On the design and quanti?cation of privacy preserving data mining
algorithms. In: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART symposium on principles of
database systems, pp 247–255
1 3
L. Qiu et al.
2. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large dat-
abases. In: Proceedings of the 1993 ACMSIGMODinternational conference on management of database
pp 207–216
3. Agrawal R, Kiernan J, Srikant R, Xu Y (2004) Order preserving encryption for numeric data. In: Pro-
ceedings of the ACM SIGMOD ICMD, pp 563–574
4. Agrawal R, Srikant R (1994) Faster algorithms for mining association rules in large databases. In: Pro-
ceedings of the 20th international conference on very large data bases (VLDB’94), Santiago de Chile,
Chile, September 12–15, pp 487–499
5. Agrawal R, Srikant R(2000) Privacy preserving data mining. In: Proceedings of the 2000 ACMSIGMOD
international conference on management of database, Texas, USA, May 16–18, pp 439–450
6. Agrawal S, Haritsa JR(2005) Aframework for high-accuracy privacy-preserving mining. In: Proceedings
of the 21th IEEE international conference on data engineering (ICDE 2005), Tokyo, Japan, pp 193–204
7. Apte C, Liu B, Pednault E, Smyth P (2002) Business applications of data mining. Commun ACM
45(8):49–53
8. Atallah M, Bertino E, Elmagarmid AK, IbrahimM, Verykios VS (1999) Disclosure limitation of sensitive
rules. In: Proceedings of the IEEE KDEE, pp 45–52
9. Bishop M, Bhumiratana B, Crawford R, Levitt K (2004) How to sanitize data. In: Proceedings of the
13th IEEE international workshops on enabling technologies: infrastructure for collaborative enterprises
(WETICE’04), Modena, Italy, June 14–16, pp 217–222
10. Bloom B (1970) Space time tradeoffs in hash coding with allowable errors. Commun ACM 7(13):422–
426
11. Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using con?dence
and support. In: Proceedings of the 4th international information hiding workshop, pp 369–383
12. Dibbeern J, Heinzl A (2002) Outsourcing information systems in small and medium sized enterprises: a
test of a multi-theoretical casaul model. In: Dibbeern J (ed) Information systems outsourcing: enduring
themes, emergent patterns, and future directions. Springer, New York
13. Du W, Zhan Z (2002) Building decision tree classi?er on private data. In: Proceedings of IEEE ICDM’02
workshop on privacy, security, and data mining, vol 14, pp 1–8
14. Ev?mievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining.
In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database
system, pp 211–222
15. Ev?mievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules.
In: Proceedings of the 8th ACM SIGKDD KDD 2002, pp 217–228
16. Hacigumus H, Iyer B, Li C, Mehrotra S (2002) Executing SQL over encrypted data in the database-ser-
vice-provider model. In: Proceedings of the ACM SIGMOD international conference on management
of database, pp 216–227
17. Hacigumus H, Iyer B, Mehrotra S (2002) Providing database as a service. In: Proceedings of the inter-
national conference on data engineering, pp 29–40
18. Hacigumus H, Iyer B, Mehrotra S (2004) Ef?cient execution of aggregation queries over encrypted rela-
tional databases. In: Proceedings of international conference on database systems for advanced applica-
tions, pp 125–136
19. Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: Proceedings of
the ACM SIGMOD international conference on management of data, Baltimore, MA, USA, June 14–16,
pp 37–48
20. Iyer B, Mehrotra S, Mykletun E, Tsudik G, Wu Y (2004) A framework for ef?cient storage security in
RDBMS. In: Proceedings of international conference on EDBT, pp 147–164
21. Kantarc?o? glu M, Clifton C (2002) Privacy preserving distributed mining of association rules on hori-
zontally partitioned data. In: Proceedings of the ACM SIGMOD workshop on research issues on data
mining and knowledge discovery, pp 24–31
22. Kantarc?o? glu M, Jin J, Clifton C (2004) When do data mining results violate privacy? In: Proceedings
of the 10th ACM SIGKDD KDD 2004, pp 599–604
23. Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data
perturbation techniques. In: Proceedings of the 3rd IEEE ICDM, pp 99–106
24. Kargupta H, Datta S, Wang Q, Sivakumar K (2005) Random-data perturbation techniques and privacy-
preserving data mining. Knowledge Inf Syst Int J 7(4):387–414
25. Lin Q-Y, Chen Y-L, Chen J-S, Chen Y-C (2003) Mining inter-organizational retailing knowledge for an
alliance formed by competitive ?rms. Inf Manage 40(5):431–442
26. Lindell Y, Pinkas B (2002) Privacy preserving data mining. J Cryptol 15(3):177–206
27. Lui SM, Qiu L(2007) Individual privacy and organizational privacy in business analytics. In: Proceedings
of the 40th Hawaii international conference on system sciences (HICSS 2007), Hawaii, USA, January
3–6, p 216b
1 3
Outsourcing data mining tasks
28. Milne G-R (2000) Privacy and ethical issues in database/interactive marketing and public policy: a
research framework and overview of the special issue. J Public Policy Marketing 19:1–6
29. Oliveira S, Zaiane O (2002) Privacy preserving frequent itemset mining. In: Proceedings of the IEEE
ICDM workshop on privacy, security and data mining, pp 43–54
30. Oliveira S, Zaiane O (2003) Algorithms for balancing privacy and knowledge discovery in association
rule mining. In: Proceedings of the 7th international database engineering and applications symposium,
pp 54–63
31. Oliveira S, Zaiane O (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the
3rd IEEE ICDM, pp 211–218
32. Ordones C, Ezquerra N, Santana CA (2006) Constraining and summarizing association rules in medical
data. Knowledge Inf Syst Int J 9(3):259–283
33. Pinkas B (2002) Cryptographic techniques for privacy preserving data mining. ACM SIGKDD Explor
4(2):12–19
34. Qiu L, Li Y, Wu X (2006) An approach to outsourcing data mining tasks while protecting business
intelligence and customer privacy. In: Workshops proceedings of the 6th IEEE international conference
on data mining (ICDM 2006), Hong Kong, China, December 18–22, pp 551–558
35. Ra´ s ZW, Gürdal O, Im S, Tzacheva A (2007) Data con?dentiality versus chase. In: Proceedings of the
joint rough sets symposium(JRS07), Toronto, Canada, May 14–16. Springer LNAI vol 4482, pp 330–337
36. Rizvi S, Haritsa J (2002) Maintaining data privacy in association rule mining. In: Proceedings of
VLDB’02, pp 682–693
37. Saygin Y, Verykios VS, Clifton C (2001) Using unknowns to prevent discovery of association rules.
Sigmod Rec 30(4):45–54
38. Vaidya J, Clifton C (2004) Privacy-preserving data mining: why, how, and when. IEEE Security Privacy
2(6):19–27
39. Xu S, Zhang J, Han D, Wang J (2006) A singular value decomposition based data distortion strategy for
privacy protection. Knowledge Inf Syst Int J 10(3):383–397
40. Yao AC-C (1986) How to generate and exchange secrets. In: Proceedings of the 27th IEEE symposium
on foundations of computer science (FOCS’86), Xi’an, China, pp 162–167
41. Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Pro-
ceedings of the 7th ACM-SIGKDD international conference on knowledge discovery and data mining,
pp 401–406
Author’s biographies
Ling Qiu is a Lecturer (academic level B) of IT in the School of Mathe-
matics, Physics and Information Technology at James Cook University,
Australia. He holds a PhD degree in Computer Science & Engineer-
ing awarded by Nanyang Technological University, Singapore in 2003.
He received an ME degree from Zhejiang University, China and a BE
degree from Huazhong University of Science and Technology, China.
His research interests include parallel and distributed computing, algo-
rithms, database systems and data mining, information systems and secu-
rity, and system simulations.
1 3
L. Qiu et al.
Yingjiu Li is currently an Assistant Professor in the School of Infor-
mation Systems at Singapore Management University, Singapore. He
received his PhDdegree in Information Technology fromGeorge Mason
University in 2003. His research interests include applications security,
privacy protection, and data rights management. He has published over
40 technical papers in the refereed journals and conference proceedings.
Yingjiu Li is a member of the ACM and the IEEE. The URL for his web
page is http://www.mysmu.edu/faculty/yjli/.
Xintao Wu is an Associate Professor in the Department of Software and
Information Systems at the University of North Carolina at Charlotte, the
USA. He got his PhD degree in Information Technology from George
Mason University in 2001. He received his BSdegree in Information Sci-
ence from the University of Science and Technology of China in 1994,
and an ME degree in Computer Engineering from the Chinese Acad-
emy of Space Technology in 1997. His major research interests include
data mining and knowledge discovery, data privacy and security, and
bio-informatics.
1 3
doc_241713779.pdf
Protecting Business Intelligence And Customer Privacy While Outsourcing Data Mining Tasks
Knowl Inf Syst
DOI 10.1007/s10115-007-0113-3
REGULAR PAPER
Protecting business intelligence and customer privacy
while outsourcing data mining tasks
Ling Qiu · Yingjiu Li · Xintao Wu
Received: 7 February 2007 / Revised: 4 July 2007 / Accepted: 20 September 2007
© Springer-Verlag London Limited 2007
Abstract Nowadays data mining plays an important role in decision making. Since many
organizations do not possess the in-house expertise of data mining, it is bene?cial to outsource
data mining tasks to external service providers. However, most organizations hesitate to do
so due to the concern of loss of business intelligence and customer privacy. In this paper,
we present a Bloom ?lter based solution to enable organizations to outsource their tasks of
mining association rules, at the same time, protect their business intelligence and customer
privacy. Our approach can achieve high precision in data mining by trading-off the storage
requirement.
1 Introduction
1.1 Background and motivation
The pervasive impact of business computing has made information technology (IT) an indis-
pensable part of daily operations and the key to success for organizations. Many organizations
This research was supported by the USA National Science Foundation Grants CCR-0310974 and
IIS-0546027.
L. Qiu (B)
School of Mathematics, Physics and Information Technology, James Cook University,
Townsville, QLD 4811, Australia
e-mail: [email protected]
Y. Li
School of Information Systems, Singapore Management University, Singapore 178902, Singapore
e-mail: [email protected]
X. Wu
Department of Software and Information Systems, University of North Carolina at Charlotte,
Charlotte, NC 28223, USA
e-mail: [email protected]
1 3
L. Qiu et al.
have accumulated large amount of data from various channels in today’s digitalized age. It
is important to make these data available for decision making. Data mining, as one of the IT
services needed by organizations, provides such a technique for the exploration and analysis
of these raw data so as to reveal hidden information and knowledge; it has also been realized
as an important way for discovering knowledge from the data and converting “data rich"
to “knowledge rich" so as to assist strategic decision making. The bene?ts of using data
mining for decision making have been demonstrated in various industries and governmental
sectors [7], e.g., banking, insurance, direct-mail marketing, telecommunications, retails, and
health care [32]. Among all of the available data mining methods, the discovery of associa-
tions between business events or transactions is one of the most commonly used techniques.
Association rule mining has been an important application in decision support and marketing
strategy [25] for an organization.
Let us consider a typical application scenario as follows. In an organization (e.g., an enter-
prise or a governmental sector), there are several divisions including an IT division which
provides ITservices for the whole organization. Afunctional division may have to delegate or
outsource its data mining tasks to the IT division due to the lack of IT expertise and powerful
computing infrastructure which are usually centrally managed by the IT division.
This scenario can be extended to a more general circumstance in which all divisions are
individually independent organizations (or companies). This is because in today’s fast-paced
business environment, it is impossible for any single organization to understand, develop,
and implement every IT needed. By outsourcing, an organization can obtain speci?c human
resources (e.g., skilled programming personnel) and technological resources (e.g., more pow-
erful computing infrastructure) for its needs of IT services (e.g., data analysis) with lower
costs [12]. It can also be extended to online scenarios, e.g., a distributed computing environ-
ment comprising of a center server and some edge servers.
The practice of outsourcing data mining tasks involves extensive collaboration (e.g.,
exchange or share of data) across different organizations. Either the raw data or the revealed
information after analysis contains the business intelligence (BI) and customer privacy of an
organization. There is a security concern of potential risk of exposing private information
in outsourcing activities [28]. Without proper security policy and technology, these priva-
cies could be very vulnerable to security breaches. Therefore, to protect BI and customer
privacy, it is urgent and critical to provide solutions from the perspectives of both legality or
regulation and technology [27]. In this paper, we focus on the technology-based solutions.
When outsourcing mining tasks,
1
we should protect the following three elements which may
expose BI and customer privacy: (1) the source data which contain all transactions and items;
(2) the mining requests which are itemsets of interests; and (3) the mining results which are
frequent itemsets and association rules.
There are various methods proposed to preserve privacy in data mining; but they cannot
protect all three elements simultaneously. This is because with these methods, when a ?rst
party
2
outsources its mining tasks to a third party,
3
it has to provide the source database
(which might be someway encrypted) together with some additional information (e.g., plain
text of mining requests) without which the mining tasks may not be carried out. Given this
situation, the existing methods are unable to ef?ciently prevent the exposure of private infor-
1
Without further speci?cation, we always refer to association rule mining tasks.
2
This is the party that outsouces its data mining tasks. It may be a functional division of an organization or
the center server in a distributed environment with client–server architecture.
3
This is the party that is authorized by the ?rst party to undertake the outsourced data mining tasks. It may be
the ITdivision of an organization or an edge server in a distributed environment with client–server architecture.
1 3
Outsourcing data mining tasks
mation to the third party, or unable to prevent the third party from (either intentionally or
unintentionally) deciphering (and possibly spreading) further information from the mining
results (which would be sent back to the ?rst party) with the additional information.
1.2 Overview of our paper
1.2.1 Our solutions
Given the situation discussed in the previous paragraphs, to protect BI and customer privacy
when outsourcing data mining tasks, we should control the direct access to the original data.
In other words, the third party should not be allowed to access the original transactional data
while performing mining tasks, and should not be able to interpret the mining results.
In this paper, we present a Bloom ?lter based approach which provides an algorithm
for privacy preserving association rule mining with computation ef?ciency and predictable
(controllable) analysis precision. The Bloom ?lter [10] is a stream (or a vector) of binary
bits. It is a computationally ef?cient and irreversible coding scheme that can represent a
set of objects while preserving privacy of the objects. With our approach, ?rstly the source
data are converted to Bloom ?lter representation and handed over to a third party together
with mining algorithms. Then the ?rst party sends its mining requests to the third party.
Mining requests are actually candidates of frequent itemsets which are also represented by
Bloom ?lters. Lastly, the third party runs the mining algorithms with source data and mining
requests, and comes out the mining results which are frequent itemsets or association rules
represented by Bloom ?lters. In the above mining process, what the ?rst party exposes to the
third party does not violate privacy [22]; that is, the third party would not be able to distill
down private information from Bloom ?lters. Therefore, all three elements aforementioned
are fully protected by Bloom ?lters.
The goal of protecting BI and customer privacy during outsoucing can be achieved by
Bloom ?lter because it satis?es simultaneously the following three conditions. First, transac-
tions containing different numbers of items are mapped to Bloom?lters with the same length.
This prevents an adversary from deciphering the compositions of transactions by analyzing
the lengths of transactions. Second, Bloom ?lters support membership queries. This allows
an authorized third party to carry out data mining tasks with only Bloom ?lters (i.e., Bloom
?lters of either transactions or candidates of frequent itemsets). Third, without knowing all
possible individual items in the transactions, it is dif?cult to identify what items are included
in the Bloom ?lter of a transaction by counting the numbers of 1’s and 0’s. This is because
the probability of a bit in a Bloom ?lter being 1 or 0 is 0.5 given that the parameters of the
Bloom ?lter are optimally chosen (see Sect. 3 for details).
1.2.2 Main contributions
A condensed (5-page) version of our study has been published in [34]. In this version, we
present in details the mathematical analysis model for problem formulation and error rate
estimation, the analysis of the mining algorithm, and the results and analysis of experiments.
In brief, our main contributions include the following:
• We propose an approach allowing to outsource data mining tasks while protecting BI and
customer privacy. It prevents third parties from interpreting customer privacy from the
source data and from deciphering BI from the mining results.
• We present solid theoretical analysis for our approach. Our analysis shows that a tradeoff
can be made between storage requirement and mining precision.
1 3
L. Qiu et al.
• We test the proposed approach rigorously on both real and synthetic datasets. Our exper-
imental results show that our approach is both effective and ?exible for practical usage.
1.2.3 Organization of the paper
The remaining sections are organized as follows. We review the related work in Sect. 2.
We revisit the basics of Bloom ?lters and formulate our data mining problem in Sect. 3. In
Sect. 4, we present theoretical analysis for the formulated problems and estimate the errors
in data mining. We also present a method of using multiple groups of Bloom ?lters to further
reduce data mining errors. Following that in Sect. 5, we develop a multi-phased algorithm
and conduct a set of empirical simulations to verify our approach. Lastly we conclude our
paper in Sect. 6 with a discussion of contribution, limitation, and future research directions.
2 Literature review
Association rule mining has been an active research area since its introduction [2]. Many
algorithms have been proposed to improve the performance of mining association rules or
frequent itemsets. An interesting direction is the development of techniques that incorporate
privacy concerns.
One type of these techniques is perturbation based, which perturbs the data to a certain
degree before data mining so that the real values of sensitive data are obscured while non-
sensitive statistics on the collection of data are preserved. In an early work [5] a perturbation-
based approach was proposed for decision tree learning. This approach was further studied in
[1, 19, 23, 24]. Some recent work [6, 8, 9, 11, 14, 15, 29–31, 36, 37, 39] investigates the tradeoff
between leakage of private information and accuracy of mining results. In [8, 11], the authors
considered the problem of limiting disclosure of sensitive rules, aiming at selectively hiding
some frequent itemsets from large databases with as little impact on other, non-sensitive
frequent itemsets as possible. The idea is to modify a given database so that the support
of a given set of sensitive rules decreases below a predetermined threshold. Similarly, in
[37] a method is presented for selectively replacing individual values with unknowns from a
database to prevent the discovery of a set of rules, while minimizing the side effects on non-
sensitive rules. In [9, 31], the authors studied the impact of hiding strategies on an original
data set by quantifying how much information is preserved after sanitizing the data set. In
[6, 15, 36], researchers also studied the problemof mining association rules fromtransactions
in which the data has been randomized to preserve the privacy of individual transactions.
One problem of perturbation-based approach is that it may introduce some false association
rules. Another drawback of this approach is that it cannot always fully preserve data privacy
while achieving high mining precision [23, 24].
The second type is distributed privacy preserving data mining [13, 21, 26, 33, 38] based on
secure multi-party computation [40]. Though this approach can preserve privacy, it works
only in distributed environment (with several parties to collaborate in mining process) and
needs sophisticated protocols (secure multi-party computation based), which makes it infea-
sible for our scenario.
Both types of techniques are designed to protect privacy by masquerading the source data,
and cannot protect privacy distillable from the mining requests and results accessible to data
miners.
Related, but not directly relevant to our work, is the research in outsourced databases
[3, 16–18, 20] and most recently in data con?dentiality [35]. Hacigumus et al. [16–18, 20]
1 3
Outsourcing data mining tasks
explored a new paradigm for data management in which a third party service provides host
database as a service. They proposed several encryption techniques to process as many que-
ries as possible at the service providers’ site without having to decrypt the data. Agrawal
et al. [3] presented an order-preserving encryption scheme for numeric data that allows com-
parison operations to be directly applied on encrypted data. However, encryption is time
consuming and it may require auxiliary indices. It is only designed for certain type of queries
but may not be suitable for complex tasks such as association rule mining. Most recently Ra´ s
et al. [35] presented a generalized strategy to reduce a disclosure risk of con?dential data by
hiding some attributes of the source data. Hiding (or replacing) some attributes may break
the integrity of source data and thus the mining results may not be meaningful if it is applied
to our application scenario.
3 Problem formulation
Our research question is how to outsource the association rule data ming tasks, at the same
time, protect BI and customer privacy. We propose a Bloom?lter based approach with which
an authorized third party (e.g., an edge server or an IT division as mentioned earlier) will
perform the association rule mining based on the dataset transformed by Bloom Filters and
report the frequent itemsets with controllable error boundary.
In this section, we ?rst brie?y review the basics of Bloom ?lter, and then present the
mathematic formalization for our problem.
3.1 Bloom ?lter revisited
A Bloom ?lter is a simple, space-ef?cient, randomized data structure for representing a set
of objects so as to support membership queries.
De?nition 3.1 Given an n-element set S = {s
1
, . . . , s
n
} and k hash functions h
1
, . . . , h
k
of range m, the Bloom ?lter of S, denoted as B(S), is a binary vector of length m that is
constructed by the following steps: (i) every bit is initially set to 0; (ii) every element s ? S is
hashed into the bit vector through the k hash functions, and the corresponding bits h
i
(s) are
set to 1.
4
A Bloom ?lter function, denoted as B(·), is a mapping from a set (not necessarily
n-element set) to its Bloom ?lter.
For membership queries, i.e., whether an item x ? S, we hash x to the Bloom ?lter of S
(through those hash functions) and check whether all h
i
(x) are 1’s. If not, then clearly x is not
a member of S. If yes, we say x is in S although this could be wrong with some probability.
De?nition 3.2 For an element s and a set S, de?ne s ?
B
S if s hashes to all 1’s in the Bloom
?lter of S, and s / ?
B
S otherwise. The false positive rate of the Bloom ?lter of S is de?ned
as the probability of s ?
B
S while s / ? S, or Pr(s ?
B
S | s / ? S).
Assuming that all hash functions are perfectly random, we have the following
Lemma 3.3 Given an n-element set S = {s
1
, . . . , s
n
} and its Bloom ?lter B(S) of length m
constructed from k hash functions, the probability for a speci?c bit in B(S) being 0 is
p
0
=
_
1 ?
1
m
_
kn
? e
?kn/m
4
A location can be set to 1 multiple times, but only the ?rst change has an effect.
1 3
L. Qiu et al.
and the probability for a speci?c bit being 1 is
p
1
= 1 ? p
0
? 1 ?e
?kn/m
Then the false positive rate of B(S) is
f = p
k
1
?
_
1 ?e
?kn/m
_
k
(1)
Given n and m, f is minimized when p
0
= p
1
= 0.5 and k =
m
n
ln 2, in which case
f = 1/2
k
= (0.6185)
m/n
.
What’s missing from traditional Bloom ?lter? Traditional Bloom ?lters cannot solve the
privacy issue because anybody knowing the hash function can derive the original itemsets
based on Bloom Filters.
To preserve the privacy of Bloom ?lters, we propose keyed Bloom ?lter by augmenting
the hash functions h
i
with a secret key K. To represent set S, an element s ? S is inserted
into Bloom ?lter B by setting the corresponding bits h
i
(s ? K) in B to 1, where ? represents
concatenation. Also, to query whether an item x ? S, we check whether all h
i
(x ? K) bits are
set to 1. Without knowing the secret key, one is unable to derive the original set by examining
a Bloom ?lter. Without further mention we always assume that Bloom ?lters in our paper are
constructed with secret keys.
3.2 Our problem
Fundamentally we want to provide a solution for privacy preserving frequent itemsets min-
ing. The frequent itemset mining has been a common task in many data mining projects
for the past decade. From frequent itemsets, one can easily derive all association rules. The
mining of frequent itemsets of association rules has a wide range of applications in many
areas, from the analysis of customer preferences to DNA patterns.
For market basket data, we de?ne each transaction, such as a list of items purchased, as a
subset of all possible items.
De?nition 3.4 Let I = {I
1
, . . . , I
d
} be a set of d boolean variables called items. Let data-
base D be a set of transactions T
1
, T
2
, . . . , T
N
where each transaction T
i
is a set of items
such that T
i
? I. The support of an itemset S over I, denoted support (S), is de?ned as the
number of the transactions that contain S. The frequency of an itemset S, denoted freq(S),
is de?ned as support (S)/N.
Problem 1 Traditional problem: frequent itemsets mining. Mathematically, given a trans-
action database D over I and a threshold ? ? [0, 1], traditional research focuses on ?nding
all frequent itemsets FS ? 2
I
such that freq(FS) ? ?.
Our idea is to transform transaction database to a collection of Bloom ?lters to preserve
the privacy in frequent itemset mining. Each transaction T
i
? T is transformed to Bloom
?lter B(T
i
) of size m using k hashed functions. To preserve the privacy of items I
i
, we assume
that the mining process is done on the (keyed) Bloom ?lters B(I
i
) of the items rather than
on the items themselves.
Problem 2 Our research problem: privacy preserving frequent itemsets mining. Given (i)
a collection of Bloom ?lters {B(T
1
), . . . , B(T
N
)} for transaction database D over I, (ii) a
set of Bloom ?lters {B(I
1
), . . . , B(I
d
)} for items in I, and (iii) a threshold ? ? [0, 1], ?nd
all Bloom ?lters B(FS) of itemsets FS ? 2
I
such that freq(FS) ? ?.
1 3
Outsourcing data mining tasks
Normally without knowing the secret key, the third party, which may not be fully trusted,
will be unable to interpret sensitive and private information either from the contents of dat-
abases or from mining results. However, we still need to handle some extreme cases for the
protection of BI and customer privacy. Case 1, some transactions may contain only one item.
The number of 1’s in these Bloom ?lters are no more than but very close to k. Therefore, the
outsourced database may divulge partial individual items. To prevent such divulgence, one
of the solutions is to insert several virtual items as white noise to those transactions in which
item numbers are smaller than a threshold. Case 2, candidates of frequent 1-itemsets are
exactly individual items which may divulge some sensitive information. To prevent this, we
extend the usage of secret key by inserting several virtual items, denoted by k
1
, k
2
, . . . , k
i
,
into all transactions before outsoucing. At the same time, each candidate of frequent 1-item-
sets is inserted with a virtual item randomly chosen from k
1
to k
i
and is sent to the third
party together with mining requests (see Sect. 5.1 for detailed mining process). Thus the
edge servers cannot easily identify candidates of frequent 1- and 2-itemsets because both
types of candidates look alike. This method can be applied to conceal candidates of frequent
2-, 3-, …, and k-itemsets. Moreover, this operation can be done before or after the mixing of
white noise discussed in case 1.
4 Analysis
In this section, we analyze the possible error rates introduced by mining the frequent itemsets
based on Bloom ?lters instead of the original dataset. For any given itemset, the frequency
learnt from Bloom ?lters may be larger than its real frequency learnt from original transac-
tions due to the false positive of Bloom?lters. We make this clear in the following analysis. By
default, we assume that for any itemset, there is a Bloom ?lter function B(·) which produces
binary vector of length m through k hash functions.
4.1 Preliminaries
The false positive of a Bloom ?lter was de?ned for checking an element from a Bloom ?lter.
Now we extend the concept of false positive to checking an itemset from a Bloom ?lter.
De?nition 4.1 Given an itemset S and a transaction T
i
, de?ne S ?
B
T
i
if for all items s ? S,
s ?
B
T
i
, and de?ne S
B
T
i
otherwise. The false positive rate for checking S from the
Bloom ?lter of T
i
, denoted as f
i
, is de?ned as the probability of S ?
B
T
i
while S T
i
, or
Pr(S ?
B
T
i
| S T
i
).
Due to the false positive of checking an itemset from a Bloom ?lter, the support or fre-
quency learnt from a collection of Bloom ?lters is different from that learnt from original
transactions. Regarding such support and frequency, we have the following
De?nition 4.2 Given a collection of N Bloom ?lters {B(T
1
), . . . , B(T
N
)} for transaction
database D over I, the support of an itemset S ? 2
I
that is learnt from the collection of ?l-
ters, denoted as Bsupport (S), is de?ned as the number of ?lters B(T
i
) that satisfy S ?
B
T
i
.
The frequency of S that is learnt fromthe collection of ?lters, denoted as Bfreq(S), is de?ned
as Bsupport (S)/N.
Lemma 4.3 In the setting of Definition 4.2, the following statements hold: (i) S ?
B
T
i
iff B(S) ? B(T
i
) = B(S), where ? is bitwise AND. (ii) If S ? T
i
, then S ?
B
T
i
. (iii)
1 3
L. Qiu et al.
If S T
i
, then S ?
B
T
i
with probability f
i
, and S
B
T
i
with probability 1 ? f
i
. (iv)
Bfreq(S) ? freq(S).
Theorem 4.4 Given an itemset S and a transaction T
i
, the false positive rate of checking S
from the Bloom ?lter of T
i
is
f
i
=
_
1 ?e
?kn
i
/m
_
||B(S?T
i
)||
where n
i
= |T
i
| is the length of transaction T
i
in terms of the number of items, and || · ||
indicates the number of 1’s in a binary vector.
Proof From Eq. (1), one can derive that the false positive rate for checking any single item
s ? S ?T
i
is p
k
1
, where p
1
=
_
1 ?e
?kn
i
/m
_
is the probability that a speci?c bit is 1 in B(T
i
)
and k is the number of bits to which the item is hashed. From Lemma 4.3, we know that
S ?T
i
?
B
T
i
. Since the items in S ?T
i
are hashed to ||B(S ?T
i
)|| bits all together, the false
positive rate f
i
for checking S from B(T
i
) is p
||B(S?T
i
)||
1
=
_
1 ?e
?kn
i
/m
_
||B(S?T
i
)||
.
Corollary 4.5 Given an itemset S and a transaction T
i
, the false positive rate f
i
of checking
S from the Bloom ?lter of T
i
is bounded:
_
1 ?e
?kn
i
/m
_
||B(S)||
? f
i
?
_
1 ?e
?kn
i
/m
_
k
Proof According to the definition of false positive, we have S T
i
and thus 1 ? ||B(S ?T
i
)
|| ? ||B(S)||. Combining this with Theorem 4.4, we have
_
1 ?e
?kn
i
/m
_
||B(S)||
? f
i
?
_
1 ?e
?kn
i
/m
_
k
.
4.2 False positive and false negative
In our data mining problem, an outsourced server has access to the Bloom ?lters {B(T
i
)} of
all transactions. However, it has no access to the original data. Therefore, given a Bloom?lter
B(S), the server cannot compute the frequency freq(S) directly. The approaches to solving
the traditional frequent itemset mining problemcannot be applied directly to solving our data
mining problem.
According to Lemma 4.3 (i), the frequency Bfreq(S) can be derived from those Bloom
?lters. Our solution is to ?nd all Bloom ?lters B(S) such that Bfreq(S) ? ?
where ?
is a
revised threshold. Note that freq(S) ? ? is required in our data mining problem; thus, we
must ensure that
_
B(S) : Bfreq(S) ? ?
_
{B(S) : freq(S) ? ?}. Because the two sets are
not necessarily the same, we need to de?ne the false positive rate and false negative rate for
checking an itemset from all Bloom ?lters.
De?nition 4.6 Given an itemset S and N Bloom ?lters B(T
i
), i = 1, . . . , N, the false
positive rate for checking S from all Bloom ?lters using revised threshold ?
? ?, denoted as
f
+
?
, is de?ned as the probability of Bfreq(S) ? ?
while freq(S) < ?, or Pr
_
Bfreq(S) ? ?
|
freq(S) < ?). The false negative rate for checking S fromall Bloom?lters using?
, denotedas
f
?
?
, is de?ned as the probability of Bfreq(S) < ?
while freq(S) ? ?, or Pr
_
Bfreq(S) < ?
|
freq(S) ? ?).
If the revised threshold ?
is the same as the threshold ?, then the false negative rate will
be zero due to the fact Bfreq(S) ? freq(S) (Lemma 4.3 (iv)); however, the false positive
rate may be greater than zero in this case. In general, one may use ?
? ? so as to balance
1 3
Outsourcing data mining tasks
false negative rate and false positive rate. The higher the ?
, the higher the false negative rate,
and the lower the false positive rate. If the false negative rate is of major concerns, one may
choose ?
= ? to zero out false negative rate. Note that choosing ?
< ? is meaningless as it
only increases the false positive rate without decreasing the false negative rate which is zero.
To formalize our analysis, we de?ne a random variable for checking an itemset against a
Bloom ?lter. Then we re-write in terms of the de?ned variables the frequency of an itemset
learnt from Bloom ?lters and the false positive/negative rates for checking an itemset from
all Bloom ?lters.
De?nition 4.7 For an itemset S and a transaction T
i
such that S T
i
, de?ne a random 0–1
variable e
i
such that e
i
= 1 if S ?
B
T
i
and e
i
= 0 if S
B
T
i
.
The de?ned variable indicates whether S ?
B
T
i
; that is, e
i
= 1 with probability f
i
and
e
i
= 0 with probability 1 ? f
i
(see Lemma 4.3). In other words, e
i
represents a Bernoulli
trial with probabilities f
i
of success and 1 ? f
i
of failure. Without loss of generality, we can
assume that S T
i
for the ?rst N · (1 ?freq(S)) transactions T
i
. Then we have
Bfreq(S) = freq(S) +
1
N
·
N(1?freq(S))
i =1
e
i
(2)
Lemma 4.8 Let s
e
be the sum of N · (1 ?freq(S)) random 0-1 variables e
i
de?ned for an
itemset S. The false positive and false negative rates for checking S from all Bloom ?lters
using a revised threshold ?
? ? are
f
+
?
= Pr
_
s
e
? N
_
?
?freq(S)
_
| freq(S) < ?
_
f
?
?
= Pr
_
s
e
< N
_
?
?freq(S)
_
| ? ? freq(S) < ?
_
4.3 Estimate of false positive and false negative
We estimate the false positive and false negative rates in the case of n
i
?
= n, where n
i
is the
size of each transaction, and n =
1
N
N
i =1
n
i
is the average size of N transactions. Note that
if n
i
n, the transaction data can be clustered into multiple groups such that in each group,
the size of each transaction is equal or close to the average size of the transactions in the
group. The data mining task can be easily extended to each group. For simplicity, we assume
that n
i
= n in our analysis; we leave the multi-group case in Sect. 4.4.
Let k =
m
n
ln 2 be the optimal number of hash functions that are used for generating
the Bloom ?lter of length m for each transaction, which consists of n items. According to
Lemma 3.3 and Corollary 4.5, the false positive rate f
i
for checking an itemset S from a
Bloom ?lter has a lower bound and an upper bound:
2
?||B(S)||
? f
i
? 2
?k
(3)
First consider a special case where f
i
=
¯
f for all i = 1, . . . , N. In this case s
e
is the
sum of N · (1 ? freq(S)) independent Bernoulli trials with probabilities
¯
f for success and
1?
¯
f for failure. Let b(?, ?,
¯
f ) =
?!
?!(???)!
¯
f
?
(1?
¯
f )
???
be the probability that ? Bernoulli
trials with probabilities
¯
f for success and (1 ?
¯
f ) for failure result in ? successes and ? ??
failures (? ? ?). Let C(?, ?,
¯
f ) =
?
i =?
b(i, ?,
¯
f ) be the cumulative binomial probability
of having at least ? successes in ? trials.
5
Then the false positive and false negative rates for
5
Function C is a standard function provided in many commercial software packages such as Matlab.
1 3
L. Qiu et al.
checking an itemset S from all Bloom ?lters are
f
+
?
(
¯
f ) = C
_
N ·
_
?
?freq(S)
_
, N · (1 ?freq(S)) ,
¯
f
_
, where 0 < freq(S) < ? (4)
f
?
?
(
¯
f ) = 1 ?C
_
N ·
_
?
?freq(S)
_
, N · (1 ?freq(S)) ,
¯
f
_
, where ? ? freq(S) < ?
(5)
Since the cumulative binomial probability C(?, ?,
¯
f ) is monotonic increasing with
¯
f ,
from formulae (3) to (5) it is easy to know the lower bounds and upper bounds for the false
positive rate and false negative rate in general case:
f
+
?
_
2
?||B(S)||
_
? f
+
?
? f
+
?
_
2
?k
_
, where 0 < freq(S) < ?
f
?
?
_
2
?k
_
? f
?
?
? f
?
?
_
2
?||B(S)||
_
, where ? ? freq(S) < ?
In a special case where ?
= ?, we have f
?
?
= 0 and
C
_
A, B, 2
?||B(S)||
_
? f
+
?
? C
_
A, B, 2
?k
_
(6)
where A = N · (? ?freq(S)) and B = N · (1 ?freq(S)).
Equation (6) indicates that the greater the number k of hash functions, the smaller the false
positive and false negative rates. In other words, the longer the Bloom ?lters, the smaller the
false positive and false negative rates. To further understand this, we compare Bfreq(S) with
freq(S) in the average case. Let E[·] denote the mean of a random variable. From Eq. (2),
we have
E
_
Bfreq(S)
_
= freq(S) +
1
N
· E[s
e
] = freq(S) +
1
N
·
N(1?freq(S))
i =1
f
i
Recall the bounds for f
i
(see Eq. 3). In the false positive case (where freq(S) ? ?), we
have
N · (1 ??) · 2
?||B(S)||
? E[s
e
] ? N · 2
?k
Similarly, in the false negative case (where ? ? freq(S) < ?
), we have
N · (1 ??
) · 2
?||B(S)||
? E[s
e
] ? N · (1 ??) · 2
?k
The above three equations imply that the average value of Bfreq(S) is greater than freq(S),
but the difference is bounded. Note that ||B(S)|| ? k. We have
E
_
Bfreq(S)
_
?freq(S) ? (1 ??) · 2
?k
< 2
?k
(7)
The longer the Bloom ?lters (i.e., the greater the k), the smaller the difference between
Bfreq(S) and freq(S). If the length of Bloom ?lters increases linearly, then the difference
of frequencies decreases exponentially. For example, if k ? 20, then the difference of fre-
quencies will be less than 10
?6
. This means that in the average case, the frequency of an
itemset detected from Bloom ?lters will be greater than that detected from original data by
at most 10
?6
. Note that the above analysis is conducted for each itemset. The overall false
positive and false negative rates depend on the distribution of itemsets and their frequencies.
Empirical study will be conducted in Sect. 5.
1 3
Outsourcing data mining tasks
4.4 Multiple groups of Bloom ?lters
The estimate of false positive and false negative rates is based on the assumption that the size
of each transaction is equal or close to the average size of all transactions. This might not
be true in many real data sets. A simple solution is to cluster the transactions into multiple
groups such that in each group, the size of each transaction is equal or close to the average
size of the transactions in the group. Across all groups, the same number k of hash functions
are used for generating Bloom ?lters. In each group, k =
m
n
ln 2 is optimized for the length
m of Bloom ?lters and the average size n of the transactions in the group. This means that
for different groups, the length of Bloom ?lters are different depending on the average size
of transactions in the group. Roughly speaking, long Bloom ?lters will be used for long
transactions, while short Bloom ?lters for short transactions.
Using different Bloom?lters for different groups of transactions will not only save storage
requirement but also increase the precision of data mining. Note that the bounds presented
in the previous section are based on k. Since the same k is optimized across multiple groups,
the bounds can be used to estimate the false positive and false negative rates in all groups.
In the case of multiple groups, the transactions are represented by the Bloom ?lters of
different lengths. This requires that each candidate itemset be represented by Bloom ?lters
of different lengths too. This is because in our data mining problem, the Bloom ?lter of each
candidate itemset needs to be checked against the Bloom ?lters of the same length. To avoid
transmitting multiple Bloom ?lters for each candidate itemset between client and server,
only the longest Bloom ?lter of each candidate itemset is sent to the server at one time. In
data mining process, the outsourced server needs to transform it into different lengths so as
to check it against different groups of Bloom ?lters. We provide a simple solution called
?-folding to transform the Bloom ?lters.
De?nition 4.9 Given an m-bit Bloom ?lter B, ?-folding of B, denoted as B
?
, is de?ned as a
?m-bit vector generated from B: for 0<i ??m, bi t (B
?
, i )=bit(B, i )
_
0<j ?m
j =i mod ?m
bi t (B, j ),
where bi t (B, i ) denotes the i
th
bit of B, ? the bitwise OR, and 0 < ? ? 1.
Example 1 Let B be a 10-bit Bloom?lter. B
0.7
is de?ned as a 7-bit vector B
0.7
: bi t (B
0.7
, i ) =
bi t (B, i ) ? bi t (B, i +7) for i = 1, 2, 3, and bi t (B
0.7
, i ) = bi t (B, i ) for i = 4, 5, 6, 7.
From the definition of Bloom ?lter, it is clear that vector B
?
is a Bloom ?lter generated
with the same k hash functions which are used for generating Bloom ?lter B.
Given a Bloom ?lter B(S) of longest length m of candidate itemset S, the frequency
Bfreq(S) is computed by checking B(S) against the Bloom ?lters of transactions in all
groups. For a particular group in which the Bloom ?lters have length m
? m (note that
m is the longest length for all groups), the server applies
m
m
-folding to B(S) such that the
transformed Bloom ?lter has the length m
.
5 Experiments
To evaluate the performance of our Bloom ?lter based method for mining frequent itemsets,
we conduct experiments based on both synthetic data and real data. A framework of our
method is shown in Algorithm 1.
1 3
L. Qiu et al.
5.1 Algorithm
Algorithm 1 can be divided into three phases: counting phase (lines 3–5), pruning phase
(lines 6–8), and candidates generating phase (lines 9–10) in each round , where indicates
the size of each candidate itemset dealt with. In the counting phase, each candidate ?lter is
checked against all transaction ?lters and the candidate’s count is updated. In the pruning
phase, any Bloom ?lter is eliminated from the candidate set if its count (i.e., Bsupport) is less
than a revised threshold N · ?
. Finally, in the candidates generating phase, new candidate
Bloom ?lters are generated from the Bloom ?lters discovered in the current round. The new
candidates will be used for data mining in the next round.
Algorithm 1 Mining frequent itemsets from Bloom ?lters
1: C
1
=
_
B(I
1
), . . . , B(I
d
)
_
// B(I
i
) is the Bloom ?lter of item I
i
2: for ( = 1; C
= ?; ++) do
3: for each B(S) ? C
and each transaction ?lter B(T
i
) do
4: if S ?
B
T
i
then Bsupport(S) ++ // S ?
B
T
i
iff B(S) ? B(T
i
) = B(S)
5: end for
6: for each B(S) ? C
do
7: if Bsupport (S) < N · ?
then delete B(S) from C
// ?
is the revised threshold in data mining
8: end for
9: F
= C
// F
is the collection of Bloom ?lters of all “frequent" itemsets with length
10: C
+1
= can_gen(F
) // generate ?lters of candidate itemsets for the next round
11: end for
12: Answer =
_
F
// all ?lters of frequent itemsets
5.1.1 Ef?cient counting
To improve the ef?ciency in the counting phase, we organize the Bloom ?lters of the trans-
actions of each group in a tree hierarchy and use every q bits to partition them at different
levels, where q is a parameter. For example, at the root level, the partition leads to 2
q
child
nodes; the Bloom ?lters in each node share the same ?rst q bits. A node splits if it contains
more than c Bloom ?lters, where c is another parameter. At the end of partition, each leaf
node contains limited number of Bloom ?lters, while each non-leaf node (except the root) is
associated with a q-bit segment with which the node shares.
Because of the randomness of keyed hash functions, the distribution of Bloom ?lters is
uniform,
6
which implies that the tree is well balanced. Therefore, an L-level tree can be used
to index up to c · 2
qL
Bloom ?lters. Given q = 5 and c = 20, for example, a 4-level tree can
be used to index 20M Bloom ?lters.
Heuristic 5.1 Let s be the q-bit segment associated with a non-leaf node and B(S) be a
Bloom ?lter of candidate itemset S. If any bit in s is 1 while the corresponding bit in B(S) is
0, then no Bloom ?lter in the subtree rooted at the non-leaf node needs to be checked in the
counting phase.
In the counting phase, we traverse the tree to compare each candidate ?lter with the trans-
action ?lters stored in the leaf nodes and update the count of the candidate ?lter appropriately.
According to the above heuristic, we may skip some subtrees in the counting process.
6
Using the optimal number of hash functions, each bit in a Bloom ?lter has the equal probability of being 1
and 0.
1 3
Outsourcing data mining tasks
An alternative way to do this is to organize candidate ?lters in a tree structure and update
their counts appropriately while traversing the tree for each transaction ?lter. In multiple
group case, this solution requires that the tree of candidate ?lters be built differently for
each group, while in the above method a static tree of transaction ?lters can be used for any
candidate ?lters.
5.1.2 Candidates generating
To support interactive data mining and discover all frequent itemsets as required in Problem2,
a multiple step interaction can be conducted between client and server in candidates gener-
ating phase. The client provides the server with a set of candidate ?lters C
; after the server
sends back the mining result F
, the client generates another set C
+1
of candidate ?lters and
sends it to the server for data mining. This process can be done repetitively in each round.
As mentioned in Sect. 3.2, with C
1
(i.e., Bloom ?lters of individual items) an edge server
may decipher partial sensitive data (e.g., the compositions of transactions). Therefore, for
the concern of protecting BI and preserving customer privacy, it is advisory not to outsource
frequent 1-itemset mining tasks, or to use alternative choice of concealing C
1
discussed in
Sect. 3.2.
The candidate generation C
+1
= can_gen(F
) in this case is conducted at client side
for privacy reasons. The Bloom ?lters in F
are transformed back to itemsets with the help
of secret key. From this collection of itemsets, the client generates a new set of candidate
itemsets using the well-known method apri ori _gen as proposed in [4]. The basic idea of
apri ori _gen is that a candidate itemset of length + 1 is generated only if all its subsets
of length appear in the collection of itemsets. The client may also edit the set of candi-
dates according to application requirements and constraints. Finally, the client transforms
the candidate itemsets to Bloom ?lters and sends them back (in C
+1
) to the server. All of
our experiments presented in the next section are based on this scenario.
Another choice to perform candidate generation is at server side. However, the server has
no secret key to perform the hash functions, so it cannot transform back and forth between
Bloom ?lters and itemsets. A possible solution is to use C
+1
={B(S
1
) ? B(S
2
) : B(S
1
),
B(S
2
) ? F
} as candidate set for the next round. It is easy to verify that B(S
1
) ? B(S
2
) is
the Bloom ?lter of itemset S
1
? S
2
; therefore, this solution generates all Bloom ?lters of the
itemsets that are unions of any two frequent itemsets (clearly, no frequent itemset is missed
in this process). The disadvantage of this solution is that the server cannot exploit Apriori
property using apri ori _gen at itemset level.
5.2 Experiment settings
Rigorous experiments are conducted on both real data and synthetic data so as to evaluate the
performance of our Bloom ?lter method in terms of mining precision, storage requirement,
and computation time. All the experiments are run on a Compaq desktop computer with
Pentium-4 CPU clock rate of 3.00GHz, 3.25GB of RAM, and 150GB harddisk, running
Microsoft Windows XP Professional Version 2002 with SP2.
5.2.1 Real data
The real data sets we adopt for this experiment are BMS-POS, BMS-WebView-1, and BMS-
WebView-2 available at http://www.ecn.purdue.edu/KDDCUP. The dataset BMS-POS con-
tains several years worth of point-of-sale data from a large electronics retailer, whereas the
1 3
L. Qiu et al.
Fig. 1 Distribution of
transaction sizes
0
20
40
60
80
100
5 10 15 20 25 30 35
Transaction size
F
r
e
q
u
e
n
c
y
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Synthetic
datasets BMS-WebView-1 and BMS-WebView-2 contain several months worth of click-
stream data from two e-commerce websites.
5.2.2 Synthetic data
We generate synthetic data using the transaction generator designed in IBM Quest project
[4]. A synthetic dataset contains 100–750K transactions; each transaction is generated from
a set of 1,000 frequent itemsets. The size of each frequent itemset is picked from Poisson
distribution with mean 4. There are totally 1,000 distinct items. The size of each transaction
is picked from Poisson distribution with mean 10. We use four sets of synthetic data, namely
Syn1, Syn2, …, and Syn4.
5.2.3 Distributions of real data and synthetic data
The distributions of transaction sizes for both real data and synthetic data are shown in
Fig. 1. According to [41], the exponential-like distributions of transaction sizes are typical in
many real data sets used for mining association rules, while the synthetic data generated by
IBM generator with Poisson distribution have been most widely used for testing the scalabil-
ity of association rule mining algorithms. We use the real data to test the mining precision,
storage requirement, computational time, and the tradeoffs among them. We use the synthetic
data for scalability test. We show that our method works well for both types of data.
5.2.4 Mining precision
For mining precision, we compare the result of our method with the solution to the traditional
problem (see Problem 1). We use the standard association rule mining algorithm Apriori [4]
to discover exactly all frequent itemsets for the traditional problem. Given a revised threshold
?
in our method, the mining precision is measured in terms of overall false positive rate F
+
?
and overall false negative rate F
?
?
. Let F be the set of Bloom?lters of those frequent itemsets
discovered by Apriori, and F
be the set of Bloom ?lters discovered by our algorithm. Then
• Overall false positive rate: F
+
?
=
|F
?F|
|F
|
.
• Overall false negative rate: F
?
?
=
|F?F
|
|F
|
.
Note that the overall false positive and false negative rates are not exactly the false positive
and false negative rates discussed in Sect. 4.3. Given a data set and a data mining threshold,
the false positive and false negative rates studied in Sect. 4.3 are the theoretical probabilities
particular to each itemset, while the overall false positive and false negative rates are the
experimental results for the whole data set. Statistically, one can expect a positive correlation
between the two measurements.
1 3
Outsourcing data mining tasks
Table 1 Experiment parameters and their default values
Parameter Meaning default value
k Number of hash functions used for generating Bloom ?lters 30
? Frequency threshold for traditional data mining problem 1%
? Coef?cient for revised threshold ?
= ? +? · 2
?k
0
g Number of groups 4
5.2.5 Storage requirement
The storage requirement is measured in terms of the length of Bloom ?lters and the num-
ber of transactions that an outsourced server needs to store. If there are multiple groups of
transactions represented by Bloom ?lters of different lengths, then the storage requirement
is the sum of the product of Bloom ?lter length and number of transactions in each group.
Let there be g groups. For each group i , let the length of Bloom ?lters be m
i
and the number
of transactions N
i
. Then
Storage requirement =
g
i =1
m
i
N
i
.
5.2.6 Experiment parameters
Table 1 gives the parameters used in our experiments as well as their default values. In our
experiments, unless otherwise indicated, only one parameter is changed at a time while others
are kept at their default values.
The number k of hash functions is the optimal parameter to determine the length m
i
of
Bloom ?lters in each group i based on the average length n
i
of transactions in the group; that
is, k =
m
i
n
i
ln 2. The same k is used across all groups.
Recall that for any itemset, the average difference of the revised frequency and the original
frequency is upper-bounded by 2
?k
(see Eq. 7). We use the coef?cient ? (where 0 ? ? ? 1)
to calculate the revised frequency threshold ?
= ? + ? · 2
?k
. This coef?cient is used to
balance between the overall false positive and false negative rates.
When the transactions are divided into multiple groups, the grouping is based on the dis-
tribution of transaction size. In our experiments, the grouping is carried out by the following
method. Let n
1
be the average length of transactions in a dataset. The transactions are ?rstly
divided into two groups such that Group 1 contains those transactions whose lengths are
not greater than n
1
and otherwise for Group 2. Likewise, Group 2 is further divided into
two groups. The grouping operation is continued until the original dataset is divided into g
groups.
5.3 Experimental results
5.3.1 Data mining time versus data conversion time
In the ?rst set of experiments, we study the data mining time versus the data conversion time
by changing the number k of hash functions from 25 to 40.
Figure 2 shows that the time of mining frequent itemsets is much more than the time of
converting raw data to Bloom ?lter representation, the more transactions, the more mining
1 3
L. Qiu et al.
Fig. 2 Data mining time versus
data conversion time
0.0
6.0
12.0
18.0
24.0
30.0
25 30 35 40
Number of hash functions
R
e
l
a
t
i
v
e
r
u
n
n
i
n
g
t
i
m
e
(
m
i
n
i
n
g
/
c
o
n
v
e
r
s
i
o
n
)
BMS-WebView-1 BMS-WebView-2 BMS-POS
0
4
8
12
16
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
0
1.5
3
4.5
6
0.5 0.75 1.5
Frequency Threshold (%)
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
BMS-WebView-1
BMS-WebView-2
BMS-POS
2 1 0.5 0.75 1.5
Frequency Threshold (%)
2 1
Fig. 3 Changing mining thresholds where g = 4, k = 30, and ? = 0
time. This result reveals that the mining process takes the major part of running time. It veri-
?es the worthiness of pre-processing (i.e., data format conversion before outsourcing mining
tasks).
5.3.2 Change mining thresholds
In the second set of experiments, we study the mining precision and running time by changing
the frequency threshold ? from 0.5 to 2%. Other parameters are kept their default values, i.e.,
g = 4, k = 30, and ? = 0.
Figure 3 shows that the false positive rate is less than 6% for datasets BMS-WebView-1
and BMS-WebView-2, and less than 1% for datatset BMS-POS. For all datasets, the running
time is decreasing with mining threshold. The reason is that a larger mining threshold will
cause fewer frequent itemsets to be discovered in a shorter time. In addition, the experiments
on dataset BMS-POS require more time than those on datasets BMS-WebView-1 and BMS-
WebView-2 because dataset BMS-POS contains approximately 6–8 times more transactions
than other two datasets.
The false negative rates are 0 for all datasets when ? = 0. We also run experiments for
the revised frequency threshold ?
= ? + ? · 2
?k
when ? varies from 0.2 to 1.0 with an
increment of 0.2. Since our experimental results are the same as compared with the cases for
which ? = 0, the mining precision is not sensitive to the change of frequency threshold in
our experiments. Therefore, in the following, we use ? = 1% and ? = 0 and only consider
the false positive in our analysis.
1 3
Outsourcing data mining tasks
0
40
80
120
160
200
20 25 30 35 40
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
2
4
6
8
20 25 30 35 40
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
10
20
30
40
20 25 30 35 40
Number of Hash Functions Number of Hash Functions
Number of Hash Functions
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Fig. 4 Changing number of hash functions where g = 4, ? = 1%, and ? = 0
5.3.3 Change number of hash functions
In the third set of experiments, we study the mining precision, computation cost, and storage
requirement with the change of the number k of hash functions. We change k = from 20 to
40 and keep the other parameters at their default values, i.e., g = 4, ? = 1%, and ? = 0.
Figure 4 shows the mining precision with the change of k. There is a globally decreasing
trend of false positive rate for each real dataset. For dataset BMS-POS, the false positive rate
is nearly 5% for k = 20, and is less than 1% for k ? 25. For datasets BMS-WebView-1 and
BMS-WebView-2, the false positive rates are below 5% for k ? 30. It also shows that the
running time changes slightly with k. The running time is around 8min for dataset BMS-POS
and within 1min for datasets BMS-WebView-1 and BMS-WebView-2. The storage require-
ment is linearly increasing with k for all datasets based on the ?gure. The reason is that
the larger the k, the longer the Bloom ?lters, and the more storage is required to store the
Bloom ?lters. The experimental results show that high mining precision can be achieved by
increasing the number of hash functions. Consequently, the storage requirement increases
linearly due to the use of longer Bloom ?lters.
5.3.4 Change number of groups
In the fourth set of experiments, we study the mining precision, computation cost, and storage
requirement by changing the number g of groups from g = 2 to 5. The other parameters are
kept at their default values, i.e., k = 30, ? = 1%, and ? = 0.
Figure 5 shows that the false positive rate decreases with the number of groups. For data-
sets BMS-WebView-1 and BMS-WebView-2, the false positive rate can be as low as 5% for
g ? 3; whereas for dataset BMS-POS, the false positive rate is less than 2.5% for g ? 2.
It also shows that running time and storage requirement change little with parameter g. The
1 3
L. Qiu et al.
0
30
60
90
120
150
2 3
2 3 2 3
4 5
4 5 4 5
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
2
4
6
8
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
0
5
10
15
20
Number of Groups Number of Groups
Number of Groups
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
BMS-WebView-1
BMS-WebView-2
BMS-POS
Fig. 5 Changing number of groups where k = 30, ? = 1%, and ? = 0
results of this set of experiments reveal that high mining precision can be achieved with larger
number of groups without increasing much the storage requirement and the running time.
5.3.5 Scalability
In the last set of experiments, we use synthetic datasets Syn1, Syn2, …, and Syn4 to test the
scalability of our algorithm with respect to the running time, mining precision, and storage
requirement. We let k = 20 and keep other parameters at their default values, i.e., g = 4,
? = 1%, and ? = 0.
Figure 6 shows that the false positive rate for all synthetic datasets is no more than 1%,
and that the running time and the storage requirement increase linearly with the number of
transactions which is scaled up from 100K in dataset Syn1 to 750K in dataset Syn4. The
scalability is also veri?ed by experiments for k > 20 (e.g., k = 25 and 30) with zero false
positive rate.
From the experimental results, we can conclude that the running time and storage require-
ment are scalable with the number of transactions while the mining precision is reasonably
high.
5.4 Summary of experiments
One can draw the following conclusions from out experiments: (1) in the various cases of
our experiments, zero false negative rate can be achieved. This result is not sensitive to the
change of mining threshold; (2) by increasing the number of hash functions in formulating
Bloom ?lters, the false positive rate decreases at the price of higher storage requirement; (3)
by increasing the number of groups in data mining, the false positive rate decreases without
increasing the running time or the storage requirement; and (4) the running time and the
1 3
Outsourcing data mining tasks
0
70
140
210
280
350
Synthetic Dataset
S
t
o
r
a
g
e
(
M
e
g
a
-
b
i
t
)
0
30
60
90
120
Synthetic Dataset
R
u
n
n
i
n
g
T
i
m
e
(
m
i
n
)
0
0.25
0.5
0.75
1
Syn1 Syn2 Syn3 Syn4 Syn1 Syn2 Syn3 Syn4
Syn1 Syn2 Syn3 Syn4
Synthetic Dataset
F
a
l
s
e
P
o
s
i
t
i
v
e
R
a
t
e
(
%
)
Fig. 6 Scalability results where k = 20, g = 4, ? = 1%, and ? = 0
storage requirement are scalable with the number of transactions that are processed in data
mining.
6 Conclusions
The contribution of this paper is threefold. First, we proposed a new approach to outsourc-
ing association rule mining tasks while protecting BI and customer privacy. The proposed
approach is different from previous solutions in that it can protect BI and customer privacy
while outsourcing mining tasks, at the same time, maintain the precision of mining results.
Second, we performed theoretical analysis on the false positive and false negative rates in
data mining. We also estimated the upper and lower bounds for the mining errors. Third,
we investigated the tradeoffs between mining precision and storage requirement. Rigorous
experiments were conducted on typical real and synthetic datasets showing that our approach
can save storage significantly without sacri?cing much mining precision, security level, and
running time.
We are planning to investigate how to protect BI and customer privacy while outsourcing
other mining tasks such as decision tree mining. It is also interesting to study privacy pre-
serving techniques for mining speci?c types of frequent itemsets, such as maximum itemsets
and closed itemsets.
References
1. Agrawal D, Aggarwal CC (2001) On the design and quanti?cation of privacy preserving data mining
algorithms. In: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART symposium on principles of
database systems, pp 247–255
1 3
L. Qiu et al.
2. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large dat-
abases. In: Proceedings of the 1993 ACMSIGMODinternational conference on management of database
pp 207–216
3. Agrawal R, Kiernan J, Srikant R, Xu Y (2004) Order preserving encryption for numeric data. In: Pro-
ceedings of the ACM SIGMOD ICMD, pp 563–574
4. Agrawal R, Srikant R (1994) Faster algorithms for mining association rules in large databases. In: Pro-
ceedings of the 20th international conference on very large data bases (VLDB’94), Santiago de Chile,
Chile, September 12–15, pp 487–499
5. Agrawal R, Srikant R(2000) Privacy preserving data mining. In: Proceedings of the 2000 ACMSIGMOD
international conference on management of database, Texas, USA, May 16–18, pp 439–450
6. Agrawal S, Haritsa JR(2005) Aframework for high-accuracy privacy-preserving mining. In: Proceedings
of the 21th IEEE international conference on data engineering (ICDE 2005), Tokyo, Japan, pp 193–204
7. Apte C, Liu B, Pednault E, Smyth P (2002) Business applications of data mining. Commun ACM
45(8):49–53
8. Atallah M, Bertino E, Elmagarmid AK, IbrahimM, Verykios VS (1999) Disclosure limitation of sensitive
rules. In: Proceedings of the IEEE KDEE, pp 45–52
9. Bishop M, Bhumiratana B, Crawford R, Levitt K (2004) How to sanitize data. In: Proceedings of the
13th IEEE international workshops on enabling technologies: infrastructure for collaborative enterprises
(WETICE’04), Modena, Italy, June 14–16, pp 217–222
10. Bloom B (1970) Space time tradeoffs in hash coding with allowable errors. Commun ACM 7(13):422–
426
11. Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using con?dence
and support. In: Proceedings of the 4th international information hiding workshop, pp 369–383
12. Dibbeern J, Heinzl A (2002) Outsourcing information systems in small and medium sized enterprises: a
test of a multi-theoretical casaul model. In: Dibbeern J (ed) Information systems outsourcing: enduring
themes, emergent patterns, and future directions. Springer, New York
13. Du W, Zhan Z (2002) Building decision tree classi?er on private data. In: Proceedings of IEEE ICDM’02
workshop on privacy, security, and data mining, vol 14, pp 1–8
14. Ev?mievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining.
In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database
system, pp 211–222
15. Ev?mievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules.
In: Proceedings of the 8th ACM SIGKDD KDD 2002, pp 217–228
16. Hacigumus H, Iyer B, Li C, Mehrotra S (2002) Executing SQL over encrypted data in the database-ser-
vice-provider model. In: Proceedings of the ACM SIGMOD international conference on management
of database, pp 216–227
17. Hacigumus H, Iyer B, Mehrotra S (2002) Providing database as a service. In: Proceedings of the inter-
national conference on data engineering, pp 29–40
18. Hacigumus H, Iyer B, Mehrotra S (2004) Ef?cient execution of aggregation queries over encrypted rela-
tional databases. In: Proceedings of international conference on database systems for advanced applica-
tions, pp 125–136
19. Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: Proceedings of
the ACM SIGMOD international conference on management of data, Baltimore, MA, USA, June 14–16,
pp 37–48
20. Iyer B, Mehrotra S, Mykletun E, Tsudik G, Wu Y (2004) A framework for ef?cient storage security in
RDBMS. In: Proceedings of international conference on EDBT, pp 147–164
21. Kantarc?o? glu M, Clifton C (2002) Privacy preserving distributed mining of association rules on hori-
zontally partitioned data. In: Proceedings of the ACM SIGMOD workshop on research issues on data
mining and knowledge discovery, pp 24–31
22. Kantarc?o? glu M, Jin J, Clifton C (2004) When do data mining results violate privacy? In: Proceedings
of the 10th ACM SIGKDD KDD 2004, pp 599–604
23. Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data
perturbation techniques. In: Proceedings of the 3rd IEEE ICDM, pp 99–106
24. Kargupta H, Datta S, Wang Q, Sivakumar K (2005) Random-data perturbation techniques and privacy-
preserving data mining. Knowledge Inf Syst Int J 7(4):387–414
25. Lin Q-Y, Chen Y-L, Chen J-S, Chen Y-C (2003) Mining inter-organizational retailing knowledge for an
alliance formed by competitive ?rms. Inf Manage 40(5):431–442
26. Lindell Y, Pinkas B (2002) Privacy preserving data mining. J Cryptol 15(3):177–206
27. Lui SM, Qiu L(2007) Individual privacy and organizational privacy in business analytics. In: Proceedings
of the 40th Hawaii international conference on system sciences (HICSS 2007), Hawaii, USA, January
3–6, p 216b
1 3
Outsourcing data mining tasks
28. Milne G-R (2000) Privacy and ethical issues in database/interactive marketing and public policy: a
research framework and overview of the special issue. J Public Policy Marketing 19:1–6
29. Oliveira S, Zaiane O (2002) Privacy preserving frequent itemset mining. In: Proceedings of the IEEE
ICDM workshop on privacy, security and data mining, pp 43–54
30. Oliveira S, Zaiane O (2003) Algorithms for balancing privacy and knowledge discovery in association
rule mining. In: Proceedings of the 7th international database engineering and applications symposium,
pp 54–63
31. Oliveira S, Zaiane O (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the
3rd IEEE ICDM, pp 211–218
32. Ordones C, Ezquerra N, Santana CA (2006) Constraining and summarizing association rules in medical
data. Knowledge Inf Syst Int J 9(3):259–283
33. Pinkas B (2002) Cryptographic techniques for privacy preserving data mining. ACM SIGKDD Explor
4(2):12–19
34. Qiu L, Li Y, Wu X (2006) An approach to outsourcing data mining tasks while protecting business
intelligence and customer privacy. In: Workshops proceedings of the 6th IEEE international conference
on data mining (ICDM 2006), Hong Kong, China, December 18–22, pp 551–558
35. Ra´ s ZW, Gürdal O, Im S, Tzacheva A (2007) Data con?dentiality versus chase. In: Proceedings of the
joint rough sets symposium(JRS07), Toronto, Canada, May 14–16. Springer LNAI vol 4482, pp 330–337
36. Rizvi S, Haritsa J (2002) Maintaining data privacy in association rule mining. In: Proceedings of
VLDB’02, pp 682–693
37. Saygin Y, Verykios VS, Clifton C (2001) Using unknowns to prevent discovery of association rules.
Sigmod Rec 30(4):45–54
38. Vaidya J, Clifton C (2004) Privacy-preserving data mining: why, how, and when. IEEE Security Privacy
2(6):19–27
39. Xu S, Zhang J, Han D, Wang J (2006) A singular value decomposition based data distortion strategy for
privacy protection. Knowledge Inf Syst Int J 10(3):383–397
40. Yao AC-C (1986) How to generate and exchange secrets. In: Proceedings of the 27th IEEE symposium
on foundations of computer science (FOCS’86), Xi’an, China, pp 162–167
41. Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Pro-
ceedings of the 7th ACM-SIGKDD international conference on knowledge discovery and data mining,
pp 401–406
Author’s biographies
Ling Qiu is a Lecturer (academic level B) of IT in the School of Mathe-
matics, Physics and Information Technology at James Cook University,
Australia. He holds a PhD degree in Computer Science & Engineer-
ing awarded by Nanyang Technological University, Singapore in 2003.
He received an ME degree from Zhejiang University, China and a BE
degree from Huazhong University of Science and Technology, China.
His research interests include parallel and distributed computing, algo-
rithms, database systems and data mining, information systems and secu-
rity, and system simulations.
1 3
L. Qiu et al.
Yingjiu Li is currently an Assistant Professor in the School of Infor-
mation Systems at Singapore Management University, Singapore. He
received his PhDdegree in Information Technology fromGeorge Mason
University in 2003. His research interests include applications security,
privacy protection, and data rights management. He has published over
40 technical papers in the refereed journals and conference proceedings.
Yingjiu Li is a member of the ACM and the IEEE. The URL for his web
page is http://www.mysmu.edu/faculty/yjli/.
Xintao Wu is an Associate Professor in the Department of Software and
Information Systems at the University of North Carolina at Charlotte, the
USA. He got his PhD degree in Information Technology from George
Mason University in 2001. He received his BSdegree in Information Sci-
ence from the University of Science and Technology of China in 1994,
and an ME degree in Computer Engineering from the Chinese Acad-
emy of Space Technology in 1997. His major research interests include
data mining and knowledge discovery, data privacy and security, and
bio-informatics.
1 3
doc_241713779.pdf