Mining Projected Clusters in High-Dimensional Spaces

grace.grecia.52 · Mar 29, 2014

Mining Projected Clusters in High-Dimensional Spaces

Chapter-1 1. INTR D!CTI N
1.1 Introduction
Data mining is the process of extracting potentially useful information from a data set. Clustering is a popular data mining technique which is intended to help the user discover and understand the structure or grouping of the data in the set according to a certain similarity measure. Clustering algorithms usually employ a distance metric e.g., Euclidean or a similarity measure in order to partition the database so that the data points in each partition are more similar than points in different partitions. The commonly used Euclidean distance, while computationally simple, requires similar objects to have close values in all dimensions. owever, with the high!dimensional data commonly encountered nowadays, the concept of similarity between objects in the full!dimensional space is often invalid and generally not helpful.

1." Pro#lem Statement
"eature selection techniques are commonly utili#ed as a preprocessing stage for clustering, in order to overcome the curse of dimensionality. The most informative dimensions are selected by eliminating irrelevant and redundant ones. $uch techniques speed up clustering algorithms and improve their performance. %evertheless, in some applications, different clusters may exist in different subspaces spanned by different dimensions. &n such cases, dimension reduction using a conventional feature selection technique may lead to substantial information loss.

'

Mining Projected Clusters in High-Dimensional Spaces

1.$ Scope
(ecent theoretical results reveal that data points in a set tend to be more equally spaced as the dimension of the space increases, as long as the components of the data point are independently and identically distributed. )lthough the condition is rarely satisfied in real applications, it still becomes less meaningful to differentiate data points based on a distance or a similarity measure computed using all the dimensions. These results explain the poor performance of conventional distance!based clustering algorithms on such data sets.

1.%

#jecti&e
) number of projected clustering algorithms have been proposed. owever, most of

them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance!based projected clustering algorithm.

1.' (pproach
These observations motivate our effort to propose a novel projected clustering algorithm, called *rojected Clustering based on the +!,eans )lgorithm -*C+).. *C+) is composed of three phases/ attribute relevance analysis, outlier handling, and discovery of projected clusters. 0ur algorithm is partitional in nature and able to automatically detect projected clusters of very low dimensionality embedded in high! dimensional space, thereby avoiding computation of the distance in the full! dimensional space.

1

Mining Projected Clusters in High-Dimensional Spaces

1.) Thesis

utline

The proposed dissertation consists of seven Chapters including &ntroduction and Conclusions. Chapter ' motivation, problem definition, objective and limitation of the proposed system. Chapter 1 emphasi#es on detailed literature survey. Chapter 2 Describes about the analysis, software requirement specification, software and hardware requirements, algorithms. Chapter 3 described the Total Design of the *roject using 4,5 Diagrams and Chapter 6 describes the implementation details of the project. Testing and validation and the $creen $hots7 (eports is described in Chapter 8. Chapter 9 describes the conclusion and future wor: of the project.

2

Mining Projected Clusters in High-Dimensional Spaces

Chapter-" ". *IT+R(T!R+ S!R,+".1 Introduction
Data mining is the process of extracting potentially useful information from a data set. Clustering is a popular data mining technique which is intended to help the user discover and understand the structure or grouping of the data in the set according to a certain similarity measure. Clustering algorithms usually employ a distance metric e.g., Euclidean or a similarity measure in order to partition the database so that the data points in each partition are more similar than points in different partitions. The commonly used Euclidean distance, while computationally simple, requires similar objects to have close values in all dimensions. owever, with the high! dimensional data commonly encountered nowadays, the concept of similarity between objects in the full!dimensional space is often invalid and generally not helpful. (ecent theoretical results reveal that data points in a set tend to be more equally spaced as the dimension of the space increases, as long as the components of the data point are i.i.d. -independently and identically distributed.. )lthough the i.i.d. condition is rarely satisfied in real applications, it still becomes less meaningful to differentiate data points based on a distance or a similarity measure computed using all the dimensions. These results explain the poor performance of conventional distance!based clustering algorithms on such data sets. "eature selection techniques are commonly utili#ed as a preprocessing stage for clustering, in order to overcome the curse of dimensionality. The most informative dimensions are selected by eliminating irrelevant and redundant ones. $uch techniques speed up clustering algorithms and improve their performance. %evertheless, in some applications, different clusters may exist in different subspaces spanned by different

3

Mining Projected Clusters in High-Dimensional Spaces dimensions. &n such cases, dimension reduction using a conventional feature selection technique may lead to substantial information loss.

The following example provides an idea of the difficulties encountered by conventional clustering algorithms and feature selection techniques. "igure ' illustrates a generated dataset set composed of ';;; data points in ';!dimensional space. %ote that this dataset is generated based on the data generator model described in. )s we can see from "igure 1.', there are four clusters that have their own relevant dimensions -e.g., cluster ' exists in dimensions A'; A3; Aith the set of medoids, *(0C54$ finds the subspace dimensions for each cluster by examining the neighboring locality of the space near it. )fter the subspace has been determined, each data point is assigned to the cluster of the nearest medoid. The algorithm is run until the sum of intracluster distances ceases to change. 0(C54$ is an extended version of *(0C54$ that loo:s for non!axis!parallel clusters, by using $ingular Calue Decomposition -$CD. to transform the data to a new coordinate system and select principal components. *(0C54$ and 0(C54$ were the first to successfully introduce a methodology for discovering projected clusters in high! dimensional spaces, and they continue to inspire novel approaches.

) limitation of these two approaches is that the process of forming the locality is based on the full dimensionality of the space. owever, it is not useful to loo: for neighbors in datasets with very low!dimensional projected clusters. &n addition, *(0C54$ and 0(C54$ require the user to provide the average dimensionality of the subspace, which also is very difficult to do in real life applications.

9

Mining Projected Clusters in High-Dimensional Spaces

&n *rocopius et al. propose an approach called D0C -Density!based 0ptimal projective Clustering. in order to identify projected clusters. D0C proceeds by discovering clusters one after anotherD defining a projected cluster as a hypercube with width 1 w, where w is a user!supplied parameter. &n order to identify relevant dimensions for each cluster, the algorithm randomly selects a seed point and a small set, Y , of neighboring data points from the dataset. ) dimension is considered as relevant to the cluster if and only if the distance between the projected value of the seed point and the data point in Y on the dimension is no more than w. )ll data points that belong to the defined hypercube form a candidate cluster. The suitability of the resulting cluster is evaluated by a quality function which is based on a user!provided parameter that controls the trade!off between the number of objects and the number of relevant dimensions. D0C tries different seeds and neighboring data points, in order to find the cluster that optimi#es the quality function. The entire process is repeated to find other projected clusters. &t is clear that since D0C scans the entire dataset repetitively, its execution time is very high. To alleviate this problem, an improved version of D0C called "ast D0C is also proposed in .

D0C is based on an interesting theoretical foundation and has been successfully applied to image processing applications. &n contrast to previous approaches, -i.e. *(0C54$ and 0(C54$., D0C is able to automatically discover the number of clusters in the dataset. owever, the input parameters of D0C are difficult to determine and an inappropriate choice by the user can greatly diminish its accuracy. "urthermore, D0C loo:s for clusters with equal width along all relevant dimensions. &n some types of data, however, clusters with different widths are more realistic.

)nother hypercube approach called "*C -"requent!*attern!based Clustering. is proposed in to improve the efficiency of D0C. "*C replaces the randomi#ed module of D0C with systematic search for the best cluster defined by a random medoid point p. &n order to discover relevant dimensions for the medoid p, an optimi#ed adaptation of the

<

Mining Projected Clusters in High-Dimensional Spaces frequent pattern tree growth method used for mining item sets is proposed. &n this context, the authors of "*C illustrate the analogy between mining frequent item sets and discovering dense projected clusters around random points. The adapted mining technique is combined with "ast D0C to discover clusters. owever, the fact that "*C returns only one cluster at a time adversely affects its computational efficiency. &n order to speed up "*C, an extended version named C"*C -Concurrent "requent!*attern!based Clustering. is also proposed in.C"*C can discover multiple clusters simultaneously, which improves the efficiency of the clustering process.

&t is shown in that "*C significantly improves the efficiency of D0C7"astD0C and can be much faster than the previous approaches. owever, since "*C is built on D0C7"astD0C it inherits some of their drawbac:s. "*C performs well only when each cluster is in the form of a hypercube and the parameter values are specified correctly. ) recent chapter proposes a hierarchical projected clustering algorithm called )(* -a ierarchical approach with )utomatic (elevant dimension selection for projected clustering.. The basic assumption of )(* is that if two data points are similar in high! dimensional space, they have a high probability of belonging to the same cluster in lower!dimensional space. =ased on this assumption, two clusters are allowed to merge only if they are similar enough in a number of dimensions. The minimum similarity and minimum number of similar dimensions are dynamically controlled by two thresholds, without the assistance of user parameters. The advantage of )(* is that it provides a mechanism to automatically determine relevant dimensions for each cluster and avoid the use of input parameters, whose values are difficult to set. &n addition to this, the study illustrates that )(* provides interesting results on gene expression data.

0n the other hand, as mentioned in $ection ', it has been shown in that, for a number of common data distributions, as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. =ased on these results, the basic

A

Mining Projected Clusters in High-Dimensional Spaces assumption of )(* will be less valid when projected clusters have few relevant

dimensions. &n such situations the accuracy of )(* deteriorates severely. This effect on )(*Es performance was also observed by Bip et al.&n order to overcome the limitation encountered by )(* and other projected clustering algorithms, the authors of )(* propose in a semi!supervised approach named $$*C -$emi!$upervised *rojected Clustering.. This algorithm is partitional in nature and similar in structure to *(0C54$. )s in semi!supervised clustering, $$*C ma:es use of domain :nowledge -labeled data points and7or labeled dimensions. in order to improve the quality of a clustering. )s reported in , the clustering accuracy can be greatly improved by inputting only a small amount of domain :nowledge. available. ) density!based algorithm named E*C -Efficient *rojective Clustering by performs projected clustering owever, in some applications, domain :nowledge in the form of labeled data points and7or labeled dimensions is very limited and not usually

istograms. is proposed in for projected clustering. E*C

by histogram construction. =y iteratively lowering a threshold, dense regions are identified in each histogram. )?signature? is generated for each data point corresponding to some region in some subspace. *rojected clusters are uncovered by identifying signatures with a large number of data points. E*C shown in that E*C has an interesting property in that no assumption is made about the can be fast and is able to handle clusters of irregular shape. 0n the avoids the computation of distance between data points in the number of clusters or the dimensionality of subspaces. &n addition to this, it has been other hand, while E*C

full!dimensional space, it suffers from the curse of dimensionality. &n our experiments, we have observed that when the dimensionality of the data space increases and the number of relevant dimensions for clusters decreases, the accuracy of E*C is affected. ) field that is closely related to projected clustering is subspace clustering. C5&F4E was the pioneering approach to subspace clustering, followed by a number of algorithms in the same field such as ,)"&) and $4=C54. The idea behind subspace clustering is to identify all dense regions in all subspaces, whereas in projected clustering, as the name

';

Mining Projected Clusters in High-Dimensional Spaces implies, the main focus is on discovering clusters that are projected onto particular spaces.The outputs of subspace clustering algorithms differ significantly from those of projected clustering. $ubspace clustering techniques tend to produce a partition of the dataset with overlapping clusters. The output of such algorithms is very large, because data points may be assigned to multiple clusters. &n contrast, projected clustering algorithms produce disjoint clusters with a single partitioning of points .Depending on the application domain, both subspace clustering and projected clustering can be powerful tools for mining high!dimensional data. $ince the major concern of this chapter is projected clustering, we will focus only on such techniques. "urther details and a survey on subspace clustering algorithms and projected clustering algorithms can be found.

".$. +.isting Su#space Clustering (lgorithms
$ome algorithms better adjust to high dimensions. "or example, the algorithm C)CT4$ -section Co!0ccurrence of Categorical Data. adjusts well since it defines a cluster only in terms of a clusterEs 1D projections. &n this section we cover techniques that are specifically designed to wor: with high dimensional data. The algorithm C5&F4E -Clustering in Fuest. G)grawal et al. 'AAhen tal:ing about high dimensionality, how high is highS ,any spatial clustering algorithms depend on indices in spatial datasets -sub!section Data *reparation. to facilitate quic: search of the nearest neighbors. Therefore, indices can serve as good proxies with respect to dimensionality curse performance impact. &ndices used in clustering algorithms are :nown to wor: effectively for dimensions below '8. "or a dimension d T 1; their performance degrades to the level of sequential search -though

'9

Mining Projected Clusters in High-Dimensional Spaces newer indices achieve significantly higher limits.. Therefore, we can arguably claim that data with more than '8 attributes is high dimensional. ow large is the gapS &f we are dealing with a retail application, 61!wee:s sales volumes represent a typical set of features, which is a special example of more general class of time series data. &n customer profiling do#ens of generali#ed item categories plus basic demographics result in at the least 6;!';; attributes. >eb clustering based on site contents results in 1;;!';;; attributes -pages7contents. for modest >eb sites. =iology and genomic data can have dimensions that easily surpass 1;;;!6;;; attributes. "inally, text mining and information retrieval also deal with many thousands of attributes -words or index terms.. $o, the gap is significant. Two general purpose techniques are used to fight high dimensionality/ -'. attributes transformations and -1. domain decomposition. )ttribute transformations are simple functions of existent attributes. "or sales profiles and 05)*!type data, roll!ups as sums or averages over time intervals -e.g., monthly volumes. can be used. Due to a fine seasonality of sales such brute force approaches rarely wor:. &n multivariate statistics principal components analysis -*C). is popular G,ardia et al. 'At?are Re1uirement
The other required software needed for the simulation of this system is Uava and its advanced components including the third party components. '. 0perating $ystem 1. 5anguages7 $oftware Uava (untime Environment, Uava $oftware Development +it '.8 Uava %et=eans &DE Ta#le $."9 So>t?are Re1uirement Ta#le >indows 1;;; 7 Q*, 5inux based systems

16

Mining Projected Clusters in High-Dimensional Spaces

$.$.' Communication Inter>ace
The $ystem should be connected to intranet and various communicating devices

$.% ;unctional Re1uirements
"unctional requirements will define the fundamental actions that must ta:e place in the software in accepting P processing the inputs in processing P generating the outputs.

$.%.1 In>ormation >lo?s
Class diagram, 4ses Case Diagram, )ctivity Diagrams, $equence Diagrams, and Collaboration Diagrams will be provided which describes the flow of data between various processes of the system.

$.%." Process Description
*rocess descriptions will be provided based on the process information. 4se Case $pecification will be enclosed and provided which describes the detailed specifications of each use case.

$.%.$ Per>ormance Re1uirements
The data must be processed from different data sources in a finite amount of time.

2.'. Non- ;unctional Re1uirements
The major non!functional (equirements of the system are as follows '. 4sability The system is designed with completely automated process hence there is no or less user intervention. 1. (eliability

18

Mining Projected Clusters in High-Dimensional Spaces The system is more reliable because of the qualities that are inherited from the chosen platform java. The code built by using java is more reliable. 2. *erformance This system is developing in the high level languages and using the advanced front!end and bac:!end technologies it will give response to the end user on client system with in very less time.

3. $upportability The system is designed to be the cross platform supportable. The system is supported on a wide range of hardware and any software platform, which is having UC,, built into the system. 6. &mplementation The system is implemented with Uava environment. The java software development :it and net beans used as software and windows xp professional is used as the platform.

$.). So>t?are s:stem attri#utes
$calability/ The number of intermediate sources can be scalable, thus changing or updating the data. (eliability/ This proposed system should provide reliable results. (esource 4tili#ation Efficiency/ This system will utili#e less processing time. $ecurity/ This system is developed in java hence it is secured $afety/ This system uses java code safety

19

Mining Projected Clusters in High-Dimensional Spaces Capacity/ )ny number of users will be able to use this system &nterfaces/ They will be provided in the design document )vailability/ This system will always cater to the needs of the users )ccuracy/ This system will produce accurate results (eusability/ This system can be easily reused Ease of 4se/ This system is developed using graphical user interface hence it is easy to use &nteroperability/ Through the use of *reprocessed file interoperability is achieved *ortability/ The system is portability any version of windows as well as 5inux $ystems. *rivacy/ This system ensures privacy of the data $ystem )dministration Ease/ This system will provide easy administration capabilities Expandability/ )ny number of modules can be added to this system ,aintainability/ The system would be design as open system and new method is easily added Testability/ Test cases will be written to ensure correct results

1<

Mining Projected Clusters in High-Dimensional Spaces

Chapter % %. D+SIolf, *.$. Bu, and U.$. *ar:, h"ast )lgorithm for *rojected Clustering,? *roc.)C,$&I,0D EAA, pp. 8'!91, 'AAA. G8H +.B.5. Bip, D.>. Cheung, ,.+. %g, and +. Cheung, h&dentifying *rojected Clusters from Iene Expression *rofiles,? U. =iomedical &nformatics, vol. 29, no. 6, pp. 236!269, 1;;3.

91

Mining Projected Clusters in High-Dimensional Spaces G9H +.B.5. Bip, D.>. Cheng, and ,.+. %g, h0n Discovery of Extremely 5ow! Dimensional Clusters 4sing $emi!$upervised *rojected Clustering,? *roc. 1'st &ntEl Conf. Data Eng. -&CDE E;6., pp. 21A!23;, 1;;6. G. Cheng, and ,.+. %g, h )(*/ ) *ractical *rojected Clustering )lgorithm,? &EEE Trans. +nowledge and Data Eng., vol. '8, no. '', pp. '2. ,c+ean, and ).T. Craig, &ntroduction to ,athematical $tatistics,

sixth ed. *earson *rentice all, 1;;6. G11H U.". 5awless, $tatistical ,odels and ,ethods for 5ifetime Data. Uohn >iley P $ons, 'Aang, and . $un, h)n 0bjective )pproach to Cluster

Calidation,? *attern (ecognition 5etters, vol. 19, no. '2, pp. '3'A!'32;, 1;;8. G13H U.U. 0liver, (.). =axter, and C.$. >allace, h4nsupervised 5earning 4sing ,,5,? *roc. '2th &ntEl Conf. ,achine 5earning -&C,5 EA8., pp. 283!291, 'AA8. G16H I. $chwar#, hEstimating the Dimension of a ,odel,? )nnals of $tatistics, vol. 8, no. 1, pp. 38'!383, 'A9orld $cientific, 'A

Mining Projected Clusters in High-Dimensional Spaces

Attachments