Cluster analysis concepts

knt.nallasamygounder · Jan 21, 2013

Description
Agglomeration schedule, centroid, centers, membership, dendogram, icicle, coefficient matrix. It also focuses on euclidean, hierarchial clustering, agglomerative and divisive clustering, linkage methods, wards method and use of SPS

Cluster Analysis

Cluster Analysis
?

Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. In cluster analysis groups or clusters are suggested by the data.

?

An Ideal Clustering Situation
Fig. 1

Variable 1

Variable 2

A Practical Clustering Situation
Fig. 2

Variable 1

Variable 2

X

Statistics Associated with Cluster Analysis
?

Agglomeration schedule. An agglomeration schedule gives information on the objects or cases being combined at each stage of a hierarchical clustering process. Cluster centroid. The cluster centroid is the mean values of the variables for all the cases or objects in a particular cluster. Cluster centers. The cluster centers are the initial starting points around which clusters are built. Cluster membership. Cluster membership indicates the cluster to which each object or case belongs.

?

?

?

Statistics Associated with Cluster Analysis
?

Dendrogram. A dendrogram, or tree graph, is a graphical device for displaying clustering results. Vertical lines represent clusters that are joined together. The position of the line on the scale indicates the distances at which clusters were joined. The dendrogram is read from left to right. Distances between cluster centers. These distances indicate how separated the individual pairs of clusters are. Clusters that are widely separated are distinct, and therefore desirable.

?

Statistics Associated with Cluster Analysis
?

?

Icicle diagram. An icicle diagram is a graphical display of clustering results, so called because it resembles a row of icicles hanging from the eaves of a house. The columns correspond to the objects being clustered, and the rows correspond to the number of clusters. An icicle diagram is read from bottom to top. Figure 7 is an icicle diagram. Similarity/distance coefficient matrix. A similarity/distance coefficient matrix is a lowertriangle matrix containing pairwise distances between objects or cases.

Conducting Cluster Analysis
Fig. 3 Formulate the Problem Select a Distance Measure Select a Clustering Procedure Decide on the Number of Clusters Interpret and Profile Clusters Assess the Validity of Clustering

Attitudinal Data For Clustering
Table 1 Case No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

V1
6 2 7 4 1 6 5 7 2 3 1 5 2 4 6 3 4 3 4 2

V2
4 3 2 6 3 4 3 3 4 5 3 4 2 6 5 5 4 7 6 3

V3
7 1 6 4 2 6 6 7 3 3 2 5 1 4 4 4 7 2 3 2

V4
3 4 4 5 2 3 3 4 3 6 3 4 5 6 2 6 2 6 7 4

V5
2 5 1 3 6 3 3 1 6 4 5 2 4 4 1 4 2 4 2 7

V6
3 4 3 6 4 4 4 4 3 6 3 4 4 7 4 7 5 3 7 2

Conducting Cluster Analysis Formulate the Problem
?

?

?

The most important part of formulating the clustering problem is selecting the variables on which the clustering is based. Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solution. The variables should be selected based on past research, theory, or the hypotheses being tested.

Conducting Cluster Analysis Select a Distance or Similarity Measure
?

?

The most commonly used measure of similarity is the Euclidean distance or its square. The Euclidean distance is the square root of the sum of the squared differences in values for each variable. Other distance measures are also available. Use of different distance measures may lead to different clustering results. Hence, it is advisable to use different measures and compare the results.

A Classification of Clustering Procedures
Fig. 4 Hierarchical Agglomerative Divisive Sequential Threshold Linkage Methods Variance Methods Ward’s Method Single Complete Average Parallel Threshold Centroid Methods Optimizing Partitioning Clustering Procedures Nonhierarchical

Conducting Cluster Analysis Select a Clustering Procedure – Hierarchical
?

?

?

?

Hierarchical clustering is characterized by the development of a hierarchy or tree-like structure. Hierarchical methods can be agglomerative or divisive. Agglomerative clustering starts with each object in a separate cluster. Clusters are formed by grouping objects into bigger and bigger clusters. This process is continued until all objects are members of a single cluster. Divisive clustering starts with all the objects grouped in a single cluster. Clusters are divided or split until each object is in a separate cluster. Agglomerative methods are commonly used in marketing research. They consist of linkage methods, error sums of squares or variance methods, and centroid methods.

Conducting Cluster Analysis Select a Clustering Procedure – Linkage Method
?

?

?

The single linkage method is based on minimum distance, or the nearest neighbor rule. At every stage, the distance between two clusters is the distance between their two closest points. The complete linkage method is based on the maximum distance or the furthest neighbor approach. The distance between two clusters is calculated as the distance between their two furthest points. The average linkage method works similarly. However, in this method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the clusters.

Linkage Methods of Clustering
Fig. 5

Single Linkage
Minimum Distance Cluster 1 Cluster 2

Complete Linkage
Maximum Distance

Cluster 1

Average Linkage

Cluster 2

Average Distance Cluster 1 Cluster 2

Conducting Cluster Analysis Select a Clustering Procedure – Nonhierarchical
?

?

?

?

The nonhierarchical clustering methods are frequently referred to as k-means clustering. These methods include sequential threshold, parallel threshold, and optimizing partitioning. In the sequential threshold method, a cluster center is selected and all objects within a prespecified threshold value from the center are grouped together. Then a new cluster center or seed is selected, and the process is repeated for the unclustered points. Once an object is clustered with a seed, it is no longer considered for clustering with subsequent seeds. The parallel threshold method operates similarly, except that several cluster centers are selected simultaneously and objects within the threshold level are grouped with the nearest center. The optimizing partitioning method differs from the two threshold procedures in that objects can later be reassigned to clusters to optimize an overall criterion, such as cluster distance.

Conducting Cluster Analysis Select a Clustering Procedure
?

It has been suggested that the hierarchical and nonhierarchical methods be used in tandem. First, an initial clustering solution is obtained using a hierarchical procedure. The number of clusters and cluster centroids so obtained are used as inputs to the optimizing partitioning method.

Results of Hierarchical Clustering
Table 2
Agglomeration Schedule Using Ward’s Procedure Stage cluster Clusters combined first appears
Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Cluster 1 Cluster 2 Coefficient 14 16 1.000000 6 7 2.000000 2 13 3.500000 5 11 5.000000 3 8 6.500000 10 14 8.160000 6 12 10.166667 9 20 13.000000 4 10 15.583000 1 6 18.500000 5 9 23.000000 4 19 27.750000 1 17 33.100000 1 15 41.333000 2 5 51.833000 1 3 64.500000 4 18 79.667000 2 4 172.662000 1 2 328.600000 Cluster 1 Cluster 2 Next stage 0 0 6 0 0 7 0 0 15 0 0 11 0 0 16 0 1 9 2 0 10 0 0 11 0 6 12 6 7 13 4 8 15 9 0 17 10 0 14 13 0 16 3 11 18 14 5 19 12 0 18 15 17 19 16 18 0

Vertical Icicle Plot Using Ward’s Method
Fig. 7

Dendrogram Using Ward’s Method
Fig. 8

Conducting Cluster Analysis Decide on the Number of Clusters
?

?

Theoretical, conceptual, or practical considerations may suggest a certain number of clusters. The relative sizes of the clusters should be meaningful.

Conducting Cluster Analysis Interpreting and Profiling the Clusters
?

?

Interpreting and profiling clusters involves examining the cluster centroids. The centroids enable us to describe each cluster by assigning it a name or label. It is often helpful to profile the clusters in terms of variables used for clustering. These may include demographic, psychographic, product usage, media usage, or other variables.

Cluster Centroids
Table 3

Means of Variables Cluster No.
1

V1
5.750

V2
3.625

V3
6.000

V4
3.125

V5
1.750

V6
3.875

2

1.667

3.000

1.833

3.500

5.500

3.333

3

3.500

5.833

3.333

6.000

3.500

6.000

Conducting Cluster Analysis Assess Reliability and Validity

1. Use different methods of clustering and compare the results. 2. Split the data randomly into halves. Perform clustering separately on each half. Compare cluster centroids across the two subsamples.

Results of Nonhierarchical Clustering
Table 4 cont.
Cluster Membership Case Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster 3 2 3 1 2 3 3 3 2 1 2 3 2 1 3 1 3 1 1 2 Distance 1.414 1.323 2.550 1.404 1.848 1.225 1.500 2.121 1.756 1.143 1.041 1.581 2.598 1.404 2.828 1.624 2.598 3.555 2.154 2.102

Results of Nonhierarchical Clustering
Table 4 cont.

Distances between Final Cluster Centers Cluster 1 2 3 1 5.568 5.698 6.928 2 5.568 3 5.698 6.928

Results of Nonhierarchical Clustering
Table 4 cont.
ANOVA Cluster Mean Square 29.108 13.546 31.392 15.713 22.537 12.171 df 2 2 2 2 2 2 Error Mean Square 0.608 0.630 0.833 0.728 0.816 1.071 df 17 17 17 17 17 17 F 47.888 21.505 37.670 21.585 27.614 11.363 Sig. 0.000 0.000 0.000 0.000 0.000 0.001

V1 V2 V3 V4 V5 V6

Number of Cases in each Cluster Cluster 1 2 3 6.000 6.000 8.000 20.000 0.000

Valid Missing

SPSS Windows
To select this procedures using SPSS for Windows click: Analyze>Classify>Hierarchical Cluster … Analyze>Classify>K-Means Cluster …

doc_247120901.ppt

Cluster analysis concepts

Attachments