Exploring BIRCH cluster hierarchy¶
An example illustrating how to explore the cluster hierarchy computed by the BIRCH algorithm.
In this example, we use a
patched verson of sklearn.cluster.Birch
that allows to store the id of samples belonging to each subcluster.
This modified version is available from freediscovery.cluster.Birch
.
Building the cluster hierarchy¶
We start by computing BIRCH clustering on some random structured data,
import numpy as np
from sklearn.datasets import make_blobs
from freediscovery.cluster import Birch, birch_hierarchy_wrapper
rng = np.random.RandomState(42)
X, y = make_blobs(n_samples=1000, n_features=10, random_state=rng)
cluster_model = Birch(threshold=0.9, branching_factor=20,
compute_sample_indices=True)
cluster_model.fit(X)
Next we wrap each subcluster in the cluster hierarchy
(cluster_model.root_
) with the
BirchSubcluster
class
that allows easier manipulation of the hierarchical tree.
htree, _ = birch_hierarchy_wrapper(cluster_model)
print('Total number of subclusters:', htree.tree_size)
Out:
Total number of subclusters: 78
Visualizing the hierarchy¶
We can now visualize the cluster hierarchy,
htree.display_tree()
Out:
[cluster_id=0] N_children: 7 N_samples: 1000
> [cluster_id=1] N_children: 3 N_samples: 37
> > [cluster_id=2] N_children: 0 N_samples: 19
> > [cluster_id=3] N_children: 0 N_samples: 3
> > [cluster_id=4] N_children: 0 N_samples: 15
> [cluster_id=5] N_children: 10 N_samples: 127
> > [cluster_id=6] N_children: 0 N_samples: 16
> > [cluster_id=7] N_children: 0 N_samples: 18
> > [cluster_id=8] N_children: 0 N_samples: 7
> > [cluster_id=9] N_children: 0 N_samples: 16
> > [cluster_id=10] N_children: 0 N_samples: 14
> > [cluster_id=11] N_children: 0 N_samples: 16
> > [cluster_id=12] N_children: 0 N_samples: 14
> > [cluster_id=13] N_children: 0 N_samples: 7
> > [cluster_id=14] N_children: 0 N_samples: 10
> > [cluster_id=15] N_children: 0 N_samples: 9
> [cluster_id=16] N_children: 17 N_samples: 230
> > [cluster_id=17] N_children: 0 N_samples: 9
> > [cluster_id=18] N_children: 0 N_samples: 19
> > [cluster_id=19] N_children: 0 N_samples: 14
> > [cluster_id=20] N_children: 0 N_samples: 6
> > [cluster_id=21] N_children: 0 N_samples: 19
> > [cluster_id=22] N_children: 0 N_samples: 19
> > [cluster_id=23] N_children: 0 N_samples: 18
> > [cluster_id=24] N_children: 0 N_samples: 20
> > [cluster_id=25] N_children: 0 N_samples: 13
> > [cluster_id=26] N_children: 0 N_samples: 14
> > [cluster_id=27] N_children: 0 N_samples: 9
> > [cluster_id=28] N_children: 0 N_samples: 20
> > [cluster_id=29] N_children: 0 N_samples: 11
> > [cluster_id=30] N_children: 0 N_samples: 13
> > [cluster_id=31] N_children: 0 N_samples: 13
> > [cluster_id=32] N_children: 0 N_samples: 8
> > [cluster_id=33] N_children: 0 N_samples: 5
> [cluster_id=34] N_children: 3 N_samples: 34
> > [cluster_id=35] N_children: 0 N_samples: 17
> > [cluster_id=36] N_children: 0 N_samples: 4
> > [cluster_id=37] N_children: 0 N_samples: 13
> [cluster_id=38] N_children: 7 N_samples: 103
> > [cluster_id=39] N_children: 0 N_samples: 22
> > [cluster_id=40] N_children: 0 N_samples: 22
> > [cluster_id=41] N_children: 0 N_samples: 12
> > [cluster_id=42] N_children: 0 N_samples: 9
> > [cluster_id=43] N_children: 0 N_samples: 14
> > [cluster_id=44] N_children: 0 N_samples: 15
> > [cluster_id=45] N_children: 0 N_samples: 9
> [cluster_id=46] N_children: 12 N_samples: 172
> > [cluster_id=47] N_children: 0 N_samples: 19
> > [cluster_id=48] N_children: 0 N_samples: 9
> > [cluster_id=49] N_children: 0 N_samples: 13
> > [cluster_id=50] N_children: 0 N_samples: 16
> > [cluster_id=51] N_children: 0 N_samples: 18
> > [cluster_id=52] N_children: 0 N_samples: 10
> > [cluster_id=53] N_children: 0 N_samples: 17
> > [cluster_id=54] N_children: 0 N_samples: 11
> > [cluster_id=55] N_children: 0 N_samples: 12
> > [cluster_id=56] N_children: 0 N_samples: 19
> > [cluster_id=57] N_children: 0 N_samples: 17
> > [cluster_id=58] N_children: 0 N_samples: 11
> [cluster_id=59] N_children: 18 N_samples: 297
> > [cluster_id=60] N_children: 0 N_samples: 21
> > [cluster_id=61] N_children: 0 N_samples: 22
> > [cluster_id=62] N_children: 0 N_samples: 18
> > [cluster_id=63] N_children: 0 N_samples: 17
> > [cluster_id=64] N_children: 0 N_samples: 17
> > [cluster_id=65] N_children: 0 N_samples: 20
> > [cluster_id=66] N_children: 0 N_samples: 20
> > [cluster_id=67] N_children: 0 N_samples: 13
> > [cluster_id=68] N_children: 0 N_samples: 10
> > [cluster_id=69] N_children: 0 N_samples: 12
> > [cluster_id=70] N_children: 0 N_samples: 21
> > [cluster_id=71] N_children: 0 N_samples: 14
> > [cluster_id=72] N_children: 0 N_samples: 22
> > [cluster_id=73] N_children: 0 N_samples: 13
> > [cluster_id=74] N_children: 0 N_samples: 11
> > [cluster_id=75] N_children: 0 N_samples: 23
> > [cluster_id=76] N_children: 0 N_samples: 7
> > [cluster_id=77] N_children: 0 N_samples: 16
We have a hierarchy 2 levels deep, with 78 sub-clusters and a total of 1000 samples.
For instance, let’s consider the subcluster with cluster_id=34
.
We can access it by id with the flattened representation of the hierarchy,
sc = htree.flatten()[34]
print(sc)
Out:
<BirchSubcluster 2b8d8ddc1db0: "{'document_id': [], 'cluster_id': 34, 'document_id_accumulated': [46, 134, 136, 156, 195, 217, 219, 348, 423, 579, 963, 713, 879, 911, 912, 925, 939, 5, 521, 961, 975, 20, 722, 459, 759, 783, 829, 785, 834, 882, 928, 943, 978, 982], 'cluster_size': 34}">
* parent: BirchSubcluster[subcluster_id=0]
* children: [BirchSubcluster[cluster_id=35], BirchSubcluster[cluster_id=36], BirchSubcluster[cluster_id=37]]
Each subcluster is a dictionary linked inside the hierarchy via the
parent / children attributes
(cf documentation of BirchSubcluster
).
The ids of the samples contained in a subcluster are stored under the
document_id_accumulated
key. We can perform any calculations with the
samples in a given cluster by indexing the original dataset X,
print('cluster_id', sc['cluster_id'])
print('document_id_accumulated', sc['document_id_accumulated'])
sc_centroid = X[sc['document_id_accumulated'], :].mean(axis=0)
print(sc_centroid)
Out:
cluster_id 34
document_id_accumulated [46, 134, 136, 156, 195, 217, 219, 348, 423, 579, 963, 713, 879, 911, 912, 925, 939, 5, 521, 961, 975, 20, 722, 459, 759, 783, 829, 785, 834, 882, 928, 943, 978, 982]
[ 1.28769333 -6.89092161 -4.06263732 -2.3882217 -0.58254266 5.00808135
-6.66734732 0.42977283 2.16344421 -9.97065391]
For instance, we can select only subclusters that are one level deep
(this includes cluster_id=34
) and compute their centroids,
htree_depth_1 = [sc for sc in htree.flatten() if sc.current_depth == 1]
for sc in htree_depth_1:
sc['centroid'] = X[sc['document_id_accumulated'], :].mean(axis=0)
print('Centroid for cluster_id=34:\n', htree.flatten()[34]['centroid'])
Out:
Centroid for cluster_id=34:
[ 1.28769333 -6.89092161 -4.06263732 -2.3882217 -0.58254266 5.00808135
-6.66734732 0.42977283 2.16344421 -9.97065391]
Custom calculations in the hierarchy¶
While for a number of computations, it is sufficient to iterate through
a flattened tree, sometimes the output of the calculation need to account
for data from any number of other subclusters in the tree (e.g. all the
children). In this case we can subclass
BirchSubcluster
to add out custom recursive
function. Here we will add a function that for any subcluster computes the
the maximum depth of the tree spanned by its children subclusters,
from freediscovery.cluster import BirchSubcluster
class NewBirchSubcluster(BirchSubcluster):
@property
def max_tree_depth(self):
if self.children:
return 1 + max(sc.max_tree_depth for sc in self.children)
else:
return 0
by re-wrapping the cluster hierarchy with this container, we get,
htree_new, _ = birch_hierarchy_wrapper(cluster_model,
container=NewBirchSubcluster)
print('Tree depth from the root node:', htree_new.max_tree_depth)
print('Tree depth from cluster_id=34:', htree_new.flatten()[34].max_tree_depth)
Out:
Tree depth from the root node: 2
Tree depth from cluster_id=34: 1
Total running time of the script: ( 0 minutes 0.263 seconds)