freediscovery.cluster.BirchSubcluster

class freediscovery.cluster.BirchSubcluster(**args)[source]

A container class for BIRCH cluster hierarchy

This is a dict like container, that is used to store each subcluster in the cluster hierarchy computed by freediscovery.cluster.Birch. A given subcluster links to the parent / children subclusters in the hierarchy with the following attributes,

Each subcluster stores the following dictionary keys,

  • document_id : list, a list of document / sample ids contained in this subcluster (excluding its children).
  • document_id_accumulated: a list of document / sample ids contained in this subcluster and its children. Only available when this class was build using birch_hierarchy_wrapper() with the compute_document_id=True parameter. It can be re-computed with the document_id_accumulated class property.
  • cluster_size: int, the number of samples contained in this subcluster and its children. This corresponds to the length of the document_id_accumulated property. Only available when this class was build using birch_hierarchy_wrapper() with the compute_document_id=True parameter.

other keys may be user-computed as necessary.

See User Manual for more details.

Note

This class descends from freediscovery.externals.jwzthreading.Container originally used to represent e-mail threads obtained with the JWZ algorithm in jwzthreading, though it is general enough to represent other hierarchical stuctures, such as BIRCH cluster hierarchy.

In FreeDiscovery this class is primarily used for documents. As a result the variables/methods containing the term “document” have the same meaning as “sample” in the general scikit-learn context.

add_child(child)[source]

Add a child to the container

Parameters:child (Container) – Child to add.
clear() → None. Remove all items from D.
copy() → a shallow copy of D
current_depth

Compute the depth in the hierarchy of the current container

display_tree(max_depth=None)[source]

Print the content of hierarchical tree below this subcluster

document_count

Count of all documents in the children subclusters

document_id_accumulated

Returns list of document / sample ids contained in this subcluster or any of its children.

flatten()[source]

Return a flatten version of the hierarchical tree

Returns:list – a flat list of containers
Return type:Containers
fromkeys()

Returns a new dict with keys from iterable and values equal to value.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_descendant(ctr)[source]

Check if ctr is a descendant of this container.

Parameters:ctr (Container) – possible descendant container.
Returns:
Return type:True if ctr is a descendant of self, else False.
increment_cluster_id(value)[source]

Increment the cluster_id of all children by the given value

is_dummy

Check if the container has some content.

items() → a set-like object providing a view on D's items
keys() → a set-like object providing a view on D's keys
limit_depth(max_depth=None)[source]

Truncate the tree to the provided maximum depth

Parameters:max_depth (int) – hierarchy depth to which truncate the tree
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

remove_child(child)[source]

Remove a child from the container

Parameters:child (Container) – Child to remove.
root

Get the root container

Returns:Containe
Return type:the top most level container
setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
tree_size

Recursively count the number of children containers. The current container is also included in the count.

update([E, ]**F) → None. Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values

Examples using freediscovery.cluster.BirchSubcluster