freediscovery.cluster.Clustering¶

class freediscovery.cluster.Clustering(cache_dir=u'/tmp/', dsid=None, mid=None)[source]¶

Document clustering

The algorithms are adapted from scikit learn.

The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:	cache_dir (str) – directory where to save temporary and regression files dsid (str, optional) – dataset id mid (str, optional) – model id

Methods

__init__([cache_dir, dsid, mid])

birch(n_clusters[, threshold, lsi_components]) Perform Birch clustering

compute_labels([label_method, n_top_words, ...]) Compute the cluster labels

dbscan([n_clusters, eps, min_samples, ...]) Perform DBSCAN clustering

delete() Delete a trained model

get_dsid(cache_dir, mid)

get_params() Get model parameters

get_path(mid)

k_means(n_clusters[, lsi_components, batch_size]) Perform K-mean clustering

list_models()

load(mid) Load results from cache specified by a mid

scores(ref_labels, labels)

param ref_labels:
	reference labels

ward_hc(n_clusters[, lsi_components, ...]) Perform Ward hierarchical clustering

birch(n_clusters, threshold=0.5, lsi_components=None)[source]¶

Perform Birch clustering

Parameters:	n_clusters (int) – number of clusters lsi_components (int) – apply LSA before the clustering algorithm threshold (float) – birch threshold

compute_labels(label_method=u'centroid-frequency', n_top_words=6, cluster_indices=None)[source]¶

Compute the cluster labels

Parameters:	label_method (str, default='centroid-frequency') – the method used for computing the cluster labels n_top_words (int, default=10) – keep only most relevant n_top_words words cluster_indices (list, default=None) – if not None, ignore clustering given by the clustering model and compute terms for the cluster provided by the given indices
Returns:	cluster_labels
Return type:	array [n_samples]

dbscan(n_clusters=None, eps=0.5, min_samples=10, algorithm=u'auto', leaf_size=30, lsi_components=None)[source]¶

Perform DBSCAN clustering

This can also be used for Duplicate Detection (when ep

Parameters:

n_clusters (int) – number of clusters # not used just present for compatibility
lsi_components (int) – apply LSA before the clustering algorithm
eps (float) –

The maximum distance between two samples for them to be considered

as in the same neighborhood.
min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

k_means(n_clusters, lsi_components=None, batch_size=1000)[source]¶

Perform K-mean clustering

Parameters:	n_clusters (int) – number of clusters lsi_components (int) – apply LSA before the clustering algorithm batch_size (int) – the bath size for the MiniBatchKMeans algorithm

scores(ref_labels, labels)[source]¶

Parameters:	ref_labels (list,) – reference labels labels (list,) – computed labels

ward_hc(n_clusters, lsi_components=None, n_neighbors=10)[source]¶

Perform Ward hierarchical clustering

Parameters:	n_clusters (int) – number of clusters lsi_components (int) – apply LSA before the clustering algorithm n_neighbors (int) – N nearest neighbors used for computing the connectivity matrix