freediscovery.cluster.Clustering

class freediscovery.cluster.Clustering(cache_dir=u'/tmp/', dsid=None, mid=None)[source]

Document clustering

The algorithms are adapted from scikit learn.

The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:
  • cache_dir (str) – directory where to save temporary and regression files
  • dsid (str, optional) – dataset id
  • mid (str, optional) – model id
__init__(cache_dir=u'/tmp/', dsid=None, mid=None)[source]

Methods

__init__([cache_dir, dsid, mid])
birch(n_clusters[, threshold, lsi_components]) Perform Birch clustering
compute_labels([label_method, n_top_words, ...]) Compute the cluster labels
dbscan([n_clusters, eps, min_samples, ...]) Perform DBSCAN clustering
delete() Delete a trained model
get_dsid(cache_dir, mid)
get_params() Get model parameters
get_path(mid)
k_means(n_clusters[, lsi_components, batch_size]) Perform K-mean clustering
list_models()
load(mid) Load results from cache specified by a mid
scores(ref_labels, labels)
param ref_labels:
 reference labels
ward_hc(n_clusters[, lsi_components, ...]) Perform Ward hierarchical clustering
birch(n_clusters, threshold=0.5, lsi_components=None)[source]

Perform Birch clustering

Parameters:
  • n_clusters (int) – number of clusters
  • lsi_components (int) – apply LSA before the clustering algorithm
  • threshold (float) – birch threshold
compute_labels(label_method=u'centroid-frequency', n_top_words=6, cluster_indices=None)[source]

Compute the cluster labels

Parameters:
  • label_method (str, default='centroid-frequency') – the method used for computing the cluster labels
  • n_top_words (int, default=10) – keep only most relevant n_top_words words
  • cluster_indices (list, default=None) – if not None, ignore clustering given by the clustering model and compute terms for the cluster provided by the given indices
Returns:

cluster_labels

Return type:

array [n_samples]

dbscan(n_clusters=None, eps=0.5, min_samples=10, algorithm=u'auto', leaf_size=30, lsi_components=None)[source]

Perform DBSCAN clustering

This can also be used for Duplicate Detection (when ep

Parameters:
  • n_clusters (int) – number of clusters # not used just present for compatibility
  • lsi_components (int) – apply LSA before the clustering algorithm
  • eps (float) –
    The maximum distance between two samples for them to be considered
    as in the same neighborhood.
  • min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
delete()[source]

Delete a trained model

get_params()[source]

Get model parameters

k_means(n_clusters, lsi_components=None, batch_size=1000)[source]

Perform K-mean clustering

Parameters:
  • n_clusters (int) – number of clusters
  • lsi_components (int) – apply LSA before the clustering algorithm
  • batch_size (int) – the bath size for the MiniBatchKMeans algorithm
load(mid)[source]

Load results from cache specified by a mid

scores(ref_labels, labels)[source]
Parameters:
  • ref_labels (list,) – reference labels
  • labels (list,) – computed labels
ward_hc(n_clusters, lsi_components=None, n_neighbors=10)[source]

Perform Ward hierarchical clustering

Parameters:
  • n_clusters (int) – number of clusters
  • lsi_components (int) – apply LSA before the clustering algorithm
  • n_neighbors (int) – N nearest neighbors used for computing the connectivity matrix