freediscovery.cluster.Clustering¶
-
class
freediscovery.cluster.
Clustering
(cache_dir=u'/tmp/', dsid=None, mid=None)[source]¶ Document clustering
The algorithms are adapted from scikit learn.
The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.
Parameters: - cache_dir (str) – directory where to save temporary and regression files
- dsid (str, optional) – dataset id
- mid (str, optional) – model id
Methods
__init__
([cache_dir, dsid, mid])birch
(n_clusters[, threshold, lsi_components])Perform Birch clustering compute_labels
([label_method, n_top_words, ...])Compute the cluster labels dbscan
([n_clusters, eps, min_samples, ...])Perform DBSCAN clustering delete
()Delete a trained model get_dsid
(cache_dir, mid)get_params
()Get model parameters get_path
(mid)k_means
(n_clusters[, lsi_components, batch_size])Perform K-mean clustering list_models
()load
(mid)Load results from cache specified by a mid scores
(ref_labels, labels)param ref_labels: reference labels ward_hc
(n_clusters[, lsi_components, ...])Perform Ward hierarchical clustering -
birch
(n_clusters, threshold=0.5, lsi_components=None)[source]¶ Perform Birch clustering
Parameters: - n_clusters (int) – number of clusters
- lsi_components (int) – apply LSA before the clustering algorithm
- threshold (float) – birch threshold
-
compute_labels
(label_method=u'centroid-frequency', n_top_words=6, cluster_indices=None)[source]¶ Compute the cluster labels
Parameters: - label_method (str, default='centroid-frequency') – the method used for computing the cluster labels
- n_top_words (int, default=10) – keep only most relevant n_top_words words
- cluster_indices (list, default=None) – if not None, ignore clustering given by the clustering model and compute terms for the cluster provided by the given indices
Returns: cluster_labels
Return type: array [n_samples]
-
dbscan
(n_clusters=None, eps=0.5, min_samples=10, algorithm=u'auto', leaf_size=30, lsi_components=None)[source]¶ Perform DBSCAN clustering
This can also be used for Duplicate Detection (when ep
Parameters: - n_clusters (int) – number of clusters # not used just present for compatibility
- lsi_components (int) – apply LSA before the clustering algorithm
- eps (float) –
- The maximum distance between two samples for them to be considered
- as in the same neighborhood.
- min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
-
k_means
(n_clusters, lsi_components=None, batch_size=1000)[source]¶ Perform K-mean clustering
Parameters: - n_clusters (int) – number of clusters
- lsi_components (int) – apply LSA before the clustering algorithm
- batch_size (int) – the bath size for the MiniBatchKMeans algorithm
-
scores
(ref_labels, labels)[source]¶ Parameters: - ref_labels (list,) – reference labels
- labels (list,) – computed labels
-
ward_hc
(n_clusters, lsi_components=None, n_neighbors=10)[source]¶ Perform Ward hierarchical clustering
Parameters: - n_clusters (int) – number of clusters
- lsi_components (int) – apply LSA before the clustering algorithm
- n_neighbors (int) – N nearest neighbors used for computing the connectivity matrix