Python API Reference

Two types of classes can be found in FreeDiscovery,
  • scikit-learn compatible estimators that inherit from sklearn.base.BaseEstimator
  • freediscovery specific classes that add a persistance layer and are designed to function together with the REST API.

Datasets

freediscovery.datasets.load_dataset([name, ...]) Download a benchmark dataset.

Parsers

freediscovery.parsers.EmailParser([...]) Parse emails

Feature extraction

freediscovery.text.FeatureVectorizer([...]) Extract features from text documents

Document categorization

freediscovery.categorization.Categorizer([...]) Document categorization model
freediscovery.lsi.LSI([cache_dir, dsid, ...]) Document categorization using Latent Semantic Indexing (LSI)

Document clustering

freediscovery.cluster.Clustering([...]) Document clustering
freediscovery.cluster.ClusterLabels(vect, ...) Calculate the cluster labels.
freediscovery.cluster._DendrogramChildren(ddata) Compute childen for a given dendogram node
freediscovery.cluster.utils._binary_linkage2clusters(...) Given a list of elements of size n_sample and a linkage matrix
freediscovery.cluster.utils._merge_clusters(X) Compute a union of all clusters

Duplicates detection

freediscovery.dupdet.DuplicateDetection([...]) Find near duplicates in a document collection.
freediscovery.dupdet.SimhashDuplicates([...]) Find near duplicates using simhash-py
freediscovery.dupdet.IMatchDuplicates([...]) Find near duplicates using the randomized I-match backend

Email threading

freediscovery.threading.EmailThreading([...]) JWZ Email threading class

Tools

freediscovery.base.BaseEstimator
freediscovery.io.parse_ground_truth_file(...) Parse a ground truth file specified by a filename.
freediscovery.utils.generate_uuid() Generate a unique id for the model
freediscovery.utils.setup_model(base_path) Generate a unique model id and create the corresponding folder for storing results

Metrics

This module aims to extend sklearn.metrics with a few additional metrics,

freediscovery.metrics.ratio_duplicates_score(x, y) Given cluster labels x and y, compute the relative error between the number of duplicates in x vs the one in y.
freediscovery.metrics.f1_same_duplicates_score(x, y) Given cluster labels x and y, compute the f1 score