Python API Reference¶
- Two types of classes can be found in FreeDiscovery,
- scikit-learn compatible estimators that inherit from sklearn.base.BaseEstimator
- freediscovery specific classes that add a persistance layer and are designed to function together with the REST API.
Datasets¶
freediscovery.datasets.load_dataset ([name, ...]) |
Download a benchmark dataset. |
Parsers¶
freediscovery.parsers.EmailParser ([...]) |
Parse emails |
Feature extraction¶
freediscovery.text.FeatureVectorizer ([...]) |
Extract features from text documents |
Document categorization¶
freediscovery.categorization.Categorizer ([...]) |
Document categorization model |
freediscovery.lsi.LSI ([cache_dir, dsid, ...]) |
Document categorization using Latent Semantic Indexing (LSI) |
Document clustering¶
freediscovery.cluster.Clustering ([...]) |
Document clustering |
freediscovery.cluster.ClusterLabels (vect, ...) |
Calculate the cluster labels. |
freediscovery.cluster._DendrogramChildren (ddata) |
Compute childen for a given dendogram node |
freediscovery.cluster.utils._binary_linkage2clusters (...) |
Given a list of elements of size n_sample and a linkage matrix |
freediscovery.cluster.utils._merge_clusters (X) |
Compute a union of all clusters |
Duplicates detection¶
freediscovery.dupdet.DuplicateDetection ([...]) |
Find near duplicates in a document collection. |
freediscovery.dupdet.SimhashDuplicates ([...]) |
Find near duplicates using simhash-py |
freediscovery.dupdet.IMatchDuplicates ([...]) |
Find near duplicates using the randomized I-match backend |
Email threading¶
freediscovery.threading.EmailThreading ([...]) |
JWZ Email threading class |
Tools¶
freediscovery.base.BaseEstimator |
|
freediscovery.io.parse_ground_truth_file (...) |
Parse a ground truth file specified by a filename. |
freediscovery.utils.generate_uuid () |
Generate a unique id for the model |
freediscovery.utils.setup_model (base_path) |
Generate a unique model id and create the corresponding folder for storing results |
Metrics¶
This module aims to extend sklearn.metrics with a few additional metrics,
freediscovery.metrics.ratio_duplicates_score (x, y) |
Given cluster labels x and y, compute the relative error between the number of duplicates in x vs the one in y. |
freediscovery.metrics.f1_same_duplicates_score (x, y) |
Given cluster labels x and y, compute the f1 score |