Python API Reference¶

Two types of classes can be found in FreeDiscovery,

scikit-learn compatible estimators that inherit from sklearn.base.BaseEstimator
freediscovery specific classes that add a persistance layer and are designed to function together with the REST API.

Datasets¶

freediscovery.datasets.load_dataset([name, ...]) Download a benchmark dataset.

freediscovery.parsers.EmailParser([...]) Parse emails

freediscovery.text.FeatureVectorizer([...]) Extract features from text documents

`freediscovery.categorization.Categorizer`([...])	Document categorization model
`freediscovery.lsi.LSI`([cache_dir, dsid, ...])	Document categorization using Latent Semantic Indexing (LSI)

`freediscovery.cluster.Clustering`([...])	Document clustering
`freediscovery.cluster.ClusterLabels`(vect, ...)	Calculate the cluster labels.
`freediscovery.cluster._DendrogramChildren`(ddata)	Compute childen for a given dendogram node
`freediscovery.cluster.utils._binary_linkage2clusters`(...)	Given a list of elements of size n_sample and a linkage matrix
`freediscovery.cluster.utils._merge_clusters`(X)	Compute a union of all clusters

`freediscovery.dupdet.DuplicateDetection`([...])	Find near duplicates in a document collection.
`freediscovery.dupdet.SimhashDuplicates`([...])	Find near duplicates using simhash-py
`freediscovery.dupdet.IMatchDuplicates`([...])	Find near duplicates using the randomized I-match backend

freediscovery.threading.EmailThreading([...]) JWZ Email threading class

`freediscovery.base.BaseEstimator`
`freediscovery.io.parse_ground_truth_file`(...)	Parse a ground truth file specified by a filename.
`freediscovery.utils.generate_uuid`()	Generate a unique id for the model
`freediscovery.utils.setup_model`(base_path)	Generate a unique model id and create the corresponding folder for storing results

This module aims to extend sklearn.metrics with a few additional metrics,

`freediscovery.metrics.ratio_duplicates_score`(x, y)	Given cluster labels x and y, compute the relative error between the number of duplicates in x vs the one in y.
`freediscovery.metrics.f1_same_duplicates_score`(x, y)	Given cluster labels x and y, compute the f1 score

`freediscovery.exceptions.NotFound`([message])
`freediscovery.exceptions.DatasetNotFound`([...])
`freediscovery.exceptions.ModelNotFound`([message])
`freediscovery.exceptions.InitException`([message])
`freediscovery.exceptions.WrongParameter`([...])
`freediscovery.exceptions.NotImplementedFD`([...])
`freediscovery.exceptions.OptionalDependencyMissing`([...])