API¶

This is the API reference for the FreeDiscovery Python package

Datasets¶

freediscovery.datasets.load_dataset([name, ...]) Download a benchmark dataset.

Feature extraction¶

freediscovery.feature_weighting.SmartTfidfTransformer([...]) TF-IDF weighting and normalization with the SMART IR notation

Categorization¶

freediscovery.neighbors.NearestNeighborRanker([...]) A nearest neighbor ranker.

Clustering¶

`freediscovery.cluster.Birch`([threshold, ...])	Non online version of the Birch clustering algorithm
`freediscovery.cluster.BirchSubcluster`(\\args)	A container class for BIRCH cluster hierarchy
`freediscovery.cluster.birch_hierarchy_wrapper`(birch)	Wrap BIRCH cluster hierarchy with a container class
`freediscovery.cluster.ClusterLabels`(vect, model)	Calculate the cluster labels.

Near Duplicates detection¶

`freediscovery.near_duplicates.SimhashNearDuplicates`([...])	Near duplicates detection using the simhash algorithm.
`freediscovery.near_duplicates.IMatchNearDuplicates`([...])	Near duplicates detection using the randomized I-Match algorithm.

Semantic search¶

freediscovery.search.Search(vectorizer, tfidf) (Semantic) search in a document collection

IO¶

freediscovery.io.parse_smart_tokens(text) Parse a dataset stored in the SMART tokenized format, used in particular for the RCV1-v2 dataset, http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (cf.

Metrics¶

This module aims to extend sklearn.metrics with a few additional metrics,

freediscovery.metrics.recall_at_k_score(...) Recall after retrieving k documents from the collections