API

This is the API reference for the FreeDiscovery Python package

Datasets

freediscovery.datasets.load_dataset([name, ...]) Download a benchmark dataset.

Feature extraction

freediscovery.feature_weighting.SmartTfidfTransformer([...]) TF-IDF weighting and normalization with the SMART IR notation

Categorization

freediscovery.neighbors.NearestNeighborRanker([...]) A nearest neighbor ranker.

Clustering

freediscovery.cluster.Birch([threshold, ...]) Non online version of the Birch clustering algorithm
freediscovery.cluster.BirchSubcluster(\*\*args) A container class for BIRCH cluster hierarchy
freediscovery.cluster.birch_hierarchy_wrapper(birch) Wrap BIRCH cluster hierarchy with a container class
freediscovery.cluster.ClusterLabels(vect, model) Calculate the cluster labels.

Near Duplicates detection

freediscovery.near_duplicates.SimhashNearDuplicates([...]) Near duplicates detection using the simhash algorithm.
freediscovery.near_duplicates.IMatchNearDuplicates([...]) Near duplicates detection using the randomized I-Match algorithm.

IO

freediscovery.io.parse_smart_tokens(text) Parse a dataset stored in the SMART tokenized format, used in particular for the RCV1-v2 dataset, http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (cf.

Metrics

This module aims to extend sklearn.metrics with a few additional metrics,

freediscovery.metrics.recall_at_k_score(...) Recall after retrieving k documents from the collections