freediscovery.dupdet.IMatchDuplicates¶

class freediscovery.dupdet.IMatchDuplicates(n_rand_lexicons=1, rand_lexicon_ratio=0.7)[source]¶

Find near duplicates using the randomized I-match backend

This class aims to expose a scikit-learn compatible API.

Parameters:	n_rand_lexicons (-) – number of random lexicons used for duplicate detection If equal to 1 no lexicon randomization is used which is equivalent to the original I-Match implementation by Chowdhury & Grossman (2002) rand_lexicon_ratio (-) – ratio of the vocabulary used in random lexicons

References

Kołcz & Chowdhury (2008) - Lexicon randomization for near-duplicate detection with I-Match.
Chowdhury et al. (2002) - Collection statistics for fast duplicate document detection.

__init__(n_rand_lexicons=1, rand_lexicon_ratio=0.7)[source]¶

Methods

__init__([n_rand_lexicons, rand_lexicon_ratio])

fit(X[, y])

param X:	List of n_features-dimensional data points. Each row

get_params([deep]) Get parameters for this estimator.

set_params(\*\*params) Set the parameters of this estimator.

fit(X, y=None)[source]¶

Parameters:	X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns:	self – Returns self.
Return type:	object

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:	deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params – Parameter names mapped to their values.
Return type:	mapping of string to any

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:	self