freediscovery.dupdet.IMatchDuplicates¶
-
class
freediscovery.dupdet.
IMatchDuplicates
(n_rand_lexicons=1, rand_lexicon_ratio=0.7)[source]¶ Find near duplicates using the randomized I-match backend
This class aims to expose a scikit-learn compatible API.
Parameters: - n_rand_lexicons (-) – number of random lexicons used for duplicate detection If equal to 1 no lexicon randomization is used which is equivalent to the original I-Match implementation by Chowdhury & Grossman (2002)
- rand_lexicon_ratio (-) – ratio of the vocabulary used in random lexicons
References
- Kołcz & Chowdhury (2008) - Lexicon randomization for near-duplicate detection with I-Match.
- Chowdhury et al. (2002) - Collection statistics for fast duplicate document detection.
Methods
__init__
([n_rand_lexicons, rand_lexicon_ratio])fit
(X[, y])param X: List of n_features-dimensional data points. Each row get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. -
fit
(X, y=None)[source]¶ Parameters: X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point. Returns: self – Returns self. Return type: object
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self