freediscovery.dupdet.DuplicateDetection¶
-
class
freediscovery.dupdet.
DuplicateDetection
(cache_dir=u'/tmp/', dsid=None, mid=None)[source]¶ Find near duplicates in a document collection.
Currently supported backends are simhash-py and i-match.
The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.
Parameters: - cache_dir (str) – directory where to save temporary and regression files
- dsid (str, optional) – dataset id
- mid (str, optional) – model id
Methods
__init__
([cache_dir, dsid, mid])delete
()Delete a trained model fit
([method])Precompute all the required values for duplicate detection get_dsid
(cache_dir, mid)get_params
()Get model parameters get_path
(mid)list_models
()load
(mid)Load results from cache specified by a mid query
(\*\*args)Find all the nearests neighbours for the dataset -
query
(**args)[source]¶ Find all the nearests neighbours for the dataset
Parameters: - distance (int, default=2) – Maximum number of differnet bits in the simhash
- blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py
Returns: cluster_id – the exact duplicates (documents with the same simhash) are grouped by in cluster_id
Return type: array