freediscovery.dupdet.DuplicateDetection

class freediscovery.dupdet.DuplicateDetection(cache_dir=u'/tmp/', dsid=None, mid=None)[source]

Find near duplicates in a document collection.

Currently supported backends are simhash-py and i-match.

The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:
  • cache_dir (str) – directory where to save temporary and regression files
  • dsid (str, optional) – dataset id
  • mid (str, optional) – model id
__init__(cache_dir=u'/tmp/', dsid=None, mid=None)[source]

Methods

__init__([cache_dir, dsid, mid])
delete() Delete a trained model
fit([method]) Precompute all the required values for duplicate detection
get_dsid(cache_dir, mid)
get_params() Get model parameters
get_path(mid)
list_models()
load(mid) Load results from cache specified by a mid
query(\*\*args) Find all the nearests neighbours for the dataset
delete()[source]

Delete a trained model

fit(method=u'simhash')[source]

Precompute all the required values for duplicate detection

get_params()[source]

Get model parameters

load(mid)[source]

Load results from cache specified by a mid

query(**args)[source]

Find all the nearests neighbours for the dataset

Parameters:
  • distance (int, default=2) – Maximum number of differnet bits in the simhash
  • blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py
Returns:

cluster_id – the exact duplicates (documents with the same simhash) are grouped by in cluster_id

Return type:

array