freediscovery.dupdet.DuplicateDetection¶

class freediscovery.dupdet.DuplicateDetection(cache_dir=u'/tmp/', dsid=None, mid=None)[source]¶

Find near duplicates in a document collection.

Currently supported backends are simhash-py and i-match.

The option use_hashing=False must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:	cache_dir (str) – directory where to save temporary and regression files dsid (str, optional) – dataset id mid (str, optional) – model id

Methods

`__init__`([cache_dir, dsid, mid])
`delete`()	Delete a trained model
`fit`([method])	Precompute all the required values for duplicate detection
`get_dsid`(cache_dir, mid)
`get_params`()	Get model parameters
`get_path`(mid)
`list_models`()
`load`(mid)	Load results from cache specified by a mid
`query`(\\args)	Find all the nearests neighbours for the dataset

fit(method=u'simhash')[source]¶: Precompute all the required values for duplicate detection

query(**args)[source]¶

Find all the nearests neighbours for the dataset

Parameters:	distance (int, default=2) – Maximum number of differnet bits in the simhash blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py
Returns:	cluster_id – the exact duplicates (documents with the same simhash) are grouped by in cluster_id
Return type:	array