freediscovery.dupdet.SimhashDuplicates¶

class freediscovery.dupdet.SimhashDuplicates(hash_func=u'murmurhash3_int_u32', hash_func_nbytes=32)[source]¶

Find near duplicates using simhash-py

Parameters:	hash_func (str or function, default='murmurhash3_int_u32') – the hashing function used to hash documents. Possibles values are “murmurhash3_int_u32” or a custom function. hash_func_nbytes (int, default=64) – expected size of the hash produced by hash_func

__init__(hash_func=u'murmurhash3_int_u32', hash_func_nbytes=32)[source]¶

Methods

__init__([hash_func, hash_func_nbytes])

fit(X[, y])

param X:	List of n_features-dimensional data points. Each row

get_index_by_hash(shash) Get document index by hash

get_params([deep]) Get parameters for this estimator.

query([distance, blocks]) Find all the nearests neighbours for the dataset

set_params(\*\*params) Set the parameters of this estimator.

fit(X, y=None)[source]¶

Parameters:	X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns:	self – Returns self.
Return type:	object

get_index_by_hash(shash)[source]¶

Get document index by hash

Parameters:	shash (uint64) – a simhash value
Returns:	index – a document index
Return type:	int

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:	deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params – Parameter names mapped to their values.
Return type:	mapping of string to any

query(distance=2, blocks=u'auto')[source]¶

Find all the nearests neighbours for the dataset

Parameters:

distance (int, default=2) – Maximum number of differnet bits in the simhash
blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py

Returns:

simhash (array) – the simhash value for all documents in the collection
cluster_id (array) – the exact duplicates (documents with the same simhash) are grouped by in cluster_id
dup_pairs (list) – list of tuples for the near-duplicates

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:	self