freediscovery.dupdet.SimhashDuplicates¶
-
class
freediscovery.dupdet.
SimhashDuplicates
(hash_func=u'murmurhash3_int_u32', hash_func_nbytes=32)[source]¶ Find near duplicates using simhash-py
Parameters: - hash_func (str or function, default='murmurhash3_int_u32') – the hashing function used to hash documents. Possibles values are “murmurhash3_int_u32” or a custom function.
- hash_func_nbytes (int, default=64) – expected size of the hash produced by hash_func
Methods
__init__
([hash_func, hash_func_nbytes])fit
(X[, y])param X: List of n_features-dimensional data points. Each row get_index_by_hash
(shash)Get document index by hash get_params
([deep])Get parameters for this estimator. query
([distance, blocks])Find all the nearests neighbours for the dataset set_params
(\*\*params)Set the parameters of this estimator. -
fit
(X, y=None)[source]¶ Parameters: X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point. Returns: self – Returns self. Return type: object
-
get_index_by_hash
(shash)[source]¶ Get document index by hash
Parameters: shash (uint64) – a simhash value Returns: index – a document index Return type: int
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
query
(distance=2, blocks=u'auto')[source]¶ Find all the nearests neighbours for the dataset
Parameters: - distance (int, default=2) – Maximum number of differnet bits in the simhash
- blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py
Returns: - simhash (array) – the simhash value for all documents in the collection
- cluster_id (array) – the exact duplicates (documents with the same simhash) are grouped by in cluster_id
- dup_pairs (list) – list of tuples for the near-duplicates
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self