freediscovery.feature_weighting.SmartTfidfTransformer

class freediscovery.feature_weighting.SmartTfidfTransformer(weighting='nsc', norm_alpha=0.75, norm_pivot=None, compute_df=False, copy=True)[source]

TF-IDF weighting and normalization with the SMART IR notation

This class is similar to sklearn.feature_extraction.text.TfidfTransformer but supports a larger number of TF-IDF weighting and normalization schemes. It should be fitted on the document-term matrix computed by sklearn.feature_extraction.text.CountVectorizer.

The TF-IDF transform consists of three subsequent operations, determined by the weighting parameter,

  1. Term frequency weighing:

    natural (n), log (l), augmented (a), boolean (b), log average (L)

  2. Document frequency weighting:

    none (n), idf (t), smoothed idf (s), probabilistic (p), smoothed probabilistic (d)

  3. Document normalization:

    none (n), cosine (c), length (l), unique (u).

Following the SMART IR notation, the weighting parameter is written as the concatenation of thee characters describing each processing step. In addition the pivoted normalization can be enabled with a fourth character p.

See the TF-IDF schemes documentation section for more details.

Parameters:
  • weighting (str, default='nsc') – the SMART notation for document, term weighting and normalization. In the form [nlabL][ntspd][ncb][p].
  • norm_alpha (float, default=0.75) – the α parameter in the pivoted normalization. This parameter is only used when weighting='???p'.
  • norm_pivot (float, default=None) – the pivot value used for the normalization. If not provided it is computed as the mean of the norm(tf*idf). This parameter is only used when weighting='???p'.
  • compute_df (bool, default=False) – compute the document frequency (df_ attribute) even when it’s not explicitly required by the weighting scheme.
  • copy (boolean, default=True) – Whether to copy the input array and operate on the copy or perform in-place operations in fit and transform.

References

[Manning2008]C.D. Manning, P. Raghavan, H. Schütze, “Document and query weighting schemes” , 2008
[Singhal1996]A. Singhal, C. Buckley, and M. Mitra. “Pivoted document length normalization.” , 1996
fit(X, y=None)[source]

Learn the document lenght and document frequency vector (if necessary).

Parameters:X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts
fit_transform(X, y=None)[source]

Apply document term weighting and normalization on text features

Parameters:X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(X, y=None)[source]

Apply document term weighting and normalization on text features

Parameters:
  • X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts
  • copy (boolean, default True) – Whether to copy X and operate on the copy or perform in-place operations.

Examples using freediscovery.feature_weighting.SmartTfidfTransformer