freediscovery.feature_weighting.SmartTfidfTransformer¶
-
class
freediscovery.feature_weighting.
SmartTfidfTransformer
(weighting='nsc', norm_alpha=0.75, norm_pivot=None, compute_df=False, copy=True)[source]¶ TF-IDF weighting and normalization with the SMART IR notation
This class is similar to
sklearn.feature_extraction.text.TfidfTransformer
but supports a larger number of TF-IDF weighting and normalization schemes. It should be fitted on the document-term matrix computed bysklearn.feature_extraction.text.CountVectorizer
.The TF-IDF transform consists of three subsequent operations, determined by the
weighting
parameter,Term frequency weighing:
natural (
n
), log (l
), augmented (a
), boolean (b
), log average (L
)Document frequency weighting:
none (
n
), idf (t
), smoothed idf (s
), probabilistic (p
), smoothed probabilistic (d
)Document normalization:
none (
n
), cosine (c
), length (l
), unique (u
).
Following the SMART IR notation, the
weighting
parameter is written as the concatenation of thee characters describing each processing step. In addition the pivoted normalization can be enabled with a fourth characterp
.See the TF-IDF schemes documentation section for more details.
Parameters: - weighting (str, default='nsc') – the SMART notation for document, term weighting and normalization.
In the form
[nlabL][ntspd][ncb][p]
. - norm_alpha (float, default=0.75) – the α parameter in the pivoted normalization. This parameter is only
used when
weighting='???p'
. - norm_pivot (float, default=None) – the pivot value used for the normalization. If not provided
it is computed as the mean of the
norm(tf*idf)
. This parameter is only used whenweighting='???p'
. - compute_df (bool, default=False) – compute the document frequency (
df_
attribute) even when it’s not explicitly required by the weighting scheme. - copy (boolean, default=True) – Whether to copy the input array and operate on the copy or perform in-place operations in fit and transform.
References
[Manning2008] C.D. Manning, P. Raghavan, H. Schütze, “Document and query weighting schemes” , 2008 [Singhal1996] A. Singhal, C. Buckley, and M. Mitra. “Pivoted document length normalization.” , 1996 -
fit
(X, y=None)[source]¶ Learn the document lenght and document frequency vector (if necessary).
Parameters: X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts
-
fit_transform
(X, y=None)[source]¶ Apply document term weighting and normalization on text features
Parameters: X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self