Feature extraction

For a general introduction to feature extraction with textual documents see the scikit-learn documentation.

TF-IDF schemes

SMART TF-IDF schemes

FreeDiscovery extends sklearn.feature_extraction.text.TfidfTransformer with a larger number of TF-IDF weighting and normalization schemes in SmartTfidfTransformer. It follows the SMART Information Retrieval System notation,

../_images/tf_idf_weighting.svg

The different options are descibed in more detail in the table below,

Term frequency Document frequency Normalization
n (natural): \({{\text{tf}}_{t,d}}\) n (no): 1 n (none): 1
l (logarithm): \(1+log({\displaystyle {\text{tf}}_{t,d}})\) t (idf): \(log{\displaystyle {\tfrac {N}{df_{t}}}}\) c (cosine): \({\displaystyle {\sqrt{\Sigma_ {t\epsilon d}{w_{t}^{2}}}}}\)
a (augmented): \(0.5 + {\displaystyle {\tfrac {0.5\times {\text{tf}}_{t,d}}{{\text{max(tf}}_{t,d})}}}\)

s (smoothed idf):

\(log{\displaystyle {\tfrac {N + 1}{df_{t } + 1}}}\)

l (length): \({\displaystyle \Sigma_{t\epsilon d}{ |w_{t}| }}\)
b (boolean): \({\displaystyle {\begin{cases}1,&{\text{if tf}}_{t,d}>0\\0,&{\text{otherwise}}\end{cases}}}\) p (prob idf): \({\displaystyle {\text{log}}{\tfrac {N-df_{t}}{df_{t}}}}\) u (unique): \({\displaystyle \Sigma_ {t\epsilon d} \textbf{bool}\left(|w_{t}|\right) }\)
L (log average): \({\displaystyle {\tfrac {1+{\text{log}}({\text{tf}}_{t,d})}{1+{\text{log}}({\text{avg}}_{t\epsilon d}({\text{tf}}_{t,d}))}}}\) d (smoothed prob idf): \({\displaystyle {\text{log}}{\tfrac {N+1-df_{t}}{df_{t} + 1}}}\)  

Pivoted document length normalization

In addition to standard TF-IDF normalizations above, pivoted normalization was proposed by Singal et al. as a way to avoid over-penalising long documents. It can be enabled with the weighting='???p' parameter. For each document the normalization term \(V_{\textbf{d}}\) is replaced by,

\[{\displaystyle (1 - \alpha) \textbf{avg} \left( V_{\textbf{d}}\right) + \alpha V_{\textbf{d}}}\]

where \(\alpha\) (norm_alpha) is a user defined parameter, such as \(\alpha \in [0, 1]\). If norm_alpha=1 the pivot cancels out and this case corresponds to regular TF-IDF normalization.

See the example on Optimizing TF-IDF schemes for a more practical illustration.

References