Feature extraction¶

For a general introduction to feature extraction with textual documents see the scikit-learn documentation.

TF-IDF schemes¶

SMART TF-IDF schemes¶

FreeDiscovery extends sklearn.feature_extraction.text.TfidfTransformer with a larger number of TF-IDF weighting and normalization schemes in SmartTfidfTransformer. It follows the SMART Information Retrieval System notation,

The different options are descibed in more detail in the table below,

Term frequency	Document frequency	Normalization
n (natural): \({{\text{tf}}_{t,d}}\)	n (no): 1	n (none): 1
l (logarithm): \(1+log({\displaystyle {\text{tf}}_{t,d}})\)	t (idf): \(log{\displaystyle {\tfrac {N}{df_{t}}}}\)	c (cosine): \({\displaystyle {\sqrt{\Sigma_ {t\epsilon d}{w_{t}^{2}}}}}\)
a (augmented): \(0.5 + {\displaystyle {\tfrac {0.5\times {\text{tf}}_{t,d}}{{\text{max(tf}}_{t,d})}}}\)	s (smoothed idf): \(log{\displaystyle {\tfrac {N + 1}{df_{t } + 1}}}\)	l (length): \({\displaystyle \Sigma_{t\epsilon d}{ \|w_{t}\| }}\)
b (boolean): \({\displaystyle {\begin{cases}1,&{\text{if tf}}_{t,d}>0\\0,&{\text{otherwise}}\end{cases}}}\)	p (prob idf): \({\displaystyle {\text{log}}{\tfrac {N-df_{t}}{df_{t}}}}\)	u (unique): \({\displaystyle \Sigma_ {t\epsilon d} \textbf{bool}\left(\|w_{t}\|\right) }\)
L (log average): \({\displaystyle {\tfrac {1+{\text{log}}({\text{tf}}_{t,d})}{1+{\text{log}}({\text{avg}}_{t\epsilon d}({\text{tf}}_{t,d}))}}}\)	d (smoothed prob idf): \({\displaystyle {\text{log}}{\tfrac {N+1-df_{t}}{df_{t} + 1}}}\)

Pivoted document length normalization¶

In addition to standard TF-IDF normalizations above, pivoted normalization was proposed by Singal et al. as a way to avoid over-penalising long documents. It can be enabled with the weighting='???p' parameter. For each document the normalization term \(V_{\textbf{d}}\) is replaced by,

\[{\displaystyle (1 - \alpha) \textbf{avg} \left( V_{\textbf{d}}\right) + \alpha V_{\textbf{d}}}\]

where \(\alpha\) (norm_alpha) is a user defined parameter, such as \(\alpha \in [0, 1]\). If norm_alpha=1 the pivot cancels out and this case corresponds to regular TF-IDF normalization.

See the example on Optimizing TF-IDF schemes for a more practical illustration.

References

C.D. Manning, P. Raghavan, H. Schütze. “Document and query weighting schemes” , 2008.
A. Singhal, C. Buckley, and M. Mitra. “Pivoted document length normalization.” 1996