Feature extraction¶
For a general introduction to feature extraction with textual documents see the scikit-learn documentation.
TF-IDF schemes¶
SMART TF-IDF schemes¶
FreeDiscovery extends sklearn.feature_extraction.text.TfidfTransformer
with a larger number of TF-IDF weighting and normalization schemes in SmartTfidfTransformer
. It follows the SMART Information Retrieval System notation,
The different options are descibed in more detail in the table below,
Term frequency | Document frequency | Normalization |
n (natural): \({{\text{tf}}_{t,d}}\) | n (no): 1 | n (none): 1 |
l (logarithm): \(1+log({\displaystyle {\text{tf}}_{t,d}})\) | t (idf): \(log{\displaystyle {\tfrac {N}{df_{t}}}}\) | c (cosine): \({\displaystyle {\sqrt{\Sigma_ {t\epsilon d}{w_{t}^{2}}}}}\) |
a (augmented): \(0.5 + {\displaystyle {\tfrac {0.5\times {\text{tf}}_{t,d}}{{\text{max(tf}}_{t,d})}}}\) | s (smoothed idf): \(log{\displaystyle {\tfrac {N + 1}{df_{t } + 1}}}\) |
l (length): \({\displaystyle \Sigma_{t\epsilon d}{ |w_{t}| }}\) |
b (boolean): \({\displaystyle {\begin{cases}1,&{\text{if tf}}_{t,d}>0\\0,&{\text{otherwise}}\end{cases}}}\) | p (prob idf): \({\displaystyle {\text{log}}{\tfrac {N-df_{t}}{df_{t}}}}\) | u (unique): \({\displaystyle \Sigma_ {t\epsilon d} \textbf{bool}\left(|w_{t}|\right) }\) |
L (log average): \({\displaystyle {\tfrac {1+{\text{log}}({\text{tf}}_{t,d})}{1+{\text{log}}({\text{avg}}_{t\epsilon d}({\text{tf}}_{t,d}))}}}\) | d (smoothed prob idf): \({\displaystyle {\text{log}}{\tfrac {N+1-df_{t}}{df_{t} + 1}}}\) |
Pivoted document length normalization¶
In addition to standard TF-IDF normalizations above, pivoted normalization was proposed by Singal et al. as a way to avoid over-penalising long documents. It can be enabled with the weighting='???p'
parameter. For each document the normalization term \(V_{\textbf{d}}\) is replaced by,
where \(\alpha\) (norm_alpha
) is a user defined parameter, such as \(\alpha \in [0, 1]\). If norm_alpha=1
the pivot cancels out and this case corresponds to regular TF-IDF normalization.
See the example on Optimizing TF-IDF schemes for a more practical illustration.
References
- C.D. Manning, P. Raghavan, H. Schütze. “Document and query weighting schemes” , 2008.
- A. Singhal, C. Buckley, and M. Mitra. “Pivoted document length normalization.” 1996