freediscovery.io.parse_smart_tokens

freediscovery.io.parse_smart_tokens(text)[source]

Parse a dataset stored in the SMART tokenized format, used in particular for the RCV1-v2 dataset, http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (cf. Appendix B.12.i.)

Parameters:text (str) – the full text of the dataset
Returns:result – the parsed dataset in a OrderedDict, with document_ids as keys, and a string of tokens as values
Return type:dict