freediscovery.datasets.load_dataset¶
-
freediscovery.datasets.
load_dataset
(name='20_newsgroups_3categories', cache_dir='/tmp', verbose=False, verify_checksum=False, document_id_generation='squared', categories=None)[source]¶ Download a benchmark dataset.
The currently supported datasets are listed below,
TREC 2009 legal collection
- treclegal09_2k_subset : 2 400 documents, 2 MB
- treclegal09_20k_subset : 20 000 documents, 30 MB
- treclegal09_37k_subset : 37 000 documents, 55 MB
- treclegal09 : 700 000 documents, 1.2 GB
The ground truth files for categorization are adapted from TAR Toolkit.
Fedora mailing list (2009-2009) - fedora_ml_3k_subset
The 20 newsgoups dataset - 20_newsgroups_3categories: only the [‘comp.graphics’, ‘rec.sport.baseball’, ‘sci.space’] categories
If you encounter any issues for downloads with this function, you can also manually download and extract the required dataset to
cache_dir
(the download url ishttp://r0h.eu/d/<name>.tar.gz
), then re-run this function to get the required metadata.Parameters: - name (str, default='20_newsgroups_3categories') – the name of the dataset file to load
- cache_dir (str, default='/tmp/') – root directory where to save the download
- verbose (bool, default=False) – print download progress
- verify_checksum (bool, default=False) – verify the checksum of the downloaded archive
- document_id_generation (str) – specifies how the document_id is computed from internal_id
must be one of
['identity', 'squared']
default="identity"
(i.e.document_id = internal_id
) - categories (str) – select a subsection of the dataset,
default='all'
Returns: - metadata (dict) – a dictionary containing metadata corresponding to the dataset
- training_set ({dict, None}) – a list of dictionaries for the training set
- test_set (dict) – a list of dictionaries for the test set