freediscovery.datasets.load_dataset¶
-
freediscovery.datasets.
load_dataset
(name=u'treclegal09_2k_subset', cache_dir=u'/tmp', force=False, verbose=True, load_ground_truth=False, verify_checksum=False)[source]¶ Download a benchmark dataset.
The currently supported datasets are listed below,
TREC 2009 legal collection
- treclegal09_2k_subset : 2 400 documents, 2 MB
- treclegal09_20k_subset : 20 000 documents, 30 MB
- treclegal09_37k_subset : 37 000 documents, 55 MB
- treclegal09 : 700 000 documents, 1.2 GB
The ground truth files for categorization are adapted from TAR Toolkit.
Fedora mailing list (2009-2009) - fedora_ml_
If you encounter any issues for downloads with this function, you can also manually download and extract the required dataset to cache_dir (the download url is http://r0h.eu/d/<name>.tar.gz), then re-run this function to get the required metadata.
Parameters: - name (str, default='treclegal09_2k_subset') – the name of the dataset file to load
- cache_dir (str, default='/tmp/') – root directory where to save the download
- force (bool, default=False) – download again if the dataset already exists. Warning: this will remove previously downloaded files!
- load_ground_truth (bool, default=False) – parse the ground truth files present in the dataset
- verbose (bool, default=False) – print download progress
- verify_checksum (bool, default=False) – verify the checksum of the downloaded archive
Returns: response – a dictionary containing paths to the dataset and corresponding metadata
Return type: dict