Load benchmark datasetΒΆ
Currently the following datasets based on TREC 2009 legal collection are supported:
- treclegal09_2k_subset : 2 400 documents, 2 MB
- treclegal09_20k_subset : 20 000 documents, 30 MB
- treclegal09_37k_subset : 37 000 documents, 55 MB
- treclegal09 : 700 000 documents, 1.2 GB The ground truth files for categorization are adapted from TAR Toolkit.
If you encounter any issues for downloads with this function,
you can also manually download and extract the required dataset to cache_dir
(the
download url is http://r0h.eu/d/<name>.tar.gz
), then re-run this function to get
the required metadata.
URL:
/api/v0/dataset/<dataset-name>
Method:
GET
URL Params: NoneData Params: None
Success Response:
HTTP 200
{"data_dir": <str>, "base_dir": <str>, "seed_non_relevant_files": <list[str]>, "seed_relevant_files": <list[str]>, "ground_truth_file": <str>}