Categorization Example [REST API]ΒΆ
An example to illustrate binary categorizaiton with FreeDiscovery
Out:
0. Load the test dataset
GET http://localhost:5001/api/v0/datasets/treclegal09_2k_subset
1.a Load dataset and initalize feature extraction
POST http://localhost:5001/api/v0/feature-extraction
=> received [u'id', u'filenames']
=> dsid = ffeb050dc0fe425ca76c6616b44c03e2
1.b Start feature extraction (in the background)
POST http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
1.c Monitor feature extraction progress
GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
1.d. check the parameters of the extracted features
GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
- binary: False
- n_jobs: -1
- stop_words: english
- use_hashing: True
- min_df: 0.0
- n_samples: 2465
- analyzer: word
- ngram_range: [1, 1]
- max_df: 1.0
- chunk_size: 2000
- use_idf: True
- data_dir: ../freediscovery_shared/treclegal09_2k_subset/data
- sublinear_tf: False
- n_samples_processed: 2465
- n_features: 50001
- norm: l2
2.a. Train the ML categorization model
5 relevant, 63 non-relevant files
POST http://localhost:5001/api/v0/categorization/
Training...
=> model id = e3f2f4ad32454a7fa4fd107c892012ac
=> Training scores: MAP = 1.000, ROC-AUC = 1.000
2.b. Check the parameters used in the categorization model
GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac
- method: LinearSVC
- options: {'loss': 'squared_hinge', 'C': 1.0, 'verbose': 0, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 1000, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': True, 'tol': 0.0001, 'class_weight': None}
2.c Categorize the complete dataset with this model
GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/predict
=> Predicting 11 relevant and 2454 non relevant documents
2.d Test categorization accuracy
using ../freediscovery_shared/treclegal09_2k_subset/ground_truth_file.txt
POST http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/test
=> Test scores: MAP = 0.959, ROC-AUC = 0.958
3.a. Calculate LSI
POST http://localhost:5001/api/v0/lsi/
=> LSI model id = 8b827da5547f46a6962274ee7771248b
=> SVD decomposition with 100 dimensions explaining 48.41 % variabilty of the data
3.b. Predict categorization with LSI
POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/predict
=> Training scores: MAP = 1.000, ROC-AUC = 1.000
3.c. Test categorization with LSI
POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/test
{u'recall': 0.8333333333333334, u'f1': 0.11695906432748539, u'roc_auc': 0.886295692349504, u'average_precision': 0.4485188870603544, u'precision': 0.06289308176100629}
=> Test scores: MAP = 0.449, ROC-AUC = 0.886
nearest_nrel_doc nearest_rel_doc prediction
0 9 1791 -0.378
1 1457 1791 -0.459
2 2314 3 -0.558
3 2451 3 1.000
4 2314 3 -0.598
5 1337 919 -0.473
6 1600 3 -0.705
7 2314 1047 -0.415
8 1337 3 -0.457
9 9 1791 -1.000
10 2451 3 -0.539
11 32 906 -0.391
12 32 906 -0.369
13 2039 919 -0.824
14 2039 1047 -0.996
15 1563 906 -0.328
16 32 3 -0.399
17 32 906 -0.270
18 99 919 -0.199
19 1104 906 -0.550
20 1337 919 -0.362
21 2275 3 -0.893
22 676 1047 -0.196
23 676 919 -0.197
24 2121 1047 -0.550
25 2275 1791 -0.615
26 32 906 -0.277
27 362 1047 -0.256
28 32 3 -0.389
29 32 906 -0.262
... ... ... ...
2435 987 1791 -0.150
2436 615 919 -0.745
2437 615 919 -0.463
2438 1337 1047 -0.217
2439 1503 1047 -0.424
2440 1503 1047 -0.225
2441 1503 3 -0.483
2442 539 1047 -0.248
2443 9 919 -0.410
2444 9 919 -0.206
2445 9 919 -0.410
2446 9 1047 -0.209
2447 2314 1047 -0.501
2448 2314 1047 -0.174
2449 1337 3 -0.582
2450 2387 3 -0.835
2451 2451 3 -1.000
2452 2451 3 -0.652
2453 2451 3 -0.554
2454 2451 3 -0.372
2455 1561 3 -0.448
2456 2387 3 0.654
2457 1563 1047 -0.537
2458 1117 919 -0.237
2459 1441 1047 -0.702
2460 81 1047 -0.193
2461 81 1047 -0.185
2462 1441 1047 -0.224
2463 81 1047 -0.199
2464 1441 1047 -0.497
[2465 rows x 3 columns]
4.a Delete the extracted features
DELETE http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
from __future__ import print_function
from time import time, sleep
from multiprocessing import Process
import requests
import pandas as pd
pd.options.display.float_format = '{:,.3f}'.format
pd.options.display.expand_frame_repr = False
dataset_name = "treclegal09_2k_subset" # see list of available datasets
BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL
if __name__ == '__main__':
print(" 0. Load the test dataset")
url = BASE_URL + '/datasets/{}'.format(dataset_name)
print(" GET", url)
res = requests.get(url).json()
# To use a custom dataset, simply specify the following variables
data_dir = res['data_dir']
seed_filenames = res['seed_filenames']
seed_y = res['seed_y']
ground_truth_file = res['ground_truth_file'] # (optional)
# 1. Feature extraction
print("\n1.a Load dataset and initalize feature extraction")
url = BASE_URL + '/feature-extraction'
print(" POST", url)
fe_opts = {'data_dir': data_dir,
'stop_words': 'english', 'chunk_size': 2000, 'n_jobs': -1,
'use_idf': 1, 'sublinear_tf': 0, 'binary': 0, 'n_features': 50001,
'analyzer': 'word', 'ngram_range': (1, 1), "norm": "l2"
}
res = requests.post(url, json=fe_opts).json()
dsid = res['id']
print(" => received {}".format(list(res.keys())))
print(" => dsid = {}".format(dsid))
print("\n1.b Start feature extraction (in the background)")
# Make this call in a background process (there should be a better way of doing it)
url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
p = Process(target=requests.post, args=(url,))
p.start()
sleep(5.0) # wait a bit for the processing to start
print('\n1.c Monitor feature extraction progress')
url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" GET", url)
t0 = time()
while True:
res = requests.get(url)
if res.status_code == 520:
p.terminate()
raise ValueError('Processing did not start')
elif res.status_code == 200:
break # processing finished
data = res.json()
print(' ... {}k/{}k files processed in {:.1f} min'.format(
data['n_samples_processed']//1000, data['n_samples']//1000, (time() - t0)/60.))
sleep(15.0)
p.terminate() # just in case, should not be necessary
print("\n1.d. check the parameters of the extracted features")
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
print(' GET', url)
res = requests.get(url).json()
print('\n'.join([' - {}: {}'.format(key, val) for key, val in res.items() \
if "filenames" not in key]))
method = BASE_URL + "/feature-extraction/{}/index".format(dsid)
res = requests.get(method, data={'filenames': seed_filenames})
seed_index = res.json()['index']
# 2. Document categorization with ML algorithms
print("\n2.a. Train the ML categorization model")
print(" {} relevant, {} non-relevant files".format(seed_y.count(1), seed_y.count(0)))
url = BASE_URL + '/categorization/'
print(" POST", url)
print(' Training...')
res = requests.post(url,
json={'index': seed_index,
'y': seed_y,
'dataset_id': dsid,
'method': 'LinearSVC', # one of "LinearSVC", "LogisticRegression", 'xgboost'
'cv': 0 # Cross Validation
}).json()
mid = res['id']
print(" => model id = {}".format(mid))
print(' => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
print("\n2.b. Check the parameters used in the categorization model")
url = BASE_URL + '/categorization/{}'.format(mid)
print(" GET", url)
res = requests.get(url).json()
print('\n'.join([' - {}: {}'.format(key, val) for key, val in res.items() \
if key not in ['index', 'y']]))
print("\n2.c Categorize the complete dataset with this model")
url = BASE_URL + '/categorization/{}/predict'.format(mid)
print(" GET", url)
res = requests.get(url).json()
prediction = res['prediction']
print(" => Predicting {} relevant and {} non relevant documents".format(
len(list(filter(lambda x: x>0, prediction))),
len(list(filter(lambda x: x<0, prediction)))))
print("\n2.d Test categorization accuracy")
print(" using {}".format(ground_truth_file))
url = BASE_URL + '/categorization/{}/test'.format(mid)
print("POST", url)
res = requests.post(url, json={'ground_truth_filename': ground_truth_file}).json()
print(' => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
# 3. Document categorization with LSI
print("\n3.a. Calculate LSI")
url = BASE_URL + '/lsi/'
print("POST", url)
n_components = 100
res = requests.post(url,
json={'n_components': n_components,
'dataset_id': dsid
}).json()
lid = res['id']
print(' => LSI model id = {}'.format(lid))
print(' => SVD decomposition with {} dimensions explaining {:.2f} % variabilty of the data'.format(
n_components, res['explained_variance']*100))
print("\n3.b. Predict categorization with LSI")
url = BASE_URL + '/lsi/{}/predict'.format(lid)
print("POST", url)
res = requests.post(url,
json={'index': seed_index,
'y': seed_y
}).json()
prediction = res['prediction']
print(' => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
df = pd.DataFrame({key: res[key] for key in res if 'prediction'==key or 'nearest' in key})
print("\n3.c. Test categorization with LSI")
url = BASE_URL + '/lsi/{}/test'.format(lid)
print(" POST", url)
res = requests.post(url,
json={'index': seed_index,
'y': seed_y,
'ground_truth_filename': ground_truth_file
}).json()
print(res)
print(' => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
print('\n', df)
# 4. Cleaning
print("\n4.a Delete the extracted features")
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
print(" DELETE", url)
requests.delete(url)
Total running time of the script: ( 1 minutes 8.696 seconds)