.. _sphx_glr_examples_categorization_example.py: Categorization Example [REST API] --------------------------------- An example to illustrate binary categorizaiton with FreeDiscovery .. rst-class:: sphx-glr-script-out Out:: 0. Load the test dataset GET http://localhost:5001/api/v0/datasets/treclegal09_2k_subset 1.a Load dataset and initalize feature extraction POST http://localhost:5001/api/v0/feature-extraction => received [u'id', u'filenames'] => dsid = ffeb050dc0fe425ca76c6616b44c03e2 1.b Start feature extraction (in the background) POST http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2 1.c Monitor feature extraction progress GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2 1.d. check the parameters of the extracted features GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2 - binary: False - n_jobs: -1 - stop_words: english - use_hashing: True - min_df: 0.0 - n_samples: 2465 - analyzer: word - ngram_range: [1, 1] - max_df: 1.0 - chunk_size: 2000 - use_idf: True - data_dir: ../freediscovery_shared/treclegal09_2k_subset/data - sublinear_tf: False - n_samples_processed: 2465 - n_features: 50001 - norm: l2 2.a. Train the ML categorization model 5 relevant, 63 non-relevant files POST http://localhost:5001/api/v0/categorization/ Training... => model id = e3f2f4ad32454a7fa4fd107c892012ac => Training scores: MAP = 1.000, ROC-AUC = 1.000 2.b. Check the parameters used in the categorization model GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac - method: LinearSVC - options: {'loss': 'squared_hinge', 'C': 1.0, 'verbose': 0, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 1000, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': True, 'tol': 0.0001, 'class_weight': None} 2.c Categorize the complete dataset with this model GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/predict => Predicting 11 relevant and 2454 non relevant documents 2.d Test categorization accuracy using ../freediscovery_shared/treclegal09_2k_subset/ground_truth_file.txt POST http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/test => Test scores: MAP = 0.959, ROC-AUC = 0.958 3.a. Calculate LSI POST http://localhost:5001/api/v0/lsi/ => LSI model id = 8b827da5547f46a6962274ee7771248b => SVD decomposition with 100 dimensions explaining 48.41 % variabilty of the data 3.b. Predict categorization with LSI POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/predict => Training scores: MAP = 1.000, ROC-AUC = 1.000 3.c. Test categorization with LSI POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/test {u'recall': 0.8333333333333334, u'f1': 0.11695906432748539, u'roc_auc': 0.886295692349504, u'average_precision': 0.4485188870603544, u'precision': 0.06289308176100629} => Test scores: MAP = 0.449, ROC-AUC = 0.886 nearest_nrel_doc nearest_rel_doc prediction 0 9 1791 -0.378 1 1457 1791 -0.459 2 2314 3 -0.558 3 2451 3 1.000 4 2314 3 -0.598 5 1337 919 -0.473 6 1600 3 -0.705 7 2314 1047 -0.415 8 1337 3 -0.457 9 9 1791 -1.000 10 2451 3 -0.539 11 32 906 -0.391 12 32 906 -0.369 13 2039 919 -0.824 14 2039 1047 -0.996 15 1563 906 -0.328 16 32 3 -0.399 17 32 906 -0.270 18 99 919 -0.199 19 1104 906 -0.550 20 1337 919 -0.362 21 2275 3 -0.893 22 676 1047 -0.196 23 676 919 -0.197 24 2121 1047 -0.550 25 2275 1791 -0.615 26 32 906 -0.277 27 362 1047 -0.256 28 32 3 -0.389 29 32 906 -0.262 ... ... ... ... 2435 987 1791 -0.150 2436 615 919 -0.745 2437 615 919 -0.463 2438 1337 1047 -0.217 2439 1503 1047 -0.424 2440 1503 1047 -0.225 2441 1503 3 -0.483 2442 539 1047 -0.248 2443 9 919 -0.410 2444 9 919 -0.206 2445 9 919 -0.410 2446 9 1047 -0.209 2447 2314 1047 -0.501 2448 2314 1047 -0.174 2449 1337 3 -0.582 2450 2387 3 -0.835 2451 2451 3 -1.000 2452 2451 3 -0.652 2453 2451 3 -0.554 2454 2451 3 -0.372 2455 1561 3 -0.448 2456 2387 3 0.654 2457 1563 1047 -0.537 2458 1117 919 -0.237 2459 1441 1047 -0.702 2460 81 1047 -0.193 2461 81 1047 -0.185 2462 1441 1047 -0.224 2463 81 1047 -0.199 2464 1441 1047 -0.497 [2465 rows x 3 columns] 4.a Delete the extracted features DELETE http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2 | .. code-block:: python from __future__ import print_function from time import time, sleep from multiprocessing import Process import requests import pandas as pd pd.options.display.float_format = '{:,.3f}'.format pd.options.display.expand_frame_repr = False dataset_name = "treclegal09_2k_subset" # see list of available datasets BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL if __name__ == '__main__': print(" 0. Load the test dataset") url = BASE_URL + '/datasets/{}'.format(dataset_name) print(" GET", url) res = requests.get(url).json() # To use a custom dataset, simply specify the following variables data_dir = res['data_dir'] seed_filenames = res['seed_filenames'] seed_y = res['seed_y'] ground_truth_file = res['ground_truth_file'] # (optional) # 1. Feature extraction print("\n1.a Load dataset and initalize feature extraction") url = BASE_URL + '/feature-extraction' print(" POST", url) fe_opts = {'data_dir': data_dir, 'stop_words': 'english', 'chunk_size': 2000, 'n_jobs': -1, 'use_idf': 1, 'sublinear_tf': 0, 'binary': 0, 'n_features': 50001, 'analyzer': 'word', 'ngram_range': (1, 1), "norm": "l2" } res = requests.post(url, json=fe_opts).json() dsid = res['id'] print(" => received {}".format(list(res.keys()))) print(" => dsid = {}".format(dsid)) print("\n1.b Start feature extraction (in the background)") # Make this call in a background process (there should be a better way of doing it) url = BASE_URL+'/feature-extraction/{}'.format(dsid) print(" POST", url) p = Process(target=requests.post, args=(url,)) p.start() sleep(5.0) # wait a bit for the processing to start print('\n1.c Monitor feature extraction progress') url = BASE_URL+'/feature-extraction/{}'.format(dsid) print(" GET", url) t0 = time() while True: res = requests.get(url) if res.status_code == 520: p.terminate() raise ValueError('Processing did not start') elif res.status_code == 200: break # processing finished data = res.json() print(' ... {}k/{}k files processed in {:.1f} min'.format( data['n_samples_processed']//1000, data['n_samples']//1000, (time() - t0)/60.)) sleep(15.0) p.terminate() # just in case, should not be necessary print("\n1.d. check the parameters of the extracted features") url = BASE_URL + '/feature-extraction/{}'.format(dsid) print(' GET', url) res = requests.get(url).json() print('\n'.join([' - {}: {}'.format(key, val) for key, val in res.items() \ if "filenames" not in key])) method = BASE_URL + "/feature-extraction/{}/index".format(dsid) res = requests.get(method, data={'filenames': seed_filenames}) seed_index = res.json()['index'] # 2. Document categorization with ML algorithms print("\n2.a. Train the ML categorization model") print(" {} relevant, {} non-relevant files".format(seed_y.count(1), seed_y.count(0))) url = BASE_URL + '/categorization/' print(" POST", url) print(' Training...') res = requests.post(url, json={'index': seed_index, 'y': seed_y, 'dataset_id': dsid, 'method': 'LinearSVC', # one of "LinearSVC", "LogisticRegression", 'xgboost' 'cv': 0 # Cross Validation }).json() mid = res['id'] print(" => model id = {}".format(mid)) print(' => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res)) print("\n2.b. Check the parameters used in the categorization model") url = BASE_URL + '/categorization/{}'.format(mid) print(" GET", url) res = requests.get(url).json() print('\n'.join([' - {}: {}'.format(key, val) for key, val in res.items() \ if key not in ['index', 'y']])) print("\n2.c Categorize the complete dataset with this model") url = BASE_URL + '/categorization/{}/predict'.format(mid) print(" GET", url) res = requests.get(url).json() prediction = res['prediction'] print(" => Predicting {} relevant and {} non relevant documents".format( len(list(filter(lambda x: x>0, prediction))), len(list(filter(lambda x: x<0, prediction))))) print("\n2.d Test categorization accuracy") print(" using {}".format(ground_truth_file)) url = BASE_URL + '/categorization/{}/test'.format(mid) print("POST", url) res = requests.post(url, json={'ground_truth_filename': ground_truth_file}).json() print(' => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res)) # 3. Document categorization with LSI print("\n3.a. Calculate LSI") url = BASE_URL + '/lsi/' print("POST", url) n_components = 100 res = requests.post(url, json={'n_components': n_components, 'dataset_id': dsid }).json() lid = res['id'] print(' => LSI model id = {}'.format(lid)) print(' => SVD decomposition with {} dimensions explaining {:.2f} % variabilty of the data'.format( n_components, res['explained_variance']*100)) print("\n3.b. Predict categorization with LSI") url = BASE_URL + '/lsi/{}/predict'.format(lid) print("POST", url) res = requests.post(url, json={'index': seed_index, 'y': seed_y }).json() prediction = res['prediction'] print(' => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res)) df = pd.DataFrame({key: res[key] for key in res if 'prediction'==key or 'nearest' in key}) print("\n3.c. Test categorization with LSI") url = BASE_URL + '/lsi/{}/test'.format(lid) print(" POST", url) res = requests.post(url, json={'index': seed_index, 'y': seed_y, 'ground_truth_filename': ground_truth_file }).json() print(res) print(' => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res)) print('\n', df) # 4. Cleaning print("\n4.a Delete the extracted features") url = BASE_URL + '/feature-extraction/{}'.format(dsid) print(" DELETE", url) requests.delete(url) **Total running time of the script:** ( 1 minutes 8.696 seconds) .. container:: sphx-glr-footer .. container:: sphx-glr-download :download:`Download Python source code: categorization_example.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: categorization_example.ipynb ` .. rst-class:: sphx-glr-signature `Generated by Sphinx-Gallery `_