Categorization Example [REST API]¶

An example to illustrate binary categorizaiton with FreeDiscovery

Out:

0. Load the test dataset
 GET http://localhost:5001/api/v0/datasets/treclegal09_2k_subset

1.a Load dataset and initalize feature extraction
 POST http://localhost:5001/api/v0/feature-extraction
   => received [u'id', u'filenames']
   => dsid = ffeb050dc0fe425ca76c6616b44c03e2

1.b Start feature extraction (in the background)
 POST http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2

1.c Monitor feature extraction progress
 GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2

1.d. check the parameters of the extracted features
 GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
     - binary: False
     - n_jobs: -1
     - stop_words: english
     - use_hashing: True
     - min_df: 0.0
     - n_samples: 2465
     - analyzer: word
     - ngram_range: [1, 1]
     - max_df: 1.0
     - chunk_size: 2000
     - use_idf: True
     - data_dir: ../freediscovery_shared/treclegal09_2k_subset/data
     - sublinear_tf: False
     - n_samples_processed: 2465
     - n_features: 50001
     - norm: l2

2.a. Train the ML categorization model
   5 relevant, 63 non-relevant files
 POST http://localhost:5001/api/v0/categorization/
 Training...
     => model id = e3f2f4ad32454a7fa4fd107c892012ac
    => Training scores: MAP = 1.000, ROC-AUC = 1.000

2.b. Check the parameters used in the categorization model
 GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac
     - method: LinearSVC
     - options: {'loss': 'squared_hinge', 'C': 1.0, 'verbose': 0, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 1000, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': True, 'tol': 0.0001, 'class_weight': None}

2.c Categorize the complete dataset with this model
 GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/predict
    => Predicting 11 relevant and 2454 non relevant documents

2.d Test categorization accuracy
         using ../freediscovery_shared/treclegal09_2k_subset/ground_truth_file.txt
POST http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/test
    => Test scores: MAP = 0.959, ROC-AUC = 0.958

3.a. Calculate LSI
POST http://localhost:5001/api/v0/lsi/
  => LSI model id = 8b827da5547f46a6962274ee7771248b
  => SVD decomposition with 100 dimensions explaining 48.41 % variabilty of the data

3.b. Predict categorization with LSI
POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/predict
    => Training scores: MAP = 1.000, ROC-AUC = 1.000

3.c. Test categorization with LSI
 POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/test
{u'recall': 0.8333333333333334, u'f1': 0.11695906432748539, u'roc_auc': 0.886295692349504, u'average_precision': 0.4485188870603544, u'precision': 0.06289308176100629}
    => Test scores: MAP = 0.449, ROC-AUC = 0.886

       nearest_nrel_doc  nearest_rel_doc  prediction
0                    9             1791      -0.378
1                 1457             1791      -0.459
2                 2314                3      -0.558
3                 2451                3       1.000
4                 2314                3      -0.598
5                 1337              919      -0.473
6                 1600                3      -0.705
7                 2314             1047      -0.415
8                 1337                3      -0.457
9                    9             1791      -1.000
10                2451                3      -0.539
11                  32              906      -0.391
12                  32              906      -0.369
13                2039              919      -0.824
14                2039             1047      -0.996
15                1563              906      -0.328
16                  32                3      -0.399
17                  32              906      -0.270
18                  99              919      -0.199
19                1104              906      -0.550
20                1337              919      -0.362
21                2275                3      -0.893
22                 676             1047      -0.196
23                 676              919      -0.197
24                2121             1047      -0.550
25                2275             1791      -0.615
26                  32              906      -0.277
27                 362             1047      -0.256
28                  32                3      -0.389
29                  32              906      -0.262
...                ...              ...         ...
2435               987             1791      -0.150
2436               615              919      -0.745
2437               615              919      -0.463
2438              1337             1047      -0.217
2439              1503             1047      -0.424
2440              1503             1047      -0.225
2441              1503                3      -0.483
2442               539             1047      -0.248
2443                 9              919      -0.410
2444                 9              919      -0.206
2445                 9              919      -0.410
2446                 9             1047      -0.209
2447              2314             1047      -0.501
2448              2314             1047      -0.174
2449              1337                3      -0.582
2450              2387                3      -0.835
2451              2451                3      -1.000
2452              2451                3      -0.652
2453              2451                3      -0.554
2454              2451                3      -0.372
2455              1561                3      -0.448
2456              2387                3       0.654
2457              1563             1047      -0.537
2458              1117              919      -0.237
2459              1441             1047      -0.702
2460                81             1047      -0.193
2461                81             1047      -0.185
2462              1441             1047      -0.224
2463                81             1047      -0.199
2464              1441             1047      -0.497

[2465 rows x 3 columns]

4.a Delete the extracted features
 DELETE http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2

from __future__ import print_function

from time import time, sleep
from multiprocessing import Process
import requests
import pandas as pd

pd.options.display.float_format = '{:,.3f}'.format
pd.options.display.expand_frame_repr = False

dataset_name = "treclegal09_2k_subset"     # see list of available datasets

BASE_URL = "http://localhost:5001/api/v0"  # FreeDiscovery server URL

if __name__ == '__main__':

    print(" 0. Load the test dataset")
    url = BASE_URL + '/datasets/{}'.format(dataset_name)
    print(" GET", url)
    res = requests.get(url).json()

    # To use a custom dataset, simply specify the following variables
    data_dir = res['data_dir']
    seed_filenames = res['seed_filenames']
    seed_y = res['seed_y']
    ground_truth_file = res['ground_truth_file']  # (optional)


    # 1. Feature extraction

    print("\n1.a Load dataset and initalize feature extraction")
    url = BASE_URL + '/feature-extraction'
    print(" POST", url)
    fe_opts = {'data_dir': data_dir,
               'stop_words': 'english', 'chunk_size': 2000, 'n_jobs': -1,
               'use_idf': 1, 'sublinear_tf': 0, 'binary': 0, 'n_features': 50001,
               'analyzer': 'word', 'ngram_range': (1, 1), "norm": "l2"
              }
    res = requests.post(url, json=fe_opts).json()

    dsid = res['id']
    print("   => received {}".format(list(res.keys())))
    print("   => dsid = {}".format(dsid))

    print("\n1.b Start feature extraction (in the background)")

    # Make this call in a background process (there should be a better way of doing it)
    url = BASE_URL+'/feature-extraction/{}'.format(dsid)
    print(" POST", url)
    p = Process(target=requests.post, args=(url,))
    p.start()
    sleep(5.0) # wait a bit for the processing to start

    print('\n1.c Monitor feature extraction progress')
    url = BASE_URL+'/feature-extraction/{}'.format(dsid)
    print(" GET", url)

    t0 = time()
    while True:
        res = requests.get(url)
        if res.status_code == 520:
            p.terminate()
            raise ValueError('Processing did not start')
        elif res.status_code == 200:
            break # processing finished
        data = res.json()
        print('     ... {}k/{}k files processed in {:.1f} min'.format(
                    data['n_samples_processed']//1000, data['n_samples']//1000, (time() - t0)/60.))
        sleep(15.0)

    p.terminate()  # just in case, should not be necessary


    print("\n1.d. check the parameters of the extracted features")
    url = BASE_URL + '/feature-extraction/{}'.format(dsid)
    print(' GET', url)
    res = requests.get(url).json()

    print('\n'.join(['     - {}: {}'.format(key, val) for key, val in res.items() \
                                                      if "filenames" not in key]))

    method = BASE_URL + "/feature-extraction/{}/index".format(dsid)
    res = requests.get(method, data={'filenames': seed_filenames})
    seed_index = res.json()['index']

    # 2. Document categorization with ML algorithms

    print("\n2.a. Train the ML categorization model")
    print("   {} relevant, {} non-relevant files".format(seed_y.count(1), seed_y.count(0)))
    url = BASE_URL + '/categorization/'
    print(" POST", url)
    print(' Training...')

    res = requests.post(url,
                        json={'index': seed_index,
                              'y': seed_y,
                              'dataset_id': dsid,
                              'method': 'LinearSVC',  # one of "LinearSVC", "LogisticRegression", 'xgboost'
                              'cv': 0                          # Cross Validation
                              }).json()

    mid = res['id']
    print("     => model id = {}".format(mid))
    print('    => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))

    print("\n2.b. Check the parameters used in the categorization model")
    url = BASE_URL + '/categorization/{}'.format(mid)
    print(" GET", url)
    res = requests.get(url).json()

    print('\n'.join(['     - {}: {}'.format(key, val) for key, val in res.items() \
                                                      if key not in ['index', 'y']]))

    print("\n2.c Categorize the complete dataset with this model")
    url = BASE_URL + '/categorization/{}/predict'.format(mid)
    print(" GET", url)
    res = requests.get(url).json()
    prediction = res['prediction']

    print("    => Predicting {} relevant and {} non relevant documents".format(
        len(list(filter(lambda x: x>0, prediction))),
        len(list(filter(lambda x: x<0, prediction)))))

    print("\n2.d Test categorization accuracy")
    print("         using {}".format(ground_truth_file))
    url = BASE_URL + '/categorization/{}/test'.format(mid)
    print("POST", url)
    res = requests.post(url, json={'ground_truth_filename': ground_truth_file}).json()

    print('    => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))


    # 3. Document categorization with LSI

    print("\n3.a. Calculate LSI")

    url = BASE_URL + '/lsi/'
    print("POST", url)

    n_components = 100
    res = requests.post(url,
                        json={'n_components': n_components,
                              'dataset_id': dsid
                              }).json()

    lid = res['id']
    print('  => LSI model id = {}'.format(lid))
    print('  => SVD decomposition with {} dimensions explaining {:.2f} % variabilty of the data'.format(
                            n_components, res['explained_variance']*100))
    print("\n3.b. Predict categorization with LSI")

    url = BASE_URL + '/lsi/{}/predict'.format(lid)
    print("POST", url)
    res = requests.post(url,
                        json={'index': seed_index,
                              'y': seed_y
                              }).json()
    prediction = res['prediction']

    print('    => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
    df = pd.DataFrame({key: res[key] for key in res if 'prediction'==key or 'nearest' in key})


    print("\n3.c. Test categorization with LSI")
    url = BASE_URL + '/lsi/{}/test'.format(lid)
    print(" POST", url)

    res = requests.post(url,
                        json={'index': seed_index,
                              'y': seed_y,
                              'ground_truth_filename': ground_truth_file
                              }).json()
    print(res)
    print('    => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))

    print('\n', df)

    # 4. Cleaning
    print("\n4.a Delete the extracted features")
    url = BASE_URL + '/feature-extraction/{}'.format(dsid)
    print(" DELETE", url)
    requests.delete(url)

Total running time of the script: ( 1 minutes 8.696 seconds)

Generated by Sphinx-Gallery