.. _sphx_glr_examples_categorization_example.py:


Categorization Example [REST API]
---------------------------------

An example to illustrate binary categorizaiton with FreeDiscovery


.. rst-class:: sphx-glr-script-out

 Out::

    0. Load the test dataset
     GET http://localhost:5001/api/v0/datasets/treclegal09_2k_subset

    1.a Load dataset and initalize feature extraction
     POST http://localhost:5001/api/v0/feature-extraction
       => received [u'id', u'filenames']
       => dsid = ffeb050dc0fe425ca76c6616b44c03e2

    1.b Start feature extraction (in the background)
     POST http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2

    1.c Monitor feature extraction progress
     GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2

    1.d. check the parameters of the extracted features
     GET http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2
         - binary: False
         - n_jobs: -1
         - stop_words: english
         - use_hashing: True
         - min_df: 0.0
         - n_samples: 2465
         - analyzer: word
         - ngram_range: [1, 1]
         - max_df: 1.0
         - chunk_size: 2000
         - use_idf: True
         - data_dir: ../freediscovery_shared/treclegal09_2k_subset/data
         - sublinear_tf: False
         - n_samples_processed: 2465
         - n_features: 50001
         - norm: l2

    2.a. Train the ML categorization model
       5 relevant, 63 non-relevant files
     POST http://localhost:5001/api/v0/categorization/
     Training...
         => model id = e3f2f4ad32454a7fa4fd107c892012ac
        => Training scores: MAP = 1.000, ROC-AUC = 1.000

    2.b. Check the parameters used in the categorization model
     GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac
         - method: LinearSVC
         - options: {'loss': 'squared_hinge', 'C': 1.0, 'verbose': 0, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 1000, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': True, 'tol': 0.0001, 'class_weight': None}

    2.c Categorize the complete dataset with this model
     GET http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/predict
        => Predicting 11 relevant and 2454 non relevant documents

    2.d Test categorization accuracy
             using ../freediscovery_shared/treclegal09_2k_subset/ground_truth_file.txt
    POST http://localhost:5001/api/v0/categorization/e3f2f4ad32454a7fa4fd107c892012ac/test
        => Test scores: MAP = 0.959, ROC-AUC = 0.958

    3.a. Calculate LSI
    POST http://localhost:5001/api/v0/lsi/
      => LSI model id = 8b827da5547f46a6962274ee7771248b
      => SVD decomposition with 100 dimensions explaining 48.41 % variabilty of the data

    3.b. Predict categorization with LSI
    POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/predict
        => Training scores: MAP = 1.000, ROC-AUC = 1.000

    3.c. Test categorization with LSI
     POST http://localhost:5001/api/v0/lsi/8b827da5547f46a6962274ee7771248b/test
    {u'recall': 0.8333333333333334, u'f1': 0.11695906432748539, u'roc_auc': 0.886295692349504, u'average_precision': 0.4485188870603544, u'precision': 0.06289308176100629}
        => Test scores: MAP = 0.449, ROC-AUC = 0.886

           nearest_nrel_doc  nearest_rel_doc  prediction
    0                    9             1791      -0.378
    1                 1457             1791      -0.459
    2                 2314                3      -0.558
    3                 2451                3       1.000
    4                 2314                3      -0.598
    5                 1337              919      -0.473
    6                 1600                3      -0.705
    7                 2314             1047      -0.415
    8                 1337                3      -0.457
    9                    9             1791      -1.000
    10                2451                3      -0.539
    11                  32              906      -0.391
    12                  32              906      -0.369
    13                2039              919      -0.824
    14                2039             1047      -0.996
    15                1563              906      -0.328
    16                  32                3      -0.399
    17                  32              906      -0.270
    18                  99              919      -0.199
    19                1104              906      -0.550
    20                1337              919      -0.362
    21                2275                3      -0.893
    22                 676             1047      -0.196
    23                 676              919      -0.197
    24                2121             1047      -0.550
    25                2275             1791      -0.615
    26                  32              906      -0.277
    27                 362             1047      -0.256
    28                  32                3      -0.389
    29                  32              906      -0.262
    ...                ...              ...         ...
    2435               987             1791      -0.150
    2436               615              919      -0.745
    2437               615              919      -0.463
    2438              1337             1047      -0.217
    2439              1503             1047      -0.424
    2440              1503             1047      -0.225
    2441              1503                3      -0.483
    2442               539             1047      -0.248
    2443                 9              919      -0.410
    2444                 9              919      -0.206
    2445                 9              919      -0.410
    2446                 9             1047      -0.209
    2447              2314             1047      -0.501
    2448              2314             1047      -0.174
    2449              1337                3      -0.582
    2450              2387                3      -0.835
    2451              2451                3      -1.000
    2452              2451                3      -0.652
    2453              2451                3      -0.554
    2454              2451                3      -0.372
    2455              1561                3      -0.448
    2456              2387                3       0.654
    2457              1563             1047      -0.537
    2458              1117              919      -0.237
    2459              1441             1047      -0.702
    2460                81             1047      -0.193
    2461                81             1047      -0.185
    2462              1441             1047      -0.224
    2463                81             1047      -0.199
    2464              1441             1047      -0.497

    [2465 rows x 3 columns]

    4.a Delete the extracted features
     DELETE http://localhost:5001/api/v0/feature-extraction/ffeb050dc0fe425ca76c6616b44c03e2


|


.. code-block:: python


    from __future__ import print_function

    from time import time, sleep
    from multiprocessing import Process
    import requests
    import pandas as pd

    pd.options.display.float_format = '{:,.3f}'.format
    pd.options.display.expand_frame_repr = False

    dataset_name = "treclegal09_2k_subset"     # see list of available datasets

    BASE_URL = "http://localhost:5001/api/v0"  # FreeDiscovery server URL

    if __name__ == '__main__':

        print(" 0. Load the test dataset")
        url = BASE_URL + '/datasets/{}'.format(dataset_name)
        print(" GET", url)
        res = requests.get(url).json()

        # To use a custom dataset, simply specify the following variables
        data_dir = res['data_dir']
        seed_filenames = res['seed_filenames']
        seed_y = res['seed_y']
        ground_truth_file = res['ground_truth_file']  # (optional)


        # 1. Feature extraction

        print("\n1.a Load dataset and initalize feature extraction")
        url = BASE_URL + '/feature-extraction'
        print(" POST", url)
        fe_opts = {'data_dir': data_dir,
                   'stop_words': 'english', 'chunk_size': 2000, 'n_jobs': -1,
                   'use_idf': 1, 'sublinear_tf': 0, 'binary': 0, 'n_features': 50001,
                   'analyzer': 'word', 'ngram_range': (1, 1), "norm": "l2"
                  }
        res = requests.post(url, json=fe_opts).json()

        dsid = res['id']
        print("   => received {}".format(list(res.keys())))
        print("   => dsid = {}".format(dsid))

        print("\n1.b Start feature extraction (in the background)")

        # Make this call in a background process (there should be a better way of doing it)
        url = BASE_URL+'/feature-extraction/{}'.format(dsid)
        print(" POST", url)
        p = Process(target=requests.post, args=(url,))
        p.start()
        sleep(5.0) # wait a bit for the processing to start

        print('\n1.c Monitor feature extraction progress')
        url = BASE_URL+'/feature-extraction/{}'.format(dsid)
        print(" GET", url)

        t0 = time()
        while True:
            res = requests.get(url)
            if res.status_code == 520:
                p.terminate()
                raise ValueError('Processing did not start')
            elif res.status_code == 200:
                break # processing finished
            data = res.json()
            print('     ... {}k/{}k files processed in {:.1f} min'.format(
                        data['n_samples_processed']//1000, data['n_samples']//1000, (time() - t0)/60.))
            sleep(15.0)

        p.terminate()  # just in case, should not be necessary


        print("\n1.d. check the parameters of the extracted features")
        url = BASE_URL + '/feature-extraction/{}'.format(dsid)
        print(' GET', url)
        res = requests.get(url).json()

        print('\n'.join(['     - {}: {}'.format(key, val) for key, val in res.items() \
                                                          if "filenames" not in key]))

        method = BASE_URL + "/feature-extraction/{}/index".format(dsid)
        res = requests.get(method, data={'filenames': seed_filenames})
        seed_index = res.json()['index']

        # 2. Document categorization with ML algorithms

        print("\n2.a. Train the ML categorization model")
        print("   {} relevant, {} non-relevant files".format(seed_y.count(1), seed_y.count(0)))
        url = BASE_URL + '/categorization/'
        print(" POST", url)
        print(' Training...')

        res = requests.post(url,
                            json={'index': seed_index,
                                  'y': seed_y,
                                  'dataset_id': dsid,
                                  'method': 'LinearSVC',  # one of "LinearSVC", "LogisticRegression", 'xgboost'
                                  'cv': 0                          # Cross Validation
                                  }).json()

        mid = res['id']
        print("     => model id = {}".format(mid))
        print('    => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))

        print("\n2.b. Check the parameters used in the categorization model")
        url = BASE_URL + '/categorization/{}'.format(mid)
        print(" GET", url)
        res = requests.get(url).json()

        print('\n'.join(['     - {}: {}'.format(key, val) for key, val in res.items() \
                                                          if key not in ['index', 'y']]))

        print("\n2.c Categorize the complete dataset with this model")
        url = BASE_URL + '/categorization/{}/predict'.format(mid)
        print(" GET", url)
        res = requests.get(url).json()
        prediction = res['prediction']

        print("    => Predicting {} relevant and {} non relevant documents".format(
            len(list(filter(lambda x: x>0, prediction))),
            len(list(filter(lambda x: x<0, prediction)))))

        print("\n2.d Test categorization accuracy")
        print("         using {}".format(ground_truth_file))  
        url = BASE_URL + '/categorization/{}/test'.format(mid)
        print("POST", url)
        res = requests.post(url, json={'ground_truth_filename': ground_truth_file}).json()

        print('    => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))


        # 3. Document categorization with LSI

        print("\n3.a. Calculate LSI")

        url = BASE_URL + '/lsi/'
        print("POST", url)

        n_components = 100
        res = requests.post(url,
                            json={'n_components': n_components,
                                  'dataset_id': dsid
                                  }).json()

        lid = res['id']
        print('  => LSI model id = {}'.format(lid))
        print('  => SVD decomposition with {} dimensions explaining {:.2f} % variabilty of the data'.format(
                                n_components, res['explained_variance']*100))
        print("\n3.b. Predict categorization with LSI")

        url = BASE_URL + '/lsi/{}/predict'.format(lid)
        print("POST", url)
        res = requests.post(url,
                            json={'index': seed_index,
                                  'y': seed_y
                                  }).json()
        prediction = res['prediction']

        print('    => Training scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))
        df = pd.DataFrame({key: res[key] for key in res if 'prediction'==key or 'nearest' in key})


        print("\n3.c. Test categorization with LSI")
        url = BASE_URL + '/lsi/{}/test'.format(lid)
        print(" POST", url)

        res = requests.post(url,
                            json={'index': seed_index,
                                  'y': seed_y,
                                  'ground_truth_filename': ground_truth_file
                                  }).json()
        print(res)
        print('    => Test scores: MAP = {average_precision:.3f}, ROC-AUC = {roc_auc:.3f}'.format(**res))

        print('\n', df)

        # 4. Cleaning
        print("\n4.a Delete the extracted features")
        url = BASE_URL + '/feature-extraction/{}'.format(dsid)
        print(" DELETE", url)
        requests.delete(url)

**Total running time of the script:** ( 1 minutes  8.696 seconds)


.. container:: sphx-glr-footer


  .. container:: sphx-glr-download

     :download:`Download Python source code: categorization_example.py <categorization_example.py>`


  .. container:: sphx-glr-download

     :download:`Download Jupyter notebook: categorization_example.ipynb <categorization_example.ipynb>`

.. rst-class:: sphx-glr-signature

    `Generated by Sphinx-Gallery <http://sphinx-gallery.readthedocs.io>`_