Categorization¶
An example to illustrate binary categorizaiton with FreeDiscovery
from __future__ import print_function
import os.path
import requests
import pandas as pd
pd.options.display.float_format = '{:,.3f}'.format
pd.options.display.expand_frame_repr = False
dataset_name = "treclegal09_2k_subset" # see list of available datasets
BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL
0. Load the example dataset¶
url = BASE_URL + '/example-dataset/{}'.format(dataset_name)
print(" GET", url)
input_ds = requests.get(url).json()
data_dir = input_ds['metadata']['data_dir']
dataset_definition = [{'document_id': row['document_id'],
'file_path': os.path.join(data_dir, row['file_path'])}
for row in input_ds['dataset']]
Out:
GET http://localhost:5001/api/v0/example-dataset/treclegal09_2k_subset
1. Feature extraction¶
print("\n1.a Load dataset and initalize feature extraction")
url = BASE_URL + '/feature-extraction'
print(" POST", url)
res = requests.post(url).json()
dsid = res['id']
print(" => received {}".format(list(res.keys())))
print(" => dsid = {}".format(dsid))
Out:
1.a Load dataset and initalize feature extraction
POST http://localhost:5001/api/v0/feature-extraction
=> received ['id']
=> dsid = d57bcc9311af477d
1.b Start feature extraction (in the background)
url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
res = requests.post(url, json={'dataset_definition': dataset_definition})
Out:
POST http://localhost:5001/api/v0/feature-extraction/d57bcc9311af477d
2. Calculate Latent Semantic Indexing¶
(used by Nearest Neighbors method)
url = BASE_URL + '/lsi/'
print("POST", url)
n_components = 100
res = requests.post(url,
json={'n_components': n_components,
'parent_id': dsid
}).json()
lsi_id = res['id']
print(' => LSI model id = {}'.format(lsi_id))
print((' => SVD decomposition with {} dimensions explaining '
'{:.2f} % variabilty of the data')
.format(n_components, res['explained_variance']*100))
Out:
POST http://localhost:5001/api/v0/lsi/
=> LSI model id = 299dfe63623f468e
=> SVD decomposition with 100 dimensions explaining 69.79 % variabilty of the data
3. Document categorization¶
3.a. Train the categorization model
print(" {} positive, {} negative files".format(
pd.DataFrame(input_ds['training_set'])
.groupby('category').count()['document_id'], 0))
for method, use_lsi in [('LinearSVC', False),
('NearestNeighbor', True)]:
print('='*80, '\n', ' '*10,
method, " + LSI" if use_lsi else ' ', '\n', '='*80)
if use_lsi:
# Categorization with the previously created LSI model
parent_id = lsi_id
else:
# Categorization with original text features
parent_id = dsid
url = BASE_URL + '/categorization/'
print(" POST", url)
print(' Training...')
res = requests.post(url,
json={'parent_id': parent_id,
'data': input_ds['training_set'],
'method': method,
'training_scores': True
}).json()
mid = res['id']
print(" => model id = {}".format(mid))
print((" => Training scores: MAP = {average_precision:.3f}, "
"ROC-AUC = {roc_auc:.3f}, recall @20%: {recall_at_20p:.3f} ")
.format(**res['training_scores']))
print("\n3.b. Check the parameters used in the categorization model")
url = BASE_URL + '/categorization/{}'.format(mid)
print(" GET", url)
res = requests.get(url).json()
print('\n'.join([' - {}: {}'.format(key, val)
for key, val in res.items() if key not in ['index', 'category']]))
print("\n3.c Categorize the complete dataset with this model")
url = BASE_URL + '/categorization/{}/predict'.format(mid)
print(" GET", url)
res = requests.get(url, json={'subset': 'test'}).json()
data = []
for row in res['data']:
nrow = {'document_id': row['document_id'],
'category': row['scores'][0]['category'],
'score': row['scores'][0]['score']}
if method == 'NearestNeighbor':
nrow['nearest_document_id'] = row['scores'][0]['document_id']
data.append(nrow)
df = pd.DataFrame(data).set_index('document_id')
print(df)
print("\n3.d Compute the categorization scores")
url = BASE_URL + '/metrics/categorization'
print(" GET", url)
res = requests.post(url, json={'y_true': input_ds['dataset'],
'y_pred': res['data']}).json()
print((" => Test scores: MAP = {average_precision:.3f}, "
"ROC-AUC = {roc_auc:.3f}, recall @20%: {recall_at_20p:.3f} ")
.format(**res))
Out:
category
negative 63
positive 5
Name: document_id, dtype: int64 positive, 0 negative files
================================================================================
LinearSVC
================================================================================
POST http://localhost:5001/api/v0/categorization/
Training...
=> model id = 1d814ddcb58a4c10
=> Training scores: MAP = 0.815, ROC-AUC = 0.900, recall @20%: 0.800
3.b. Check the parameters used in the categorization model
GET http://localhost:5001/api/v0/categorization/1d814ddcb58a4c10
- method: LinearSVC
- options: {'C': 1.0, 'class_weight': None, 'dual': True, 'fit_intercept': True, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}
3.c Categorize the complete dataset with this model
GET http://localhost:5001/api/v0/categorization/1d814ddcb58a4c10/predict
category score
document_id
568516 negative 0.778
3717184 negative 0.762
454276 negative 0.762
5171076 negative 0.760
3568321 negative 0.759
1803649 negative 0.754
1252161 negative 0.753
5166529 negative 0.752
573049 negative 0.752
885481 negative 0.751
877969 negative 0.751
3286969 negative 0.750
274576 negative 0.750
118336 negative 0.750
5239521 negative 0.749
3709476 negative 0.749
319225 negative 0.749
4941729 negative 0.749
119716 negative 0.749
4507129 negative 0.749
4713241 negative 0.748
2845969 negative 0.748
4981824 negative 0.748
4708900 negative 0.748
320356 negative 0.748
4831204 negative 0.748
4359744 negative 0.747
318096 negative 0.747
1190281 negative 0.746
1254400 negative 0.746
... ... ...
4652649 negative 0.631
36481 negative 0.631
38025 negative 0.631
199809 negative 0.630
90000 negative 0.630
4857616 negative 0.630
268324 negative 0.630
270400 negative 0.630
80656 negative 0.630
5022081 negative 0.630
4968441 negative 0.630
88804 negative 0.630
1170724 negative 0.630
79524 negative 0.630
3073009 negative 0.629
350464 negative 0.628
984064 negative 0.627
221841 negative 0.627
846400 negative 0.621
3240000 negative 0.620
3222025 negative 0.620
1046529 negative 0.620
316969 negative 0.619
822649 negative 0.619
850084 negative 0.575
833569 negative 0.567
3422500 negative 0.552
819025 negative 0.549
3182656 negative 0.519
3323329 negative 0.501
[2397 rows x 2 columns]
3.d Compute the categorization scores
GET http://localhost:5001/api/v0/metrics/categorization
=> Test scores: MAP = 0.982, ROC-AUC = 1.000, recall @20%: 1.000
================================================================================
NearestNeighbor + LSI
================================================================================
POST http://localhost:5001/api/v0/categorization/
Training...
=> model id = c8262ecf879941d9
=> Training scores: MAP = 1.000, ROC-AUC = 1.000, recall @20%: 1.000
3.b. Check the parameters used in the categorization model
GET http://localhost:5001/api/v0/categorization/c8262ecf879941d9
- method: NearestNeighbor
- options: {'algorithm': 'brute', 'leaf_size': 30, 'n_jobs': 1, 'radius': None}
3.c Categorize the complete dataset with this model
GET http://localhost:5001/api/v0/categorization/c8262ecf879941d9/predict
category nearest_document_id score
document_id
37249 negative 150544 1.000
1245456 negative 150544 1.000
205209 negative 150544 1.000
35721 negative 150544 1.000
541696 negative 150544 1.000
531441 negative 150544 1.000
75076 negative 150544 1.000
851929 negative 150544 1.000
3625216 negative 1218816 1.000
1203409 negative 150544 1.000
3617604 negative 1218816 1.000
1210000 negative 1218816 1.000
1214404 negative 1218816 1.000
432964 negative 150544 1.000
354025 negative 150544 1.000
465124 negative 150544 1.000
861184 negative 150544 1.000
801025 negative 150544 1.000
868624 negative 150544 1.000
1256641 negative 150544 1.000
1261129 negative 150544 1.000
138384 negative 150544 1.000
5041 negative 150544 1.000
5625 negative 150544 1.000
1067089 negative 150544 1.000
164836 negative 150544 1.000
230400 negative 150544 1.000
1279161 negative 150544 1.000
925444 negative 150544 1.000
881721 negative 150544 1.000
... ... ... ...
5776 negative 1263376 0.086
4338889 negative 1263376 0.086
5184 negative 1263376 0.086
4624 negative 1263376 0.086
4322241 negative 1263376 0.086
4318084 negative 456976 0.086
4334724 negative 456976 0.086
4489 negative 456976 0.086
599076 negative 1846881 0.083
1800964 negative 1846881 0.083
300304 negative 6561 0.083
363609 negative 335241 0.082
544644 positive 844561 0.082
490000 negative 81 0.074
504100 negative 6007401 0.068
682276 negative 1024 0.067
1582564 negative 1024 0.067
1530169 negative 2788900 0.061
627264 negative 2788900 0.061
1532644 negative 2788900 0.059
628849 negative 2788900 0.059
609961 negative 2788900 0.058
1500625 negative 2788900 0.058
1503076 negative 2788900 0.056
611524 negative 2788900 0.056
361201 positive 3207681 0.051
546121 positive 3207681 0.051
524176 positive 844561 0.046
519841 negative 2560000 0.045
543169 positive 844561 0.043
[2397 rows x 3 columns]
3.d Compute the categorization scores
GET http://localhost:5001/api/v0/metrics/categorization
=> Test scores: MAP = 0.287, ROC-AUC = 0.667, recall @20%: 0.571
5 Delete the extracted features (and LSI decomposition)
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
requests.delete(url)
Total running time of the script: ( 0 minutes 5.449 seconds)