Clustering¶
Cluster documents into clusters
import os.path
import pandas as pd
from time import time
import requests
pd.options.display.float_format = '{:,.3f}'.format
dataset_name = "treclegal09_2k_subset" # see list of available datasets
BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL
0. Load the example dataset¶
url = BASE_URL + '/example-dataset/{}'.format(dataset_name)
print(" GET", url)
input_ds = requests.get(url).json()
# To use a custom dataset, simply specify the following variables
data_dir = input_ds['metadata']['data_dir']
dataset_definition = [{'document_id': row['document_id'],
'file_path': os.path.join(data_dir, row['file_path'])}
for row in input_ds['dataset']]
Out:
GET http://localhost:5001/api/v0/example-dataset/treclegal09_2k_subset
1. Feature extraction (non hashed)¶
1.a Load dataset and initalize feature extraction
url = BASE_URL + '/feature-extraction'
print(" POST", url)
res = requests.post(url).json()
dsid = res['id']
print(" => received {}".format(list(res.keys())))
print(" => dsid = {}".format(dsid))
Out:
POST http://localhost:5001/api/v0/feature-extraction
=> received ['id']
=> dsid = 25c2c07f71124437
1.b Run feature extraction
url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
res = requests.post(url, json={'dataset_definition': dataset_definition})
Out:
POST http://localhost:5001/api/v0/feature-extraction/25c2c07f71124437
2. Calculate LSI¶
url = BASE_URL + '/lsi/'
print("POST", url)
n_components = 300
res = requests.post(url,
json={'n_components': n_components,
'parent_id': dsid
}).json()
lsi_id = res['id']
print(' => LSI model id = {}'.format(lsi_id))
print((' => SVD decomposition with {} dimensions '
'explaining {:.2f} % variabilty of the data')
.format(n_components, res['explained_variance']*100))
Out:
POST http://localhost:5001/api/v0/lsi/
=> LSI model id = b373255f3cea4a85
=> SVD decomposition with 300 dimensions explaining 87.16 % variabilty of the data
3. Document Clustering (LSI + K-Means)¶
print("\n3.a. Document clustering (LSI + K-means)")
url = BASE_URL + '/clustering/k-mean/'
print(" POST", url)
t0 = time()
res = requests.post(url,
json={'parent_id': lsi_id,
'n_clusters': 10,
}).json()
mid = res['id']
print(" => model id = {}".format(mid))
Out:
3.a. Document clustering (LSI + K-means)
POST http://localhost:5001/api/v0/clustering/k-mean/
=> model id = f5b218520ce0433e
3.b. Computing cluster labels
url = BASE_URL + '/clustering/k-mean/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
json={'n_top_words': 3
}).json()
t1 = time()
data = res['data']
for row in data:
row['n_documents'] = len(row.pop('documents'))
print(' .. computed in {:.1f}s'.format(t1 - t0))
print(pd.DataFrame(data))
Out:
GET http://localhost:5001/api/v0/clustering/k-mean/f5b218520ce0433e
.. computed in 1.2s
cluster_id cluster_label cluster_similarity cluster_size n_documents
0 0 tenet normal mon 0.235 253 253
1 1 ect hou enron_development 0.390 113 113
2 2 migration outlook team 0.331 116 116
3 3 ricafrente ricafrente_david normal 0.241 212 212
4 4 teneo test recipients 0.297 145 145
5 5 enron tana jones 0.234 152 152
6 6 subject ect shall 0.070 1080 1080
7 7 tue normal oct 0.332 134 134
8 8 test administrative recipients 0.255 204 204
9 9 rewrite address server 1.000 56 56
4. Document Clustering (LSI + Birch Clustering)¶
print("\n4.a. Document clustering (LSI + Birch clustering)")
url = BASE_URL + '/clustering/birch/'
print(" POST", url)
t0 = time()
res = requests.post(url,
json={'parent_id': lsi_id,
'n_clusters': -1,
'min_similarity': 0.7,
'branching_factor': 20,
'max_tree_depth': 2,
}).json()
mid = res['id']
print(" => model id = {}".format(mid))
Out:
4.a. Document clustering (LSI + Birch clustering)
POST http://localhost:5001/api/v0/clustering/birch/
=> model id = 679e56463de74cda
4.b. Computing cluster labels
url = BASE_URL + '/clustering/birch/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
json={'n_top_words': 3
}).json()
t1 = time()
print(' .. computed in {:.1f}s'.format(t1 - t0))
data = res['data']
for row in data:
row['n_documents'] = len(row.pop('documents'))
print(pd.DataFrame(data))
Out:
GET http://localhost:5001/api/v0/clustering/birch/679e56463de74cda
.. computed in 2.0s
children cluster_depth cluster_id cluster_label cluster_similarity cluster_size n_documents
0 [1, 2, 12, 24, 25, 28, 41, 42, 43, 47, 48, 58,... 0 0 normal test ect 0.074 2465 2465
1 [] 1 1 deal aquilla muni 0.684 17 17
2 [3, 4, 5, 6, 7, 8, 9, 10, 11] 1 2 test shackleton administrative 0.187 258 258
3 [] 2 3 rate group public 0.255 36 36
4 [] 2 4 calo shackleton dinner 0.673 16 16
5 [] 2 5 shackleton load normal 0.396 47 47
6 [] 2 6 deseret etringer counsel 0.537 22 22
7 [] 2 7 sample financial trading 0.395 18 18
8 [] 2 8 jones tana load 0.404 32 32
9 [] 2 9 nemec doc gallup 0.458 16 16
10 [] 2 10 services commission energy 0.391 14 14
11 [] 2 11 test recipients administrative 0.446 57 57
12 [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] 1 12 rewrite address ect 0.219 372 372
13 [] 2 13 issues cone enron_development 0.470 26 26
14 [] 2 14 ect hou london 0.380 39 39
15 [] 2 15 rewrite address server 0.531 114 114
16 [] 2 16 attorney confirmation south 0.446 11 11
17 [] 2 17 ena transaction kean 0.394 43 43
18 [] 2 18 memo week clair 0.337 24 24
19 [] 2 19 tiger bailey sekse 0.487 14 14
20 [] 2 20 test bruno gaillard 0.563 15 15
21 [] 2 21 normal test teneo 0.445 46 46
22 [] 2 22 meet load teneo 0.535 13 13
23 [] 2 23 load teneo test 0.494 27 27
24 [] 1 24 amoco jefferson sorenson 0.556 15 15
25 [26, 27] 1 25 sanders normal nov 0.481 63 63
26 [] 2 26 sanders normal nov 0.493 54 54
27 [] 2 27 sanders fri conference 0.578 9 9
28 [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40] 1 28 normal subject wed 0.120 481 481
29 [] 2 29 south simons kean 0.296 41 41
.. ... ... ... ... ... ... ...
57 [] 2 57 wed nov tenet 0.390 38 38
58 [59, 60] 1 58 tue tenet thu 0.290 107 107
59 [] 2 59 thu tenet normal 0.325 68 68
60 [] 2 60 tue oct tenet 0.471 39 39
61 [62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 7... 1 61 enron_development teneo normal 0.102 687 687
62 [] 2 62 products weather internet 0.387 22 22
63 [] 2 63 enron_development advice catalytica 0.337 77 77
64 [] 2 64 teneo test date 0.352 47 47
65 [] 2 65 ruppert operator exxon 0.355 52 52
66 [] 2 66 enron_development gdr hagerty 0.394 32 32
67 [] 2 67 trade inflation investment 0.414 18 18
68 [] 2 68 bump doc agreement 0.356 45 45
69 [] 2 69 boyd hunter language 0.366 27 27
70 [] 2 70 sampling americancentury rcr 0.320 36 36
71 [] 2 71 houston dickson calo 0.352 29 29
72 [] 2 72 attorney fri transactions 0.344 21 21
73 [] 2 73 neuner wed teneo 0.293 32 32
74 [] 2 74 haedicke normal nov 0.413 42 42
75 [] 2 75 teneo taylor_mark bockius 0.334 32 32
76 [] 2 76 registration legal cftc 0.364 14 14
77 [] 2 77 fri tenet mtg 0.374 26 26
78 [] 2 78 lunch otc energy 0.308 48 48
79 [] 2 79 market teneo electricity 0.359 20 20
80 [] 2 80 houston shipper nemec 0.444 21 21
81 [] 2 81 migration outlook team 0.459 46 46
82 [83, 84, 85, 86] 1 82 shackleton_sara teneo group 0.230 82 82
83 [] 2 83 doc isda cini 0.397 33 33
84 [] 2 84 account stay motion 0.395 15 15
85 [] 2 85 laryngitis diligence review 0.434 10 10
86 [] 2 86 shackleton_sara teneo tiger 0.360 24 24
[87 rows x 7 columns]
5. Optimal sampling (LSI + Birch Clustering)¶
t0 = time()
url = BASE_URL + '/clustering/birch/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
json={'return_optimal_sampling': True,
'sampling_min_coverage': 0.9
}).json()
t1 = time()
print(' .. computed in {:.1f}s'.format(t1 - t0))
data = res['data']
print(pd.DataFrame(data))
Out:
GET http://localhost:5001/api/v0/clustering/birch/679e56463de74cda
.. computed in 0.1s
cluster_id cluster_similarity cluster_size documents
0 15 0.531 114 [{'document_id': 5041, 'similarity': 0.9245882...
1 63 0.337 77 [{'document_id': 33124, 'similarity': 0.594356...
2 39 0.309 76 [{'document_id': 2735716, 'similarity': 0.5962...
3 59 0.325 68 [{'document_id': 3587236, 'similarity': 0.6634...
4 37 0.325 60 [{'document_id': 2402500, 'similarity': 0.6318...
5 11 0.446 57 [{'document_id': 855625, 'similarity': 0.88858...
6 26 0.493 54 [{'document_id': 3802500, 'similarity': 0.5989...
7 55 0.345 54 [{'document_id': 3171961, 'similarity': 0.6304...
8 65 0.355 52 [{'document_id': 1664100, 'similarity': 0.6798...
9 31 0.337 50 [{'document_id': 714025, 'similarity': 0.62037...
10 33 0.336 50 [{'document_id': 4990756, 'similarity': 0.5743...
11 78 0.308 48 [{'document_id': 1119364, 'similarity': 0.5790...
12 5 0.396 47 [{'document_id': 61504, 'similarity': 0.640083...
13 64 0.352 47 [{'document_id': 2149156, 'similarity': 0.7265...
14 21 0.445 46 [{'document_id': 358801, 'similarity': 0.66319...
15 81 0.459 46 [{'document_id': 5527201, 'similarity': 0.6800...
16 68 0.356 45 [{'document_id': 300304, 'similarity': 0.64720...
17 17 0.394 43 [{'document_id': 79524, 'similarity': 0.603813...
18 74 0.413 42 [{'document_id': 2992900, 'similarity': 0.6323...
19 29 0.296 41 [{'document_id': 2809, 'similarity': 0.4766740...
20 14 0.380 39 [{'document_id': 4422609, 'similarity': 0.7943...
21 60 0.471 39 [{'document_id': 1893376, 'similarity': 0.7655...
22 30 0.366 38 [{'document_id': 627264, 'similarity': 0.73297...
23 57 0.390 38 [{'document_id': 2334784, 'similarity': 0.7788...
24 3 0.255 36 [{'document_id': 1168561, 'similarity': 0.3576...
25 36 0.311 36 [{'document_id': 1232100, 'similarity': 0.4595...
26 70 0.320 36 [{'document_id': 910116, 'similarity': 0.51148...
27 51 0.481 34 [{'document_id': 6041764, 'similarity': 0.7041...
28 32 0.351 33 [{'document_id': 1100401, 'similarity': 0.6933...
29 83 0.397 33 [{'document_id': 4765489, 'similarity': 0.5323...
30 8 0.404 32 [{'document_id': 215296, 'similarity': 0.58934...
31 66 0.394 32 [{'document_id': 91809, 'similarity': 0.806330...
32 73 0.293 32 [{'document_id': 3663396, 'similarity': 0.4763...
33 75 0.334 32 [{'document_id': 3378244, 'similarity': 0.6113...
34 38 0.408 30 [{'document_id': 3359889, 'similarity': 0.6664...
35 44 0.381 30 [{'document_id': 504100, 'similarity': 0.57251...
36 35 0.401 29 [{'document_id': 3164841, 'similarity': 0.5582...
37 71 0.352 29 [{'document_id': 3279721, 'similarity': 0.5513...
38 45 0.348 28 [{'document_id': 1962801, 'similarity': 0.6525...
39 54 0.368 28 [{'document_id': 2468041, 'similarity': 0.5644...
40 23 0.494 27 [{'document_id': 835396, 'similarity': 0.65802...
41 46 0.443 27 [{'document_id': 2076481, 'similarity': 0.8140...
42 69 0.366 27 [{'document_id': 1032256, 'similarity': 0.5915...
43 13 0.470 26 [{'document_id': 289, 'similarity': 0.85353300...
44 77 0.374 26 [{'document_id': 5788836, 'similarity': 0.6740...
45 18 0.337 24 [{'document_id': 16900, 'similarity': 0.457269...
46 50 0.372 24 [{'document_id': 4494400, 'similarity': 0.6840...
47 86 0.360 24 [{'document_id': 4622500, 'similarity': 0.4755...
48 34 0.368 23 [{'document_id': 874225, 'similarity': 0.48114...
49 52 0.384 23 [{'document_id': 2524921, 'similarity': 0.6179...
50 6 0.537 22 [{'document_id': 17689, 'similarity': 0.804906...
51 62 0.387 22 [{'document_id': 1129969, 'similarity': 0.6836...
52 47 0.498 21 [{'document_id': 3118756, 'similarity': 0.7570...
53 56 0.489 21 [{'document_id': 2307361, 'similarity': 0.7566...
54 72 0.344 21 [{'document_id': 4652649, 'similarity': 0.5300...
55 80 0.444 21 [{'document_id': 652864, 'similarity': 0.69850...
56 79 0.359 20 [{'document_id': 142884, 'similarity': 0.59510...
57 7 0.395 18 [{'document_id': 125316, 'similarity': 0.60864...
58 53 0.448 18 [{'document_id': 2669956, 'similarity': 0.7132...
59 67 0.414 18 [{'document_id': 44100, 'similarity': 0.658200...
- Delete the extracted features
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
requests.delete(url)
Total running time of the script: ( 0 minutes 9.746 seconds)