Clustering

Cluster documents into clusters

import os.path
import pandas as pd
from time import time
import requests

pd.options.display.float_format = '{:,.3f}'.format


dataset_name = "treclegal09_2k_subset"     # see list of available datasets

BASE_URL = "http://localhost:5001/api/v0"  # FreeDiscovery server URL

0. Load the example dataset

url = BASE_URL + '/example-dataset/{}'.format(dataset_name)
print(" GET", url)
input_ds = requests.get(url).json()

# To use a custom dataset, simply specify the following variables
data_dir = input_ds['metadata']['data_dir']
dataset_definition = [{'document_id': row['document_id'],
                       'file_path': os.path.join(data_dir, row['file_path'])}
                      for row in input_ds['dataset']]

Out:

GET http://localhost:5001/api/v0/example-dataset/treclegal09_2k_subset

1. Feature extraction (non hashed)

1.a Load dataset and initalize feature extraction

url = BASE_URL + '/feature-extraction'
print(" POST", url)
res = requests.post(url).json()

dsid = res['id']
print("   => received {}".format(list(res.keys())))
print("   => dsid = {}".format(dsid))

Out:

POST http://localhost:5001/api/v0/feature-extraction
   => received ['id']
   => dsid = 25c2c07f71124437

1.b Run feature extraction

url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
res = requests.post(url, json={'dataset_definition': dataset_definition})

Out:

POST http://localhost:5001/api/v0/feature-extraction/25c2c07f71124437

2. Calculate LSI

url = BASE_URL + '/lsi/'
print("POST", url)

n_components = 300
res = requests.post(url,
                    json={'n_components': n_components,
                          'parent_id': dsid
                          }).json()

lsi_id = res['id']
print('  => LSI model id = {}'.format(lsi_id))
print(('  => SVD decomposition with {} dimensions '
       'explaining {:.2f} % variabilty of the data')
      .format(n_components, res['explained_variance']*100))

Out:

POST http://localhost:5001/api/v0/lsi/
  => LSI model id = b373255f3cea4a85
  => SVD decomposition with 300 dimensions explaining 87.16 % variabilty of the data

3. Document Clustering (LSI + K-Means)

print("\n3.a. Document clustering (LSI + K-means)")

url = BASE_URL + '/clustering/k-mean/'
print(" POST", url)
t0 = time()
res = requests.post(url,
                    json={'parent_id': lsi_id,
                          'n_clusters': 10,
                          }).json()

mid = res['id']
print("     => model id = {}".format(mid))

Out:

3.a. Document clustering (LSI + K-means)
 POST http://localhost:5001/api/v0/clustering/k-mean/
     => model id = f5b218520ce0433e

3.b. Computing cluster labels

url = BASE_URL + '/clustering/k-mean/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
                   json={'n_top_words': 3
                         }).json()
t1 = time()


data = res['data']
for row in data:
    row['n_documents'] = len(row.pop('documents'))

print('    .. computed in {:.1f}s'.format(t1 - t0))
print(pd.DataFrame(data))

Out:

GET http://localhost:5001/api/v0/clustering/k-mean/f5b218520ce0433e
    .. computed in 1.2s
   cluster_id                       cluster_label  cluster_similarity  cluster_size  n_documents
0           0                    tenet normal mon               0.235           253          253
1           1           ect hou enron_development               0.390           113          113
2           2              migration outlook team               0.331           116          116
3           3  ricafrente ricafrente_david normal               0.241           212          212
4           4               teneo test recipients               0.297           145          145
5           5                    enron tana jones               0.234           152          152
6           6                   subject ect shall               0.070          1080         1080
7           7                      tue normal oct               0.332           134          134
8           8      test administrative recipients               0.255           204          204
9           9              rewrite address server               1.000            56           56

4. Document Clustering (LSI + Birch Clustering)

print("\n4.a. Document clustering (LSI + Birch clustering)")

url = BASE_URL + '/clustering/birch/'
print(" POST", url)
t0 = time()
res = requests.post(url,
                    json={'parent_id': lsi_id,
                          'n_clusters': -1,
                          'min_similarity': 0.7,
                          'branching_factor': 20,
                          'max_tree_depth': 2,
                          }).json()

mid = res['id']
print("     => model id = {}".format(mid))

Out:

4.a. Document clustering (LSI + Birch clustering)
 POST http://localhost:5001/api/v0/clustering/birch/
     => model id = 679e56463de74cda

4.b. Computing cluster labels

url = BASE_URL + '/clustering/birch/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
                   json={'n_top_words': 3
                         }).json()
t1 = time()

print('    .. computed in {:.1f}s'.format(t1 - t0))
data = res['data']
for row in data:
    row['n_documents'] = len(row.pop('documents'))

print(pd.DataFrame(data))

Out:

GET http://localhost:5001/api/v0/clustering/birch/679e56463de74cda
    .. computed in 2.0s
                                             children  cluster_depth  cluster_id                        cluster_label  cluster_similarity  cluster_size  n_documents
0   [1, 2, 12, 24, 25, 28, 41, 42, 43, 47, 48, 58,...              0           0                      normal test ect               0.074          2465         2465
1                                                  []              1           1                    deal aquilla muni               0.684            17           17
2                       [3, 4, 5, 6, 7, 8, 9, 10, 11]              1           2       test shackleton administrative               0.187           258          258
3                                                  []              2           3                    rate group public               0.255            36           36
4                                                  []              2           4               calo shackleton dinner               0.673            16           16
5                                                  []              2           5               shackleton load normal               0.396            47           47
6                                                  []              2           6             deseret etringer counsel               0.537            22           22
7                                                  []              2           7             sample financial trading               0.395            18           18
8                                                  []              2           8                      jones tana load               0.404            32           32
9                                                  []              2           9                     nemec doc gallup               0.458            16           16
10                                                 []              2          10           services commission energy               0.391            14           14
11                                                 []              2          11       test recipients administrative               0.446            57           57
12       [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]              1          12                  rewrite address ect               0.219           372          372
13                                                 []              2          13        issues cone enron_development               0.470            26           26
14                                                 []              2          14                       ect hou london               0.380            39           39
15                                                 []              2          15               rewrite address server               0.531           114          114
16                                                 []              2          16          attorney confirmation south               0.446            11           11
17                                                 []              2          17                 ena transaction kean               0.394            43           43
18                                                 []              2          18                      memo week clair               0.337            24           24
19                                                 []              2          19                   tiger bailey sekse               0.487            14           14
20                                                 []              2          20                  test bruno gaillard               0.563            15           15
21                                                 []              2          21                    normal test teneo               0.445            46           46
22                                                 []              2          22                      meet load teneo               0.535            13           13
23                                                 []              2          23                      load teneo test               0.494            27           27
24                                                 []              1          24             amoco jefferson sorenson               0.556            15           15
25                                           [26, 27]              1          25                   sanders normal nov               0.481            63           63
26                                                 []              2          26                   sanders normal nov               0.493            54           54
27                                                 []              2          27               sanders fri conference               0.578             9            9
28   [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]              1          28                   normal subject wed               0.120           481          481
29                                                 []              2          29                    south simons kean               0.296            41           41
..                                                ...            ...         ...                                  ...                 ...           ...          ...
57                                                 []              2          57                        wed nov tenet               0.390            38           38
58                                           [59, 60]              1          58                        tue tenet thu               0.290           107          107
59                                                 []              2          59                     thu tenet normal               0.325            68           68
60                                                 []              2          60                        tue oct tenet               0.471            39           39
61  [62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 7...              1          61       enron_development teneo normal               0.102           687          687
62                                                 []              2          62            products weather internet               0.387            22           22
63                                                 []              2          63  enron_development advice catalytica               0.337            77           77
64                                                 []              2          64                      teneo test date               0.352            47           47
65                                                 []              2          65               ruppert operator exxon               0.355            52           52
66                                                 []              2          66        enron_development gdr hagerty               0.394            32           32
67                                                 []              2          67           trade inflation investment               0.414            18           18
68                                                 []              2          68                   bump doc agreement               0.356            45           45
69                                                 []              2          69                 boyd hunter language               0.366            27           27
70                                                 []              2          70         sampling americancentury rcr               0.320            36           36
71                                                 []              2          71                 houston dickson calo               0.352            29           29
72                                                 []              2          72            attorney fri transactions               0.344            21           21
73                                                 []              2          73                     neuner wed teneo               0.293            32           32
74                                                 []              2          74                  haedicke normal nov               0.413            42           42
75                                                 []              2          75            teneo taylor_mark bockius               0.334            32           32
76                                                 []              2          76              registration legal cftc               0.364            14           14
77                                                 []              2          77                        fri tenet mtg               0.374            26           26
78                                                 []              2          78                     lunch otc energy               0.308            48           48
79                                                 []              2          79             market teneo electricity               0.359            20           20
80                                                 []              2          80                houston shipper nemec               0.444            21           21
81                                                 []              2          81               migration outlook team               0.459            46           46
82                                   [83, 84, 85, 86]              1          82          shackleton_sara teneo group               0.230            82           82
83                                                 []              2          83                        doc isda cini               0.397            33           33
84                                                 []              2          84                  account stay motion               0.395            15           15
85                                                 []              2          85          laryngitis diligence review               0.434            10           10
86                                                 []              2          86          shackleton_sara teneo tiger               0.360            24           24

[87 rows x 7 columns]

5. Optimal sampling (LSI + Birch Clustering)

t0 = time()
url = BASE_URL + '/clustering/birch/{}'.format(mid)
print(" GET", url)
res = requests.get(url,
                   json={'return_optimal_sampling': True,
                         'sampling_min_coverage': 0.9
                         }).json()
t1 = time()
print('    .. computed in {:.1f}s'.format(t1 - t0))
data = res['data']

print(pd.DataFrame(data))

Out:

GET http://localhost:5001/api/v0/clustering/birch/679e56463de74cda
    .. computed in 0.1s
    cluster_id  cluster_similarity  cluster_size                                          documents
0           15               0.531           114  [{'document_id': 5041, 'similarity': 0.9245882...
1           63               0.337            77  [{'document_id': 33124, 'similarity': 0.594356...
2           39               0.309            76  [{'document_id': 2735716, 'similarity': 0.5962...
3           59               0.325            68  [{'document_id': 3587236, 'similarity': 0.6634...
4           37               0.325            60  [{'document_id': 2402500, 'similarity': 0.6318...
5           11               0.446            57  [{'document_id': 855625, 'similarity': 0.88858...
6           26               0.493            54  [{'document_id': 3802500, 'similarity': 0.5989...
7           55               0.345            54  [{'document_id': 3171961, 'similarity': 0.6304...
8           65               0.355            52  [{'document_id': 1664100, 'similarity': 0.6798...
9           31               0.337            50  [{'document_id': 714025, 'similarity': 0.62037...
10          33               0.336            50  [{'document_id': 4990756, 'similarity': 0.5743...
11          78               0.308            48  [{'document_id': 1119364, 'similarity': 0.5790...
12           5               0.396            47  [{'document_id': 61504, 'similarity': 0.640083...
13          64               0.352            47  [{'document_id': 2149156, 'similarity': 0.7265...
14          21               0.445            46  [{'document_id': 358801, 'similarity': 0.66319...
15          81               0.459            46  [{'document_id': 5527201, 'similarity': 0.6800...
16          68               0.356            45  [{'document_id': 300304, 'similarity': 0.64720...
17          17               0.394            43  [{'document_id': 79524, 'similarity': 0.603813...
18          74               0.413            42  [{'document_id': 2992900, 'similarity': 0.6323...
19          29               0.296            41  [{'document_id': 2809, 'similarity': 0.4766740...
20          14               0.380            39  [{'document_id': 4422609, 'similarity': 0.7943...
21          60               0.471            39  [{'document_id': 1893376, 'similarity': 0.7655...
22          30               0.366            38  [{'document_id': 627264, 'similarity': 0.73297...
23          57               0.390            38  [{'document_id': 2334784, 'similarity': 0.7788...
24           3               0.255            36  [{'document_id': 1168561, 'similarity': 0.3576...
25          36               0.311            36  [{'document_id': 1232100, 'similarity': 0.4595...
26          70               0.320            36  [{'document_id': 910116, 'similarity': 0.51148...
27          51               0.481            34  [{'document_id': 6041764, 'similarity': 0.7041...
28          32               0.351            33  [{'document_id': 1100401, 'similarity': 0.6933...
29          83               0.397            33  [{'document_id': 4765489, 'similarity': 0.5323...
30           8               0.404            32  [{'document_id': 215296, 'similarity': 0.58934...
31          66               0.394            32  [{'document_id': 91809, 'similarity': 0.806330...
32          73               0.293            32  [{'document_id': 3663396, 'similarity': 0.4763...
33          75               0.334            32  [{'document_id': 3378244, 'similarity': 0.6113...
34          38               0.408            30  [{'document_id': 3359889, 'similarity': 0.6664...
35          44               0.381            30  [{'document_id': 504100, 'similarity': 0.57251...
36          35               0.401            29  [{'document_id': 3164841, 'similarity': 0.5582...
37          71               0.352            29  [{'document_id': 3279721, 'similarity': 0.5513...
38          45               0.348            28  [{'document_id': 1962801, 'similarity': 0.6525...
39          54               0.368            28  [{'document_id': 2468041, 'similarity': 0.5644...
40          23               0.494            27  [{'document_id': 835396, 'similarity': 0.65802...
41          46               0.443            27  [{'document_id': 2076481, 'similarity': 0.8140...
42          69               0.366            27  [{'document_id': 1032256, 'similarity': 0.5915...
43          13               0.470            26  [{'document_id': 289, 'similarity': 0.85353300...
44          77               0.374            26  [{'document_id': 5788836, 'similarity': 0.6740...
45          18               0.337            24  [{'document_id': 16900, 'similarity': 0.457269...
46          50               0.372            24  [{'document_id': 4494400, 'similarity': 0.6840...
47          86               0.360            24  [{'document_id': 4622500, 'similarity': 0.4755...
48          34               0.368            23  [{'document_id': 874225, 'similarity': 0.48114...
49          52               0.384            23  [{'document_id': 2524921, 'similarity': 0.6179...
50           6               0.537            22  [{'document_id': 17689, 'similarity': 0.804906...
51          62               0.387            22  [{'document_id': 1129969, 'similarity': 0.6836...
52          47               0.498            21  [{'document_id': 3118756, 'similarity': 0.7570...
53          56               0.489            21  [{'document_id': 2307361, 'similarity': 0.7566...
54          72               0.344            21  [{'document_id': 4652649, 'similarity': 0.5300...
55          80               0.444            21  [{'document_id': 652864, 'similarity': 0.69850...
56          79               0.359            20  [{'document_id': 142884, 'similarity': 0.59510...
57           7               0.395            18  [{'document_id': 125316, 'similarity': 0.60864...
58          53               0.448            18  [{'document_id': 2669956, 'similarity': 0.7132...
59          67               0.414            18  [{'document_id': 44100, 'similarity': 0.658200...
  1. Delete the extracted features
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
requests.delete(url)

Total running time of the script: ( 0 minutes 9.746 seconds)

Generated by Sphinx-Gallery