Semantic Search¶
An example of Semantic Search
from __future__ import print_function
import os.path
import requests
import pandas as pd
pd.options.display.float_format = '{:,.3f}'.format
pd.options.display.expand_frame_repr = False
dataset_name = "treclegal09_2k_subset" # see list of available datasets
BASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL
0. Load the test dataset¶
url = BASE_URL + '/example-dataset/{}'.format(dataset_name)
print(" GET", url)
input_ds = requests.get(url).json()
# create a custom dataset definition for ingestion
data_dir = input_ds['metadata']['data_dir']
dataset_definition = [{'document_id': row['document_id'],
'file_path': os.path.join(data_dir, row['file_path'])}
for row in input_ds['dataset']]
Out:
GET http://localhost:5001/api/v0/example-dataset/treclegal09_2k_subset
1. Feature extraction¶
1.a Load dataset and initalize feature extraction
url = BASE_URL + '/feature-extraction'
print(" POST", url)
res = requests.post(url).json()
dsid = res['id']
print(" => received {}".format(list(res.keys())))
print(" => dsid = {}".format(dsid))
Out:
POST http://localhost:5001/api/v0/feature-extraction
=> received ['id']
=> dsid = febef059c97f4f26
1.b Start feature extraction
url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
requests.post(url, json={'dataset_definition': dataset_definition})
Out:
POST http://localhost:5001/api/v0/feature-extraction/febef059c97f4f26
2. Calculate LSI¶
(used for Nearest Neighbors method)
url = BASE_URL + '/lsi/'
print("POST", url)
n_components = 100
res = requests.post(url,
json={'n_components': n_components,
'parent_id': dsid
}).json()
lsi_id = res['id']
print(' => LSI model id = {}'.format(lsi_id))
print((" => SVD decomposition with {} dimensions explaining "
"{:.2f} % variabilty of the data")
.format(n_components, res['explained_variance']*100))
Out:
POST http://localhost:5001/api/v0/lsi/
=> LSI model id = 25172edac276482d
=> SVD decomposition with 100 dimensions explaining 69.79 % variabilty of the data
3. Semantic search¶
print("\n3.a. Perform the semantic search")
query = ("There are some conflicts with the draft date, so we will probably "
"need to have it on a different date.")
url = BASE_URL + '/search/'
print(" POST", url)
res = requests.post(url,
json={'parent_id': lsi_id,
'query': query
}).json()
data = res['data']
df = pd.DataFrame(data).set_index('document_id')
print(df)
print(df.score.max())
Out:
3.a. Perform the semantic search
POST http://localhost:5001/api/v0/search/
score
document_id
5267025 0.585
5262436 0.583
5148361 0.516
115600 0.516
116964 0.515
5157441 0.515
116281 0.463
5152900 0.463
5067001 0.348
97969 0.348
73984 0.338
54756 0.338
4813636 0.338
4950625 0.338
75625 0.338
53361 0.333
4800481 0.333
2500 0.319
4251844 0.319
6724 0.309
6889 0.300
4363921 0.293
3055504 0.292
43264 0.288
3728761 0.287
1265625 0.287
1004004 0.283
3189796 0.275
3625216 0.273
1210000 0.273
... ...
280900 -0.172
1600 -0.175
872356 -0.176
1263376 -0.181
1750329 -0.182
1838736 -0.183
1062961 -0.184
2268036 -0.186
249001 -0.187
605284 -0.188
3511876 -0.188
956484 -0.191
263169 -0.193
244036 -0.194
3671056 -0.198
3701776 -0.198
751689 -0.198
1656369 -0.198
3268864 -0.199
1766241 -0.208
3694084 -0.234
3125824 -0.237
3139984 -0.248
3118756 -0.250
3132900 -0.251
193600 -0.254
3122289 -0.255
3115225 -0.256
3136441 -0.258
3129361 -0.258
[2465 rows x 1 columns]
0.584964279257773
Delete the extracted features
url = BASE_URL + '/feature-extraction/{}'.format(dsid)
requests.delete(url)
Total running time of the script: ( 0 minutes 3.432 seconds)