FreeDiscovery
1.4.dev0

Contents:

  • FreeDiscovery Engine
    • FreeDiscovery overview
    • Quick start
    • Examples
      • Email threading
      • Data Ingestion
      • Semantic Search
        • 0. Load the test dataset
        • 1. Feature extraction
        • 2. Calculate LSI
        • 3. Semantic search
      • Hierarchical Clustering Example
      • Clustering
      • Duplicate Detection
      • Categorization
    • Data ingestion
    • Command line interface
    • Deployment options
    • API
  • FreeDiscovery Core
  • User Manual
  • Contributing
  • Release history
FreeDiscovery
  • Docs »
  • FreeDiscovery Engine »
  • Examples »
  • Semantic Search
  • Edit on GitHub

Semantic Search¶

An example of Semantic Search

from __future__ import print_function

import os.path
import requests
import pandas as pd

pd.options.display.float_format = '{:,.3f}'.format
pd.options.display.expand_frame_repr = False

dataset_name = "treclegal09_2k_subset"     # see list of available datasets

BASE_URL = "http://localhost:5001/api/v0"  # FreeDiscovery server URL

0. Load the test dataset¶

url = BASE_URL + '/example-dataset/{}'.format(dataset_name)
print(" GET", url)
input_ds = requests.get(url).json()

# create a custom dataset definition for ingestion
data_dir = input_ds['metadata']['data_dir']
dataset_definition = [{'document_id': row['document_id'],
                       'file_path': os.path.join(data_dir, row['file_path'])}
                      for row in input_ds['dataset']]

Out:

GET http://localhost:5001/api/v0/example-dataset/treclegal09_2k_subset

1. Feature extraction¶

1.a Load dataset and initalize feature extraction

url = BASE_URL + '/feature-extraction'
print(" POST", url)
res = requests.post(url).json()

dsid = res['id']
print("   => received {}".format(list(res.keys())))
print("   => dsid = {}".format(dsid))

Out:

POST http://localhost:5001/api/v0/feature-extraction
   => received ['id']
   => dsid = febef059c97f4f26

1.b Start feature extraction

url = BASE_URL+'/feature-extraction/{}'.format(dsid)
print(" POST", url)
requests.post(url, json={'dataset_definition': dataset_definition})

Out:

POST http://localhost:5001/api/v0/feature-extraction/febef059c97f4f26

2. Calculate LSI¶

(used for Nearest Neighbors method)

url = BASE_URL + '/lsi/'
print("POST", url)
n_components = 100
res = requests.post(url,
                    json={'n_components': n_components,
                          'parent_id': dsid
                          }).json()

lsi_id = res['id']
print('  => LSI model id = {}'.format(lsi_id))
print(("  => SVD decomposition with {} dimensions explaining "
       "{:.2f} % variabilty     of the data")
      .format(n_components, res['explained_variance']*100))

Out:

POST http://localhost:5001/api/v0/lsi/
  => LSI model id = 25172edac276482d
  => SVD decomposition with 100 dimensions explaining 69.79 % variabilty     of the data

3. Semantic search¶

print("\n3.a. Perform the semantic search")


query = ("There are some conflicts with the draft date, so we will probably "
         "need to have it on a different date.")

url = BASE_URL + '/search/'
print(" POST", url)

res = requests.post(url,
                    json={'parent_id': lsi_id,
                          'query': query
                          }).json()

data = res['data']

df = pd.DataFrame(data).set_index('document_id')
print(df)

print(df.score.max())

Out:

3.a. Perform the semantic search
 POST http://localhost:5001/api/v0/search/
             score
document_id
5267025      0.585
5262436      0.583
5148361      0.516
115600       0.516
116964       0.515
5157441      0.515
116281       0.463
5152900      0.463
5067001      0.348
97969        0.348
73984        0.338
54756        0.338
4813636      0.338
4950625      0.338
75625        0.338
53361        0.333
4800481      0.333
2500         0.319
4251844      0.319
6724         0.309
6889         0.300
4363921      0.293
3055504      0.292
43264        0.288
3728761      0.287
1265625      0.287
1004004      0.283
3189796      0.275
3625216      0.273
1210000      0.273
...            ...
280900      -0.172
1600        -0.175
872356      -0.176
1263376     -0.181
1750329     -0.182
1838736     -0.183
1062961     -0.184
2268036     -0.186
249001      -0.187
605284      -0.188
3511876     -0.188
956484      -0.191
263169      -0.193
244036      -0.194
3671056     -0.198
3701776     -0.198
751689      -0.198
1656369     -0.198
3268864     -0.199
1766241     -0.208
3694084     -0.234
3125824     -0.237
3139984     -0.248
3118756     -0.250
3132900     -0.251
193600      -0.254
3122289     -0.255
3115225     -0.256
3136441     -0.258
3129361     -0.258

[2465 rows x 1 columns]
0.584964279257773

Delete the extracted features

url = BASE_URL + '/feature-extraction/{}'.format(dsid)
requests.delete(url)

Total running time of the script: ( 0 minutes 3.432 seconds)

Download Python source code: semantic_search.py
Download Jupyter notebook: semantic_search.ipynb

Generated by Sphinx-Gallery

Next Previous

© Copyright 2016 - 2017, Grossman Labs LLC. Last updated on Jan 31, 2019.

Built with Sphinx using a theme provided by Read the Docs.