FreeDiscovery

Version: v0

Schemes:

Summary

Path Operation Description
/api/v0/categorization/ GET
POST
/api/v0/categorization/{mid} DELETE
GET
/api/v0/categorization/{mid}/predict GET
/api/v0/clustering/birch POST
/api/v0/clustering/dbscan POST
/api/v0/clustering/k-mean/ POST
/api/v0/clustering/{method}/{mid} DELETE
GET
/api/v0/duplicate-detection/ POST
/api/v0/duplicate-detection/{mid} DELETE
GET
/api/v0/email-threading/{mid} DELETE
GET
/api/v0/example-dataset/{name} GET
/api/v0/feature-extraction GET
POST
/api/v0/feature-extraction/{dsid} DELETE
GET
POST
/api/v0/feature-extraction/{dsid}/append POST
/api/v0/feature-extraction/{dsid}/delete POST
/api/v0/feature-extraction/{dsid}/id-mapping POST
/api/v0/lsi/ GET
POST
/api/v0/lsi/{mid} DELETE
GET
/api/v0/metrics/categorization POST
/api/v0/metrics/clustering POST
/api/v0/metrics/duplicate-detection POST
/api/v0/search/ POST
/api/v0/stop-words/ POST
/api/v0/stop-words/{name} DELETE
GET

Paths

GET /api/v0/categorization/

List existing categorization models

default
object
method: string
options: string

POST /api/v0/categorization/

Build the categorization ML model

The option use_hashing=True must be set for the feature extraction. Recommended options also include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • data: a list of dict which have a category field and one or several fields that can be used for indexing, such as document_id and optionally rendition_id.
  • method: classification algorithm to use (default: LogisticRegression),
  • cv: binary, if true optimal parameters of the ML model are determined by cross-validation over 5 stratified K-folds (default False).
  • training_scores: binary, compute the efficiency scores on the training dataset. This would make computations much slower for NearestNeighbors (default False).

cv: boolean
data: object[]
object
category: string
document_id: integer (int32)
render_id: integer (int32)
method: string
parent_id: string
training_scores: boolean
default
id: string
training_scores: object
average_precision: number
f1: number
precision: number
recall: number
recall_at_20p: number
roc_auc: number

DELETE /api/v0/categorization/{mid}

Delete the categorization model

mid path string
default

GET /api/v0/categorization/{mid}

Load categorization model parameters

mid path string
default
method: string
options: string

GET /api/v0/categorization/{mid}/predict

Predict document categorization with a previously trained model

Parameters

  • max_result_categories : the maximum number of categories in the results
  • sort_by : if provided and not None, the field used for sorting results. Valid values are [None, 'score'] or any of the ingested category names.
  • sort_order: the sort order (if applicable), one of ['ascending', 'descending']
  • max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
  • ml_output : type of the output in ['decision_function', 'probability'], only affects ML methods.
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
  • min_score : filter out results below a similarity threashold
  • subset: apply prediction to a document subset. Must be one of ['all', 'train', 'test']. Default: 'test'.
  • subset_document_id: apply prediction to a subset of document_id.
  • batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
  • batch_size: the number of document scores retrieved per batch. Default: 10000

batch_id: integer (int32)
batch_size: integer (int32) 10000
max_result_categories: integer (int32) 1
max_results: integer (int32)
metric: string cosine
min_score: number -1
ml_output: string probability
sort_by: string score
sort_order: string descending
subset: string test
subset_document_id: integer[]
integer (int32)
mid path string
default
data: object[]
object
document_id: integer (int32)
render_id: integer (int32)
scores: object[]
object
category: string
document_id: integer (int32)
render_id: integer (int32)
score: number
pagination: object
batch_id: integer (int32)
batch_id_last: integer (int32)
current_response_count: integer (int32)
total_response_count: integer (int32)

POST /api/v0/clustering/birch

Compute birch clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for data ingestion also include, ntc.

Parameters

  • parent_id: dataset_id or lsi_id
  • n_clusters: the number of clusters or -1 to use hierarchical clustering (default: -1)
  • min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch. Increasing this value would increase the hierarchical tree depth (and the number of clusters).
  • branching_factor: Maximum number of CF subclusters in each node. If a new samples enters such that the number of subclusters exceed the branching_factor then the node has to be split. The corresponding parent also has to be split and if the number of subclusters in the parent is greater than the branching factor, then it has to be split recursively. Decreasing this value would increase the number of clusters.
  • max_tree_depth : Maximum hierarchy depth (only applicable when n_clusters=-1)
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

branching_factor: integer (int32) 20
max_tree_depth: integer (int32)
metric: string cosine
min_similarity: number 0.5
n_clusters: integer (int32) -1
parent_id: string
default
id: string

POST /api/v0/clustering/dbscan

Compute clustering (DBSCAN)

The option use_hashing=False must be set for the feature extraction. Recommended options for the data ingestion also include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
  • min_samples: (optional) int The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metric: string cosine
min_samples: integer (int32) 10
min_similarity: number 0.5
parent_id: string
default
id: string

POST /api/v0/clustering/k-mean/

Compute K-mean clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for feature extraction include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • n_clusters: the number of clusters
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

metric: string cosine
n_clusters: integer (int32) 150
parent_id: string
default
id: string

DELETE /api/v0/clustering/{method}/{mid}

Delete a clustering model

mid path string
method path string
default

GET /api/v0/clustering/{method}/{mid}

Compute cluster labels

Parameters

  • n_top_words: keep only most relevant n_top_words words
  • return_optimal_sampling : Instead of cluster results, the optimal sampling results will be returned (with no cluster labels). This option is only valid with Birch algorithm. Note that optimal sampling cannot return more samples than the subclusters in the birch clustering results (default: false)
  • sampling_min_similarity : Similarity threashold used by smart sampling. Decreasing this value would result in more sampled documents. Default: 1.0 (i.e. use the full cluster hierarichy).
  • sampling_min_coverage : Minimal coverage requirement in [0, 1] range. Increasing this value would result in a larger number of samples. (default: 0.9)

n_top_words: integer (int32) 5
return_optimal_sampling: boolean
sampling_min_coverage: number 0.9
sampling_min_similarity: number 1
mid path string
method path string
default
data: object[]
object
children: integer[]
integer (int32)
cluster_depth: integer (int32)
cluster_id: integer (int32)
cluster_label: string
cluster_similarity: number
cluster_size: integer (int32)
documents: object[]
object
document_id: integer (int32)
render_id: integer (int32)
similarity: number

POST /api/v0/duplicate-detection/

Compute near duplicates

Parameters

  • parent_id: dataset_id or lsi_id
  • method: str, default='simhash' Method used for duplicate detection. One of "simhash", "i-match"

method: string simhash
parent_id: string
default
id: string

DELETE /api/v0/duplicate-detection/{mid}

mid path string
default

GET /api/v0/duplicate-detection/{mid}

Query duplicates

Parameters

  • distance : int, default=2 Maximum number of differnet bits in the simhash (Simhash method only)
  • n_rand_lexicons : int, default=1 number of random lexicons used for duplicate detection (I-Match method only)
  • rand_lexicon_ratio : float, default=0.7 ratio of the vocabulary used in random lexicons (I-Match method only)
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

distance: integer (int32)
metric: string cosine
n_rand_lexicons: integer (int32)
rand_lexicon_ratio: number
mid path string
default
data: object[]
object
children: integer[]
integer (int32)
cluster_depth: integer (int32)
cluster_id: integer (int32)
cluster_label: string
cluster_similarity: number
cluster_size: integer (int32)
documents: object[]
object
document_id: integer (int32)
render_id: integer (int32)
similarity: number

DELETE /api/v0/email-threading/{mid}

Delete a processed dataset

mid path string
default

GET /api/v0/email-threading/{mid}

Get email threading parameters

mid path string
default
group_by_subject: boolean

GET /api/v0/example-dataset/{name}

Download a benchmark dataset.

The currently supported datasets are listed below,

1. TREC 2009 legal collection

- `treclegal09_2k_subset` : 2 400 documents, 2 MB
- `treclegal09_20k_subset` : 20 000 documents, 30 MB
- `treclegal09_37k_subset` : 37 000 documents, 55 MB
- `treclegal09` : 700 000 documents, 1.2 GB

The ground truth files for categorization are adapted from TAR Toolkit.

2. Fedora mailing list (2009-2009)
- `fedora_ml_3k_subset`

3. The 20 newsgoups dataset
- `20_newsgroups_3categories`: only the ['comp.graphics',
'rec.sport.baseball', 'sci.space'] categories

If you encounter any issues for downloads with this function,
you can also manually download and extract the required dataset to
``cache_dir`` (the download url is ``http://r0h.eu/d/<name>.tar.gz``),
then re-run this function to get the required metadata.

n_categories: integer (int32) 2
name path string
default
dataset: object[]
object
category: string
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)
metadata: object
data_dir: string
name: string
training_set: object[]
object
category: string
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

GET /api/v0/feature-extraction

View parameters used for the feature extraction

default
object
analyzer: string word
chunk_size: integer (int32)
column_ids: integer[]
integer (int32)
column_separator: string ,
data_dir: string
filenames: string[]
string
id: string
max_df: number
min_df: number
n_features: integer (int32) 100001
n_jobs: integer (int32) 1
n_samples: integer (int32)
n_samples_processed: integer (int32)
ngram_range: integer[] 1,1
integer (int32)
norm_alpha: number 0.75
overwrite: boolean
parse_email_headers: boolean
preprocess: string[]
string
stop_words: string english
use_hashing: boolean
weighting: string nnc

POST /api/v0/feature-extraction

Initialize the feature extraction on a document collection.

Parameters

  • n_features: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)
  • analyzer: 'word', 'char', 'char_wb' Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: 'word')
  • ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

  • stop_words: "english" or "None" Remove stop words from the resulting tokens. Only applies for the "word" analyzer. If "english", a built-in stop word list for English is used. ( default: "english")

  • n_jobs: The maximum number of concurrently running jobs (default: 1)
  • chuck_size: The number of documents simultaneously processed by a running job (default: 5000)
  • weighting: the SMART notation for document term weighting and normalization. In the form [nlabL][ntp][ncb] , see https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
  • norm_alpha: the alpha value used for pivoted normalization

  • use_hashing: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)

  • min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.
  • max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.
  • parse_email_headers: when documents are emails, attempt to parse the information contained in the header (default: False)
  • preprocess: a list of pre-processing steps to apply before vectorization. A subset of ['emails_ignore_header'], default: [].
  • id: (optional) custom dataset id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.
  • overwrite: if a custom dataset id was provided, and it already exists, overwrite it. Default: false
  • column_ids : list of ints. If provided the input dataset is expected to be CSV, and the columns with the provided ids are selected. Documents can only be provided using dataset_definition parameter that must contain a single file path
  • column_separator: str, character used to delimit columns. Only used if column_ids is provided. Default: ','

analyzer: string word
chunk_size: integer (int32)
column_ids: integer[]
integer (int32)
column_separator: string ,
data_dir: string
id: string
max_df: number
min_df: number
n_features: integer (int32) 100001
n_jobs: integer (int32) 1
ngram_range: integer[] 1,1
integer (int32)
norm_alpha: number 0.75
overwrite: boolean
parse_email_headers: boolean
preprocess: string[]
string
stop_words: string english
use_hashing: boolean
weighting: string nnc
default
id: string

DELETE /api/v0/feature-extraction/{dsid}

Delete a processed dataset

dsid path string
default

GET /api/v0/feature-extraction/{dsid}

Load extracted features (and obtain the processing status)

dsid path string

POST /api/v0/feature-extraction/{dsid}

Run feature extraction on a dataset,

Parameters

  • data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
  • dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'content': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where document_id and rendition_id are optional, while either file_path or content field must be provided.
  • vectorize: [optional] this option can be used to ingest the dataset_definition in batches (optionally with document content), then make one final call to vectorize all sent documents (bool, default: True)
  • document_id_generator: [optional] if the document_id is not provided, this specifies how it is generated. If indexed_file_path the document_id is given by the index of the sorted file_path, otherwise if infer_file_path the document_id is inferred from the file_path strings, removing all non digit characters. In this second case, the file_path must contain a unique numeric ID (default: indexed_file_path)

data_dir: string
dataset_definition: object[]
object
content: string
document_id: integer (int32)
file_path: string
rendition_id: integer (int32)
document_id_generator: string indexed_file_path
vectorize: boolean true
dsid path string
default
id: string

POST /api/v0/feature-extraction/{dsid}/append

Add new documents to an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.

This operation cannot be undone.

Warning: all categorization, clustering, duplicate detection and email threading models associated with this dataset will be removed and need to be re-trained.

Parameters

  • data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
  • dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where rendition_id are optional, while either file_path or content field must be provided.

data_dir: string
dataset_definition: object[]
object
content: string
document_id: integer (int32)
file_path: string
rendition_id: integer (int32)
dsid path string
default

POST /api/v0/feature-extraction/{dsid}/delete

Remove documents from an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.

     This operation cannot be undone.

Warning: all categorization, clustering, duplicate detection and
email threading models associated with this dataset will be removed and
need to be re-trained.

**Parameters**
- `dataset_definition`: [optional] a list of dictionaries `[{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]` where `rendition_id` are optional.

dataset_definition: object[]
object
document_id: integer (int32)
file_path: string
rendition_id: integer (int32)
dsid path string
default

POST /api/v0/feature-extraction/{dsid}/id-mapping

Compute correspondence between id fields for documents. At least one of the fields used for indexing must be provided, and all the rest will be computed (if available). If the data parameter is not provided, return all the correspondence table

Parameters

  • data: the ids of documents used as the query
  • return_file_path: whether the results should include the file path

data: object[]
object
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)
return_file_path: boolean true
dsid path string
default
data: object[]
object
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

GET /api/v0/lsi/

List existing LSI models

parent_id: string
default
object
n_components: integer (int32)
parent_id: string

POST /api/v0/lsi/

Build a Latent Semantic Indexing (LSI) model

Recommended data ingestion options also include, use_idf=1, sublinear_tf=0, binary=0.

The recommended value for the n_components (dimensions of the SVD decompositions) is in the [100, 200] range.

Parameters

  • n_components: Desired dimensionality of the output data. Must be strictly less than the number of features.
  • parent_id: parent dataset identified by dataset_id
  • alpha: floor on the number of components used with small datasets
  • id: (optional) custom model id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.
  • overwrite: if a custom model id was provided, and it already exists, overwrite it. Default: false

alpha: number 0.33
id: string
n_components: integer (int32) 150
overwrite: boolean
parent_id: string
default
explained_variance: number
id: string

DELETE /api/v0/lsi/{mid}

Delete a Latent Semantic Indexing (LSI) model

mid path string
default

GET /api/v0/lsi/{mid}

Show Latent Semantic Indexing (LSI) model parameters

mid path string
default
n_components: integer (int32)
parent_id: string

POST /api/v0/metrics/categorization

Compute categorization metrics to assess the quality of categorization.

In the case of binary categrorization, category labels are sorted alphabetically and the second one is expected to be the positive one.

Parameters

  • y_true: [required] ground truth categorization data
  • y_pred: [required] predicted categorization results
  • metrics: [required] list of str. Metrics to compute, any combination of "precision", "recall", "f1", "roc_auc"

metrics: string[]
string
y_pred: object[]
object
document_id: integer (int32)
render_id: integer (int32)
scores: object[]
object
category: string
document_id: integer (int32)
render_id: integer (int32)
score: number
y_true: object[]
object
category: string
document_id: integer (int32)
render_id: integer (int32)
default
average_precision: number
f1: number
precision: number
recall: number
recall_at_20p: number
roc_auc: number

POST /api/v0/metrics/clustering

Compute clustering metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.

Parameters

  • labels_true: [required] list of int. Ground truth clustering labels
  • labels_pred: [required] list of int. Predicted clustering labels
  • metrics: [required] list of str. Metrics to compute, any combination of "adjusted_rand", "adjusted_mutual_info", "v_measure"

labels_pred: integer[]
integer (int32)
labels_true: integer[]
integer (int32)
metrics: string[]
string
default
adjusted_mutual_info: number
adjusted_rand: number
v_measure: number

POST /api/v0/metrics/duplicate-detection

Compute duplicate detection metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.

Parameters

  • labels_true: [required] list of int. Ground truth clustering labels
  • labels_pred: [required] list of int. Predicted clustering labels
  • metrics: [required] list of str. Metrics to compute, any combination of "ratio_duplicates", "f1_same_duplicates", "mean_duplicates_count"

labels_pred: integer[]
integer (int32)
labels_true: integer[]
integer (int32)
metrics: string[]
string
default
f1_same_duplicates: number
mean_duplicates_count: number
ratio_duplicates: number

POST /api/v0/search/

Perform document search (if parent_id is a dataset_id) or a semantic search (if parent_id is a lsi_id).

Parameters

  • parent_id : the id of the previous processing step (either dataset_id or lsi_id)
  • query : the seach query. Either query or query_document_id must be provided.
  • query_document_id : the id of the document used as the search query. Either query or query_document_id must be provided.
  • metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
  • min_score : filter out results below a similarity threashold
  • max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
  • sort_by : if provided and not None, the field used for sorting results. Valid values are [None, 'score']
  • sort_order: the sort order (if applicable), one of ['ascending', 'descending']
  • batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
  • batch_size: the number of document scores retrieved per batch. Default: 10000
    • subset_document_id: apply prediction to a subset of document_id.

batch_id: integer (int32)
batch_size: integer (int32) 10000
max_results: integer (int32)
metric: string cosine
min_score: number -1
parent_id: string
query: string
query_document_id: integer (int32)
sort_by: string score
sort_order: string descending
subset_document_id: integer[]
integer (int32)
default
data: object[]
object
document_id: integer (int32)
render_id: integer (int32)
score: number
pagination: object
batch_id: integer (int32)
batch_id_last: integer (int32)
current_response_count: integer (int32)
total_response_count: integer (int32)

POST /api/v0/stop-words/

Store a list of custom stop words

name: string
stop_words: string[]
string
default
name: string
stop_words: string[]
string

DELETE /api/v0/stop-words/{name}

Delete a stored custom stop words

name path string
default

GET /api/v0/stop-words/{name}

Load a stored list of stop words

name path string
default
name: string
stop_words: string[]
string

Parameter definitions

Schema definitions