FreeDiscovery

Version: v0

Schemes:

Summary

Path	Operation	Description
/api/v0/categorization/	GET
/api/v0/categorization/	POST
/api/v0/categorization/{mid}	DELETE
/api/v0/categorization/{mid}	GET
/api/v0/categorization/{mid}/predict	GET
/api/v0/clustering/birch	POST
/api/v0/clustering/dbscan	POST
/api/v0/clustering/k-mean/	POST
/api/v0/clustering/{method}/{mid}	DELETE
/api/v0/clustering/{method}/{mid}	GET
/api/v0/duplicate-detection/	POST
/api/v0/duplicate-detection/{mid}	DELETE
/api/v0/duplicate-detection/{mid}	GET
/api/v0/email-threading/{mid}	DELETE
/api/v0/email-threading/{mid}	GET
/api/v0/example-dataset/{name}	GET
/api/v0/feature-extraction	GET
/api/v0/feature-extraction	POST
/api/v0/feature-extraction/{dsid}	DELETE
	GET
	POST
/api/v0/feature-extraction/{dsid}/append	POST
/api/v0/feature-extraction/{dsid}/delete	POST
/api/v0/feature-extraction/{dsid}/id-mapping	POST
/api/v0/lsi/	GET
/api/v0/lsi/	POST
/api/v0/lsi/{mid}	DELETE
/api/v0/lsi/{mid}	GET
/api/v0/metrics/categorization	POST
/api/v0/metrics/clustering	POST
/api/v0/metrics/duplicate-detection	POST
/api/v0/search/	POST
/api/v0/stop-words/	POST
/api/v0/stop-words/{name}	DELETE
/api/v0/stop-words/{name}	GET

Paths

GET /api/v0/categorization/

List existing categorization models

default

object

method: string
options: string

POST /api/v0/categorization/

Build the categorization ML model

The option use_hashing=True must be set for the feature extraction. Recommended options also include, weighting="ntc".

Parameters

parent_id: dataset_id or lsi_id
data: a list of dict which have a category field and one or several fields that can be used for indexing, such as document_id and optionally rendition_id.
method: classification algorithm to use (default: LogisticRegression),
- "LogisticRegression": LogisticRegression
- "LinearSVC": Linear SVM,
- "NearestNeighbor": nearest neighbor classifier (requires LSI)
cv: binary, if true optimal parameters of the ML model are determined by cross-validation over 5 stratified K-folds (default False).
training_scores: binary, compute the efficiency scores on the training dataset. This would make computations much slower for NearestNeighbors (default False).

cv: boolean

data: object[]

object

category: string
document_id: integer (int32)
render_id: integer (int32)

method: string

parent_id: string

training_scores: boolean

default

id: string

training_scores: object

average_precision: number
f1: number
precision: number
recall: number
recall_at_20p: number
roc_auc: number

DELETE /api/v0/categorization/{mid}

Delete the categorization model


mid		path	string

default

GET /api/v0/categorization/{mid}

Load categorization model parameters


mid		path	string

default

method: string
options: string

GET /api/v0/categorization/{mid}/predict

Predict document categorization with a previously trained model

Parameters

max_result_categories : the maximum number of categories in the results
sort_by : if provided and not None, the field used for sorting results. Valid values are [None, 'score'] or any of the ingested category names.
sort_order: the sort order (if applicable), one of ['ascending', 'descending']
max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
ml_output : type of the output in ['decision_function', 'probability'], only affects ML methods.
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
min_score : filter out results below a similarity threashold
subset: apply prediction to a document subset. Must be one of ['all', 'train', 'test']. Default: 'test'.
subset_document_id: apply prediction to a subset of document_id.
batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
batch_size: the number of document scores retrieved per batch. Default: 10000

batch_id: integer (int32)
batch_size: integer (int32) 10000
max_result_categories: integer (int32) 1
max_results: integer (int32)
metric: string cosine
min_score: number -1
ml_output: string probability
sort_by: string score
sort_order: string descending
subset: string test
subset_document_id: integer[]: integer (int32)


mid		path	string

default

data: object[]

object

document_id: integer (int32)

render_id: integer (int32)

scores: object[]

object

category: string
document_id: integer (int32)
render_id: integer (int32)
score: number

pagination: object

batch_id: integer (int32)
batch_id_last: integer (int32)
current_response_count: integer (int32)
total_response_count: integer (int32)

POST /api/v0/clustering/birch

Compute birch clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for data ingestion also include, ntc.

Parameters

parent_id: dataset_id or lsi_id
n_clusters: the number of clusters or -1 to use hierarchical clustering (default: -1)
min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch. Increasing this value would increase the hierarchical tree depth (and the number of clusters).
branching_factor: Maximum number of CF subclusters in each node. If a new samples enters such that the number of subclusters exceed the branching_factor then the node has to be split. The corresponding parent also has to be split and if the number of subclusters in the parent is greater than the branching factor, then it has to be split recursively. Decreasing this value would increase the number of clusters.
max_tree_depth : Maximum hierarchy depth (only applicable when n_clusters=-1)
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

branching_factor: integer (int32) 20
max_tree_depth: integer (int32)
metric: string cosine
min_similarity: number 0.5
n_clusters: integer (int32) -1
parent_id: string

default

id: string

POST /api/v0/clustering/dbscan

Compute clustering (DBSCAN)

The option use_hashing=False must be set for the feature extraction. Recommended options for the data ingestion also include, weighting="ntc".

Parameters

parent_id: dataset_id or lsi_id
min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
min_samples: (optional) int The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metric: string cosine
min_samples: integer (int32) 10
min_similarity: number 0.5
parent_id: string

default

id: string

POST /api/v0/clustering/k-mean/

Compute K-mean clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for feature extraction include, weighting="ntc".

Parameters

parent_id: dataset_id or lsi_id
n_clusters: the number of clusters
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

metric: string cosine
n_clusters: integer (int32) 150
parent_id: string

default

id: string

DELETE /api/v0/clustering/{method}/{mid}

Delete a clustering model


mid		path	string
method		path	string

default

GET /api/v0/clustering/{method}/{mid}

Compute cluster labels

Parameters

n_top_words: keep only most relevant n_top_words words
return_optimal_sampling : Instead of cluster results, the optimal sampling results will be returned (with no cluster labels). This option is only valid with Birch algorithm. Note that optimal sampling cannot return more samples than the subclusters in the birch clustering results (default: false)
sampling_min_similarity : Similarity threashold used by smart sampling. Decreasing this value would result in more sampled documents. Default: 1.0 (i.e. use the full cluster hierarichy).
sampling_min_coverage : Minimal coverage requirement in [0, 1] range. Increasing this value would result in a larger number of samples. (default: 0.9)

n_top_words: integer (int32) 5
return_optimal_sampling: boolean
sampling_min_coverage: number 0.9
sampling_min_similarity: number 1


mid		path	string
method		path	string

default

data: object[]

object

children: integer[]

integer (int32)

cluster_depth: integer (int32)

cluster_id: integer (int32)

cluster_label: string

cluster_similarity: number

cluster_size: integer (int32)

documents: object[]

object

document_id: integer (int32)
render_id: integer (int32)
similarity: number

POST /api/v0/duplicate-detection/

Compute near duplicates

Parameters

parent_id: dataset_id or lsi_id
method: str, default='simhash' Method used for duplicate detection. One of "simhash", "i-match"

method: string simhash
parent_id: string

default

id: string

DELETE /api/v0/duplicate-detection/{mid}


mid		path	string

default

GET /api/v0/duplicate-detection/{mid}

Query duplicates

Parameters

distance : int, default=2 Maximum number of differnet bits in the simhash (Simhash method only)
n_rand_lexicons : int, default=1 number of random lexicons used for duplicate detection (I-Match method only)
rand_lexicon_ratio : float, default=0.7 ratio of the vocabulary used in random lexicons (I-Match method only)
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].

distance: integer (int32)
metric: string cosine
n_rand_lexicons: integer (int32)
rand_lexicon_ratio: number


mid		path	string

default

data: object[]

object

children: integer[]

integer (int32)

cluster_depth: integer (int32)

cluster_id: integer (int32)

cluster_label: string

cluster_similarity: number

cluster_size: integer (int32)

documents: object[]

object

document_id: integer (int32)
render_id: integer (int32)
similarity: number

DELETE /api/v0/email-threading/{mid}

Delete a processed dataset


mid		path	string

default

GET /api/v0/email-threading/{mid}

Get email threading parameters


mid		path	string

default

group_by_subject: boolean

GET /api/v0/example-dataset/{name}

Download a benchmark dataset.

The currently supported datasets are listed below,

1. TREC 2009 legal collection

   - `treclegal09_2k_subset`  :   2 400 documents,   2 MB
   - `treclegal09_20k_subset` :  20 000 documents,  30 MB
   - `treclegal09_37k_subset` :  37 000 documents,  55 MB
   - `treclegal09`            : 700 000 documents, 1.2 GB

   The ground truth files for categorization are adapted from TAR Toolkit.

2. Fedora mailing list (2009-2009)
   - `fedora_ml_3k_subset`

3. The 20 newsgoups dataset
   - `20_newsgroups_3categories`: only the ['comp.graphics',
   'rec.sport.baseball', 'sci.space'] categories

If you encounter any issues for downloads with this function,
you can also manually download and extract the required dataset to
``cache_dir`` (the download url is ``http://r0h.eu/d/<name>.tar.gz``),
then re-run this function to get the required metadata.

n_categories: integer (int32) 2


name		path	string

default

dataset: object[]

object

category: string
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

metadata: object

data_dir: string
name: string

training_set: object[]

object

category: string
document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

GET /api/v0/feature-extraction

View parameters used for the feature extraction

default

object

analyzer: string word
chunk_size: integer (int32)
column_ids: integer[]: integer (int32)
column_separator: string ,
data_dir: string
filenames: string[]: string
id: string
max_df: number
min_df: number
n_features: integer (int32) 100001
n_jobs: integer (int32) 1
n_samples: integer (int32)
n_samples_processed: integer (int32)
ngram_range: integer[] 1,1: integer (int32)
norm_alpha: number 0.75
overwrite: boolean
parse_email_headers: boolean
preprocess: string[]: string
stop_words: string english
use_hashing: boolean
weighting: string nnc

POST /api/v0/feature-extraction

Initialize the feature extraction on a document collection.

Parameters

n_features: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)
analyzer: 'word', 'char', 'char_wb' Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: 'word')
ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
stop_words: "english" or "None" Remove stop words from the resulting tokens. Only applies for the "word" analyzer. If "english", a built-in stop word list for English is used. ( default: "english")
n_jobs: The maximum number of concurrently running jobs (default: 1)
chuck_size: The number of documents simultaneously processed by a running job (default: 5000)
weighting: the SMART notation for document term weighting and normalization. In the form [nlabL][ntp][ncb] , see https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
norm_alpha: the alpha value used for pivoted normalization
use_hashing: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.
parse_email_headers: when documents are emails, attempt to parse the information contained in the header (default: False)
preprocess: a list of pre-processing steps to apply before vectorization. A subset of ['emails_ignore_header'], default: [].
id: (optional) custom dataset id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.
overwrite: if a custom dataset id was provided, and it already exists, overwrite it. Default: false
column_ids : list of ints. If provided the input dataset is expected to be CSV, and the columns with the provided ids are selected. Documents can only be provided using dataset_definition parameter that must contain a single file path
column_separator: str, character used to delimit columns. Only used if column_ids is provided. Default: ','

analyzer: string word
chunk_size: integer (int32)
column_ids: integer[]: integer (int32)
column_separator: string ,
data_dir: string
id: string
max_df: number
min_df: number
n_features: integer (int32) 100001
n_jobs: integer (int32) 1
ngram_range: integer[] 1,1: integer (int32)
norm_alpha: number 0.75
overwrite: boolean
parse_email_headers: boolean
preprocess: string[]: string
stop_words: string english
use_hashing: boolean
weighting: string nnc

default

id: string

DELETE /api/v0/feature-extraction/{dsid}

Delete a processed dataset


dsid		path	string

default

GET /api/v0/feature-extraction/{dsid}

Load extracted features (and obtain the processing status)


dsid		path	string

POST /api/v0/feature-extraction/{dsid}

Run feature extraction on a dataset,

Parameters

data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'content': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where document_id and rendition_id are optional, while either file_path or content field must be provided.
vectorize: [optional] this option can be used to ingest the dataset_definition in batches (optionally with document content), then make one final call to vectorize all sent documents (bool, default: True)
document_id_generator: [optional] if the document_id is not provided, this specifies how it is generated. If indexed_file_path the document_id is given by the index of the sorted file_path, otherwise if infer_file_path the document_id is inferred from the file_path strings, removing all non digit characters. In this second case, the file_path must contain a unique numeric ID (default: indexed_file_path)

data_dir: string

dataset_definition: object[]

object

content: string
document_id: integer (int32)
file_path: string
rendition_id: integer (int32)

document_id_generator: string indexed_file_path

vectorize: boolean true


dsid		path	string

default

id: string

POST /api/v0/feature-extraction/{dsid}/append

Add new documents to an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.

This operation cannot be undone.

Warning: all categorization, clustering, duplicate detection and email threading models associated with this dataset will be removed and need to be re-trained.

Parameters

data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where rendition_id are optional, while either file_path or content field must be provided.

data_dir: string

dataset_definition: object[]

object

content: string
document_id: integer (int32)
file_path: string
rendition_id: integer (int32)


dsid		path	string

default

POST /api/v0/feature-extraction/{dsid}/delete

Remove documents from an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.

     This operation cannot be undone.

     Warning: all categorization, clustering, duplicate detection and
     email threading models associated with this dataset will be removed and
     need to be re-trained.

     **Parameters**
      - `dataset_definition`: [optional] a list of dictionaries `[{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]` where  `rendition_id` are optional.

dataset_definition: object[]

object

document_id: integer (int32)
file_path: string
rendition_id: integer (int32)


dsid		path	string

default

POST /api/v0/feature-extraction/{dsid}/id-mapping

Compute correspondence between id fields for documents. At least one of the fields used for indexing must be provided, and all the rest will be computed (if available). If the data parameter is not provided, return all the correspondence table

Parameters

data: the ids of documents used as the query
return_file_path: whether the results should include the file path

data: object[]

object

document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

return_file_path: boolean true


dsid		path	string

default

data: object[]

object

document_id: integer (int32)
file_path: string
internal_id: integer (int32)
render_id: integer (int32)

GET /api/v0/lsi/

List existing LSI models

parent_id: string

default

object

n_components: integer (int32)
parent_id: string

POST /api/v0/lsi/

Build a Latent Semantic Indexing (LSI) model

Recommended data ingestion options also include, use_idf=1, sublinear_tf=0, binary=0.

The recommended value for the n_components (dimensions of the SVD decompositions) is in the [100, 200] range.

Parameters

n_components: Desired dimensionality of the output data. Must be strictly less than the number of features.
parent_id: parent dataset identified by dataset_id
alpha: floor on the number of components used with small datasets
id: (optional) custom model id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.
overwrite: if a custom model id was provided, and it already exists, overwrite it. Default: false

alpha: number 0.33
id: string
n_components: integer (int32) 150
overwrite: boolean
parent_id: string

default

explained_variance: number
id: string

DELETE /api/v0/lsi/{mid}

Delete a Latent Semantic Indexing (LSI) model


mid		path	string

default

GET /api/v0/lsi/{mid}

Show Latent Semantic Indexing (LSI) model parameters


mid		path	string

default

n_components: integer (int32)
parent_id: string

POST /api/v0/metrics/categorization

Compute categorization metrics to assess the quality of categorization.

In the case of binary categrorization, category labels are sorted alphabetically and the second one is expected to be the positive one.

Parameters

y_true: [required] ground truth categorization data
y_pred: [required] predicted categorization results
metrics: [required] list of str. Metrics to compute, any combination of "precision", "recall", "f1", "roc_auc"

metrics: string[]

string

y_pred: object[]

object

document_id: integer (int32)

render_id: integer (int32)

scores: object[]

object

category: string
document_id: integer (int32)
render_id: integer (int32)
score: number

y_true: object[]

object

category: string
document_id: integer (int32)
render_id: integer (int32)

default

average_precision: number
f1: number
precision: number
recall: number
recall_at_20p: number
roc_auc: number

POST /api/v0/metrics/clustering

Compute clustering metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.

Parameters

labels_true: [required] list of int. Ground truth clustering labels
labels_pred: [required] list of int. Predicted clustering labels
metrics: [required] list of str. Metrics to compute, any combination of "adjusted_rand", "adjusted_mutual_info", "v_measure"

labels_pred: integer[]: integer (int32)
labels_true: integer[]: integer (int32)
metrics: string[]: string

default

adjusted_mutual_info: number
adjusted_rand: number
v_measure: number

POST /api/v0/metrics/duplicate-detection

Compute duplicate detection metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.

Parameters

labels_true: [required] list of int. Ground truth clustering labels
labels_pred: [required] list of int. Predicted clustering labels
metrics: [required] list of str. Metrics to compute, any combination of "ratio_duplicates", "f1_same_duplicates", "mean_duplicates_count"

labels_pred: integer[]: integer (int32)
labels_true: integer[]: integer (int32)
metrics: string[]: string

default

f1_same_duplicates: number
mean_duplicates_count: number
ratio_duplicates: number

POST /api/v0/search/

Perform document search (if parent_id is a dataset_id) or a semantic search (if parent_id is a lsi_id).

Parameters

parent_id : the id of the previous processing step (either dataset_id or lsi_id)
query : the seach query. Either query or query_document_id must be provided.
query_document_id : the id of the document used as the search query. Either query or query_document_id must be provided.
metric : The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].
min_score : filter out results below a similarity threashold
max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
sort_by : if provided and not None, the field used for sorting results. Valid values are [None, 'score']
sort_order: the sort order (if applicable), one of ['ascending', 'descending']
batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
batch_size: the number of document scores retrieved per batch. Default: 10000
- subset_document_id: apply prediction to a subset of document_id.

batch_id: integer (int32)
batch_size: integer (int32) 10000
max_results: integer (int32)
metric: string cosine
min_score: number -1
parent_id: string
query: string
query_document_id: integer (int32)
sort_by: string score
sort_order: string descending
subset_document_id: integer[]: integer (int32)

default

data: object[]

object

document_id: integer (int32)
render_id: integer (int32)
score: number

pagination: object

batch_id: integer (int32)
batch_id_last: integer (int32)
current_response_count: integer (int32)
total_response_count: integer (int32)

POST /api/v0/stop-words/

Store a list of custom stop words

name: string
stop_words: string[]: string

default

name: string
stop_words: string[]: string

DELETE /api/v0/stop-words/{name}

Delete a stored custom stop words


name		path	string

default

GET /api/v0/stop-words/{name}

Load a stored list of stop words


name		path	string

default

name: string
stop_words: string[]: string

Parameter definitions

FreeDiscovery

Summary

Paths

GET /api/v0/categorization/

POST /api/v0/categorization/

DELETE /api/v0/categorization/{mid}

GET /api/v0/categorization/{mid}

GET /api/v0/categorization/{mid}/predict

POST /api/v0/clustering/birch

POST /api/v0/clustering/dbscan

POST /api/v0/clustering/k-mean/

DELETE /api/v0/clustering/{method}/{mid}

GET /api/v0/clustering/{method}/{mid}

POST /api/v0/duplicate-detection/

DELETE /api/v0/duplicate-detection/{mid}

GET /api/v0/duplicate-detection/{mid}

DELETE /api/v0/email-threading/{mid}

GET /api/v0/email-threading/{mid}

GET /api/v0/example-dataset/{name}

GET /api/v0/feature-extraction

POST /api/v0/feature-extraction

DELETE /api/v0/feature-extraction/{dsid}

GET /api/v0/feature-extraction/{dsid}

POST /api/v0/feature-extraction/{dsid}

POST /api/v0/feature-extraction/{dsid}/append

POST /api/v0/feature-extraction/{dsid}/delete

POST /api/v0/feature-extraction/{dsid}/id-mapping

GET /api/v0/lsi/

POST /api/v0/lsi/

DELETE /api/v0/lsi/{mid}

GET /api/v0/lsi/{mid}

POST /api/v0/metrics/categorization

POST /api/v0/metrics/clustering

POST /api/v0/metrics/duplicate-detection

POST /api/v0/search/

Parameters

POST /api/v0/stop-words/

DELETE /api/v0/stop-words/{name}

GET /api/v0/stop-words/{name}

Parameter definitions

Schema definitions