Version: v0
List existing categorization models
Build the categorization ML model
The option use_hashing=True
must be set for the feature extraction. Recommended options also include, weighting="ntc"
.
Parameters
parent_id
: dataset_id
or lsi_id
data
: a list of dict which have a category
field and one or several fields that can be used for indexing, such as document_id
and optionally rendition_id
.method
: classification algorithm to use (default: LogisticRegression),cv
: binary, if true optimal parameters of the ML model are determined by cross-validation over 5 stratified K-folds (default False).training_scores
: binary, compute the efficiency scores on the training dataset. This would make computations much slower for NearestNeighbors (default False). Delete the categorization model
mid | path | string |
Load categorization model parameters
mid | path | string |
Predict document categorization with a previously trained model
Parameters
max_result_categories
: the maximum number of categories in the resultssort_by
: if provided and not None, the field used for sorting results. Valid values are [None, 'score'] or any of the ingested category names.sort_order
: the sort order (if applicable), one of ['ascending', 'descending']max_results
: return only the first max_results
documents. If max_results <= 0
all documents are returned.ml_output
: type of the output in ['decision_function', 'probability'], only affects ML methods.metric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].min_score
: filter out results below a similarity threasholdsubset
: apply prediction to a document subset. Must be one of ['all', 'train', 'test']. Default: 'test'.subset_document_id
: apply prediction to a subset of document_id. batch_id
: retrieve a given subset of scores (-1 to retrieve all). Default: 0batch_size
: the number of document scores retrieved per batch. Default: 10000mid | path | string |
Compute birch clustering
The option use_hashing=False
must be set for the feature extraction. Recommended options for data ingestion also include, ntc
.
Parameters
parent_id
: dataset_id
or lsi_id
n_clusters
: the number of clusters or -1 to use hierarchical clustering (default: -1)min_similarity
: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch. Increasing this value would increase the hierarchical tree depth (and the number of clusters).branching_factor
: Maximum number of CF subclusters in each node. If a new samples enters such that the number of subclusters exceed the branching_factor then the node has to be split. The corresponding parent also has to be split and if the number of subclusters in the parent is greater than the branching factor, then it has to be split recursively. Decreasing this value would increase the number of clusters.max_tree_depth
: Maximum hierarchy depth (only applicable when n_clusters=-1
)metric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].Compute clustering (DBSCAN)
The option use_hashing=False
must be set for the feature extraction. Recommended options for the data ingestion also include, weighting="ntc"
.
Parameters
parent_id
: dataset_id
or lsi_id
min_similarity
: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birchmetric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].min_samples
: (optional) int The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.Compute K-mean clustering
The option use_hashing=False
must be set for the feature extraction. Recommended options for feature extraction include, weighting="ntc"
.
Parameters
parent_id
: dataset_id
or lsi_id
n_clusters
: the number of clustersmetric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].Delete a clustering model
mid | path | string | ||
method | path | string |
Compute cluster labels
Parameters
n_top_words
: keep only most relevant n_top_words
wordsreturn_optimal_sampling
: Instead of cluster results, the optimal sampling results will be returned (with no cluster labels). This option is only valid with Birch algorithm. Note that optimal sampling cannot return more samples than the subclusters in the birch clustering results (default: false)sampling_min_similarity
: Similarity threashold used by smart sampling. Decreasing this value would result in more sampled documents. Default: 1.0 (i.e. use the full cluster hierarichy).sampling_min_coverage
: Minimal coverage requirement in [0, 1] range. Increasing this value would result in a larger number of samples. (default: 0.9)mid | path | string | ||
method | path | string |
Compute near duplicates
Parameters
parent_id
: dataset_id
or lsi_id
method
: str, default='simhash' Method used for duplicate detection. One of "simhash", "i-match"mid | path | string |
Query duplicates
Parameters
distance
: int, default=2 Maximum number of differnet bits in the simhash (Simhash method only)n_rand_lexicons
: int, default=1 number of random lexicons used for duplicate detection (I-Match method only)rand_lexicon_ratio
: float, default=0.7 ratio of the vocabulary used in random lexicons (I-Match method only)metric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].mid | path | string |
Delete a processed dataset
mid | path | string |
Get email threading parameters
mid | path | string |
Download a benchmark dataset.
The currently supported datasets are listed below,
1. TREC 2009 legal collection
- `treclegal09_2k_subset` : 2 400 documents, 2 MB
- `treclegal09_20k_subset` : 20 000 documents, 30 MB
- `treclegal09_37k_subset` : 37 000 documents, 55 MB
- `treclegal09` : 700 000 documents, 1.2 GB
The ground truth files for categorization are adapted from TAR Toolkit.
2. Fedora mailing list (2009-2009)
- `fedora_ml_3k_subset`
3. The 20 newsgoups dataset
- `20_newsgroups_3categories`: only the ['comp.graphics',
'rec.sport.baseball', 'sci.space'] categories
If you encounter any issues for downloads with this function,
you can also manually download and extract the required dataset to
``cache_dir`` (the download url is ``http://r0h.eu/d/<name>.tar.gz``),
then re-run this function to get the required metadata.
name | path | string |
View parameters used for the feature extraction
Initialize the feature extraction on a document collection.
Parameters
n_features
: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)analyzer
: 'word', 'char', 'char_wb' Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: 'word')ngram_range
: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
stop_words
: "english" or "None" Remove stop words from the resulting tokens. Only applies for the "word" analyzer. If "english", a built-in stop word list for English is used. ( default: "english")
n_jobs
: The maximum number of concurrently running jobs (default: 1)chuck_size
: The number of documents simultaneously processed by a running job (default: 5000)weighting
: the SMART notation for document term weighting and normalization. In the form [nlabL][ntp][ncb] , see https://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemnorm_alpha
: the alpha value used for pivoted normalization
use_hashing
: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)
min_df
: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.max_df
: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.parse_email_headers
: when documents are emails, attempt to parse the information contained in the header (default: False)preprocess
: a list of pre-processing steps to apply before vectorization. A subset of ['emails_ignore_header'], default: [].id
: (optional) custom dataset id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.overwrite
: if a custom dataset id was provided, and it already exists, overwrite it. Default: falsecolumn_ids
: list of ints. If provided the input dataset is expected to be CSV, and the columns with the provided ids are selected. Documents can only be provided using dataset_definition
parameter that must contain a single file pathcolumn_separator
: str, character used to delimit columns. Only used if column_ids
is provided. Default: ','Delete a processed dataset
dsid | path | string |
Load extracted features (and obtain the processing status)
dsid | path | string |
Run feature extraction on a dataset,
Parameters
data_dir
: [optional] relative path to the directory with the input files. Either data_dir
or dataset_definition
must be provided.dataset_definition
: [optional] a list of dictionaries [{'file_path': <str>, 'content': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]
where document_id
and rendition_id
are optional, while either file_path
or content
field must be provided. vectorize
: [optional] this option can be used to ingest the dataset_definition in batches (optionally with document content), then make one final call to vectorize all sent documents (bool, default: True)document_id_generator
: [optional] if the document_id
is not provided, this specifies how it is generated. If indexed_file_path
the document_id
is given by the index of the sorted file_path
, otherwise if infer_file_path
the document_id
is inferred from the file_path
strings, removing all non digit characters. In this second case, the file_path
must contain a unique numeric ID (default: indexed_file_path
)dsid | path | string |
Add new documents to an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.
This operation cannot be undone.
Warning: all categorization, clustering, duplicate detection and email threading models associated with this dataset will be removed and need to be re-trained.
Parameters
data_dir
: [optional] relative path to the directory with the input files. Either data_dir
or dataset_definition
must be provided.dataset_definition
: [optional] a list of dictionaries [{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]
where rendition_id
are optional, while either file_path
or content
field must be provided. dsid | path | string |
Remove documents from an existing processed dataset. This will also automatically update the LSI model if any is present. Raw documents on disk are not affected.
This operation cannot be undone.
Warning: all categorization, clustering, duplicate detection and
email threading models associated with this dataset will be removed and
need to be re-trained.
**Parameters**
- `dataset_definition`: [optional] a list of dictionaries `[{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]` where `rendition_id` are optional.
dsid | path | string |
Compute correspondence between id fields for documents. At least one of the fields used for indexing must be provided, and all the rest will be computed (if available). If the data parameter is not provided, return all the correspondence table
Parameters
data
: the ids of documents used as the queryreturn_file_path
: whether the results should include the file pathdsid | path | string |
List existing LSI models
Build a Latent Semantic Indexing (LSI) model
Recommended data ingestion options also include, use_idf=1, sublinear_tf=0, binary=0
.
The recommended value for the n_components
(dimensions of the SVD decompositions) is
in the [100, 200] range.
Parameters
n_components
: Desired dimensionality of the output data. Must be strictly less than the number of features.parent_id
: parent dataset identified by dataset_id
alpha
: floor on the number of components used with small datasetsid
: (optional) custom model id. Can only contain letters, numbers, "_" or "-". It must also be between 2 and 50 characters long.overwrite
: if a custom model id was provided, and it already exists, overwrite it. Default: falseDelete a Latent Semantic Indexing (LSI) model
mid | path | string |
Show Latent Semantic Indexing (LSI) model parameters
mid | path | string |
Compute categorization metrics to assess the quality of categorization.
In the case of binary categrorization, category labels are sorted alphabetically and the second one is expected to be the positive one.
Parameters
Compute clustering metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.
Parameters
Compute duplicate detection metrics to assess the quality of categorization, comparing the groud truth labels with the predicted ones.
Parameters
Perform document search (if parent_id
is a dataset_id
) or a semantic search (if parent_id
is a lsi_id
).
parent_id
: the id of the previous processing step (either dataset_id
or lsi_id
)query
: the seach query. Either query
or query_document_id
must be provided.query_document_id
: the id of the document used as the search query. Either query
or query_document_id
must be provided.metric
: The similarity returned by nearest neighbor classifier in ['cosine', 'jaccard', 'cosine-positive'].min_score
: filter out results below a similarity threasholdmax_results
: return only the first max_results
documents. If max_results <= 0
all documents are returned.sort_by
: if provided and not None, the field used for sorting results. Valid values are [None, 'score']sort_order
: the sort order (if applicable), one of ['ascending', 'descending']batch_id
: retrieve a given subset of scores (-1 to retrieve all). Default: 0batch_size
: the number of document scores retrieved per batch. Default: 10000subset_document_id
: apply prediction to a subset of document_id. Store a list of custom stop words
Delete a stored custom stop words
name | path | string |
Load a stored list of stop words
name | path | string |