Load a dataset and initialize feature extraction¶
Initialize the feature extraction on a document collection.
- URL:
/api/v0/feature-extraction/
- Method:
POST
, URL Params: None - Data Params: (following the sklearn.feature_extraction.text.HashingVectorizer API)
data_dir
: [required] relative path to the directory with the input filesn_features
: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)analyzer
: ‘word’, ‘char’, ‘char_wb’ Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: ‘word’)ngram_range
: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.stop_words
: “english” or “None” Remove stop words from the resulting tokens. Only applies for the “word” analyzer. If “english”, a built-in stop word list for English is used. ( default: “None”)n_jobs
: The maximum number of concurrently running jobs (default: 1)norm
: The normalization to use after the feature weighting (‘None’, ‘l1’, ‘l2’) (default: ‘None’)chuck_size
: The number of documents simultaneously processed by a running job (default: 5000)binary
: If set to 1, all non zero counts are set to 1. (default: True)use_idf
: Enable inverse-document-frequency reweighting (default: False).sublinear_tf
: Apply sublinear tf scaling, i.e. replace tf with log(1 + tf) (default: False).use_hashing
: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)min_df
: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.max_df
: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.
Success Response:
HTTP 200
{"id": <str>, "filenames": <list[str]> }
Error Response:
HTTP 422
{"error": "Some error message"}`