Load a dataset and initialize feature extraction¶

Initialize the feature extraction on a document collection.

URL: /api/v0/feature-extraction/
Method: POST, URL Params: None
Data Params: (following the sklearn.feature_extraction.text.HashingVectorizer API)
- data_dir: [required] relative path to the directory with the input files
- n_features: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)
- analyzer: ‘word’, ‘char’, ‘char_wb’ Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: ‘word’)
- ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
- stop_words: “english” or “None” Remove stop words from the resulting tokens. Only applies for the “word” analyzer. If “english”, a built-in stop word list for English is used. ( default: “None”)
- n_jobs: The maximum number of concurrently running jobs (default: 1)
- norm: The normalization to use after the feature weighting (‘None’, ‘l1’, ‘l2’) (default: ‘None’)
- chuck_size: The number of documents simultaneously processed by a running job (default: 5000)
- binary: If set to 1, all non zero counts are set to 1. (default: True)
- use_idf: Enable inverse-document-frequency reweighting (default: False).
- sublinear_tf: Apply sublinear tf scaling, i.e. replace tf with log(1 + tf) (default: False).
- use_hashing: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.

Success Response: HTTP 200

 {"id": <str>, "filenames": <list[str]>  }

Error Response: HTTP 422
```
 {"error": "Some error message"}`
```