freediscovery.categorization.Categorizer¶
-
class
freediscovery.categorization.
Categorizer
(cache_dir=u'/tmp/', dsid=None, mid=None, cv_scoring=u'roc_auc', cv_n_folds=3)[source]¶ Document categorization model
The option use_hashing=True must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.
Parameters: - cache_dir (str) – folder where the model will be saved
- dsid (str, optional) – dataset id
- mid (str, optional) – model id
- cv_scoring (str, optional, default='roc_auc') – score that is used for Cross Validation, cf. sklearn
- cv_n_folds (str, optional) – number of K-folds used for Cross Validation
Methods
__init__
([cache_dir, dsid, mid, cv_scoring, ...])delete
()Delete a trained model get_dsid
(cache_dir, mid)get_params
()Get model parameters get_path
(mid)list_models
()predict
([chunk_size])Predict the relevance using a previously trained model train
(index, y[, method, cv])Train the categorization model -
predict
(chunk_size=5000)[source]¶ Predict the relevance using a previously trained model
Parameters: chunck_size (int) – chunck size
-
train
(index, y, method=u'LinearSVC', cv=None)[source]¶ Train the categorization model
Parameters: - index (array-like, shape (n_samples)) – document indices of the training set
- y (array-like, shape (n_samples)) – target binary class relative to index
- method (str) – the ML algorithm to use (one of “LogisticRegression”, “LinearSVC”, ‘xgboost’)
- cv (str) – use cross-validation
Returns: - cmod (sklearn.BaseEstimator) – the scikit learn classifier object
- Y_train (array-like, shape (n_samples)) – training predictions