freediscovery.categorization.Categorizer¶

class freediscovery.categorization.Categorizer(cache_dir=u'/tmp/', dsid=None, mid=None, cv_scoring=u'roc_auc', cv_n_folds=3)[source]¶

Document categorization model

The option use_hashing=True must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:	cache_dir (str) – folder where the model will be saved dsid (str, optional) – dataset id mid (str, optional) – model id cv_scoring (str, optional, default='roc_auc') – score that is used for Cross Validation, cf. sklearn cv_n_folds (str, optional) – number of K-folds used for Cross Validation

__init__(cache_dir=u'/tmp/', dsid=None, mid=None, cv_scoring=u'roc_auc', cv_n_folds=3)[source]¶

Methods

`__init__`([cache_dir, dsid, mid, cv_scoring, ...])
`delete`()	Delete a trained model
`get_dsid`(cache_dir, mid)
`get_params`()	Get model parameters
`get_path`(mid)
`list_models`()
`predict`([chunk_size])	Predict the relevance using a previously trained model
`train`(index, y[, method, cv])	Train the categorization model

delete()[source]¶: Delete a trained model

get_params()[source]¶: Get model parameters

predict(chunk_size=5000)[source]¶

Predict the relevance using a previously trained model

Parameters:	chunck_size (int) – chunck size

train(index, y, method=u'LinearSVC', cv=None)[source]¶

Train the categorization model

Parameters:

index (array-like, shape (n_samples)) – document indices of the training set
y (array-like, shape (n_samples)) – target binary class relative to index
method (str) – the ML algorithm to use (one of “LogisticRegression”, “LinearSVC”, ‘xgboost’)
cv (str) – use cross-validation

Returns:

cmod (sklearn.BaseEstimator) – the scikit learn classifier object
Y_train (array-like, shape (n_samples)) – training predictions