freediscovery.categorization.Categorizer

class freediscovery.categorization.Categorizer(cache_dir=u'/tmp/', dsid=None, mid=None, cv_scoring=u'roc_auc', cv_n_folds=3)[source]

Document categorization model

The option use_hashing=True must be set for the feature extraction. Recommended options also include, use_idf=1, sublinear_tf=0, binary=0.

Parameters:
  • cache_dir (str) – folder where the model will be saved
  • dsid (str, optional) – dataset id
  • mid (str, optional) – model id
  • cv_scoring (str, optional, default='roc_auc') – score that is used for Cross Validation, cf. sklearn
  • cv_n_folds (str, optional) – number of K-folds used for Cross Validation
__init__(cache_dir=u'/tmp/', dsid=None, mid=None, cv_scoring=u'roc_auc', cv_n_folds=3)[source]

Methods

__init__([cache_dir, dsid, mid, cv_scoring, ...])
delete() Delete a trained model
get_dsid(cache_dir, mid)
get_params() Get model parameters
get_path(mid)
list_models()
predict([chunk_size]) Predict the relevance using a previously trained model
train(index, y[, method, cv]) Train the categorization model
delete()[source]

Delete a trained model

get_params()[source]

Get model parameters

predict(chunk_size=5000)[source]

Predict the relevance using a previously trained model

Parameters:chunck_size (int) – chunck size
train(index, y, method=u'LinearSVC', cv=None)[source]

Train the categorization model

Parameters:
  • index (array-like, shape (n_samples)) – document indices of the training set
  • y (array-like, shape (n_samples)) – target binary class relative to index
  • method (str) – the ML algorithm to use (one of “LogisticRegression”, “LinearSVC”, ‘xgboost’)
  • cv (str) – use cross-validation
Returns:

  • cmod (sklearn.BaseEstimator) – the scikit learn classifier object
  • Y_train (array-like, shape (n_samples)) – training predictions