freediscovery.lsi.LSI

class freediscovery.lsi.LSI(cache_dir='/tmp/', dsid=None, mid=None, verbose=False)[source]

Document categorization using Latent Semantic Indexing (LSI)

Parameters:
  • cache_dir (str) – folder where the model will be saved
  • dsid (str) – dataset id
  • mid (str) – LSI model id (the dataset id will be inferred)
  • verbose (bool, optional) – print progress messages
__init__(cache_dir='/tmp/', dsid=None, mid=None, verbose=False)[source]

Methods

__init__([cache_dir, dsid, mid, verbose])
delete() Delete a trained model
get_dsid(cache_dir, mid)
get_params() Get model parameters
get_path(mid)
list_models()
load(mid) Load the computed features from cache specified by mid
predict(index, y[, accumulate, chunk_size]) Predict the document relevance using a previously trained LSI model
transform(n_components[, n_iter]) Perform the SVD decomposition
delete()[source]

Delete a trained model

get_params()[source]

Get model parameters

load(mid)[source]

Load the computed features from cache specified by mid

predict(index, y, accumulate='nearest-max', chunk_size=100)[source]

Predict the document relevance using a previously trained LSI model

Parameters:
  • index (array-like, shape (n_samples)) – document indices of the training set
  • y (array-like, shape (n_samples)) – target binary class relative to index
  • accumulate (str, optional, default='nearest-max') – if accumulate==”nearest-max” the cosine distance to the closest relevant/non relevant document is used as classification score, otherwise if accumulate==”centroid-max” the centroid of relevant documents is used as the query vector.
transform(n_components, n_iter=5)[source]

Perform the SVD decomposition

Parameters:
  • n_components (int) – number of selected singular values (number of LSI dimensions)
  • n_iter (int) – number of iterations for the stochastic SVD algorithm
Returns:

  • mid (str) – model id
  • lsi (BaseEstimator) – the TruncatedSVD object
  • exp_var (float) – the explained variance of the SVD decomposition