freediscovery.text.FeatureVectorizer

class freediscovery.text.FeatureVectorizer(cache_dir=u'/tmp/', dsid=None, verbose=False)[source]

Extract features from text documents

Parameters:
  • cache_dir (str, default='/tmp/') – directory where to save temporary and regression files
  • dsid (str) – load an exising dataset
  • verbose (bool) – pring progress messages
__init__(cache_dir=u'/tmp/', dsid=None, verbose=False)[source]

Methods

__init__([cache_dir, dsid, verbose])
delete() Delete the current dataset
get_params() Get the vectorizer parameters
list_datasets() List all datasets in the working directory
load(dsid) Load a computed features from disk
preprocess(data_dir[, file_pattern, ...]) Initalize the features extraction.
query_features(indices[, n_top_words]) Query the features with most weight
search(filenames) Return the document ids that correspond to the provided filenames, without preserving order.
transform() Run the feature extraction

Attributes

n_samples_ Number of documents in the dataset
delete()[source]

Delete the current dataset

get_params()[source]

Get the vectorizer parameters

list_datasets()[source]

List all datasets in the working directory

load(dsid)[source]

Load a computed features from disk

n_samples_

Number of documents in the dataset

preprocess(data_dir, file_pattern='.*', dir_pattern='.*', n_features=11000000, chunk_size=5000, analyzer='word', ngram_range=(1, 1), stop_words='None', n_jobs=1, use_idf=False, sublinear_tf=False, binary=True, use_hashing=True, norm=None, min_df=0.0, max_df=1.0)[source]

Initalize the features extraction. See sklearn.feature_extraction.text for a detailed description of the input parameters

query_features(indices, n_top_words=10)[source]

Query the features with most weight

search(filenames)[source]

Return the document ids that correspond to the provided filenames, without preserving order.

Parameters:filenames (list[str]) – list of filenames (relatives to the data_dir)
Returns:indices – corresponding list of document id (order is not preserved)
Return type:array[int]
transform()[source]

Run the feature extraction