freediscovery.text.FeatureVectorizer¶

class freediscovery.text.FeatureVectorizer(cache_dir=u'/tmp/', dsid=None, verbose=False)[source]¶

Extract features from text documents

Parameters:	cache_dir (str, default='/tmp/') – directory where to save temporary and regression files dsid (str) – load an exising dataset verbose (bool) – pring progress messages

Methods

`__init__`([cache_dir, dsid, verbose])
`delete`()	Delete the current dataset
`get_params`()	Get the vectorizer parameters
`list_datasets`()	List all datasets in the working directory
`load`(dsid)	Load a computed features from disk
`preprocess`(data_dir[, file_pattern, ...])	Initalize the features extraction.
`query_features`(indices[, n_top_words])	Query the features with most weight
`search`(filenames)	Return the document ids that correspond to the provided filenames, without preserving order.
`transform`()	Run the feature extraction

Attributes

n_samples_ Number of documents in the dataset

preprocess(data_dir, file_pattern='.*', dir_pattern='.*', n_features=11000000, chunk_size=5000, analyzer='word', ngram_range=(1, 1), stop_words='None', n_jobs=1, use_idf=False, sublinear_tf=False, binary=True, use_hashing=True, norm=None, min_df=0.0, max_df=1.0)[source]¶: Initalize the features extraction. See sklearn.feature_extraction.text for a detailed description of the input parameters

query_features(indices, n_top_words=10)[source]¶: Query the features with most weight

search(filenames)[source]¶

Return the document ids that correspond to the provided filenames, without preserving order.

Parameters:	filenames (list[str]) – list of filenames (relatives to the data_dir)
Returns:	indices – corresponding list of document id (order is not preserved)
Return type:	array[int]