freediscovery.text.FeatureVectorizer¶
-
class
freediscovery.text.
FeatureVectorizer
(cache_dir=u'/tmp/', dsid=None, verbose=False)[source]¶ Extract features from text documents
Parameters: - cache_dir (str, default='/tmp/') – directory where to save temporary and regression files
- dsid (str) – load an exising dataset
- verbose (bool) – pring progress messages
Methods
__init__
([cache_dir, dsid, verbose])delete
()Delete the current dataset get_params
()Get the vectorizer parameters list_datasets
()List all datasets in the working directory load
(dsid)Load a computed features from disk preprocess
(data_dir[, file_pattern, ...])Initalize the features extraction. query_features
(indices[, n_top_words])Query the features with most weight search
(filenames)Return the document ids that correspond to the provided filenames, without preserving order. transform
()Run the feature extraction Attributes
n_samples_
Number of documents in the dataset -
n_samples_
¶ Number of documents in the dataset
-
preprocess
(data_dir, file_pattern='.*', dir_pattern='.*', n_features=11000000, chunk_size=5000, analyzer='word', ngram_range=(1, 1), stop_words='None', n_jobs=1, use_idf=False, sublinear_tf=False, binary=True, use_hashing=True, norm=None, min_df=0.0, max_df=1.0)[source]¶ Initalize the features extraction. See sklearn.feature_extraction.text for a detailed description of the input parameters
-
search
(filenames)[source]¶ Return the document ids that correspond to the provided filenames, without preserving order.
Parameters: filenames (list[str]) – list of filenames (relatives to the data_dir) Returns: indices – corresponding list of document id (order is not preserved) Return type: array[int]