Scaling Benchmarks

This page aims to summarize the performance and scaling of the algorithms used in FreeDiscovery.

Benchmarks are computed running the examples on the TREC 2009 corpus of 700 000 documents (1.5 GB or 7 GB uncompressed). The following benchmarks are given for Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz (4 CPU cores) server with 16 GB of RAM. The time complexites are experimentally (approximately) estimated for the given parameters.

Document ingestion

Method Parameters Time (s) Complexity
Vectorizer
  • use_hashing=False
  • use_idf=False
780 O(n_samples)

Preprocessing

Method Parameters Time (s) Complexity
LSI
  • n_components=150
270 O(n_samples)

Near Duplicates Detection

The examples/duplicate_detection_example.py script with dataset_name=’legal09int’ was used,

Method Parameters Time (s) Complexity
DBSCAN
  • eps=0.1
  • n_max_samples=2
  • lsi_components=100
3800 O(n_samples*log(n_samples))
I-Match
  • n_rand_lexicons=10
  • rand_lexicon_ratio=0.9
680 O(n_samples)
Simhash
  • distance=1
270 O(n_samples)

where n_samples is the number of documents in the dataset.