Scaling Benchmarks

This page aims to summarize the performance and scaling of the algorithms used in FreeDiscovery.

Benchmarks are computed running the examples on the TREC 2009 corpus of 700 000 documents (1.5 GB or 7 GB uncompressed). The following benchmarks are given for Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz (4 CPU cores) server with 16 GB of RAM. The time complexites are experimentally (approximately) estimated for the given parameters.

Near Duplicates Detection

The examples/duplicate_detection_example.py script with dataset_name=’legal09int’ was used,

Method Parameters Time (s) Complexity
DBSCAN
  • eps=0.1
  • n_max_samples=2
  • lsi_components=100
3800 O(n_samples*log(n_samples))
I-Match
  • n_rand_lexicons=10
  • rand_lexicon_ratio=0.9
680 O(n_samples)
Simhash
  • distance=1
270 O(n_samples)

where n_samples is the number of documents in the dataset.

Other benchmarks will be added shortly.