Scaling Benchmarks¶

This page aims to summarize the performance and scaling of the algorithms used in FreeDiscovery.

Benchmarks are computed running the examples on the TREC 2009 corpus of 700 000 documents (1.5 GB or 7 GB uncompressed). The following benchmarks are given for Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz (4 CPU cores) server with 16 GB of RAM. The time complexites are experimentally (approximately) estimated for the given parameters.

Near Duplicates Detection¶

The examples/duplicate_detection_example.py script with dataset_name=’legal09int’ was used,

Method	Parameters	Time (s)	Complexity
DBSCAN	eps=0.1 n_max_samples=2 lsi_components=100	3800	O(n_samples*log(n_samples))
I-Match	n_rand_lexicons=10 rand_lexicon_ratio=0.9	680	O(n_samples)
Simhash	distance=1	270	O(n_samples)

where n_samples is the number of documents in the dataset.

Other benchmarks will be added shortly.