Scaling Benchmarks
==================

This page aims to summarize the performance and scaling of the algorithms used in FreeDiscovery.

Benchmarks are computed running the `examples <./examples/index.html>`_ on the TREC 2009 corpus of 700 000 documents (1.5 GB or 7 GB uncompressed). The following benchmarks are given for Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz (4 CPU cores) server with 16 GB of RAM. The time complexites are experimentally (approximately) estimated for the given parameters.


Near Duplicates Detection
-------------------------

The `examples/duplicate_detection_example.py <./examples/duplicate_detection_example.html>`_ script with `dataset_name='legal09int'` was used,


+---------+---------------------------+-----------+---------------------------------+
| Method  | Parameters                | Time (s)  | Complexity                      |
+=========+===========================+===========+=================================+
|         | - `eps=0.1`               |           |                                 |
| DBSCAN  | - `n_max_samples=2`       |    3800   | `O(n_samples*log(n_samples))`   |
|         | - `lsi_components=100`    |           |                                 |
+---------+---------------------------+-----------+---------------------------------+
| I-Match | - `n_rand_lexicons=10`    |    680    | `O(n_samples)`                  |
|         | - `rand_lexicon_ratio=0.9`|           |                                 |
+---------+---------------------------+-----------+---------------------------------+
| Simhash | - `distance=1`            |    270    | `O(n_samples)`                  |
+---------+---------------------------+-----------+---------------------------------+

where `n_samples` is the number of documents in the dataset.

Other benchmarks will be added shortly.