Scaling Benchmarks¶
This page aims to summarize the performance and scaling of the algorithms used in FreeDiscovery.
Benchmarks are computed running the examples on the TREC 2009 corpus of 700 000 documents (1.5 GB or 7 GB uncompressed). The following benchmarks are given for Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz (4 CPU cores) server with 16 GB of RAM. The time complexites are experimentally (approximately) estimated for the given parameters.
Document ingestion¶
Method | Parameters | Time (s) | Complexity |
---|---|---|---|
Vectorizer |
|
780 | O(n_samples) |
Preprocessing¶
Method | Parameters | Time (s) | Complexity |
---|---|---|---|
LSI |
|
270 | O(n_samples) |
Semantic search¶
Method | Parameters | Time (s) | Complexity |
---|---|---|---|
Text query |
|
270 | O(n_samples) |
Near Duplicates Detection¶
The examples/duplicate_detection_example.py script with dataset_name=’legal09int’ was used,
Method | Parameters | Time (s) | Complexity |
---|---|---|---|
DBSCAN |
|
3800 | O(n_samples*log(n_samples)) |
I-Match |
|
680 | O(n_samples) |
Simhash |
|
270 | O(n_samples) |
where n_samples is the number of documents in the dataset.