As a start, we had to design the database and for that, we made a commercial agreement with the database company that agreed to be a part of our solution as an OEM our brand.
We decided to use IBM cloud, BareMetal servers and not virtual servers because of the bare metal that gives the most flexibility. It is very likely that you can work with other clouds as well, due to budget reasons, we did not test it.
As for the data, we took what’s called TPCH data which is an international benchmark that has a C++ generator, designed to generate data at any scale factor you give it and has known benchmarks for queries.
Along the way, we came across an Ebay open source project called Kylin. It involves a lot of technologies, including Hive and Spark. Kylin makes a very smart cache and is located between the database and the BI tool, it is distributed, which means it is a cluster of servers as large as you want (depending one the size of the cache).
Kylin’s job is to calculate in advance all the possible permutations you will want to ask the BI and will return an answer in less than one second. Kylin designed to support trillions of records and it supports Ansi SQL.
We did the experiments with TPCH1, We defined the dimensions, the metrics as count and sum and ran a query which is a simple aggregation.
The performance was surprising. The same amount of data tested on Hive and Kylin. On Hive, we got an answer after 58 seconds, on Kylin the answer was received after 1 second.
We did another experiment with join between 2 tables, define another model and other dimensions and the improvement was even more surprising, from 380 seconds to 3 seconds. This means that we also managed not to give up the join query and also maintained the performance.