JutoLake - Analytics with no boundaries
Author: Omid Vahdaty 18.11.2020
How It all started with a customer use case
A customer who has 42 billion records came to me and wished to make a self service BI dashboard for an end customer. Meaning, Our customer wants to give his end customer a direct access to the data, or in other words, give the end customer the ability to query 42 billion records on production. The idea was to let him build a live dashboard without knowing what he is going to do, which aggregations, which metrics and which dimensions.
When the information is not properly organized, it is not clear what would be the cost and whether it can even theoretically work, therefore, I said it can’t be done. In turn, they asked, what can be done and what can be compromised to make it work?
After a few days of thinking, I came back with an answer:
Let’s make a 10 days POC. I’ll look at the data, check the infrastructure, perform some benchmarks and try to put something together. Eventually, I went back to the client with a solution.
It’s not the first time I’ve been approached with such a use case.
What was different here?
This time I didn’t refuse, but invested my full experience to solve the problem and it made me think. What if tomorrow a customer will come with a bigger scale, lets say 100 billion records, or 500 billion records, or a trillion or 10 trillion and then what? How do I solve the problem?
Questions we asked ourselves
- If we do a POC, invest money and energy, we would have an interesting use case, but will it work?
Will it be fast? Cheap? Simple?
- Should it be in the cloud or in a data center?
- Where can I find trillions of records for the POC?
- What BI vendor can support billion or trillions of records?
I made some inquiries and I found out that the only vendor that has stated that it supported 4 billion records is Qlik.
It needs proper hardware, but one can put in 4 billion records and it will take care of it.
It also means, that before you are reaching Qlik, you should take care of all the trillions.
I went to the drawing board to look for solutions
A taste of the engineering complexity
BigQuery has 2 configurations: flat rate or pay as you go.
Flat rate is like any clusters of servers you know.
In a rough estimate:
Queries- Suppose we pay $30,000 per month on X fixed servers- $360,000 annual fee.
Storage on the order of 1 Petabyte, i.e. 1000 Terabytes, each tera costs $20 per month- $240,000 annual fee.
BigQuery flat rate- $600,000.
pay as you go:
Storage– the same amount but the price will be $5 per terabyte. If we want to analyze all the petabytes everyday, we will reach a cost of $5000 per day!!! All that just for the BI.
It means, at night there will be an automation that will run the data and give us an answer of aggregation into the BI, and in the morning, we will have a ready dashboard.
$5000 per day, 30 days- $150,000 per month.
BigQuery pay as you go- $2,000,000
If we summarize the possibility of using BigQuery, flat rate will save us about 80% of the cost in a simple design.
So what’s the problem?
The flat rate has a limited amount of resources, You will have to “grease” the system a lot, do a lot of experiments to make it work and eventually, you will reach a cost of about a million dollar a year.
Using Snowflake costs about $2 per hour for the smallest server and about $256 per hour for the largest service.
Now you have to start trying. If we’ll take something in the middle, around $64 per hour, 10 hours overnight, 365 days because we want it to run everyday- The cost will be $233,000 just for the compute.
Storage- let’s say it’s the same as everyone- $250,000.
Annual summary- $483,000
Most likely that in this price, the cost for the compute wont be enough, so we took an outside consultant to try to make an initial forecast.( The price of this experiment is very high, so it is only a forecast). We got a sum of about a $1,000,000 a year and these are the orders of magnitude.
Let’s see what others did
I found a company called Nielsen Marketing Cloud, which has a data department of about 30 people. They process about a quarter of a trillion records a day. Their operation is complicated and I will try to go over it briefly.
They use SQS to manage queues, AWS Lambda to do some computing and EMR to do some transformation. They claim that processing about 250 billion records a day costs $1000 per day.
They started at a cost of $7.7 per billion records and dropped to a cost of $4.25 per billion. So let’s take their numbers- $30,000 a month on 250 billion records, which is $360,000 a year just on data processing.
You need to add to that $250,000 for storage and you reach $610,000 at the minimum of the minimum.
What is the problem?
The numbers grow linearly as the amount of data and that means that to get to a trillion, the system will quadruple.
I also found a video running on the net about a company that used Data Studio and BigQuery about an enormous amount of data. This is very dangerous if you do not know exactly what you are doing, because Data Studio sends a query to BigQuery on each change in the UI and if you are in ”pay as you go” with huge amounts of data, the invoice will be high accordingly.
Most of the companies I came across, tried to solve this with Hadoop, Redshift, BigQuery and Snowflake. it is important to see how strong is your team in terms of data engineering, which methodologies you work with and what your commercial agreement is with this companies.
Bottom line, everyone will choose the tool they will work with and will have to compromise, not because these are not good tools, but because it is unrealistic to visually analyze more than a few billion records. Think what a 4 million points scatter would look like (Hint… Blue screen)
What is our solution?
Our solution is called JutoLake and we have invested many months of thinking, working, examining technologies, benchmarks and optimizations in the level of hardware, software and data.
- Trillion records are about 1 petabyte of data.
- We want to consider the annual TCO (cost of ownership) so that it makes financial sense – How much does it cost to own such a solution.
- Pay as you go is unrealistic, because the amount of data will increase and we will want to limit the budget for data processing.
- The solution should be scalable, which means if we start with 100 terabytes, and the next day I reach 10 petabytes, we will not have to replace the whole system.
- In terms of velocity, if we can prove that a trillion records a day are feasible, automation must be completed within 24 hours, to be ready for a daily dashboard in the morning.
- In terms of query patterns, we did not want to limit ourselves, which means that the joins and aggregation type of queries were important to us. Even though it is an anti-pattern in the Big Data world, analysts still do windows function, aggregations and joins as the amount of data does not concern them, they want to analyze data.
- For the purpose of the demo, we said that an update once a day is enough for us.
How does it work?
As a start, we had to design the database and for that, we made a commercial agreement with the database company that agreed to be a part of our solution as an OEM our brand.
We decided to use IBM cloud, BareMetal servers and not virtual servers because of the bare metal that gives the most flexibility. It is very likely that you can work with other clouds as well, due to budget reasons, we did not test it.
As for the data, we took what’s called TPCH data which is an international benchmark that has a C++ generator, designed to generate data at any scale factor you give it and has known benchmarks for queries.
Along the way, we came across an Ebay open source project called Kylin. It involves a lot of technologies, including Hive and Spark. Kylin makes a very smart cache and is located between the database and the BI tool, it is distributed, which means it is a cluster of servers as large as you want (depending one the size of the cache).
Kylin’s job is to calculate in advance all the possible permutations you will want to ask the BI and will return an answer in less than one second. Kylin designed to support trillions of records and it supports Ansi SQL.
We did the experiments with TPCH1, We defined the dimensions, the metrics as count and sum and ran a query which is a simple aggregation.
The performance was surprising. The same amount of data tested on Hive and Kylin. On Hive, we got an answer after 58 seconds, on Kylin the answer was received after 1 second.
We did another experiment with join between 2 tables, define another model and other dimensions and the improvement was even more surprising, from 380 seconds to 3 seconds. This means that we also managed not to give up the join query and also maintained the performance.
What is JutoLake DB?
We took a database that is peta scale and designed for high volume and velocity projects, that is SQL Ansi , all we need to do is insert and select and can make “crazy” & huge Join queries. The backend will be a reasonable amount of servers – between 1 to 4 max, knows how to handle the data quantities and without messing with indexes and performance tuning.
Why choose JutoLake?
A simple solution without architectural restrictions, linear in terms of scale and will give an answer to BI use cases in less than a second. The cost will be reasonable and obviously, the service will come with the knowledge and experience we have already accumulated over the years.
So how does the magic happen?
As with any data architecture, the first layer is the ingestion layer that ingests the data into the system. No matter what cloud or technology you use, JutoLake will be in the top layer – the presentation.
We will get a full copy into our database and its designation will allow 2 use cases:
The first would be the Hadoop query which means some kind of research or drill down. The second will be BI.
You can see the results in the benchmark:
We took 60 billion records, aggregation, and joins, ran and got an answer in minutes. This means that with this system, you can update our dashboard every hour and analyze 60 billion records, even for end customers.
If we want to analyze a few trillion records, it’s already a matter of a few hours and it’s something that should be overnight.
If we use it directly in front of the database and it can be done, it’s a matter of a few hours. That is, every few hours you can update the dashboard and if you want to update the Kylin, it is a matter of seconds.
The original idea was to reach 10 trillion records, but during the work, we ran out of space in hard disks 🙂 and stopped at only 3.6 trillion records.
The querying took much less than 1 day, which is a realistic time for such analysis and this amount of records.
If I may mention again, the aim of the experiment was to reach 1 trillion records in one machine and the assumption is that with more machines, more data can be reached.
If you want to try using JutoLake, or have any questions, please feel free to contact me.