Enjoyable Serverless Analytics

BoilingData
4 min readSep 2, 2022

It’s real — high performance on-demand serverless analytics is more than just an option or add-on to existing toolbox. You don’t have to suffer from random cold starts and unpredictable query performance, but enjoy the interaction with your data on S3.

Serverless has been around for years, but recently it has become a viable option for data processing too due to increased resources (more CPU and more memory), access to network storages, and increased local SSD size — at least with AWS Lambda.

We started to work with DuckDB three years ago when it wasn’t on top of the hype curve. It’s an embedded and highly performant state of the art OLAP database that you can embed into your applications — a missing piece for serverless. It makes a perfect fit with AWS Lambda and S3.

We added support for “RAM” disk, so that DuckDB can work with in-memory data, query after query without having to worry about S3 latency or Lambda network bandwidth or even SSD bandwidth/latency. This gives you hot in-memory database overlay — SQL compute cache, hot pools of boiling data that cools down when not used anymore.

BoilingData has a GUI where you can test with demo data sets like NYC and CC.
Simple GUI with QueryLog and API logging

Single tenant dedicated resources

The core with Boiling’s service is the ability to “boil” data, keep it hot, and always route your queries to the hot Lambda instances with your data already in memory. The routing happens globally and queries run where your S3 Buckets are located to avoid having to send data cross-region with egress costs.

With Lambdas, every query gets always dedicated resources in a single tenant environment. Lambda scales out to thousands of instances in seconds. Boiling can run the same query over hundreds of S3 Objects at the same time, again and again.

NodeJS SDK — https://github.com/boilingdata/node-boilingdata

Asynchronous API

We started right away with asynchronous API with WebSockets to get the possibility to stream query response data when it is ready. This helps with query response time and e.g. with analytics metrics dashboards where trends data and aggregations over partitions are calculated live concurrently and results arrive immediately when ready.

What’s Boiling bro?

People ask that what BoilingData is, how does is it compare to Presto, or can it do true distributed queries to replace a Data Warehouse.

Boiling is an in-memory SQL compute overlay. Data is brought from S3 into the Lambda memory, kept there, and SQL queries run against it with the embedded databases. Once in the memory and run with DuckDB the response time is in the order of ms, whilst with e.g. Presto that loads data from S3 with every query, the response times are in seconds or tens of seconds. They are different, Presto being a large cluster running JVMs, good for ETL jobs, DuckDB inside Lambda much faster with data sets fitting in, but requires the initial load latency when data is brought from S3 to Lambda CPU memory.

To take yet another perspective with Spark. They say that DuckDB needs one instance, whilst Spark needs 32 for the same query performance. They are different, Spark having a large usage base, lots of legacy code and scaling horizontally, DuckDB running the queries in-memory and scaling vertically (for a working set, horizontally with multiple independent working sets).

Boiling is a caching layer. Any query you run will be cached in-memory and optionally persisted. In data processing, there are numerous caching layers and Boiling brings the selected data closer to the CPU from S3.

Boiling combines best of the two worlds. Try to fetch hundreds of columns from columnar data store like Parquet. It’s painfully slow as its not meant to do horizontal data retrieval, but vertical. However, if you take SQLite and fetch all columns with a row number (or index) it is very fast, especially if you have the storage page hierarchy already cached in-memory. With BoilingData you can use DuckDB for the WHERE part and SQLite for the SELECT part.

But we are just beginning, early in the journey.. Data Engineering world has lots to learn from Software Engineering, what it comes to Semantic Layers, data contracts etc. and we know that data is becoming more realtime and streaming, globally. We love query graphs and want to make them more interactive, we love coding and APIs and want to bring them closer to data. And what more, running our service in AWS, with Lambda and other services gives us fast iterations and modularity. Our world is not restricted to a single low level programming language codebase. We want to take you in into our enjoyable serverless analyticsland!

You can start using BoilingData by signing up to our application here https://app.boilingdata.com/ and play with the demo datasets, set your own IAM role and access your S3 Buckets they way you like.

--

--

BoilingData
BoilingData

Written by BoilingData

Scalable on-the-fly analytics from cold data (SQL), easy apps (code), and hot services (APIs). Where data, apps, and services merge!

No responses yet