Redefining SaaS Analytics: Dynamic, Real-time Insights at a Fraction of the Cost

BoilingData
5 min readMar 27, 2023

--

We are excited to announce a major milestone for BoilingData: the successful validation of our system on the very dataset that prompted us to explore the question, “How can we decrease hot data querying costs while delivering real-time, dynamic analytics?” We are pleased to report that our system has demonstrated its effectiveness in achieving this objective.

The Data Challenge

In collaboration with a MarTech SaaS company that provides marketing insights from comprehensive web crawls, we faced a data challenge. The company crawls many of the largest websites on the internet, and the data is evaluated against hundreds of rules to identify issues or suggest optimizations. Over time, this company has accumulated a data warehouse that contains several hundred terabytes of crawl data.

A primary challenge for this company is finding a balance between ensuring data availability for arbitrary client analysis while managing the associated costs. This usage pattern is common in B2B SaaS: clients expect complete, flexible data access at all times, yet predicting their login times, data interests, and required insights is difficult. Moreover, client users typically analyze only a small portion of their data at any given time. This unpredictability of specific data needs adds complexity to finding a storage solution that offers both flexibility and cost-efficiency.

To facilitate flexibility, the company stored all raw data from the past 90 days in a large data cluster and pre-calculated over 300 insights for each client-defined data segment to enable low latency upon initial login. However, this expensive and cumbersome approach did not adequately provide clients with the dynamic data filtering and segmented insights they required.

This architecture poses several challenges:

  • Storing vast amounts of historical data in an always-on analytics cluster, such as ElasticSearch, for immediate access is prohibitively expensive.
  • Despite retaining all data from the past 90 days in memory, we estimated that over 85% of it was never accessed or aggregated.
  • Responsive dashboarding necessitates calculating aggregates at the end of each crawl (using an ETL paradigm), but this method does not offer dynamic filtering and segmentation, resulting in end-user frustration.
  • To reduce cost overhead, data is archived after 3 months. Users can trigger an unarchive function to rehydrate it into the live database; however, this process can potentially take hours or days, depending on the dataset, leading to a subpar user experience.
  • If an end-user modifies their data segment definitions, constructing long-term trends of insights is impossible without rehydrating several terabytes of historical data into the live database.

These data-related challenges hindered product development and resulted in sub-optimal user experiences. Various solutions, including Athena, Presto, and Redshift, were explored to increase flexibility and decrease costs, but they all had limitations.

BoilingData Emerges

Our founding team was captivated by DuckDB’s 2019 launch, recognizing its potential to address the challenges they faced. DuckDB’s performance in delivering analytics from columnar data using an embeddable binary made it an ideal foundation for a highly scalable analytics platform, particularly when combined with our expertise in scaling systems using AWS Lambda.

To achieve this, we addressed several obstacles:

  • Lambda is stateless, making it unsuitable for analytics sessions where data loading and network speed are bottlenecks. We designed a method for keeping data warm and in-memory for extended periods, effectively addressing this issue.
  • Even if data remains active in a Lambda instance’s memory, subsequent queries are not guaranteed to be routed to that specific instance, resulting in significant cache misses. Our query planning and routing layer addresses this issue.
  • Lambdas have limited power and network capabilities. To create an effective querying system, a query must utilize the power of hundreds or thousands of individual Lambda instances. We developed a network optimized distributed query system to enable distributed querying.

Addressing the Original Problem

With all components in place, we can now address the initial problem.

Rather than focusing on optimizing data lifecycle management within a data warehouse to facilitate interactive analytics, we recognized the potential of creating a new serverless data warehouse paradigm where each client’s data would have its own transient and highly elastic data warehouse. The data boundaries and access rules can be defined and modified at any time, but we only initiate compute resources at the moment of query. This approach allows us to transfer hundreds of GBs of data to hundreds of Lambdas in under 7 seconds and promptly put the warehouse cluster to sleep the instant a query has completed execution.

We use a sample web crawl containing 192GB of ZSTD Parquet compressed data about millions of webpages and hundreds of millions of links. This is a large dataset by most standards, but because an end-user may only analyze a given crawl a few times in a 3-month period, it becomes incredibly wasteful to keep it hot in a live database for that entire time: If this single crawl were kept live in a “Hot Tier” cluster from Elastic.co, it would cost $5,000 to keep live for that period (ES expands the dataset to 850GB per copy, or 2560GB in a standard HA configuration).

Using BoilingData, we can generate this crawl’s full insight set from cold data in S3 40% faster than with AWS Athena. Once the data has been loaded into a cluster of AWS Lambda instances, subsequent generations are approximately 96% faster due to BoilingData’s warm, in-memory data storage.

For this crawl, we can produce all 300 insights for 20 user-defined segments in under two seconds. This allows end-users to dynamically modify their segment definitions on the fly and receive long-term trends from all insights immediately. For B2B SaaS businesses, this means they can offer more flexible insights and fast analysis of more data at lower costs than keeping everything in an always-on cluster.

What’s Next?

Although we are excited about solving this initial problem, it is just the beginning. We have an ambitious roadmap ahead, featuring improvements and new features such as:

  • Enabling BoilingData as a query acceleration layer for businesses that only need faster analytics without replacing their entire data stack.
  • Allowing BoilingData to be used as a data source in popular BI tools such as Tableau and PowerBI
  • Deployment of our innovative distributed SQL JOIN model to allow for analytics queries across relational datasets
  • The ability to lift data out of more data sources and streams, such as Kafka topics, to undertake real-time aggregations

As we continue to refine BoilingData, our goal remains the same: to provide an innovative, cost-effective, and efficient solution for real-time analytics that not only meets but exceeds the expectations of clients of SaaS businesses. Stay tuned for future updates on our progress as we work to revolutionize the way businesses access and analyze their data.

If you would like to hear more about BoilingData or to run a quick demo on your data contact us at info@boilingdata.com

--

--

BoilingData
BoilingData

Written by BoilingData

Scalable on-the-fly analytics from cold data (SQL), easy apps (code), and hot services (APIs). Where data, apps, and services merge!

Responses (1)