Quix’s CEO and streaming advocate discuss the foundations of the modern data stack
Michael Rosam: Welcome to this week’s Stream Session. I’m Mike, CEO and co-Founder of Quix. This week, I’m joined by Clara, our Streaming Advocate, to discuss the modern data stack. Could you define the modern data stack for us, Clara?
Clara Hennecke: Starting with the hardest question, are we? I think we all know that there are no clear boundaries on the modern data stack — it’s very hard to define.
What is the modern data stack?
Clara: The use of microservices is the first important definition of the modern data stack. The introduction of the first scalable cloud data warehouse, Amazon Redshift, in 2012 allowed dozens of startups to pop up and offer SaaS that users could integrate. With this development, the sector began moving away from monolithic platforms.
Mike: From my perspective, the modern data stack sounds a bit like a marketing ploy. Are data professionals using the idea and language? What are you seeing in your experience?
Clara: I don’t have a negative opinion of the term; it’s helpful for facilitating even this conversation. But to your question, the use of the modern data stack depends on industry. I have experience in the manufacturing industry, and its use of data and data architectures look quite different from that in startups and venture capitalist firms. Startups and VCs certainly use the language and concept of the modern data stack. These companies integrate specialized, open-source tools.
Why use microservices rather than monoliths?
Mike: Why is this unbundling important? Is it because it allows you to use the best tool for the job, or because it allows people to continue using tools they’re familiar with?
Clara: Very good question. Due to the VC structure of startup funding, we have so many new, extremely specialized tools cropping up at startling speeds. It’s possible for practitioners to start with one tool and realize that another tool actually does their desired functionality even better.
Mike: It sounds like the modern data stack is real, and that startups and scale-ups are using one. I wonder what’s happening in bigger organizations. What happens in my bank, for example? What are they doing?
Clara: Each company is different, so we’d need to ask someone at your bank to get specifics. But long-established businesses typically are advancing toward a modern data stack, just at slower speeds than startups because they have legacy systems and internal processes that slow down the progression, which in itself brings some instability. Migrating from a monolithic platform to a microservice stack can take two to ten years.
Mike: Interesting. It sounds like some trade-off of speed versus certainty. Obviously, a startup is using the modern data stack so that it can build something quickly.
Clara: Yes. But startups are also attracting younger professionals who have had the opportunity to study data through university. That’s creating more experts who want to work with expert tools — and talk about them. I’m used to hearing people discuss the latest tools they’ve discovered in their Substacks.
Mike: I’d like to talk about some specific technologies, because the modern data stack is pretty complex.
Clara: I think we have a bit of a different opinion on that. I’m curious, what do you think about when you look into the modern data stack and all these emerging tools?
Mike: I definitely see opportunity. I think loads of these tools, such as Fivetran and Segment, decrease the hassle of data. But one of my fears is having an employee set up an expert tool like Fivetran and then leaving without having trained anyone else to use it. I also wonder if data stacks fulfill a short-term need in the business lifecycle; I think that as startups become scale-ups and then enterprises, they’ll begin building systems in house because they’ll need more and more specialized requirements.
Clara: That’s interesting. But from my perspective, I haven’t seen companies outgrow the modern data stack because, when companies grow, they also grow their teams.
Of course, there are some reasons to build your own systems: when we’re really talking about very big data or very special requirements. I think that’s happening. I haven’t seen it by myself that much yet, because I think we’re talking about a very small portion of businesses that have reached this scale. I think the modern data stack is quite scalable.
Mike: Oh, that’s very interesting. I’d like to hear from anyone who may have seen use cases where they go into hyper-scale territory. If there’s a shelf life to the modern data stack, it would be quite interesting to explore.
A concrete example: the unified data infrastructure
Mike: Could you talk us through this diagram from Andreessen Horowitz from left to right?
Clara: For a bit of context, this stack doesn’t include AI capabilities, real-time capabilities, or business intelligence reporting analytics capabilities. But this is a very advanced set-up and not the starting point for a company exploring the data-driven journey. It also has components that don’t fit into the typical modern data stack.
How to choose the right tools for your own modern data stack
Mike: Why would you choose one tool or another? Do you generally choose what you know?
Clara: In the teams that I’ve worked in, we focus on finding the tools that best fit our project and, even after we’ve started, we give new tools a try. In general, the data community is excited about new tools.
I think that, right now, Snowflake and BigQuery are easy to use and include a large number of integrations. An extensive ecosystem is something that we always look for, but that creates a self-reinforcing process.
Mike: Connected account and the number of integrations gives you a good idea of the ecosystem. If it’s well supported, then you’ll feel confident using it.
Clara: Not just confident — it’s just a relevant factor to how much easier a tool will make your life. You don’t want to build and maintain custom integrations anymore. You want to outsource this workload. That’s why integrations are important.
You don’t want to build and maintain custom integrations anymore — you want to outsource this workload. That’s why integrations are important.
What comes after business intelligence? From dashboards to recycling data
Mike: Very cool. Let’s say we’ve got a data stack running for business intelligence. But now we want to move beyond business intelligence. What’s next?
Clara: That’s a great question. I see a lot of companies a bit stuck in business intelligence. They always advocate for it because a lot of people get stuck in like, “Let’s build another report and let’s serve more stakeholders.”
To move forward, we first need to work on getting the system more stable, observe what’s there and add governance. We look into the buckets at the bottom, data governance, and data observability that we’ve already built and stabilize them. It’s just the ugly truth that dashboards and business intelligence have a lot of latency.
Mike: Wow, that’s an interesting insight. Data systems can fail. Software apps these days rarely fail unless someone pulls the plug on the Amazon Data Center. Where are the data and data apps’ weak points?
Clara: That’s a huge point you just mentioned. We could talk for hours about this. I always ask myself the question, why are we in the data space so much behind the software engineering stuff? I still see so many people working directly on their master, not working in their branches, no previews, no tests. It’s really wild.
One place where a lot of failures happen is in the data source, which is typically maintained by non-data or even multiple teams. MongoDB is often maintained by the tech team, which has control over the schema of the tables used to load data into your warehouse. Then we have a schema change in the MongoDB, and nobody’s informed. Suddenly, your data loading job fails, and then everything also upstream is going to fail. That’s a very common problem.
When there’s a schema change in the incoming data and transformation step, the tech teams don’t consider each case to test with existing data. A new type of data comes into column A and then the system fails, pulling down the dashboards.
Mike: Is it a problem that data goes through so many tools? Does a microservice architecture create a large number of failure points?
Clara: Yes, exactly. That’s why data observability and governance is so important. dbt has lineage graphs but it only tells you what’s happening in dbt, not in other areas of the data stack.
Mike: Okay, let’s say we’ve stabilized our current business intelligence set-up. Now we want to introduce something more advanced, such as machine learning.
Clara: A lot of companies then extract, load, transform backwards, what we call Reverse ETL, which means getting data back from your warehouse after it’s been processed and moved into some application. An easy example is enriching data from Salesforce and sending it to another tool.
Another example is getting product usage data from your data warehouse. Teams track with Segment how much a product is used and then send this data to Salesforce so that customer success managers can directly see into accounts to be warned about high numbers of churn and consider ways of upselling clients.
Mike: Let’s say you’ve got good data systems, you’re creating reliable data, and you want to do some machine learning. At this point, it’s all about going into a warehouse and training models on huge data sets — it’s a big data technology, right?
Clara: I don’t know if I would say big data technology because I think you can also have machine learning with little data. We’ll see more and more small companies use machine learning to process what little data they have.
Mike: Yeah, that’s true. You can also just take a machine learning model that’s pre-trained, such as HuggingFace. You can deploy that, so maybe you don’t need a big data set to do machine learning applications at all.
What’s workflow management?
Mike: Let’s talk about workflow management. How would you describe a workflow manager like Airflow compared to a data replication tool like Fivetran?
Clara: For me, it’s a bit confusing. We also see the workflow manager two times in this picture. For me, the workflow manager would sit somewhere at the top, especially in this unbundled data system. We have a lot of tools for everything. They run very often in batches, and they run through specific times. To have this process, you need somewhere a tool that aligns all these processes to make sure that you have a synchronous schedule. I would move it at the bottom to the other tools next to data discovery, data governance, data observability, because it also serves these purposes of, “Okay, something breaks, you can look into your workflow manager. Which process was failing?”
Modern data stack for streaming
Mike: You’ve been a streaming advocate now at Quix for two months. You’re very much a modern data stack, BigQuery, Snowflake, DBT person. So what do you now think of streaming?
Clara: I first think of the gap of data in tech. The streaming technologies, especially if we talk about Kafka, it’s like people in tech talk about it, and some people in data talk about it, but I see more people talking in the tech departments, building the products, and working on microservice architectures.
Kafka is typically used to build microservice architecture, so modern IT infrastructure. But that’s also quite separate from the data team. It’s very often not a discussion that the tech and the data team have together and think about.
Kafka is then used to send the data into a data warehouse, and there is the world of the data team. I think a bit about a broken experience when thinking about streaming because the companies don’t think holistically about this topic, but more in silos.
Mike: I do read a lot about Kafka in data, and indeed the other technologies in data. Of course, there’s technologies like Debezium from Red Hat, which lets you do some change data capture, which is like a version of an ingestion tool. In general, I see streaming the same as you; it’s really just a microservices event streaming technology today, but that’s not fulfilling the massive potential that there is for streaming.
Clara: It’s quite funny that when you then hear from data teams, “Yeah, we work with Kafka.” Then you ask them, “Hey, cool. What are you doing with Kafka?” Then it’s, “Yeah, we use it to load data into Snowflake.” They’re using it exactly like there’s the event streaming, then they load the data into their database, and then they do everything batch-based.
I think we see two kinds of companies. These really quite big companies, big data teams where they use streaming and stream processing with a lot of events and where they also don’t really come from modern data stack. I’d say these companies have a business intelligence analytics team doing the modern data stack, and then they have a separate team of data scientists, engineers, and software engineers working on stream processing.
But that’s a different journey of these companies, and it’s also just a portion of those companies. We have this application of streaming in data, but not that much in this journey from, “I start with my model data stack, and then I start to adopt streaming.” That’s what we’re still missing to see.
Mike: Right. Today streaming is just being used to move data from A to B.
Clara: Yeah, except for these big players, such as Netflix, Airbnb, or Uber, where their entire value proposition is based on having to do stream processing.
Mike: How would you define stream processing?
Clara: Well, you have a data stream, a flow of real-time data, and you want to do transformations to this data in real time — any transformation you want to do. Stream processing is data transformation in real time.
Mike: Basically, stream processing would be dbt for Kafka. Any transformation that you want to do with a low latency, you would do with a stream processing workload.
Clara: I think one thing to mention is real-time analytics and real-time analytics databases. A lot of people I talk with don’t really buy into that concept, because when we talk about real-time analytics, it’s very often that a human consumes the finalized data product. A human consumes these analytics, and we are just not made to react in real time. Real time is a nice feature, but I think there are probably other opinions out there, and there’s a lot of money in this space right now.
But the other thing that we talk about when streaming is transactional workloads. These are workloads where we act on single events and also where a machine is the consumer, and a machine can very well react to real time.
Mike: Would low latency let you query across a number of rows in an event-stream table (we’d call it a record on the broker). It will let you query a whole bunch of records in low latency for the purpose of serving to some human in a dashboard.
Clara: Yeah, of course. That’s embedded analytics. Then we also have analytics capabilities in products where we want to show our user-specific statistics about their usage, about, I don’t know, a trend of how many people visited my profile.
We have another question, Mike. Maybe this one to you. Anoop asks, “Are real-time data and data streams different things?”
Mike: Cool, interesting question. It depends how you want to answer that. I would say real-time data… I’m just going to say something really controversial here, all data is real time. All data, every single data point ever created was created in a moment with a timestamp; that’s real-time data. If you’re able to create and distribute your data, then it’s real-time data.
A stream would be a continuous number of real-time data events. Obviously, you can create one event like a full hour logged in to my hypothetical banking app, that’s just one real-time data point. A stream of them would be her clickstream. She logs in, then she goes to her balance page, then she makes a balance transfer, then she applies for a loan, then she applies for a mortgage, this would be a clickstream.
Clara: To me, it makes sense, but I also think it’s quite hard to clearly cut a line between the two concepts of real-time data and data streams. There’s, of course, an overlap. It’s not mutually exclusive that there’s one thing and the other thing. There’s a big overlap.
Mike: One of the misconceptions I often hear is it has to be a continuous stream to be streaming, and I don’t think it does. What streaming means really is that it’s also event-driven. It means it’s more of a push, call it on-demand. If you’re using a streaming system to build some data backend, you could use streaming to really kick off processes.
That means when Clara logs in, something happens and it’s going to happen in very low latency and it’s going to happen very reliably because that message was put onto Kafka, which is a queue built for software. So it’s a very reliable way to build backend data automations.
Clara: Anoop put your answer into a short description. “Stream is a list of events and real time is the event itself.”
Mike: Perfect. Real-time data is one event and then streaming is a sequence of real-time events, for sure. The differences, I think the nuance that I’m trying to highlight here is, it doesn’t have to really be a very continuous sequence. Indeed, often the sequences are very sporadic.
Clara might log into her mobile bank and then she might get distracted for a few moments, so there won’t be any events for, I don’t know, 15 seconds. At that point, actually, you might want to say, “Hey, Clara, we’re going to log you out,” so you send an event back. Event streaming doesn’t have to be very concurrent data points.
Clara: There’s a follow-up question now: “Is event-driven systems equal to data streaming?”
Mike: Yeah. I think so. Data streaming is also called event streaming. Ultimately, this language comes from a software world where the login is called one event — so you create an event — and that’s a stamp in time, which is, “Clara logged into the mobile bank,” so you send that event.
Then, of course, there’s a second one, “Clara was authenticated. Clara was served this page.” So then you build up an event stream. Event streaming comes from the software world for sure. The other interesting thing about streaming is it can apply to sensor data. If you had something like a wind turbine in a wind farm and you had the rotation of the blades, you would send a stream of numeric values to represent the number of revolutions per second.
I think event streaming is increasingly moving toward what you might call traditional time-series applications of sensor data. But, in reality, in the streaming world, whether it’s an event or a time-series data point, every single data point always has a timestamp, and that marks it all out as time-series data, whether it’s event data, a photograph, a video, or a sensor reading; it’s always created in a point in time with a timestamp.
That’s quite an interesting nuance about stream processing; you’ve got to get your head around everything. Really, the primary key for all your data is the time, date time.
Clara: There’s one more question. “When we say event, we mean business event, not a technical event, ‘Clara logs in mobile bank, Clara went to order page.’” I can briefly comment on that. I think very often, there’s a business event happening and a technical event happening at the same time. If I understand correctly, I, as a user, do something, and then there’s also very often, or in most cases, a technical event block. Something happens in my database; there’s an event collector that tracks this. I think very often we both appear at the same time.