Get ready to rethink your core, tables and processing
It’s been 130 years since Herman Hollerith tabulated the US Census electronically, pioneering a machine that processed data in batches on punched cards. Given the evolution of other critical technologies since then — the assembly line, the internet, and cloud computing, to name a few — it’s fitting that batch data processing also evolves.
Enter streaming data processing. The demand for instant data analytics, rather than waiting for data to be processed in batches, comes from dozens of industries and applications ranging from financial services to automotive to IoT.
In this article, I share how streaming data requires a significant paradigm shift from the habits we developed over five generations of handling data in batches. But first, let’s answer some basic questions.
What is batch data processing?
Batch data processing is the processing of large volumes of data collected over a period — such as minutes, hours, and days — in groups, called batches. It usually runs automatically without human interaction, at a scheduled time or as the need arises.
Batch data processing usually undergoes a three-stage process. This includes data gathering and input, processing, and output as information. Put another way, batch data processing entails data that has been collected and grouped, then processed via a program, with results output in sequential order.
Each aspect of the three methods — input, process, and output — ensure seamless batch data processing requires different programs.
The problem with batch data processing
One of the most significant advantages of batch data processing is the processing of large volumes of data. However, for modern businesses, access to real-time information is vital to making decisions.
“Access to real time information is vital to competitive decision making.”
Batch data processing is most suitable for data that doesn’t need to be processed immediately, such as payroll or sales records. However, some problems are associated with using batch data processing for businesses. These include:
- Cost: Batch data processing systems are capital intensive because setting up the software program, hardware infrastructure, and deployment of the batched data processing system are all costly.
- Expertise: Setting up a batch data processing system is complex. Knowledgeable developers are expensive and rare but necessary to a well-functioning system. Additionally, when there are errors in processing, debugging is time consuming.
- Speed: Making decisions can suffer from the time lag between when data is created, processed and returned to the business.
Batch data processing alternatives
What happens when organizations require real-time data analysis to support their growth? We take a look at alternatives to batched data processing.
Real-time data processing
Real-time data processing means the input, processing and output of data continuously. Data is processed as quickly as it is inputted into the system, in the shortest possible period (in “real time”), and the processor is always active.
Examples of real-time processing programs include ATMs and point-of-sale validation, so fraudulent transactions can be stopped before they are completed. Real-time processing can also significantly improve business function through real-time analytics.
Stream data processing
Stream data processing is the continuous processing of data in an endless flow. In-stream processing, data is analyzed as it arrives in the system. Access to information on the fly is crucial for stream processing.
One example of a continuous stream of data is your news feed on a platform like Twitter. Another is the constant stream of data generated by Formula One race cars sensors, with each vehicle producing 1.1 million data points per second.
Quix is a platform for working with streaming data. With Quix, developers can use Python or C# to connect their applications to a message broker (which we talk about in the next section), create contextual streams of data, and process and store them.
Paradigm shift #1: The message broker is your new core
In the old paradigm of batch data processing, a database was at the core of everything. Data had to be inputted, analyzed and results outputted from the database. As we discussed above, all of this reading, processing and writing on a database required significant time and resources.
“In the new paradigm, a message broker is the new beating heart of your information architecture.”
A message broker is the new beating heart of your information architecture in the new paradigm. The message broker accepts streaming data the same way a database receives data, but there is no need to write information to a database before processing it because the processing happens as the streaming data comes in.
The significant advantage is that the broker holds the most recent data in memory so the program running on the computer cluster can access it quickly, while older data is written to disk. By connecting your code to the broker, your deployments instantly receive messages. You can learn more about how this works in the Quix documentation.
Paradigm shift #2: Think in Streams, not tables
At the core of the traditional relational databases are tables, where data is stored (and retrieved), consisting of rows and columns. Tables hold data and can be queried to retrieve data, just as we learned in batch data processing.
In our paradigm shift to stream data processing, streams are at the core. Instead of data stored in tables in a database, data is delivered in a continuous flow of records on a message broker. Each record is called a log. This makes things pretty hard for developers because the logs are completely unstructured. Each log has no idea what information is in the following log or the nth log, so it’s hard to build an application that efficiently processes the correct information at the right time.
We solved this at Quix by creating the Streams Class. It lets you define an object to collect all the information for a given context, such as one customer ID. The Streams Class then arranges your data in a table-like format with the timestamp as the primary key for each row and a column used for each parameter and event value at that timestamp in the stream of records.
“Streams Class maintain structure and context when storing data, so it’s easy to explore historical data or use common ML libraries.”
The Streams Class can also maintain this structure when storing data. This makes it easy to explore historical data because it’s all recorded in your application context. It also makes it easy to use streaming data with standard ML libraries and tools such as Scikit-learn and Jupyter Notebooks.
You can use one format to develop models on historical data and deploy them to production, all by using the Quix SDK. And because streams are in memory, while tables are in the disk, your applications will be fast and efficient.
Paradigm shift #3: In-memory processing
With batch data processing, the focus has always been on data that isn’t needed in real time. It requires digging into the database every time you need to process and output data. That works fine — as long as you’re not in a hurry. But high latency can undermine businesses that rely on timely analytics.
Using the traditional approach isn’t fast or sustainable for access to real-time data. There is a shift in the architectural practice of inputting, processing and outputting data with in-memory processing. In-memory processing sends data streams through the message broker instead of the database (which is on the disk), leading to significantly lower latency and higher efficiency.
How can organizations improve their data processing?
Large organizations typically have many systems integrated with a wide variety of technologies. Their purpose is to receive, store and transmit data. The data is most often stored, at rest, in a database. Once the data has been collected and then funneled into the database for storage, it can be read for batch processing.
“With extremely high volumes of data, or where speed is vital, expensive hardware is often the solution. However, much of this data is not needed or irrelevant.”
In situations where extremely high volumes of data are required or where speed is vital, expensive hardware is often the solution. However, much of the data is either not needed or is only relevant in the instant it’s generated. The deferred nature of batch processing means that insights, decisions, or opportunities gained from working with live data in real time are lost.
The lost opportunity from batch processing stale data doesn’t need to be the reality. Processing live data in real time is possible — and much easier than you’d think.
Instead of a database, Quix is built with a message broker at its core, meaning Quix lets users work with live data the instant it’s created. What you do with the data at that moment can be as simple as discarding portions of it that aren’t useful or analyzing it and reacting in real time.
How data stream processing is changing business
The important trend in data is the demand for companies to act on data faster and more efficiently. Organizations already invest heavily in data — including data warehouses and data scientists — but it’s not enough to collect and store data. Producing insights from that data and being able to act on this analysis quickly are the key factors in transforming this data investment into actual business value.
Embracing the paradigm shifts associated with streaming data will enable organizations to achieve lower latency, higher bandwidth and greater efficiency compared to traditional batch processing. Harnessing the power to process an ever-growing volume and velocity of streaming data — and automate actions in response to it — creates a significant competitive advantage over businesses limited by last-generation technology.
“Harnessing the power to process an ever-growing volume and velocity of streaming data — and automate actions in response to it — creates a significant competitive advantage over businesses limited by last-generation technology.”
Working with real-time data streams has only been available to massive organizations with the resources to apply hundreds of developers to this problem. But with Quix’s platform, any developer can stream, process and store data at scale without managing infrastructure.
By creating a layer of abstraction on the complexities of streaming data, Quix’s SDK enables developers to write code in Python or C# that connects directly to a message broker, creating a seamless live data stream. This setup improves developer productivity without requiring expensive teams or infrastructure.
The transition from batch data processing to stream data processing will undoubtedly be difficult for some — it requires several paradigm shifts in the approach to storing, processing and acting on data. But the exponential growth in digital products and services and heightened business competition demands not just a faster way to handle data but a more efficient approach as well.
If you’d like to try Quix’s data processing platform for free, sign up for immediate access. And join us in our Slack community channel, where you’ll find friendly technical folks to answer questions.