back
September 27, 2021
|
Industry insights

Implementing stream processing: my experience using Python libraries

I tested three Python client libraries — Apache Spark, Apache Flink, and Quix — on performance, scalability and ease of use. Here’s what I learned.

Shiny light lines on black background.
Quix brings DataFrames and the Python ecosystem to stream processing. Stateful, scalable and fault tolerant. No wrappers. No JVM. No cross-language debugging.

Our library is open source—support the project by starring the repo.

Apache Spark on Databricks vs. DIY Apache Flink vs. Quix.ai

When I set out to answer the question of which data stream processing system is best, it seemed like a simple experiment: take a randomly generated string of lorem ipsum text and calculate various metrics over a sliding window of time. But this simple experiment quickly revealed the complexities associated with working with streaming data.

TL;DR: the system you choose will significantly impact ease of use, performance, and cost.

This blog shares my experience with the pros and cons of each platform — Apache Spark on Databricks, a DIY Apache Flink install, and Quix — so you can make the best decision for your application. My colleague Tomas Neubauer (Quix CTO) details our results in full in his blog post.

I’m also eager to hear about your experiences. Jump on our Slack to chat with my colleagues and me.

Stream processing with Apache Spark

In Databricks, it’s easy to configure Kafka as a streaming source, and you can create a table from plain text messages. But despite its large user base, Spark’s documentation doesn’t provide enough information for troubleshooting. As a result, developers must frequently rely on community support via Stack Overflow and other forums.

In particular, the Databricks platform file storage is not adequately documented. A file is needed for the Kafka broker to store its internal variables in the checkpoint location. However, the file storage directory (/FileStorage) seems hidden from the user or undocumented. Since this directory is not accessible from the UI, the only way to reset a Kafka checkpoint file is to create a new file and leave the old file on the system. Erasing the Spark state is done illogically via the SQL >>RESET;<< command.

Spark DataFrame functionality is limited when working on the streaming data, which, combined with insufficient documentation, makes development difficult. There is no straightforward way that the developer can debug the data being processed.

Spark’s CPU utilization averaged 14%, but the system could not process more than 133,000 messages per second — a fraction of what Flink and Quix were capable of handling. This is a problem, especially when working on Databricks, where I had to provision a whole cluster for a month.

It cost nearly $1,000 to build, set up, run and optimize this experiment across the month, so it’s disappointing to know that effectively, only 14% of that cost was spent wisely.

Stream processing with Apache Flink

Recognized as the industry standard, Apache Flink is a scalable data stream processing library with a mature code and user base. Its Table API is easy to use and understand, it provides many types of rolling windows out of the box, and it scales nodes adequately so that it can process a high volume of data.

However, the Table API is not very flexible as it’s really a SQL wrapper compiled to Java runtime. As a developer, I can’t implement my own classes, methods, etc., or use external libraries in my project.

There is a pure Python API, but it is incomplete, and the associated DataStream features are unsupported and undocumented. Additionally, there is no straightforward way for a developer to debug data that is being processed. The Flink Kafka connector requires you to include the JAR manually. And you can’t load the plaintext to the single-column table — you must create a specific generator just for Flink (we used JSON).

Setting up a Flink cluster isn’t easy. The self-managed nature of Flink requires knowledge of setting up the server by yourself. In our case, we created Docker images supporting the execution of Flink Python code and deployed them into Azure VMs. The default Docker image from the Docker hub does not have support for Python, which needs to be manually installed into the container.

Stream processing with Quix Streams

As the newest data stream processing platform, Quix sets itself apart by solving many of the problems that plague Spark and Flink while introducing new techniques that provide greater performance and efficiency.

Quix has tightly coupled compute with the broker and database, providing a platform that’s easy to use and set up while also delivering very high performance. It took me less than a day to run the tests on Quix, compared to multiple weeks to run them properly on Flink and Spark.

“It took me less than a day to run the tests on Quix, compared to multiple weeks to run them properly on Flink and Spark.”

Quix offers many out-of-the-box capabilities. For example, the visualize feature allows developers to see the incoming data from the message broker and quickly debug running code. This, together with a built-in IDE with Git versioning system, ensure that you can iterate and develop very quickly.

The Quix Streams client library is simple to use, supports any Python code, offers very high performance, and enables you to stream and process data in various formats, such as RAW/binary, numeric, and strings and types, such as events and parameters (time-series scalar values like XYZ position).

Quix’s documentation is also the easiest to use and includes many ready-to-use examples. Support for the Pandas DataFrames makes development easy because the in-memory implementation inside the Pandas is done for you.

Python implementation is a dream for data scientists. It provides a simple API to access data, either in batched chunks to improve performance or immediately as data arrives from the broker to achieve the lowest latency.

Quix’s limitations include some elements that Flink already has established, such as an out-of-the-box rolling window; auto scaling for parallelism in the nodes (currently, it’s manual with a slider in the UX); shared state (this must be created manually by adding topics); and internal state management (implemented manually using the feature in the client library).

These limitations are all being addressed by Quix on the roadmap, and, for now, the performance, ease of use, Python API, and managed infrastructure more than make up for the limitations.

Which stream processing platform is best?

I’m a big fan of trying things before you buy them. That’s why this seemingly simple experiment helped me dig deep into the pros and cons of each system. It’s clear from the performance results that Apache Spark is a library that can’t handle the demands of real-time data stream processing. At the same time, Databricks is expensive and difficult to use for stream processing applications.

Both Quix and Flink have their advantages, depending on your use case, but Flink is tough to use and a 100% DIY solution. Considering efficiency (read: cost savings) and the ease of use for setup and troubleshooting, I believe Quix is the compelling choice.

Developers waste hours troubleshooting and debugging when they could be writing fresh new code. The cost of their time, and the opportunity cost of bringing new products to market, are often underestimated. Quix removes all of this pain, allowing developers to focus on their code and data from day one.

See our full breakdown if you’d like to read more about my experiment and the performance results. Also, on the blog, my CEO gives a very, very detailed comparison of each of these client libraries. You can also visit my GitHub repository to see the attached codebase. If you’d like to try the Quix platform free, you can get immediate access.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Related content

Featured image for the "Navigating stateful stream processing" post published on the Quix blog
Industry insights

Navigating stateful stream processing

Discover what sets stateful stream processing apart from stateless processing and read about its related concepts, challenges and use cases.
Tim Sawicki
Words by
windowing in stream processing
Industry insights

A guide to windowing in stream processing

Explore streaming windows (including tumbling, sliding and hopping windows) and learn about windowing benefits, use cases and technologies.
Daniil Gusev
Words by
real time feature engineering architecture diagram
Industry insights

What is real-time feature engineering?

Pre-computing features for real-time machine learning reduces the precision of the insights you can draw from data streams. In this guide, we'll look at what real-time feature engineering is and show you a simple example of how you can do it yourself.
Tun Shwe
Words by