What is Kafka?
Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. Originally created to handle real-time data feeds at LinkedIn in 2011, Kafka quickly evolved from messaging queue to a full-fledged event streaming platform capable of handling over 1 million messages per second, or trillions of messages per day.
Why Kafka?
Kafka has numerous advantages. Today, Kafka is used by over 80% of the Fortune 100 across virtually every industry, for countless use cases big and small. It is the de facto technology developers and architects use to build the newest generation of scalable, real-time data streaming applications. While these can be achieved with a range of technologies available in the market, below are the main reasons Kafka is so popular.
What is a data stream in Apache Kafka?
A data stream is typically thought of as a potentially unbounded sequence of data. The name streaming is used because we are interested in the data being accessible as soon as it is produced.
Each of the applications in an organization where data is created is a potential data stream creator. Data created as part of data streams are typically small. The data throughput to data streams is highly variable: some streams will receive tens of thousands of records per second, and some will receive one or two records per hour.
Apache Kafka is used to store these data streams (also called topics), which then allows systems to perform stream processing — an act of performing continual calculations on a potentially endless and constantly evolving source of data. Once the stream is processed and stored in Apache Kafka, it may be transferred to another system, e.g., a database.
What are the use cases of Apache Kafka?
The use cases of Apache Kafka are many. These include stream processing for different business applications. Apache Kafka makes up the storage mechanism for some of the prominent stream processing frameworks, e.g., Apache Flink, and Samza.
Messaging systems
Activity Tracking
Gather metrics from many different locations, for example, IoT devices
Application logs analysis
De-coupling of system dependencies
Integration with Big Data technologies like Spark, Flink, Storm, Hadoop.
Event-sourcing store
Where is Apache Kafka not a great fit?
Apache Kafka is a great fit for the use cases outlined above, but there are a few use cases when using Apache Kafka is either not possible or not recommended:
- Proxying millions of clients for mobile apps or IoT: the Kafka protocol is not made for that, but some proxies exist to bridge the gap.
- A database with indexes: Kafka is an event streaming log with no analytical capability built in and no complex query model.
- An embedded real-time technology for IoT: there are lower level and lighter alternatives to perform these use cases on embedded systems.
- Work queues: Kafka is made of topics, not queues (unlike RabbitMQ, ActiveMQ, SQS). Queues are meant to scale to millions of consumers and to delete messages once processed. In Kafka data is not deleted once processed and consumers cannot scale beyond the number of partitions in a topic.
- Kafka as a blockchain: Kafka topics present some characteristics of a blockchain, where data is appended in a log, and Kafka topics can be immutable, but lack some key properties of blockchains such as the cryptographic verification of the data, as well as full history preservation.
How is Kafka concretely being used within the industry?
Apache Kafka is widely used in the industry. Some of the use cases are highlighted below.
- Uber uses Kafka extensively in their real-time pricing pipeline. Kafka is the backbone through which significant proportion of the events are communicated to the different stream processing calculations. The speed and flexibility of Kafka allows Uber to adjust their pricing models to the constantly evolving events in the real world (number of available drivers and their position, users and their position, weather event, other events), and bill users the right amount to manage offer and demand.
- Netflix has integrated Kafka as the core component of its data platform. They refer to it internally as their Keystone data pipeline. As part of Netflix’s Keystone, Kafka handles billions of events a day. Just to give an idea about the huge amount of data that Kafka can handle, Netflix sends about 5 hundred billion events and 1.3 petabytes of data per day into Kafka.
Unknown to many, Kafka is at the core of lots of the services we enjoy on a daily basis from some of the world largest tech companies such as Uber, Netflix, Airbnb, LinkedIn, Apple & Walmart.