Target Audience :
This article is for those who are completely new to kafka or just started learning it. When I started learning kafka I made some notes for my own purpose which I am sharing with all of you. I have documented this is in very simple language with some examples. I hope this will help you to understand kafka.
What is kafka ?
NOTE : Apache kafka documentation does a great job in explaining all the terms, architecture and implementation of kafka. Kafka is a distributed streaming platform. Now what exactly do you mean by this ?
- Distributed : Which means you can store data in distributed fashion i.e on various nodes.
- Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
- You can consume data from topics.
So in simple terms messages/records/data that is generated continuously by thousands of data sources can be stored in distributed fashion. Let’s discuss about following terms :
- Commit logs
Where are the messages stored ?
Messages are stored in `Topic`. You can consider Topic as a logical entity. So when ever you are going publish data in kafka you have to mention Topic name. It is just to logically identify what type of messages you are going to store in kafka.
Let’s consider you have created a streaming application which generates lot of data. So if you want to store data in kafka you can create a Topic name with your application name just to identify what kind of messages are stored in it.
So if topic is just a logical entity where is the actual data stored ?
Every topic has some number of partitions (configurable value) where actual data is stored in the form of commit logs. Each partition will have its own commit log which is immutable in nature.
Commit log will store sequence of messages in the sequence they are published.
What is topic offset : The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
In simple terms you can consider offset as the address of each record/message stored in commit log.
Let us try some practical stuff. We will create a topic with 2 partition.
NOTE : Please try to check the points which we have covered till now. Rest we will discuss after observing the output of the command :
# kafka-topics.sh --create --zookeeper zookeeper.server.com:2181 --replication-factor 2 --partitions 2 --topic Test
Created topic “Test”.
In order to get details about the topic which we created, we can execute describe on topic to get more information :
# /usr/hdp/current/kafka-broker/bin/kafka-topics.sh --describe --zookeeper `hostname -f`:2181 --topic Test Topic:Test PartitionCount:2 ReplicationFactor:2 Configs: Topic: TestPartition: 0Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002 Topic: TestPartition: 1Leader: 1002 Replicas: 1001,1002 Isr: 1001,1002
From describe command output we can see that topic Test was created with two partition i.e Partition: 0 and Partition: 1. As discussed, messages are stored in commit logs of partitions. In kafka you have something called as kafka log directory (not service logs ) where partition data is stored locally on filesystem. In my case kafka log directory is /kafka-logs.
Let see what is in it :
# ls -ltr /kafka-logs/ -rw-r--r--. 1 kafka hadoop 0 Aug 7 06:38 cleaner-offset-checkpoint -rw-r--r--. 1 kafka hadoop 57 Aug 7 06:38 meta.properties drwxr-xr-x. 2 kafka hadoop 70 Aug 14 21:01 Test-1 drwxr-xr-x. 2 kafka hadoop 70 Aug 14 21:01 Test-0 -rw-r--r--. 1 kafka hadoop 53 Aug 14 21:10 recovery-point-offset-checkpoint -rw-r--r--. 1 kafka hadoop 53 Aug 14 21:11 replication-offset-checkpoint
NOTE : Please ignore remaining files. We will discuss about these files in later section.
From above output we can see two directories Test-1 and Test-0. From below output “00000000000000000000.log” file is commit log of partition Test-0 where messages would be stored.
# ls -ltr /kafka-logs/Test-0/ total 0 -rw-r--r--. 1 kafka hadoop 0 Aug 14 21:01 00000000000000000000.log -rw-r--r--. 1 kafka hadoop 10485760 Aug 14 21:01 00000000000000000000.index
You might have several questions till now :
- What is 1001,1002 ?
- What is ReplicationFactor ?
- What is Replicas ?
- What is Leader ?
- What is ISR ?
What is 1001,1002 ? Brokers : Brokers are the nodes in your cluster where kafka service is installed. If you have two broker nodes with kafka service installed and running on it then each broker will identified by ID called as broker ID.
How to check broker ID for each node : There are two ways of checking broker ID 1. From meta.properties in kafka log directory 2. By checking znode in zookeeper.
# cat /kafka-logs/meta.properties version=0 broker.id=1001
What is ReplicationFactor and replicas ?
(Please look at above diagram for reference )While creating a topic you mentioned how many number of partitions you want to create (please refer topic —create command). You can also mention the number for replication factor for each partition in topic that you want to create.
So in “kafka-topics.sh –create” we created a topic Test with 2 partition and 2 replication factor. This will create two replicas for each partition (Test-0 and Test-1) on two brokers available in cluster. So for partition Test-0, a directory with name Test-0 will be created on both brokers under kafka log dir /kafka-logs. On both the nodes this directory will contain commit logs. Same will be applicable to Test-1 . These partitions on both broker nodes are known as replicas (Take a look at describe command output and observe it carefully)
On broker 1001
[root@broker1001 ~]# ls -ltr /kafka-logs/Test-1 total 0 -rw-r--r--. 1 kafka hadoop 0 Aug 14 21:01 00000000000000000000.log -rw-r--r--. 1 kafka hadoop 10485760 Aug 14 21:01 00000000000000000000.index
On broker 1002
[root@broker1002 ~]# ls -ltr /kafka-logs/Test-1 total 0 -rw-r--r--. 1 kafka hadoop 0 Aug 14 21:01 00000000000000000000.log -rw-r--r--. 1 kafka hadoop 10485760 Aug 14 21:01 00000000000000000000.index
Point to remember : number of replicas will be equal to number of replication factor. Your replication factor cannot be greater than the number of brokers in your cluster.
What is Leader ? What is ISR ?
Whenever you want to publish/consumer messages from a topic, leader for each partition is responsible for that. All request goes to leader of the partitions in topic.
Leader for each partition will be selected from one of the replicas of each partition. (Take a look at describe command output and observe it carefully) For Test-0 partition replica on broker 1001 is the leader while for Test-1 partition replica on broker node 1002 is the leader.
ISR stands for In sync replica. Whenever you write a message to topic (partition), that message will be marked as committed only if same message is written in replica of the partition (You can change this behaviour though). In this way replica stays in sync with leader. (Take a look at describe command output and observe ISR part carefully). In our case for topic Test Test-0 partition, replica 1002 is in sync with leader 1001. For partition Test-1, replica 1001 is in sync with leader 1002. In ISR section you will observe that leader of both partition is also present in ISR because leader is also one of the replica.