Target Audience :
This article is for the folks who are completely new to kafka. When I started learning kafka I made some notes for my own purpose which I am sharing with all of you. I hope this will help you to understand kafka.
Retention Policy :
Till what time messages are stored in kafka ?
Old log messages is discarded after a fixed period of time identified by value of log.retention.hours or when the log reaches some predetermined size identified by value of log.retention.bytes.
Controller in kafka :
In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partition.
How to check which broker is the controller node ?
You can check this from ephermal znode created in zookeeper. Use zkCli.sh in order to log in to zookeeper terminal :
Do a get on /controller znode which will give a broker ID. Now you can check broker id from meta.properties file under kafka log directory where messages are store as described in first post :
# get /controller
Basic Operations in kafka :
Create a topic :
# bin/kafka-topics.sh --create --zookeeper zookeeper.example.com:2181 --replication-factor 2 --partitions 2 --topic Test
⁃ To List the Topic :
# kafka-topics.sh --list --zookeeper zookeeper.example.com:2181
⁃ Producer script to produce messages :
# bin/kafka-console-producer.sh --broker-list kafka.example.com:6667 --topic
⁃ Consumer script to consumer message :
# bin/kafka-console-consumer.sh --zookeeper zookeeper.example.com:2181 --topic --from-beginning
⁃ To describe a topic :
# bin/kafka-topics.sh --describe --zookeeper zookeeper.example.com:2181 --topic
⁃ To list the message from beginning :
# bin/kafka-console-consumer.sh --zookeeper zookeeper.example.com:2181 --from-beginning --topic
In my previous post we discussed that messages are stored in log files.
# ls -ltr /kafka-logs/Test-0/
-rw-r--r--. 1 kafka hadoop 0 Aug 14 21:01 00000000000000000000.log
-rw-r--r--. 1 kafka hadoop 10485760 Aug 14 21:01 00000000000000000000.index
Let’s discuss about log and index file. What are these files ? what is the purpose of having two files ?
Log file is the actual file where messages are stored. You can call this as commit logs/ segments. There can be multiple log file / commit log / segments depending on the value of log.segment.bytes property. Once the value is reached kafka will create a new log file as well as corresponding index file.
For each message in this file, first 64 bytes contains its own offset information. Now if we want to look for a message of a specific offset it will take long time if log file is of huge size. It will have to go through entire log file starting from 1st offset till it reaches the desired offset. This is time consuming process.
Now let’s consider if you want to publish some messages data in kafka. For this kafka should be aware of last offset. kafka actually has to do such kind of lookups to determine the latest offset and be able to further increment incoming messages correctly.
Consider a situation where kafka wants to delete some messages from partition after their retention period is over. This operation will be time consuming as it has to go through entire log file.
This is why we have index file for each log file. Every log file will start with base offset (starting offset). Every new log file’s base offset will be greater that offset in previous log file.
Every entry in index file is made up of 8 bytes, 4 bytes to store relative offset to base offset and 4 bytes to store position of the message. let’s say the base offset is 10000000000000000000, rather than having to store next offsets 10000000000000000001 and 10000000000000000002 they are just 1 and 2.