In a company, there is a source system and a target system, they communicate with each other.
If we increase the source systems and the target systems then there is a lot of data to be exchanged. This becomes really complicated. For data transferring we come across a lot of choices like which transfer protocol to use, how the data is parsed? Data schema and many more.
Here Apache Kafka comes in, Apache Kafka allows you to decouple your data streams and your systems so now your source systems will have their data end up in Apache Kafka. Your target systems will source their data straight from Apache Kafka and so this decoupling is what
What is a Stream?
A Stream can be defined in general as an unbounded and constant flow of data packets in real-time. Data packets are developed in the state of key-value pairs, and the packets are automatically moved from the source; there is no requirement to put a request for the identical.
Apache Kafka Stream API Architecture
The producer and Consumer libraries are used by Apache KStream internally. It is coupled with Kafka, and the API allows you to leverage Kafka’s abilities by acquiring Data Parallelism, Fault-tolerance, and numerous other powerful features.
Following are the various components present in the KStream Architecture:
- Input Stream
- Output Stream
- Instance
The instance consists of the following three parts.
- Consumer
- Local State
- Stream Topology
- Input and output data is stored in Kafka’s clusters by Input stream and output stream.
- Inside every model, we have Consumer, Stream Topology, and Local State
- Stream Topology is the flow or DAG in which the given assignment is completed
- Intermediate results like Map, FlatMap are stored in the memory location that is called State.
To improve data parallelism, we can instantly increase the number of Instances.
Kafka Stream Features
Elastic
Apache Kafka, an open-source project, was designed to be favorably available and horizontally scalable. Hence, Kafka’s support, Kafka streams API, has reached its highly elastic nature and can be easily expandable.
Fault-tolerant
The Data logs are partitioned initially, and these partitions are shared among all the servers in the cluster that are managing the data and the individual requests. Thus Kafka accomplishes fault tolerance by duplicating each partition over several servers.
Highly viable
Since Kafka clusters are admiringly available, they can be preferred in any use cases irrespective of size. They are qualified for sustaining small, medium, and large scale use cases.
Integrated Security
Best in class security for the data is offered by Kafka that has three major security components. The components are mentioned below.
- Authorization of ACLs
- Encryption of data using SSL/TLS
- Authentication of SSL/SASL
Support for Java and Scala
Designing and deploying the Kafka server-side application is much more comfortable as the Kafka supports Java and Scala with ease.
Exactly-once processing semantics
Exactly-once processing means that the program that the user writes is executed only once, and the data in the states is committed only once by the SPE( stream processing element)
Which companies Uses Kafka?
35% of Fortune 500 companies use Kafka such as LinkedIn, Airbnb, Netflix, Uber, Walmart, and so many others. Let’s take a look at some concrete examples.
- Netflix is using Kafka to apply recommendations in real-time while you’re watching TV shows and this is why basically when you leave a TV show you’ll get a new recommendation right away
- Uber uses Kafka to gather user taxi and trip data in real-time to compute and forecast demand and compute the almighty surge pricing in real-time so Uber uses Kafka extensively
- LinkedIn uses Kafka to prevent spam and its platform to collect user interactions and make better connection recommendations all in real-time.