169 questions and answers about Apache Kafka and growing. Download the flashcards for spaced repetition!
When a _____ detects a quota violation, it computes and returns the amount of delay needed to bring the violating client under its quota. It then _____ to the client, refusing to process requests from it until the delay is over. The client will also refrain from sending further requests to the broker during the delay.
broker mutes the channel bi-directionallyFor any Linux filesystem used for data directories, enabling the _____ option is recommended, as it disables updating of a file's atime (last access time) attribute when the file is read. This can eliminate a significant number of filesystem writes, as Kafka does not rely on the atime attributes at all.
noatimeCould you use an asynchronous workflow to do expensive work (such as periodic data aggregation) in advance?
YesCan message queues receive messages?
YesCan message queues hold messages?
YesCan message queues deliver messages?
YesCan Redis be used as a message broker?
YesCan messages be lost in a Redis message broker?
YesAn application publishes a job to a message queue, then notifies the user of the job status. A _____ picks up the job from the queue, processes it, then signals its completion.
workerIn asynchronous workflows, jobs are processed in the _____ without blocking the user. For example, a tweet can instantly appear on your timeline, but could take some time before it is actually delivered to followers.
backgroundCan queues add delays to operations?
YesWhen dealing with many inexpensive or realtime operations, are queues a good use case?
They can be, but they can introduce delay and complexity compared to synchronous execution.A queue has grown significantly, becoming larger than available memory. What are some problems that may appear?
Cache misses, disk reads, slower performance_____ pressure limits queue sizes, allowing for good throughput / latency for jobs already in the queue. Once filled, the queue's clients are asked to try again later.
Back pressureWhat protocol is used in RabbitMQ message queues?
AMQPScheduled _____ queues receive tasks, run them and deliver the results.
Task queuesAre real-time payments and financial transactions a use case for event streaming?
YesIs real-time shipment/logistics monitoring a use case for event streaming?
YesIs real-time IoT device monitoring a use case for event streaming?
YesIs user interaction telemetry a use case for event streaming?
YesIs microservice implementation a use case for event streaming?
YesKafka's three key capabilities are: To _____/_____ to streams of events To _____ streams of events durably, reliably and indefinitely. To _____ streams of events, as they occur or retrospectively.
publish / subscribe store processCan Kafka implement continous import/export of your data from/to other systems?
YesKafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the _____.
brokersIf a Kafka server fails, the other servers will _____ to ensure continuous operations without any data loss.
take over its workReading / writing data to Kafka is done in the form of _____.
eventsAn event consists of:- _____: "Alice"
- _____: "Made a payment of $200 to Bob"
- _____: "Jun. 25, 2020 at 2:06 p.m."
- _____ (optional)
a key
a value
a timestamp
metadata (optional)
_____ are client applications that publish (write) events to Kafka
Producers._____ are clients that subscribe to (read and process) Kafka events.
consumersDo produces sometimes need to wait for consumers by design?
No - they are fully decoupledEvents are organized and durably stored in _____, similar to files stored in a folder.
topicsAn example _____ name could be "payments".
topic_____ in Kafka are always multi-producer and multi-subscriber: each can always have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events.
TopicsA Kafka event has been consumed. What happens to it?
It is retained for as long as it is defined to be retained, configured per topic.Topics are partitioned, meaning a topic is spread over a number of "_____" located on different Kafka brokers.
bucketsTopics are distributed via partitioning (buckets). This improves scalability because it allows client applications to both read and write the data from/to many _____ at the same time.
brokersKafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as _____.
they were writtenWhen a new event is published to a topic, it is actually appended to one of the topic's _____. Events with the same event key (such as ID) are all written to the same one.
partitionsCan Kafka replace a traditional message broker?
YesCan Kafka be used for log aggregation?
YesCan Kafka process data in multiple-stage pipelines, where raw input data is consumed from Topics, then aggregated/enriched/transformed into new topics for further consumptions and processing?
YesCan an event represent a payment transaction?
YesCan an event represent a geolocation update?
YesCan an event represent a shipping order?
YesCan Kafka support log aggregation?
YesDoes Kafka support large data backlogs?
YesIn Kafka, can you process feeds to create new, derived feeds?
Yes - implemented by partitioning and the consumer model.Kafka relies heavily on the _____ for storing and caching messages.
filesystemIn Kafka, using the filesystem and relying on _____ is superior to maintaining an in-memory cache or other structure�we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects.
pagecacheKafka protocol is built around a "_____" abstraction where network requests group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The server in turn appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time.
message setByte copying can be an inefficiency while under large load. To avoid this we employ a standardized binary message format that is shared by the _____, the _____ and the _____ (so data chunks can be transferred without modification between them).
producer, broker and consumerThe _____ maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer.
message logThe producer sends data directly to the broker that is the _____ for the partition. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriately direct its requests.
leaderThe Kafka _____ works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. It specifies its offset in the log with each request and receives back a chunk of log beginning from that position, with possibility to rewind it to re-consume data as needed.
consumerIn Kafka, data is pushed from the _____ to the _____.
producer brokerIn Kafka, data is pulled from the _____ by the _____.
broker consumerA _____-based system like Kafka has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems.
pullA consumer can deliberately _____ back to an old offset and re-consume data. This violates the common contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and is discovered after some messages are consumed, the consumer can re-consume those messages once the bug is fixed.
rewindThe position of a consumer in each partition is a single integer: the _____ of the next message to consume. This makes the state about what has been consumed very small, just one number for each partition. This state can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.
offset"_____" delivery means messages may be lost but are never redelivered.
At most once"_____" delivery means messages are never lost but may be redelivered.
At least once"_____" delivery means messages are delivered once and only once. This is what Kafka implements.
Exactly onceWhen publishing a message Kafka has a notion of the message being "_____" to the log. It will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.
commitedKafka replicates the log for each topic's partitions across a configurable number of servers. Can you set this replication factor per topic?
YesKafka replicates the _____ for each topic's partitions across a configurable number of servers.
logKafka is meant to be used with replication by default�in fact we implement un-replicated topics as replicated topics where the replication factor is one. The unit of replication is the topic _____.
partitionUnder non-failure conditions, each partition in Kafka has a single _____ and zero or more _____.
leader followersThe total number of (partition?) replicas including the leader constitute the replication factor. All reads and writes go to the _____ of the partition.
leaderTypically, there are many more partitions than _____ and the leaders are evenly distributed among _____.
brokersThe logs on the _____ are identical to the leader's log�all have the same offsets and messages in the same order (though, of course, at any given time the leader may have a few as-yet unreplicated messages at the end of its log).
followersFollowers consume messages from the _____ just as a normal Kafka consumer would and apply them to their own log.
leaderA Kafka node is "in sync" if it meets 2 conditions: 1. A node must be able to maintain its session with _____2. If it is a follower, it must _____ writes happening on the leader without falling too far behind.
ZooKeeper replicateThe _____ keeps track of the set of "in sync" nodes.
leaderThe determination of _____ replicas is controlled by the replica.lag.time.max.ms configuration.
stuck and laggingIf a follower dies, gets stuck, or falls behind, the _____ will remove it from the list of in sync replicas.
leaderA message is considered committed when all _____ for that partition have applied it to their log.
in sync replicasA _____ message will not be lost, as long as there is at least one in sync replica alive, at all times.
committed_____ have the option of waiting for the message to be committed, depending on their preference for tradeoff between latency and durability. This preference is controlled by the acks setting that the producer uses.
ProducersTopics have a setting for the "minimum number" of in-sync replicas that is checked when the _____ requests acknowledgment that a message has been written to the full set of in-sync replicas. If a less stringent acknowledgement is requested by the _____, then the message can be committed, and consumed, even if the number of in-sync replicas is lower than the minimum (e.g. it can be as low as just the leader).
producerA Kafka partition is a replicated _____ which models the process of coming into consensus on the order of a series of values (generally numbering the log entries 0, 1, 2, ...). A leader chooses the ordering of values provided to it. As long as the leader remains alive, all followers need to only copy the values and ordering the leader chooses.
logTo choose its quorum set, Kafka dynamically maintains a set of _____ that are caught-up to the leader. Only members of this set are eligible for election as leader.
in-sync replicas (ISR)A write to a Kafka partition is not considered committed until _____ have received the write.
all in-sync replicasKafka does not require that crashed nodes recover with all their data intact. Before being allowed to join the _____, a replica must fully re-sync again even if it lost unflushed data in its crash.
ISRKafka's guarantee with respect to data loss is predicated on _____ remaining in sync.
at least one replicaSystems must do something when all the replicas die - usually choosing between availability and consistency:- DEFAULT: Wait for a replica in the _____ to come back to life and choose this replica as the leader (hopefully it still has all its data). Kafka will remain unavailable as long as those replicas are down. If they or their data are gone, it is lost.
- Choose the first replica (not necessarily in the _____) that comes back to life as the leader. If a non-in-sync replica comes back to life and we allow it to become leader, then its log becomes the source of truth even though it is not guaranteed to have every committed message.
ISR
When writing to Kafka, _____ can choose whether they wait for the message to be acknowledged by replicas. Note that "acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message.
producersIf a topic is configured with only two replicas and one fails (i.e., only one in sync replica remains), then writes that specify _____ will succeed. However, these writes could be lost if the remaining replica also fails. Although this ensures maximum availability of the partition, this behavior may be undesirable to some users who prefer durability over availability.
acks=allA topic can disable _____ - if all replicas become unavailable, then the partition will remain unavailable until the most recent leader becomes available again. This prefers unavailability over the risk of message loss.
unclean leader electionA topic can specify a minimum _____ - the partition will only accept writes if the size of the ISR is above a certain minimum, in order to prevent the loss of messages that were written to just a single replica, which subsequently becomes unavailable. This setting only takes effect if the producer uses acks=all and guarantees that the message will be acknowledged by at least this many in-sync replicas. This setting offers a trade-off between consistency and availability. A higher setting for minimum ISR size guarantees better consistency since the message is guaranteed to be written to more replicas which reduces the probability that it will be lost. However, it reduces availability since the partition will be unavailable for writes if the number of in-sync replicas drops below the minimum threshold.
ISR sizePartitions A Kafka cluster will manage thousands of topic partitions, balanced within a cluster in a _____ fashion to avoid clustering all partitions for high-volume topics on a small number of nodes.
round-robinKafka balances leadership so that each _____ is the leader for a proportional share of its partitions.
nodeLog _____ ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. Use cases: - restoring state after application crashes or system failure- reloading caches after application restarts during operational maintenance
compactionLog _____ gives us a more granular retention mechanism so that we are guaranteed to retain at least the last update for each primary key (e.g. [email protected]
). By doing this we guarantee that the log contains a full snapshot of the final value for every key not just keys that changed recently. This means downstream consumers can restore their own state off this topic without us having to retain a complete log of all changes.
compaction
Log compaction can be useful for _____. This is a style of application design which co-locates query processing with application design and uses a log of changes as the primary store for the application.
Event sourcingLog _____ is useful when you have a data set in multiple data systems, and one of these systems is a database. For example you might have a database, a cache, a search cluster, and a Hadoop cluster. Each change to the database will need to be reflected in the cache, the search cluster, and eventually in Hadoop. In the case that one is only handling the real-time updates you only need recent log. But if you want to be able to reload the cache or restore a failed search node you may need a complete data set.
compactionLog _____ is useful with a process that does local computation can be made fault-tolerant by logging out changes that it makes to its local state so another process can reload these changes and carry on if it should fail. A concrete example of this is handling counts, aggregations, and other "group by"-like processing in a stream query system. Samza, a real-time stream-processing framework, uses this feature for exactly this purpose.
compactionAny _____ that stays caught-up to within the head of the log will see every message that is written; these messages will have sequential offsets.
- The topic's min.compaction.lag.ms
can be used to guarantee the minimum length of time must pass after a message is written before it could be compacted. I.e. it provides a lower bound on how long each message will remain in the (uncompacted) head. - The topic's max.compaction.lag.ms
can be used to guarantee the maximum delay between the time a message is written and the time the message becomes eligible for compaction.
consumer
Ordering of messages is always maintained. Log compaction will never re-order messages, just _____ some.
removeWith log compaction, the offset for a message never changes. It is the permanent _____ for a position in the log.
identifierLog _____ guarantees that any consumer progressing from the start of the log will see at least the final state of all records in the order they were written. Additionally, all delete markers for deleted records will be seen, provided the consumer reaches the head of the log in a time period less than the topic's delete.retention.ms
setting (the default is 24 hours). In other words: since the removal of delete markers happens concurrently with reads, it is possible for a consumer to miss delete markers if it lags by more than delete.retention.ms
.
compaction
Log compaction is handled by the _____, a pool of background threads that recopy log segment files, removing records whose key appears in the head of the log.
log cleanerCan log cleaning be enabled per-topic?
YesKafka cluster has the ability to enforce _____ on requests to control the broker resources used by clients.
quotasTwo types of client quotas can be enforced by Kafka brokers for each group of clients sharing a quota:- _____ quotas define byte-rate thresholds (since 0.9)
- _____ quotas define CPU utilization thresholds as a percentage of network and I/O threads (since 0.11)
Network bandwidth
Request rate
Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket; in Linux this is done with the _____ system call.
sendfile system callWhen a machine crashes or data needs to be re-loaded or re-processed, one needs to do a full load. _____ allows feeding both of these use cases off the same backing topic.
Log compaction. This style of usage of a log is described in more detail in this blog post.All connections of a quota group share the quota configured for the group. For example, if (user="test-user", client-id="test-client") has a produce quota of 10MB/sec, this is shared across all _____ instances of user "test-user" with the client-id "test-client".
producerQuotas can be applied to (user, client-id), user or client-id groups. For a given connection, the _____ quota matching the connection is applied.
most specificThe identity of Kafka clients is the _____ which represents an authenticated user in a secure cluster.
user principalThe tuple (_____, _____) defines a secure logical group of clients that share both user principal and client-id.
user, client-idIn a cluster that supports unauthenticated clients, _____ is a grouping of unauthenticated users chosen by the broker using a configurable PrincipalBuilder
.
user principal
_____ is a logical grouping of clients with a meaningful name chosen by the client application.
Client-idBy default, each unique client group receives a fixed quota as configured by the cluster. This quota is defined and utilized by clients on a per-_____ basis before getting throttled.
brokerMessages consist of the variable-length items: - _____- opaque _____ byte array - opaque _____ byte array
header keyvalue
Messages are also known as...
recordsMessages are always written in _____.
batchesKafka consumer tracks the maximum offset it has consumed in each partition and has the capability to _____ offsets so that it can resume from those offsets in the event of a restart.
commitKafka provides the option to store all the offsets for a given consumer group in a designated broker (for that group) called the _____. Any consumer instance in that consumer group should send its offset commits and fetches to that group coordinator (broker). Consumer groups are assigned to coordinators based on their group names.
group coordinatorYou have the option of either adding topics manually or having them be created automatically when data is first published to _____.
a non-existent topicThe Kafka cluster will automatically detect any broker shutdown or failure and elect new _____ for the partitions on that machine.
leadersWe refer to the process of replicating data between Kafka clusters "_____" to avoid confusion with the replication that happens amongst the nodes in a single cluster.
mirroringDoes Kafka come with a tool for mirroring data between clusters?
YesToo add servers to a Kafka clusters, assign them a _____ and start up Kafka on them.
unique broker IDNew Kafka servers will not automatically be assigned any data partitions, so unless partitions are moved to them they won't be doing any work until new _____ are created.
topicsThe partition reassignment tool can be used to move partitions across _____.
brokersKafka lets you apply a _____ to replication traffic, setting an upper bound on the bandwidth used to move replicas from machine to machine. This is useful when rebalancing a cluster, bootstrapping a new broker or adding or removing brokers, as it limits the impact these data-intensive operations will have on users.
throttleThe most important consumer configuration is the _____.
fetch sizeKafka always immediately writes all data to the filesystem and supports the ability to configure the _____ policy that controls when data is forced out of the OS cache and onto disk using the _____. It can force data to disk after a period of time or after a certain number of messages has been written.
flushKafka must eventually call _____ to know that data was flushed. When recovering from a crash for any log segment not known to be _____'d Kafka will check the integrity of each message by checking its CRC and also rebuild the accompanying offset index file as part of the recovery process executed on startup.
fsyncEXT4 has had more usage, but recent improvements to the _____ filesystem have shown it to have better performance characteristics for Kafka's workload with no compromise in stability.
XFS_____ is the practice of capturing, storing, processing, routing and reacting to streams of events built from from event sources (databases, devices, software).
Event streamingIs implementing data platforms and event-driven architecture a use case for event streaming?
YesKafka distributed systems consist of _____ and _____ that communicate via a binary protocol over TCP.
clients and serversServers that run _____ continuously import/export data as event streams, integrating Kafka with your existing systems, databases or other Kafka clusters.
Kafka Connect_____ allow you to write distributed applications and microservices that read/write/process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.
ClientsAn _____ records the fact that something happened in your system.
event (or "record"/"message")Does Kafka's performance lower with data size?
Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.Draw a diagram: - A topic has four partitions P1, P2, P3, P4. - Two different producers are independently publishing new events to the topic by writing events over the network to the topic's partitions. Both can write to the same partition if appropriate.- Events with the same key (denoted by their color in the diagram) are written to the same partition.
A topic can be fault-tolerant and highly-available via being replicated across datacenters, so that there are always multiple _____ that have a copy of the data just in case things go wrong, you want to do maintenance on the brokers, and so on. A common production setting is a replication factor of 3, i.e., there will always be three copies of your data. This replication is performed at the level of topic-partitions.
brokersCan Kafka be used to aggregate monitoring statistics from distributed applications to create centralized feeds of operational data?
YesLog aggregation typically collects physical log files off servers and puts them in HDFS or a central server for processing. Kafka abstracts away the details of files and gives a cleaner, lower-latency abstraction of log/event data as _____. This allows for easier support for multiple data sources and distributed data consumption.
a stream of messagesA processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; further processing might normalize or deduplicate this content and publish the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics. Starting in 0.10.0.0, a light-weight but powerful stream processing library called _____ is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.
Kafka Streams_____ is a style of application design where state changes are logged as a time-ordered sequence of records.
Event sourcingKafka can serve as an external commit-log for a distributed system, helping replicate data between nodes and re-syncing failed nodes to restore their data. The _____ feature in Kafka helps support this usage.
log compactionDo you have to create a topic before writing your events?
YesEvents in Kafka are durably stored. Can they be read any number of times by any number of consumers?
Yes_____ allows you to integrate (via 'connectors') and continuously ingest data from existing, external systems into Kafka, and vice versa.
Kafka ConnectYou can process events with the _____ Java/Scala client library. The library supports exactly-once processing, stateful operations and aggregations, windowing, joins, processing based on event-time, etc.
Kafka StreamsIf your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's _____.
pagecacheEfficient message compression requires compressing multiple messages together rather than compressing each message individually. A "_____" of messages can be clumped together compressed and sent to the server in this form. It will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
batchTo enable batching, the Kafka producer will attempt to accumulate data in memory and to send out larger batches of N messages in a single _____.
requestKafka's topics are divided into a set of totally ordered _____, each consumed by exactly one consumer within each subscribing consumer group at any given time.
partitions_____ aims to improve the availability of stream applications, consumer groups and other applications built on top of the group rebalance protocol. The rebalance protocol relies on the group coordinator to allocate entity ids to group members. These generated ids are ephemeral and will change when members restart and rejoin.
Kafka�s group management protocol allows group members to provide persistent entity ids. Group membership remains unchanged based on those ids, thus no rebalance will be triggered.
Static membership
If a majority of servers suffer a permanent failure, then you must either choose to lose _____ of your data or violate _____ by taking what remains on an existing server as your new source of truth.
100%consistency