Let source backfill finish when there's no data from upstream #18299

xxchan · 2024-08-28T09:12:32Z

Current status and the problem

Previous algorithm in risingwavelabs/rfcs#72:

We only track backfill_offset, but not upstream_offset, and we finish backfill only when upstream_offset > backfill offset. So if there's no data from upstream, we cannot finish backfill.

This wasn't a problem until we want blocking DDL on source (#15587). The DDL will be blocked forever until new messages come, which would confuse users.

Solution

Since the message stream is unbounded, we have to do such kind of "compare progress" between backfill and upstream stream. But we can improve that by checking in the backfill side instead of the upstream side. The former is more controllable in the SourceBackfillExecutor.

We add a new memory state target_offset, which is updated to upstream_offset when upstream messages come. Then we can compare progress and finish backfill in the backfill side. (To be more precise, we can only enter SourceCachingUp stage and wait for upstream_offset to be larger than it.) After this, we still cannot finish backfill when there's no upstream message, but it may end slightly faster.
Initialize target_offset to the latest available offset instead of None (in Kafka this is high_watermark-1). With this, we can finish backfill even there's no message from upstream. Note that this will relies on the connector to providing the ability to fetch latest_offset.
There's still a case not covered: there's no history data to backfill at all. We still cannot finish backfill in this case. To optimize, we need the ability of has_message_available provided by the connector. And check this before starting backfill.
Perhaps we can also use has_message_available as an exit condition of backfill, but that may require larger change in the source reader.

Diagram of new algorithm:
(Note: this applies to each split. After #18300, we do not need special hacks (wait for all splits finished to finish backfill), and can treat all splits equally.)

stateDiagram-v2
    state Init <<choice>>
    [*] --> hasMessageAvailable
    hasMessageAvailable --> Init
    Init --> Backfilling: Yes (<i>Or Unknown</i>)<br/>target_offset #colone; fetchLatestOffset() (<i>Or None</i>)
    Init --> Finished: No
    state Backfilling {
        [*] --> backfilling
        state bf_choice <<choice>>
        backfilling --> bf_choice: <b>backfill message</b><br/>update backfill_offset
        bf_choice --> [*]:  <b>if</b> backfill_offset >= target_offset<br/><b>close backfill stream</b>
        bf_choice --> backfilling : <b>if</b> backfill_offset < target_offset<br/>

        backfilling --> bf_choice: <b>upstream message</b><br/>target_offset #colone; upstream_offset
    }
    Backfilling --> SourceCatchingUp
    state SourceCatchingUp {
        [*] --> source_catching_up
        state choice <<choice>>
        source_catching_up --> source_catching_up: <b>backfill message</b><br/>ignore
        source_catching_up --> choice: <b>upstream message</b> 
        choice --> source_catching_up:   <b>if</b> upstream_offset < backfill_offset<br/>
        choice --> [*]:   <b>if</b> upstream_offset >= backfill_offset <br/>
    }
    SourceCatchingUp --> Finished
    Finished --> ForwardingUpstream

Connector APIs

I investigated some connectors' API docs to see whether they can get latest_offset, and has_message_available

Note: it might be tempting to just use poll to check has_message_available, but not all connector APIs will return sth like NoMoreMessage in poll. We cannot distinguish timeout from no more messages in this case.

Kafka
- latest_offset: high_watermark-1
- has_message_available: high_watermark>low_watermark
Kinesis
- Does not support neither latest_offset nor has_message_available directly.
  - DescribeStream return EndingSequenceNumber only for closed shards (after shard merge/split), not open ones.
- However, GetRecords will return MillisBehindLatest. "A value of zero indicates that record processing is caught up, and there are no new records to process at this moment." So we can use this API to workaround (we will fetch and drop data). Note that
Pulsar
- supports getLastMessageIds
- supports hasMessageAvailable. It explicitly mentioned that "This check can be used by an application to scan through a topic and stop when the reader reaches the current last published message.".

Note: another possibly useful ability is "fetch last 1/N messages". But this is subtle. We need to make sure the message is "inclusive".

Kafka: supports OffsetTail(n) natively (will become -n in the protocol). And Kafka offset is consecutive.
Pulsar: supports startMessageId(MessageId.latest).startMessageIdInclusive() to read last 1. But not last N.
Kinesis: seems not supported.

The text was updated successfully, but these errors were encountered:

xxchan · 2024-09-18T06:49:17Z

We've implemented 1-3 for kafka with:

/// Information used to determine whether we should start and finish source backfill.
///
/// XXX: if a connector cannot provide the latest offsets (but we want to make it shareable),
/// perhaps we should ban blocking DDL for it.
#[derive(Debug, Clone)]
pub enum BackfillInfo {
    HasDataToBackfill {
        /// The last available offsets for each split (**inclusive**).
        ///
        /// This will be used to determine whether source backfill is finished when
        /// there are no _new_ messages coming from upstream `SourceExecutor`. Otherwise,
        /// blocking DDL cannot finish until new messages come.
        ///
        /// When there are upstream messages, we will use the latest offsets from the upstream.
        latest_offset: String,
    },
    /// If there are no messages in the split at all, we don't need to start backfill.
    /// In this case, there will be no message from the backfill stream too.
    /// If we started backfill, we cannot finish it until new messages come.
    /// So we mark this a special case for optimization.
    NoDataToBackfill,
}

xxchan added the type/feature label Aug 28, 2024

xxchan self-assigned this Aug 28, 2024

github-actions bot added this to the release-2.1 milestone Aug 28, 2024

xxchan closed this as completed Sep 18, 2024

xxchan mentioned this issue Oct 21, 2024

feat: improve progress msg in SHOW JOBS for source backfill #18925

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let source backfill finish when there's no data from upstream #18299

Let source backfill finish when there's no data from upstream #18299

xxchan commented Aug 28, 2024 •

edited

Loading

xxchan commented Sep 18, 2024

Let source backfill finish when there's no data from upstream #18299

Let source backfill finish when there's no data from upstream #18299

Comments

xxchan commented Aug 28, 2024 • edited Loading

Current status and the problem

Solution

Connector APIs

xxchan commented Sep 18, 2024

xxchan commented Aug 28, 2024 •

edited

Loading