Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that different data pipelines are logically separate #375

Open
jkarpen opened this issue Aug 28, 2024 · 6 comments
Open

Ensure that different data pipelines are logically separate #375

jkarpen opened this issue Aug 28, 2024 · 6 comments

Comments

@jkarpen
Copy link

jkarpen commented Aug 28, 2024

At one point the 30-second raw data pipeline was tightly coupled with the config table uploads. This meant that an incident in one pipeline could affect the others. As an example, in June there was an incident where the data relay server was down for a couple of weeks. It took almost a week of data crawling to recover, and the config table uploads were scheduled behind the 30-second data uploads. Because of the tight coupling, it took a long time to update the config tables (the scripts for which can run in under a minute), even though they are logically separate.

Image

Going forward, the data relay server should be able to schedule the different parts of the pipeline independently so that incidents (or incident recovery) in one of them do not affect the others.

Note: there has been some refactoring of the upload scripts since the above incident, so the coupling may not be the same now.

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

The problem Ian stated was valid, in that data relay was not designed to be "multi-tasking," more specifically, it can only sequentially execute the planned tasks. Therefore, if the plan changes, such as in the backfill situation where we need to go back in history to redo the relay, the current data relay has to choose either completely cancel current existing plans and focus on the backfill, or continue to execute the existing plan all the way until its completion, after which other tasks are possible to execute.

The root cause is the single queue design.

Single Queue Data Relay drawio

To mitigate the situation that the last task wait too long, I change the design to be multi-task-queue like below:

Multi Queue Data Relay drawio

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Sep 19, 2024

Thanks for the thoughtful discussion.
Document the QnA:

  1. Is this the only solution for the challenge?
    Answer: the other solution would be to increase the machines (or processes) that pulls the data base, therefore the task queue has a faster speed to complete. The problem is, still single thread, and also, increasing the data pull will exert pressure on the production database.

  2. Would this address the "logic separation" concern?
    Answer: Yes. The tasks from 30sec relay, daily config relay and the backfill will have their fair treatment, due to the fact that they are no longer assigned to a single queue, and their respective queue will have guaranteed handling from the data puller.

  3. What is the major risks or difficulties on executing this plan?
    Answer: beyond the basic coding, the major challenge is the operation: the single queue will have to be migrated to the multiple queue through a sequence of delicate Kafka operations, and it has to be done in a service down period to eliminate the impact to the production. From the dashboard, I can locate such opportunity window to perform the operations without production impact.

The next step is to execute the plan.

@jkarpen
Copy link
Author

jkarpen commented Sep 26, 2024

Pingping successfully implemented the changes outlined above. As a next step before closing this issue Pingping will like to implement a dashboard to track performance of the different queues. Pingping will meet with @ian-r-rose when he returns next week for input on the KPIs to use.

@jkarpen
Copy link
Author

jkarpen commented Oct 3, 2024

Next step on this task is to document the code. Pingping will create a separate issue for creating a dashboard to track performance.

@jkarpen
Copy link
Author

jkarpen commented Oct 10, 2024

Per @pingpingxiu-DOT-ca-gov there is not a need to add a new dashboard, existing dashboards should capture any issues. Next step on this issue is for Pingping and @ian-r-rose to meet and review the code.

@jkarpen
Copy link
Author

jkarpen commented Oct 24, 2024

Per @pingpingxiu-DOT-ca-gov this is waiting on the virtual environments PR to be completed. Then Pingping will submit a PR for this to be further reviewed before completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants