Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17306: fix replication problem on follower restart #2873

Closed

Conversation

ds-manzinger
Copy link

@ds-manzinger ds-manzinger commented Nov 19, 2024

https://issues.apache.org/jira/browse/SOLR-17306

Description

If Leader has Replication disabled - do not delete Followers data on restart

Solution

Check if Leader Replication is enabled

Tests

Implemented Unit Tests, that check different restart scenarios. Enable Directory Storage for Replica, othwerwise tests will not work because memory is cleaned on restart

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@epugh
Copy link
Contributor

epugh commented Nov 19, 2024

I am hoping someone with more confidence in this area weighs in, but if you don't get a review, please do poke me!!!

@epugh
Copy link
Contributor

epugh commented Nov 19, 2024

Could #1875 be augmented to test your scenario?

@ds-manzinger
Copy link
Author

@epugh First, we tried using BATS for testing (as you already mentioned in the JIRA Ticket), but it was easier to create the unit test attached to the Pull Request.

We built this because our setup is Leader → Replication Leader → Followers.
When we update our catalog, we stop the Leader replication. If the Replication Leader restarts during this time (e.g., due to automated patching), it loses all the data. Followers are an autoscaling group that scales depending on the traffic.

We have been running this patch in our production environment for two months without any issues on Solr 9.6.

Should we also submit a Pull Request against the main branch?

@epugh
Copy link
Contributor

epugh commented Nov 19, 2024

Should we also submit a Pull Request against the main branch?
Please do open aginst main... Generally we commit to main and then backport to branch_9x. This should go out in 9.8. Or maybe 9.7.1 if we cut that soon.

@epugh
Copy link
Contributor

epugh commented Nov 19, 2024

BTW, it would be interesting if you could write a short few paragraphs type story of how you do your lifecycle with leader/folllower/replicas, your ASG etc. In https://github.com/apache/solr/pull/2783/files#diff-b58818a370dac65f7abb0064599f8813a56841b4a40f960ea2b81e398b820f43 we are talking about the architecture you highlight. Would be great if you could review the pros/cons and weigh in. Let me know if you are interested and we can discuss more on that PR...

@ds-manzinger
Copy link
Author

#2874

This is the Pull Request against main branch

@ds-manzinger
Copy link
Author

ds-manzinger commented Nov 21, 2024

BTW, it would be interesting if you could write a short few paragraphs type story of how you do your lifecycle with leader/folllower/replicas, your ASG etc. In https://github.com/apache/solr/pull/2783/files#diff-b58818a370dac65f7abb0064599f8813a56841b4a40f960ea2b81e398b820f43 we are talking about the architecture you highlight. Would be great if you could review the pros/cons and weigh in. Let me know if you are interested and we can discuss more on that PR...

Our setup is Mentioned already in that Pull request with leader -> repeater -> follower.

Just the Followers are Autoscaled. We do not use any SolrCloud functionality. Just simple leader/follower Setup.

The Repeater is the Leader that is always Up for Autoscaling and Leader is Locked for Replication when updating the catalog. Maybe at some time we will change our update strategy and do not have the need to lock the leader.

The Problem why we created that PR was:

Leader was Locked, Repeater got a System Patch and was restartet automatically. then the data was removed

@epugh
Copy link
Contributor

epugh commented Nov 21, 2024

Closing in favour of the main branch. Will dig into probably early next week unless someone beats me too it!

@epugh epugh closed this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants