-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFE reading state file on different F5 pair #147
Comments
Hi @john-jones01, can you share the output of tail -f /var/log/restnoded/restnoded.log | grep f5-cloud-failover on both devices for the time period when this happened? It makes no sense that resetting the state file would cause failover since all that does is clear out the file in storage. If you didn't already have silly logging enabled when trying the dry run on internal, can you set it if possible? thanks |
HI Mike, thank you for your reply. I am also perplexed by what is happening with this as well. Attaching the files for the day that my prod edge F5 failed over, and logs when i attempted the dry-run yesterday after updating the RPM. I already have silly logging enabled. sep-17th-priv2-cfe.txt |
I just created issue EC-550 for this. Will let you know when I have a look at the logs. |
Please let me know if I didn't get the correct logs or if you need more. With log level "silly" it is really chatty of course, so going through it was a bit challenging. While I have you, I would love to better understand a few things if I may please:
|
Not sure if you have access or use ihealth, but I uploaded a qkview for support ticket 00687310. Seems like it would be easier to navigate the logs using that than these text outputs. Will uploaded the prod edge qkview shortly. Just trying to make your life easier :) |
Sounds good, most likely support will get it sorted. If it's escalated we will use the issue number I provided above. |
Thank you, Mike. The assigned engineer said he was going to escalate it yesterday, and also asked me to submit an issue while he does that. I see that log as well, but above that I see it may be trying to pull from the edge state file (?). Why I was asking the 2 questions above. It looks like to me it's trying to use the state file for our prod edge config file. There is no state file for priv. I have the same permissions for the edge sa & bucket as the internal sa and bucket. [f5-cloud-failover] Cloud provider found targetInstances: {"0":{"kind":"compute#targetInstance","id":"6217496490224367059","creationTimestamp":"2022-06-09T08:01:17.000-07:00","name":"nc-ec500-lbpha-edgea-us-central1-ti","description":"","zone":" |
I updated the RPM a few days ago and still see schema as 2.1.1. After update, I did restart javad and restnoded services. [john.jones01@nc-ec500-lbp-priv2-us-central1:Active:Standalone] config # curl -su admin: -X GET http://localhost:8100/mgmt/shared/cloud-failover/declare | jq . |
This actually shows that CFE thinks it should be updating the edge instance, but doesn't have permission to do anything with the storage account (which I think is the correct one). Instance discovery isn't keyed off of the state file, it's based on tags + self IPs IIRC. |
I have my cfe config file set to look at the priv bucket, and the labels all match -- i've checked 10+ times lol. So I am lost on why it keeps trying to go to the edge bucket. Unless... So I am now thinking somehow the RPM I downloaded off the edge and put onto the priv is somehow "hardwired" this data to my priv F5, even after I re-uploaded a new RPM to that instance. |
I am going to delete that RPM and install it fresh and see if that helps. Last time i imported the new RPM on top of 2.1.1 (vs removing and adding) Maybe it's written in a database or something along those lines. |
It appears priv is trying to use the correct bucket: Tue, 17 Sep 2024 23:51:16 GMT - finest: [f5-cloud-failover] deployment bucket name: f5-prod-priv1-priv2-failover But doesn't have permission: Tue, 17 Sep 2024 23:51:19 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission The reference tonc-ec500-lbpha-edgea-us-central1-ti from the internal logs is the instance name, not the bucket name. So it looks like internal doesn't have permission to the bucket it needs, and also is discovering the edge instance when determining which instance needs updating, which is unrelated to the bucket issue. |
@john-jones01 Can you verify the forwarding rules have the correct labels for each deployment? The f5_target_instance_pair should be unique. |
Thanks. That 403 is interesting since I've been digging into the GCP logs. It is attempting to assume a permission not noted in the prep guide. Furthermore, there is not state file for this to "manage" per say. When is this state file created? |
The state file is created whenever CFE executes, if it doesn't already exist. If you don't have permission to the bucket then it makes sense that it's not there. I would be curious to see the logs for the edge bucket that's working. I assume the permissions are the same for both and CFE is making the same requests, so what's the difference? |
The actual forwarding rules themselves or the logic in the cfe.json in the config folder/path? |
Yes sir, permissions are the same. Attached screenshot of edge. Looking over the graph, the logs have most certainly jumped for edge with me doing this priv CFE. Only one error, ironically on the day and right around the times I thought maybe the state file command failed that pair over. 24 assigned permissions |
|
Is there a way to update the forwarding rules with a label vs a description? Google does NOT make it easy to update a forwarding rules description, but very easy to add a label. So in order to update the forwarding rules with a description, I would have to delete them and re-add. 😢 |
HI @mikeshimkus is there any way we can get on a meeting to discuss this please? |
Hi Mike, i figured this issue out. I have a question for you if I may please. Just want to better understand this :) How is the state file utilized with CFE? |
Hi @john-jones01, the purpose of the state file is to help CFE recover from a failed failover. Whenever CFE runs it checks the state file for the status of the previous run, if that's anything but "SUCCEEDED" or "NEVER_RUN", CFE will attempt to use the state file info to reset the cloud provider networking back to the last known good configuration. |
Do you already have an issue opened with F5 support? Yes. 00687310
Github Issues are consistently monitored by F5 staff, but should be considered as best effort only and you should not expect to receive the same level of response as provided by F5 Support. Please open an case with F5 if this is a critical issue.
Description
Describe the problem you're having or the enhancement you'd like to request.
Our org has CFE setup on our edge F5s that work fine. So I am building out getting CFE deployed on our internal F5s.
I went through the entire quick-start guide verbatim. I created a new bucket with different lables and mapped those in the cfe,json on the F5s. The only deviation is during my first dry-run attempt I did import the 2.1.1 RPM from our edge F5 onto our internal F5s. Yesterday I updated the RPM to the latest 2.1.2.
The problem I am having is the internal/priv dry-run is observing the edge state file and not creating it's own f5cloudfailoverstate.json file to write and read data from. Even worse, when going through the troubleshooting steps for resetting the state file on our internal F5s, it failed over our edge F5s. I see through the rest noded logs that it see the internal F5
Environment information
For bugs, enter the following information:
Severity Level
For bugs, enter the bug severity level. Do not set any labels.
Severity: 5
Severity level definitions:
The text was updated successfully, but these errors were encountered: