Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFE reading state file on different F5 pair #147

Open
john-jones01 opened this issue Sep 18, 2024 · 22 comments
Open

CFE reading state file on different F5 pair #147

john-jones01 opened this issue Sep 18, 2024 · 22 comments

Comments

@john-jones01
Copy link

Do you already have an issue opened with F5 support? Yes. 00687310

Github Issues are consistently monitored by F5 staff, but should be considered as best effort only and you should not expect to receive the same level of response as provided by F5 Support. Please open an case with F5 if this is a critical issue.

Description

Describe the problem you're having or the enhancement you'd like to request.
Our org has CFE setup on our edge F5s that work fine. So I am building out getting CFE deployed on our internal F5s.

I went through the entire quick-start guide verbatim. I created a new bucket with different lables and mapped those in the cfe,json on the F5s. The only deviation is during my first dry-run attempt I did import the 2.1.1 RPM from our edge F5 onto our internal F5s. Yesterday I updated the RPM to the latest 2.1.2.

The problem I am having is the internal/priv dry-run is observing the edge state file and not creating it's own f5cloudfailoverstate.json file to write and read data from. Even worse, when going through the troubleshooting steps for resetting the state file on our internal F5s, it failed over our edge F5s. I see through the rest noded logs that it see the internal F5

Environment information

For bugs, enter the following information:

  • Cloud Failover Extension Version: 2.1.2
  • BIG-IP version: 17.1.1.3 Engineering Hotfix 0.109.5
  • Cloud provider: GCP

Severity Level

For bugs, enter the bug severity level. Do not set any labels.

Severity: 5
cfe-version
cfe-config
prod-EDGE-bucket-label
prod-priv-bucket-label
prod-EDGE-bucket
prod-priv-bucket-permissions
prod-priv-bucket

Severity level definitions:

  1. Severity 1 (Critical) : Defect is causing systems to be offline and/or nonfunctional. immediate attention is required.
  2. Severity 2 (High) : Defect is causing major obstruction of system operations.
  3. Severity 3 (Medium) : Defect is causing intermittent errors in system operations.
  4. Severity 4 (Low) : Defect is causing infrequent interuptions in system operations.
  5. Severity 5 (Trival) : Defect is not causing any interuptions to system operations, but none-the-less is a bug.
@mikeshimkus
Copy link
Contributor

Hi @john-jones01, can you share the output of tail -f /var/log/restnoded/restnoded.log | grep f5-cloud-failover on both devices for the time period when this happened? It makes no sense that resetting the state file would cause failover since all that does is clear out the file in storage.

If you didn't already have silly logging enabled when trying the dry run on internal, can you set it if possible? thanks
https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/troubleshooting.html#confirm-cfe-configuration

@john-jones01
Copy link
Author

HI Mike, thank you for your reply. I am also perplexed by what is happening with this as well. Attaching the files for the day that my prod edge F5 failed over, and logs when i attempted the dry-run yesterday after updating the RPM. I already have silly logging enabled.
*Priv 2 is the internal F5 I am working on getting CFE working/installed.
**Edge is the F5 that failed over around 16:28 GMT. I ran the command on priv 2 around 16:00 GMT on priv2.

sep-17th-priv2-cfe.txt
sep-16th-reset-state-command-priv2.txt
cfe-prod-edge-output-september-16th.txt

@mikeshimkus
Copy link
Contributor

I just created issue EC-550 for this. Will let you know when I have a look at the logs.

@john-jones01
Copy link
Author

Hi @john-jones01, can you share the output of tail -f /var/log/restnoded/restnoded.log | grep f5-cloud-failover on both devices for the time period when this happened? It makes no sense that resetting the state file would cause failover since all that does is clear out the file in storage.

If you didn't already have silly logging enabled when trying the dry run on internal, can you set it if possible? thanks https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/troubleshooting.html#confirm-cfe-configuration

Please let me know if I didn't get the correct logs or if you need more. With log level "silly" it is really chatty of course, so going through it was a bit challenging.

While I have you, I would love to better understand a few things if I may please:

  1. When are the "f5cloudfailoverstate.json" built/sent to the bucket? As you can see in my above screenshot, I do not have any app data for the priv (internal) F5s. Only edge has that state file. Wondering if that is why I can see BOTH internal and external F5s configs. It might be storing both in the edge state file? Just a thought, not sure. Then again, I see a 403 when doing priv even though the SA account permissions are exactly the same for edge and priv.
  2. Would me downloading an RPM off of the edge F5, and uploading it to the priv F5 have anything to do with this behavior? I have since updated it to 2.1.2, but the schema in the logs still show 2.1.1. Makes me a bit suspicious...

@john-jones01
Copy link
Author

Not sure if you have access or use ihealth, but I uploaded a qkview for support ticket 00687310. Seems like it would be easier to navigate the logs using that than these text outputs. Will uploaded the prod edge qkview shortly. Just trying to make your life easier :)

@mikeshimkus
Copy link
Contributor

mikeshimkus commented Sep 18, 2024

Sounds good, most likely support will get it sorted. If it's escalated we will use the issue number I provided above.

@john-jones01
Copy link
Author

john-jones01 commented Sep 19, 2024

Thank you, Mike. The assigned engineer said he was going to escalate it yesterday, and also asked me to submit an issue while he does that.

I see that log as well, but above that I see it may be trying to pull from the edge state file (?). Why I was asking the 2 questions above. It looks like to me it's trying to use the state file for our prod edge config file. There is no state file for priv. I have the same permissions for the edge sa & bucket as the internal sa and bucket.

[f5-cloud-failover] Cloud provider found targetInstances: {"0":{"kind":"compute#targetInstance","id":"6217496490224367059","creationTimestamp":"2022-06-09T08:01:17.000-07:00","name":"nc-ec500-lbpha-edgea-us-central1-ti","description":"","zone":"
Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Cloud Provider initialization complete
Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Failover initialization complete
Wed, 18 Sep 2024 12:44:05 GMT - info: [f5-cloud-failover] Fetching device info
Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Address operations enabled? true
Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Route operations enabled? false
Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Data will be downloaded from f5cloudfailoverstate.json
Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] downloadDataFromStorage found response code 403
Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission

@john-jones01
Copy link
Author

I updated the RPM a few days ago and still see schema as 2.1.1. After update, I did restart javad and restnoded services.

[john.jones01@nc-ec500-lbp-priv2-us-central1:Active:Standalone] config # curl -su admin: -X GET http://localhost:8100/mgmt/shared/cloud-failover/declare | jq .
{
"message": "success",
"declaration": {
"class": "Cloud_Failover",
"environment": "gcp",
"controls": {
"class": "Controls",
"logLevel": "silly"
},
"externalStorage": {
"scopingName": "f5-prod-priv1-priv2-failover"
},
"failoverAddresses": {
"enabled": true,
"scopingTags": {
"f5_cloud_failover_label": "lbp-priv-ac"
},
"requireScopingTags": false
},
"schemaVersion": "2.1.1"

cfe-rpm

@mikeshimkus
Copy link
Contributor

Thank you, Mike. The assigned engineer said he was going to escalate it yesterday, and also asked me to submit an issue while he does that.

I see that log as well, but above that I see it may be trying to pull from the edge state file (?). Why I was asking the 2 questions above. It looks like to me it's trying to use the state file for our prod edge config file. There is no state file for priv. I have the same permissions for the edge sa & bucket as the internal sa and bucket.

[f5-cloud-failover] Cloud provider found targetInstances: {"0":{"kind":"compute#targetInstance","id":"6217496490224367059","creationTimestamp":"2022-06-09T08:01:17.000-07:00","name":"nc-ec500-lbpha-edgea-us-central1-ti","description":"","zone":" Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Cloud Provider initialization complete Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Failover initialization complete Wed, 18 Sep 2024 12:44:05 GMT - info: [f5-cloud-failover] Fetching device info Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Address operations enabled? true Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Route operations enabled? false Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Data will be downloaded from f5cloudfailoverstate.json Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] downloadDataFromStorage found response code 403 Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission

This actually shows that CFE thinks it should be updating the edge instance, but doesn't have permission to do anything with the storage account (which I think is the correct one). Instance discovery isn't keyed off of the state file, it's based on tags + self IPs IIRC.

@john-jones01
Copy link
Author

Thank you, Mike. The assigned engineer said he was going to escalate it yesterday, and also asked me to submit an issue while he does that.
I see that log as well, but above that I see it may be trying to pull from the edge state file (?). Why I was asking the 2 questions above. It looks like to me it's trying to use the state file for our prod edge config file. There is no state file for priv. I have the same permissions for the edge sa & bucket as the internal sa and bucket.
[f5-cloud-failover] Cloud provider found targetInstances: {"0":{"kind":"compute#targetInstance","id":"6217496490224367059","creationTimestamp":"2022-06-09T08:01:17.000-07:00","name":"nc-ec500-lbpha-edgea-us-central1-ti","description":"","zone":" Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Cloud Provider initialization complete Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Failover initialization complete Wed, 18 Sep 2024 12:44:05 GMT - info: [f5-cloud-failover] Fetching device info Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Address operations enabled? true Wed, 18 Sep 2024 12:44:05 GMT - fine: [f5-cloud-failover] Route operations enabled? false Wed, 18 Sep 2024 12:44:05 GMT - finest: [f5-cloud-failover] Data will be downloaded from f5cloudfailoverstate.json Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] downloadDataFromStorage found response code 403 Wed, 18 Sep 2024 12:44:06 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission

This actually shows that CFE thinks it should be updating the edge instance, but doesn't have permission to do anything with the storage account (which I think is the correct one). Instance discovery isn't keyed off of the state file, it's based on tags + self IPs IIRC.

I have my cfe config file set to look at the priv bucket, and the labels all match -- i've checked 10+ times lol. So I am lost on why it keeps trying to go to the edge bucket. Unless...

So I am now thinking somehow the RPM I downloaded off the edge and put onto the priv is somehow "hardwired" this data to my priv F5, even after I re-uploaded a new RPM to that instance.

@john-jones01
Copy link
Author

I am going to delete that RPM and install it fresh and see if that helps. Last time i imported the new RPM on top of 2.1.1 (vs removing and adding) Maybe it's written in a database or something along those lines.

@mikeshimkus
Copy link
Contributor

It appears priv is trying to use the correct bucket:

Tue, 17 Sep 2024 23:51:16 GMT - finest: [f5-cloud-failover] deployment bucket name: f5-prod-priv1-priv2-failover

But doesn't have permission:

Tue, 17 Sep 2024 23:51:19 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission

The reference tonc-ec500-lbpha-edgea-us-central1-ti from the internal logs is the instance name, not the bucket name. So it looks like internal doesn't have permission to the bucket it needs, and also is discovering the edge instance when determining which instance needs updating, which is unrelated to the bucket issue.

@mikeshimkus
Copy link
Contributor

@john-jones01 Can you verify the forwarding rules have the correct labels for each deployment? The f5_target_instance_pair should be unique.

@john-jones01
Copy link
Author

john-jones01 commented Sep 19, 2024

It appears priv is trying to use the correct bucket:

Tue, 17 Sep 2024 23:51:16 GMT - finest: [f5-cloud-failover] deployment bucket name: f5-prod-priv1-priv2-failover

But doesn't have permission:

Tue, 17 Sep 2024 23:51:19 GMT - finest: [f5-cloud-failover] State file data: Insufficient Permission

The reference tonc-ec500-lbpha-edgea-us-central1-ti from the internal logs is the instance name, not the bucket name. So it looks like internal doesn't have permission to the bucket it needs, and also is discovering the edge instance when determining which instance needs updating, which is unrelated to the bucket issue.

Thanks. That 403 is interesting since I've been digging into the GCP logs. It is attempting to assume a permission not noted in the prep guide. Furthermore, there is not state file for this to "manage" per say. When is this state file created?

perm-query

@mikeshimkus
Copy link
Contributor

The state file is created whenever CFE executes, if it doesn't already exist. If you don't have permission to the bucket then it makes sense that it's not there.

I would be curious to see the logs for the edge bucket that's working. I assume the permissions are the same for both and CFE is making the same requests, so what's the difference?

@john-jones01
Copy link
Author

@john-jones01 Can you verify the forwarding rules have the correct labels for each deployment? The f5_target_instance_pair should be unique.

The actual forwarding rules themselves or the logic in the cfe.json in the config folder/path?

@john-jones01
Copy link
Author

The state file is created whenever CFE executes, if it doesn't already exist. If you don't have permission to the bucket then it makes sense that it's not there.

I would be curious to see the logs for the edge bucket that's working. I assume the permissions are the same for both and CFE is making the same requests, so what's the difference?

Yes sir, permissions are the same. Attached screenshot of edge. Looking over the graph, the logs have most certainly jumped for edge with me doing this priv CFE. Only one error, ironically on the day and right around the times I thought maybe the state file command failed that pair over.

24 assigned permissions
compute.forwardingRules.get
compute.forwardingRules.list
compute.forwardingRules.setTarget
compute.instances.create
compute.instances.get
compute.instances.list
compute.instances.updateNetworkInterface
compute.networks.updatePolicy
compute.routes.create
compute.routes.delete
compute.routes.get
compute.routes.list
compute.targetInstances.get
compute.targetInstances.list
compute.targetInstances.use
storage.buckets.create
storage.buckets.get
storage.buckets.list
storage.buckets.update
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
storage.objects.update

edge-gcp-log
edge-gcp-log-all

@john-jones01
Copy link
Author

@john-jones01 Can you verify the forwarding rules have the correct labels for each deployment? The f5_target_instance_pair should be unique.

The actual forwarding rules themselves or the logic in the cfe.json in the config folder/path?

Ahh, I see. Talking about this. No, we do not have this
image (1)

@john-jones01
Copy link
Author

Is there a way to update the forwarding rules with a label vs a description? Google does NOT make it easy to update a forwarding rules description, but very easy to add a label. So in order to update the forwarding rules with a description, I would have to delete them and re-add. 😢

@john-jones01
Copy link
Author

HI @mikeshimkus is there any way we can get on a meeting to discuss this please?

@john-jones01
Copy link
Author

Hi Mike, i figured this issue out.

I have a question for you if I may please. Just want to better understand this :)

How is the state file utilized with CFE?

@mikeshimkus
Copy link
Contributor

Hi @john-jones01, the purpose of the state file is to help CFE recover from a failed failover. Whenever CFE runs it checks the state file for the status of the previous run, if that's anything but "SUCCEEDED" or "NEVER_RUN", CFE will attempt to use the state file info to reset the cloud provider networking back to the last known good configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants