Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic #1133

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Aug 22, 2024

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

Additionally, the IBM DNS logic doesn't work for multiple targets and this creates unexpected and problematic results. The logic has been refactored to only create and delete using the first target. It warns the user when multiple targets are set.

This PR also includes some unit test fix up and missing unit test coverage for the IBM CIS Provider.

This resolves the same DNS issues for public PowerVS cloud as it uses the same logic.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Aug 22, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

This PR also includes some unit test fix up and missing coverage for the IBM CIS Provider.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci openshift-ci bot requested review from alebedev87 and frobware August 22, 2024 22:46
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from b387d07 to b07d602 Compare August 22, 2024 22:50
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 22, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci openshift-ci bot requested a review from lihongan August 22, 2024 22:57
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

This PR also includes some unit test fix up and missing coverage for the IBM CIS Provider.

This also resolves the same DNS issues for public PowerVS cloud as it uses the same logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented Aug 26, 2024

I missed one detail of the bug, missing instructions in the Progressing status condition for PowerVS when you change scope. Looking into it.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2024
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from a057f1a to 0716651 Compare August 26, 2024 19:28
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 26, 2024

@lihongan Added the missing scope change instructions for the PowerVS type.

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 27, 2024

infra failures
/retest

@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from 0716651 to f51b1d3 Compare August 27, 2024 14:31
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 28, 2024

Infra issues
/retest

@lihongan
Copy link
Contributor

pre-merge tested on OpenStack and looks good now

$ oc -n openshift-ingress-operator patch ingresscontroller/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy": {"type":"LoadBalancerService", "loadBalancer": {"scope":"External"}}}}'
ingresscontroller.operator.openshift.io/intlb patched

$ oc get co/ingress
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.17.0-0.ci.test-2024-08-28-040831-ci-ln-13rxbgt-latest   True        True          False      23m     ingresscontroller "intlb" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: LoadBalancerProgressing=True (OperandsProgressing: One or more managed resources are progressing: The IngressController scope was changed from "Internal" to "External".  To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-intlb`; the service load-balancer will then be deprovisioned and a new one created.  This will most likely cause the new load-balancer to have a different host name and IP address from the old one's.  Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"Internal"}}}}'`).

$ oc -n openshift-ingress delete svc/router-intlb
service "router-intlb" deleted

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.110.89   38.102.83.53   80:32114/TCP,443:30834/TCP   4m23s

will test on IBMCloud later

@lihongan
Copy link
Contributor

pre-merge tested on IBMCloud and also looks good

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                         PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.111.21   53084b0d-eu-de.lb.appdomain.cloud   80:31026/TCP,443:32398/TCP   78s

$ oc -n openshift-ingress-operator patch ingresscontroller/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy": {"type":"LoadBalancerService", "loadBalancer": {"scope":"External"}}}}'
ingresscontroller.operator.openshift.io/intlb patched

$ oc -n openshift-ingress delete svc/router-intlb
service "router-intlb" deleted

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                         PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.242.68   72736508-eu-de.lb.appdomain.cloud   80:32041/TCP,443:30579/TCP   65s

$ oc -n openshift-ingress-operator get ingresscontroller/intlb -oyaml
<......>
  - lastTransitionTime: "2024-08-28T09:24:02Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady

@Miciah
Copy link
Contributor

Miciah commented Aug 28, 2024

/assign

@Miciah
Copy link
Contributor

Miciah commented Aug 28, 2024

@SzucsAti, this is a follow-up to #796. Are you available to review the changes to the IBM DNS provider?

@gcs278 gcs278 changed the title OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic and Add Missing Instructions to the Progressing Condition Aug 28, 2024
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from f51b1d3 to 7a97b9a Compare August 28, 2024 15:29
Copy link
Contributor

@SzucsAti SzucsAti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for the fix!

@gcs278
Copy link
Contributor Author

gcs278 commented Sep 16, 2024

/retest

@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch 2 times, most recently from 4758a5b to ec2fcdf Compare September 16, 2024 20:46
Both IBM public and private cloud unit tests were missing unit test
coverage. This update extends test coverage for the Delete and
CreateOrUpdateRecord functions. This commit provides an important point
of reference for future commits that may preturb the existing
functionality.

Both Test_createOrUpdateDNSRecord functions previously only tested the
update logic. In order to test the create logic, `CreateDNSRecord` and
`CreateResourceRecord` needed to be implemented in the public and
private `fake_client.go` respectively.

The new test cases required the ability to control the response and results
of `ListAllDnsRecords`, which were previously hardcoded in both public
and private IBM cloud unit tests. Both public and private unit tests
were updated to use the new OutputResults field specified in
the `ListAllDnsRecordsInputOutput` struct, allowing the new test
cases to specify no result (indicating no existing DNS record) so we
can trigger the create logic.

Lastly, various test cases were added to cover untested scenarios, such
as testing CNAME Records, mismatching targets, missing record types or
IDs, handling nil results, etc.
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch 2 times, most recently from 32caad5 to 2223383 Compare September 16, 2024 20:55
@gcs278
Copy link
Contributor Author

gcs278 commented Sep 16, 2024

infra failures
/retest

@gcs278 gcs278 changed the title [WIP] OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic Sep 16, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 16, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Sep 16, 2024

Removed WIP, sorry for the delay, I am open to reviews again @candita @Miciah @SzucsAti

Updates since last review:

  • I added more unit test coverage.
    • I felt it necessary to fix existing bugs in the unit tests and add more coverage as it provides confidence that I didn't perturb the DNS provider logic in an unintentional way.
    • The unit tests being in separate commits is intentional and provides a valuable diff before & after the actual bug fix here. You can review the last commit and see there were a few required unit test updates which are results of the bug fixes I added.
  • I removed create/update support for multiple targets in DNSRecords for IBM Cloud.
    • First, the code was misleading because IBM DNS doesn't support it, and therefore it doesn't work (it hot loops and overwrites records in IBM private cloud).
    • Second, since the bug fix actually impacts the way that multi-targeted DNSRecords would get created in IBM Public Cloud, it seemed logical to fix this right now, otherwise, we'd create a future headache for maintaining compatibility (if we were ever concerned about compatibility here)

@gcs278
Copy link
Contributor Author

gcs278 commented Sep 16, 2024

Keeping the hold while I wait to see what the results of adding a IBM Cloud job are openshift/release#56785

/hold

@gcs278
Copy link
Contributor Author

gcs278 commented Sep 17, 2024

/retest

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 28, 2024

Now that openshift/release#56785 has merged, though it will most likely fail due to DNS Propagation issues (being tested in #1132), I'm curious to see if we see anything interesting:
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 30, 2024

I don't think it's enabled, but just curious:
/testwith e2e-ibmcloud-operator #1164

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 30, 2024

/testwith openshift/cluster-ingress-operator/master/e2e-ibmcloud-operator #1164

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 11, 2024

After introducing and running e2e-ibmcloud-operator on different PRs, I realized it was passing successfully without this fix. I was mistaken that our E2E would have caught this bug. However, it looks like adding a Availabile=True condition check to TestScopeChange after deleting the service ensures test coverage for bugs like this. Updated this PR to add that.

Now let's test to make sure everything works:
/testwith openshift/cluster-ingress-operator/master/e2e-ibmcloud-operator #1164

The IBM Public Cloud DNS provider (`cis_provider.go`) had a bug in
`createOrUpdateDNSRecord` where it checked for the existence of a
DNS record by filtering both DNS name and target. If the target
was updated (e.g., due to a load balancer recreation), the logic
would not match the existing DNS record. As a result, the function
would attempt to create a new record, but fail because a record with
that name already existed, as multiple DNS records with the same
name are not allowed in IBM Cloud DNS providers.

The fix is to remove the filtering by target and rely solely on
filtering by name, as the name is the only attribute that needs
to be unique.

Additionally, the IBM DNS logic doesn't work for multiple targets and
this creates unexpected and problematic results. The logic has been
refactored to only create using the first target and it warns the user
when multiple targets are set. This change is low risk since the Ingress
Operator will never create a DNSRecord with multiple targets in
`desiredDNSRecord`.

Modify TestScopeChange to check for `Available=True` after deleting the
service to add test coverage for this issue.
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from 396acf9 to d93c10b Compare November 12, 2024 01:33
@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

Forgot to push...
/testwith openshift/cluster-ingress-operator/master/e2e-ibmcloud-operator #1164

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

/testwith openshift/cluster-ingress-operator/master/e2e-ibmcloud-operator openshift/cluster-ingress-operator/pull#1164

Copy link
Contributor

openshift-ci bot commented Nov 12, 2024

@gcs278, testwith: Error processing request. ERROR:

could not determine job runs: invalid format for additional PR: openshift/cluster-ingress-operator/pull#1164

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

/testwith openshift/cluster-ingress-operator/master/e2e-ibmcloud-operator https://github.com/openshift/cluster-ingress-operator#1164

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

multi-pr-openshift-cluster-ingress-operator-1133-openshift-cluster-ingress-operator-1164 passed which has E2E test coverage for this bug, proving it resolved the problem.

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

/retest-required

@Miciah Miciah added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 13, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Dec 5, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Dec 5, 2024

@gcs278: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ibmcloud-operator 2223383 link false /test e2e-ibmcloud-operator
ci/prow/e2e-azure-ovn d93c10b link false /test e2e-azure-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +137 to +139
if resourceRecord.ID == nil {
return fmt.Errorf("delete: record id is nil")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check this earlier, like where we check the resourceRecord.Type?

@@ -190,6 +194,10 @@ func (p *Provider) createOrUpdateDNSRecord(record *iov1.DNSRecord, zone configv1
log.Info("Warning: TTL must be one of [1 60 120 300 600 900 1800 3600 7200 18000 43200]. RecordTTL set to default", "default DSNSVCS record TTL", defaultDNSSVCSRecordTTL)
record.Spec.RecordTTL = defaultDNSSVCSRecordTTL
}
// We only support one target, warn the user.
if len(record.Spec.Targets) > 1 {
log.Info("Warning: Only one DNSRecord target is supported. Additional targets will be ignored.", "targets", record.Spec.Targets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you decide that this should only be an Info warning? Shouldn't it be an Error to be more visible?

case iov1.CNAMERecordType:
inputRData, err := p.dnsService.NewResourceRecordInputRdataRdataCnameRecord(target)

// TODO DNS record update should handle the case where we have an A record and want a CNAME record or vice versa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a carried-over TODO, but should it be done now in order to fix the bug? Or is it out of scope?

@@ -152,6 +154,10 @@ func (p *Provider) createOrUpdateDNSRecord(record *iov1.DNSRecord, zone configv1
log.Info("Warning: TTL must be between 120 and 2,147,483,647 seconds, or 1 for Automatic. RecordTTL set to default", "default CIS record TTL", defaultCISRecordTTL)
record.Spec.RecordTTL = defaultCISRecordTTL
}
// We only support one target, warn the user.
if len(record.Spec.Targets) > 1 {
log.Info("Warning: Only one DNSRecord target is supported. Additional targets will be ignored.", "targets", record.Spec.Targets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more visible if it was logged as an Error.

}
}
if result == nil || result.Result == nil {
return fmt.Errorf("createOrUpdateDNSRecord: ListAllDnsRecords returned nil as result")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for the very first record created? We can't find any to update, but we need to go on to create it instead of returning an error, don't we?

createDnsRecordInputOutput: dnsclient.CreateDnsRecordInputOutput{
InputId: "testCreate",
OutputError: nil,
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusOK},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to find a way to check that the list was updated?

OutputError: errors.New("error in CreateDnsRecord"),
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusRequestTimeout},
},
expectErrorContains: "error in CreateDnsRecord",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really seem to be testing any error input if we're just matching the supplied error.

Comment on lines 57 to +58
func (fdc FakeDnsClient) ListResourceRecords(listResourceRecordsOptions *dnssvcsv1.ListResourceRecordsOptions) (result *dnssvcsv1.ListResourceRecords, response *core.DetailedResponse, err error) {
fakeListDnsrecordsResp := &dnssvcsv1.ListResourceRecords{}
recordType := string(iov1.ARecordType)
rData := map[string]interface{}{"ip": fdc.ListAllDnsRecordsInputOutput.RecordTarget}

fakeListDnsrecordsResp.ResourceRecords = append(fakeListDnsrecordsResp.ResourceRecords, dnssvcsv1.ResourceRecord{ID: &fdc.ListAllDnsRecordsInputOutput.RecordName, Name: &fdc.ListAllDnsRecordsInputOutput.RecordName, Type: &recordType, Rdata: rData})

resp := &core.DetailedResponse{
StatusCode: fdc.ListAllDnsRecordsInputOutput.OutputStatusCode,
Headers: map[string][]string{},
Result: result,
RawResult: []byte{},
}

return fakeListDnsrecordsResp, resp, fdc.ListAllDnsRecordsInputOutput.OutputError
return fdc.ListAllDnsRecordsInputOutput.OutputResult, fdc.ListAllDnsRecordsInputOutput.OutputResponse, fdc.ListAllDnsRecordsInputOutput.OutputError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listResourceRecordsOptions is unused. Also, can you add a comment about what is actually being returned here? We haven't changed anything.

Comment on lines +13 to +16
"k8s.io/utils/pointer"

"github.com/IBM/go-sdk-core/v5/core"
"github.com/IBM/networking-go-sdk/dnssvcsv1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for imports not in order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this further after learning more about my questions in cis_provider_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. priority/backlog Higher priority than priority/awaiting-more-evidence. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants