Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should fence_mpath agent be utilized instead of the fence_scsi agent? #26

Open
rcproam opened this issue Apr 3, 2019 · 7 comments
Open

Comments

@rcproam
Copy link

rcproam commented Apr 3, 2019

This is not an issue with the current design. Possibly label as enhancement?

In particular, due to the documented issue "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage", would it be more reliable to utilize the fence_mpath agent than the fence_scsi agent?
I've encountered an issue very similar to the issue described here: https://access.redhat.com/solutions/3201072

Red Hat recommends utilizing the fence_mpath agent instead of fence_scsi to resolve this particular issue, however fence_mpath is more complex to configure, and may likely come with its own unique caveats/issues.
https://access.redhat.com/articles/3078811

Still need to test the fence_mpath agent with my particular buildout to confirm whether or not it resolves the fencing / scsi reservation issue I've encountered, but I'm opening this issue in case others might have time to test the fence_mpath agent before I can.

@rcproam
Copy link
Author

rcproam commented Apr 4, 2019

Description of fence_mpath agent and how it functions compared to fence_scsi:

fence_mpath: new fence agent for dm-multipath based on mpathpersist
Previously, scenario with multipath and underlying SCSI devices was solved by using
fence_scsi what works correctly but there are some limitation. The most important
is that unfencing has to be done when all paths are available as it is executed only once.
This new fence agent solve this situation properly as most of this situations are solved
by mpathpersist which is part of dm-multipath.
https://lists.fedorahosted.org/pipermail/cluster-commits/2014-November/004033.html

@ewwhite
Copy link
Owner

ewwhite commented Apr 4, 2019

I'd still see if you can debug your specific issue. I don't know of anyone using fence_mpath for this type of setup, and there are plenty of folks using this guide with success.

Please note what I mentioned about diverse heartbeat network paths.

@rcproam
Copy link
Author

rcproam commented Apr 8, 2019

Thanks @ewwhite I will try to debug some more... still trying to understand how the pcs resource start and stop timeouts affect failover as the suggested 90 seconds seems like a very large value (IIRC the TCP session timeout for NFS is only like 60 seconds).
Also, my particular deployment is utilizing a SuperMicro Storage Bridge Bay (SBB), which includes an internal Ethernet interconnect between nodes which I am using for heartbeats.

@rcproam
Copy link
Author

rcproam commented Apr 8, 2019

So I placed node#2 (cluster-nas2) into standby, then shut it down completely. When I subsequently startup node#2 again it causes pacemaker to crash on node#1. Below is the excerpt from the syslog on node#1 showing the sequence:

Apr 8 01:35:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3481, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: Transition 3481 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3482, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: Transition 3482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:17:01 svr-lf-nas1 CRON[13384]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3483, saving inputs in /var/lib/pacemaker/pengine/pe-input-368.bz2
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: Transition 3483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-368.bz2): Complete
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Scheduling Node cluster-nas2 for shutdown
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3484, saving inputs in /var/lib/pacemaker/pengine/pe-input-369.bz2
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Transition 3484 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-369.bz2): Complete
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[1]: 1
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [QUORUM] Members[1]: 1
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:19:15 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Removing all cluster-nas2 attributes for peer loss
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Lost attribute writer cluster-nas2
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:25 svr-lf-nas1 kernel: [3133621.758535] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:19:27 svr-lf-nas1 ntpd[2809]: Deleting interface #11 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=3133254 secs
Apr 8 02:19:28 svr-lf-nas1 kernel: [3133624.730941] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:19:30 svr-lf-nas1 ntpd[2809]: Listen normally on 12 eno3 198.51.100.1:123
Apr 8 02:20:33 svr-lf-nas1 kernel: [3133689.895368] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:20:35 svr-lf-nas1 ntpd[2809]: Deleting interface #12 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=65 secs
Apr 8 02:20:37 svr-lf-nas1 kernel: [3133692.983744] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:20:38 svr-lf-nas1 ntpd[2809]: Listen normally on 13 eno3 198.51.100.1:123
Apr 8 02:20:42 svr-lf-nas1 kernel: [3133698.535494] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:20:44 svr-lf-nas1 ntpd[2809]: Deleting interface #13 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=6 secs
Apr 8 02:20:45 svr-lf-nas1 kernel: [3133701.371873] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:20:47 svr-lf-nas1 ntpd[2809]: Listen normally on 14 eno3 198.51.100.1:123
Apr 8 02:21:12 svr-lf-nas1 kernel: [3133728.815903] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:21:14 svr-lf-nas1 ntpd[2809]: Deleting interface #14 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=27 secs
Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760563] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760633] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed
Apr 8 02:21:40 svr-lf-nas1 ntpd[2809]: Listen normally on 15 eno3 198.51.100.1:123
Apr 8 02:22:35 svr-lf-nas1 kernel: [3133811.692929] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:22:37 svr-lf-nas1 ntpd[2809]: Deleting interface #15 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=57 secs
Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401931] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401997] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: error: Node cluster-nas2[2] appears to be online even though we think it is dead
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_INTEGRATION
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[2]: 1 2
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [QUORUM] Members[2]: 1 2
Apr 8 02:23:35 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Recorded attribute writer: cluster-nas2
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: error: Cannot perform modification with no data
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: warning: Completed cib_modify operation for section status: Invalid argument (rc=-22, origin=cluster-nas2/crmd/35, version=0.256.6)
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Another DC detected: cluster-nas2 (op=noop)
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_ELECTION -> S_INTEGRATION
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Syncing the Cluster Information Base from cluster-nas2 to rest of cluster
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Requested version <generation_tuple crm_feature_set="3.0.11" validate-with="pacemaker-2.6" epoch="256" num_updates="13" admin_epoch="0" cib-last-written="Mon Apr 8 02:18:52 2019" update-origin="cluster-nas2" update-client="crm_attribute" update-user="hacluster" have-quorum="1" dc-uuid="2"/>
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Updating all attributes after cib_refresh_notify event
Apr 8 02:23:36 svr-lf-nas1 ntpd[2809]: Listen normally on 16 eno3 198.51.100.1:123
Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation reboot of cluster-nas1 by cluster-nas2 for [email protected]: OK
Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation on of cluster-nas2 by cluster-nas2 for [email protected]: OK
Apr 8 02:23:37 svr-lf-nas1 crmd[2850]: crit: We were allegedly just fenced by cluster-nas2 for cluster-nas2!
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: warning: The crmd process (2850) can no longer be respawned, shutting the cluster down.
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutting down Pacemaker
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping pengine
Apr 8 02:23:37 svr-lf-nas1 kernel: [3133873.286736] sd 0:0:13:0: Parameters changed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: new_event_notification (2847-2850-7): Bad file descriptor (9)
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 pengine[2849]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping attrd
Apr 8 02:23:37 svr-lf-nas1 attrd[2848]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping lrmd
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping stonith-ng
Apr 8 02:23:37 svr-lf-nas1 stonith-ng[2846]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping cib
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutdown complete
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Attempting to inhibit respawning after fatal error

@ewwhite
Copy link
Owner

ewwhite commented Apr 8, 2019

Can you show me the pcs resource creation string you used for the fencing?

Maybe also the cluster creation string... and also your hosts files?

@ewwhite
Copy link
Owner

ewwhite commented Apr 12, 2019

Any updates? @rcproam

@rcproam
Copy link
Author

rcproam commented Apr 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants