Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane network recovery and fault tolerance. #139

Open
YufengXin opened this issue Mar 5, 2020 · 3 comments
Open

Control plane network recovery and fault tolerance. #139

YufengXin opened this issue Mar 5, 2020 · 3 comments
Assignees

Comments

@YufengXin
Copy link
Contributor

(1) LC shutdown and stateful recovery is validated in the RENCI testbed as the switches kept all the flow rules.
(2) minuet runs into an issue due to port occupancy not cleared.
(3) in-disk database and manifest at the LC
(4) working on adding a backup port for the resiliency of management plane in the scenario of link failure, assuming one link failure at a time.

@YufengXin YufengXin changed the title Management network recovery and fault tolerance. Control plane network recovery and fault tolerance. May 21, 2020
@YufengXin
Copy link
Contributor Author

(1) Adding a backup port in the manifest file approved working
(2) now need to automate the computation of the new Spanning Tree

@mcevik0
Copy link
Contributor

mcevik0 commented Jul 29, 2020

I will add some findings here.
On RENCI Testbed setup, following steps are performed.

  1. Activate SDX controller
  2. Activate Local Controllers on all sites (RENCI, DUKE, UNC, NCSU)
[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policies

============================================================================== 
--- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies
============================================================================== 
{
  "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies",
  "links": {
    "policy2": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/2",
      "policynumber": 2,
      "type": "FloodTree",
      "user": "SDXCTLR"
    },
    "policy3": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/3",
      "policynumber": 3,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy4": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/4",
      "policynumber": 4,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy5": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/5",
      "policynumber": 5,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy6": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/6",
      "policynumber": 6,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy7": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/7",
      "policynumber": 7,
      "type": "EdgePort",
      "user": "SDXCTLR"
    }
  }
}
  1. Stop Local Controller at DUKE
[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policies

============================================================================== 
--- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies
============================================================================== 
{
  "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies",
  "links": {
    "policy2": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/2",
      "policynumber": 2,
      "type": "FloodTree",
      "user": "SDXCTLR"
    },
    "policy3": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/3",
      "policynumber": 3,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy4": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/4",
      "policynumber": 4,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy5": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/5",
      "policynumber": 5,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy6": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/6",
      "policynumber": 6,
      "type": "EdgePort",
      "user": "SDXCTLR"
    },
    "policy9": {
      "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9",
      "policynumber": 9,
      "type": "ManagementSDXRecover",
      "user": "SDXCTLR"
    }
  }
}

[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policy -N 9

============================================================================== 
--- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9
============================================================================== 
{
  "policy9": {
    "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9",
    "json": {
      "ManagementSDXRecover": {
        "switch": "rencis1"
      }
    },
    "policynumber": "9",
    "type": "ManagementSDXRecover",
    "user": "SDXCTLR"
  }
}

Logs on SDX Controller

140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
2020-07-29 16:03:53,924 sdxcontroller.usermanager: 140017185625856 INFO     getting user: mcevik
INFO:sdxcontroller.usermanager:getting user: mcevik
INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:03:53] "GET /api/v1/policies HTTP/1.1" 200 -
140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
2020-07-29 16:03:59,320 sdxcontroller.usermanager: 140017185625856 INFO     getting user: mcevik
INFO:sdxcontroller.usermanager:getting user: mcevik
INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:03:59] "GET /api/v1/policies HTTP/1.1" 200 -
140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
2020-07-29 16:04:01,708 sdxcontroller.usermanager: 140017185625856 INFO     getting user: mcevik
INFO:sdxcontroller.usermanager:getting user: mcevik
INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:04:01] "GET /api/v1/policies HTTP/1.1" 200 -
SDX Closing: Missing a heartbeat on 0x7f585009f2d0
SDX Heartbeat Closing due to error on 0x7f585009f2d0
ATTRIBUTE ERROR 'NoneType' object has no attribute 'recv' ON CXN Connection:
  address:     10.14.11.2
  port:        34554
  recv_cb:     None
  recv_thread: None
  sock:        None

2020-07-29 16:04:05,465 sdxcontroller: 140017371850496 WARNING  Removing connection Connection:
  address:     10.14.11.2
  port:        34554
  recv_cb:     None
  recv_thread: None
  sock:        None

WARNING:sdxcontroller:Removing connection Connection:
  address:     10.14.11.2
  port:        34554
  recv_cb:     None
  recv_thread: None
  sock:        None

2020-07-29 16:04:05,465 sdxcontroller: 140017371850496 DEBUG    Local Controller Lost connection: dukectlr
DEBUG:sdxcontroller:Local Controller Lost connection: dukectlr
2020-07-29 16:04:05,468 sdxcontroller: 140017371850496 DEBUG    Getting backup LC.
DEBUG:sdxcontroller:Getting backup LC.
2020-07-29 16:04:05,468 sdxcontroller: 140017371850496 DEBUG    Got backup LC: rencis1
DEBUG:sdxcontroller:Got backup LC: rencis1
2020-07-29 16:04:05,472 sdxcontroller.rulemanager: 140017371850496 DEBUG    Sending remove breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f5850e469d0>
DEBUG:sdxcontroller.rulemanager:Sending remove breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f5850e469d0>
rule_rm_callback - EdgePort(dukes1)
2020-07-29 16:04:05,474 sdxcontroller.rulemanager: 140017371850496 INFO     add_rule: Beging with rule: ManagementSDXRecover(rencis1)
INFO:sdxcontroller.rulemanager:add_rule: Beging with rule: ManagementSDXRecover(rencis1)
2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO     add_rule: breakdowns [<shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>]
INFO:debug.sdxcontroller.rulemanager:add_rule: breakdowns [<shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>]
2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO     add_rule: hash and cookies set to 9
INFO:debug.sdxcontroller.rulemanager:add_rule: hash and cookies set to 9
2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO     _add_rule_to_db: ManagementSDXRecover(rencis1):9
INFO:debug.sdxcontroller.rulemanager:_add_rule_to_db: ManagementSDXRecover(rencis1):9
2020-07-29 16:04:05,476 debug.sdxcontroller.rulemanager: 140017371850496 INFO       ACTIVE_RULE
INFO:debug.sdxcontroller.rulemanager:  ACTIVE_RULE
2020-07-29 16:04:05,476 debug.sdxcontroller.rulemanager: 140017371850496 DEBUG    _install_rule: ManagementSDXRecover(rencis1):9
DEBUG:debug.sdxcontroller.rulemanager:_install_rule: ManagementSDXRecover(rencis1):9
2020-07-29 16:04:05,476 sdxcontroller.rulemanager: 140017371850496 DEBUG    Sending install breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>
DEBUG:sdxcontroller.rulemanager:Sending install breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>
2020-07-29 16:04:05,476 sdxcontroller.rulemanager: 140017371850496 DEBUG        ManagementSDXRecoverRule: switch 201
DEBUG:sdxcontroller.rulemanager:    ManagementSDXRecoverRule: switch 201
rule_add_callback - ManagementSDXRecover(rencis1)
2020-07-29 16:04:05,479 debug.sdxcontroller.rulemanager: 140017371850496 INFO     add_rule: Rule added to db: ManagementSDXRecover(rencis1)
INFO:debug.sdxcontroller.rulemanager:add_rule: Rule added to db: ManagementSDXRecover(rencis1)
2020-07-29 16:04:05,479 sdxcontroller: 140017371850496 WARNING  Removing connection Connection:
  address:     10.14.11.2
  port:        34554
  recv_cb:     None
  recv_thread: None
  sock:        None

WARNING:sdxcontroller:Removing connection Connection:
  address:     10.14.11.2
  port:        34554
  recv_cb:     None
  recv_thread: None
  sock:        None

140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
  1. While rebuilding the VFC on the DUKE switch, I noticed that SDX controller crashed with the logs below.
    For some reason, NCSU connection (10.14.11.4) is affected.
    One datapoint for debugging is that some errors (or missing statements) may exist in the renci_ben.manifest for failover/backup ports. Some safety checks can be useful for configuration errors to prevent crashing of the system.
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
SDX Closing: Missing a heartbeat on 0x7f5850942390
SDX Heartbeat Closing due to error on 0x7f5850942390
2020-07-29 16:07:22,894 sdxcontroller: 140017371850496 WARNING  Removing connection Connection:
  address:     10.14.11.4
  port:        44498
  recv_cb:     None
  recv_thread: None
  sock:        None

WARNING:sdxcontroller:Removing connection Connection:
  address:     10.14.11.4
  port:        44498
  recv_cb:     None
  recv_thread: None
  sock:        None

2020-07-29 16:07:22,895 sdxcontroller: 140017371850496 DEBUG    Local Controller Lost connection: ncsuctlr
DEBUG:sdxcontroller:Local Controller Lost connection: ncsuctlr
2020-07-29 16:07:22,899 sdxcontroller: 140017371850496 DEBUG    Getting backup LC.
DEBUG:sdxcontroller:Getting backup LC.
Traceback (most recent call last):
  File "SDXController.py", line 423, in <module>
    sdx._main_loop()
  File "SDXController.py", line 274, in _main_loop
    self._handle_connection_loss(cxn)
  File "SDXController.py", line 226, in _handle_connection_loss
    backuplcswitch = topo.node[name]['internalconfig']['backuplcswitch']
KeyError: 'backuplcswitch'
root@atlanticwave-sdx-controller:/# 

@mcevik0
Copy link
Contributor

mcevik0 commented Jul 31, 2020

With corrected manifest, exception above is not received anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants