Disaster recovery using Multisite (light) with NSX-T 2.4

Last week I was presenting in front of give and take 400 VMware colleagues (at the VMware WWKO19 / TechSummit in Las Vegas), where I expected to present in front of 40 colleagues. If that was not exciting and nervewracking enough I decided to do a live demo. Unfortunately, due to the nerves and the last minute decision to include an unplanned (live) demo (because live demos are much cooler), my demo did not work out as I expected. The main reason was that I was not able to restore a backup on another NSX-T manager with another IP address. Something crucial when you are doing disaster recovery using NSX-T 2.4 across multiple sites. I promised to find out what I did wrong or what was wrong and post a video about this…

Well, guess what?! I know now (what I did not knew then) that restoring a backup to another NSX-T Manager with another IP address is not supported from the GUI. You need to do some RESP API magic to make it work, I have tested this REST API Magic and the complete Disaster Recovery procedure and recorded it just as promised!

I have created a 4-part video series that will demonstrate how NSX-T 2.4 multi-site (Disaster Recovery) works. Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening.

A bit of a spoiler here: The steps to recover from a site failure is a lengthy process with a lot of (manual) steps, checks and specific prerequisites. The whole process took me around 45 minutes! (if I subtract my slowness)

The full high-level steps that should be taken are described below and can be watched in Part 4 of the video series:

  1. Make sure DC1 NSX-T Manager(s) is using FQDN for component registration and backup
    1. This is not the case out of the box
    2. This can only be turned on (and off) with a REST API call
  2. Verify if the backup is done correctly with the FQDN name in the folder name
  3. Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
  4. Deploy (a) new NSX-T Manager(s) on DC2 with a new IP address in another IP range then the DC1 NSX-T Manager was in
  5. SIMULATE A DISASTER IN DC1 + START CONTINUOUS PING FROM WEB01 (172.16.10.11) + START STOPWATCH
    1. Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
  6. Repoint the DNS A record to the (new) NSX-T Manager(s) in DC2
  7. Make sure this new DC2 NSX-T Manager(s) is using FQDN for component registration and backup
    1. This is basically the same as we did in step 1
  8. Restore the backup on the new DC2 NSX-T Manager(s)
    1. This may take around 20 minutes to finish
  9. Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
    1. This is basically the same as we did in step 3
  10. Run SRM Recovery Plan on DC2 SRM Server and recover the Web, App and DB VM’s of DC1
  11. Log in to the (newly restored from backup) NSX-T Manager(s)
  12. Move the T1 Gateway from the DC1-EN-CLUSTER (that is no longer available) to the DC2-EN-CLUSTER
  13. Move the uplink from the DC1-T0 Gateway (that is no longer available) to the DC2-T0 Gateway
  14. Verify if ping starts working again
  15. Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around

Have fun testing this out! I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s

PART 1 = Introduction to the Lab / POC / Virtual Network environment https://youtu.be/rqmuTJeuAeA

PART 2 = Ping + Trace-route tests to demonstrate normal operation of the Active /Standby deployment https://youtu.be/c-HkB2PCcas

PART 3 = Ping + Trace-route tests to demonstrate normal operation of the Active / Active deployment https://youtu.be/MAp7BTDjfag

PART 4 = Simulate failure on DC1 and continue operations from DC2 https://youtu.be/auuNPPaQkV0