My love/hate relationship with Cisco Nexus 1000v Part 2

Continuing on with the hate part of my relationship, we recently ran into an issue where our Primary VSM died on us. It would boot up to this:
Loader Loading stage 1.5.


Loader loading, please wait...



User Access Verification
KCN1K login: admin
Password:
No directory, logging in with HOME=/
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2010, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php
System coming up. Please wait...
Couldn't open shared segment for cli server
System is not ready. Please retry when system is ready.
Press F and hit to start shell in Forced mode.
[Note: Shell may not be fully functional in Forced mode]
F
Entering Forced mode
cli_share_segment: errors pertaining to opening shared memory
Shared memory create failed!
Error while communicating with vshd.
Error setting roles for user: admin
User role list initialization failed
Error while communicating with vshd.
Component group list initialization failed
Could not communicate with vshd
Could not communicate with vshd
When we purchased Nexus 1000V, it was part of our Enterprise Plus Licensing from VMware. So, we have to open our support requests through VMware. The advice given by VMware was to just dump the VSM, promote the Secondary to Primary, and deploy a new Secondary. Easy enough, I killed the Primary, made a Snapshot of the Secondary, and promoted the Secondary to Primary. When it rebooted, it went into a reboot loop and wouldn't come online. My critical mistake was to roll back the snapshot - the VSM did boot properly, however, only half of the VEMs would register to it. The other half had all kinds of unusual networking issues - some VMs working, others not, etc. Long story short, after VMware engaged Cisco to assist, it was discovered that the VSM had went into a reboot loop because we had recently upgraded from 10Gig Ethernet to 10Gig FCoE and I had failed to configure mac-pinning on the uplink port profile in the VSM (I was using CDP pinning instead). Apparently, upon promoting the Secondary VSM to Primary, it was having issues communicating with the VEMs, which caused the reboot loop. Still confusing was that our configuration worked at all, and why it continued working with half of the hosts, but not the other half. The world may never know.

After getting the VEMs re-registered, promoting from Secondary to Primary was successful, and I was then able to deploy a new Secondary VSM. After that, I upgraded to the newest 1.4 VSM & VEMs and everything seemed to be moving smoothly. Until I tried to boot up VMs. Networking would not come up, and a vemcmd show ports revealed that all of the VM ports were in a blocked status. After many hours (spanning two time zones) on the phone with Cisco, including a technical escalation to someone who was running commands I have never even heard of, all but two of our hosts were back in service (one of which was purple-screening, caused by an odd Emulex driver issue).

Another call to Cisco was required to resolve the issue with the other non-working host - the VEM module was flapping with these errors:
2011 Feb 20 22:33:44 KCN1K %VEM_MGR-2-VEM_MGR_REMOVE_UNEXP_NODEID_REQ: Removing VEM 20 (Unexpected Node Id Request)
2011 Feb 20 22:33:44 KCN1K %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet20/5 is detached (module removed)
2011 Feb 20 22:33:44 KCN1K %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet20/6 is detached (module removed)
Backing up a step, I intentionally excluded part of the story: upon completing the 1000v 1.3a -> 1.4 upgrade, I was unable to run the svs upgrade complete command as documented - it returned the error command failed. upgrade is not in start/progress state. At the time, I had no idea what the consequences of this might be. Turns out, there was one significant consequence. If you run a vemcmd show card from the host itself, you will see a couple of related fields, entitled Secondary VSM MAC and Upgrade. In a typical operating environment, these will read 00:00:00:00:00:00 and Default, respectively. On all of our hosts, Upgrade was showing Started (an obvious consequence of not being able to complete the upgrade), and on this one particular host, the MAC address of the Secondary VSM was present instead of the all 00s address. Apparently, this is what caused the Unexpected Node Id Request error message. Go ahead, Google it - as of today, that error message is non-existent as far as Google is concerned.

So what was the fix? Simple really, just a single command from the VEM (from every VEM, really) - echo "upgrade complete" > /tmp/dpafifo. You know you want to - Google this one too. Nothing? I didn't find anything either. Thankfully, after running this command on all of the hosts (along with a vem restart and even a host reboot on a few of the hosts), our cluster was once again healthy. Never a dull moment.

While you are at it, take a look at the Bug (CSCtn49830) that Cisco opened for this issue.

Comments

Unknown said…
Man, what a great writeup. Not related to my exact issue, but a well written doc that I learned from. Thanks for that! I really appreciate the fact that you included so much information on the problem symptoms and resolution.

awesome =)

-Pete
Unknown said…
Man, what a great writeup. Not related to my exact issue, but a well written doc that I learned from. Thanks for that! I really appreciate the fact that you included so much information on the problem symptoms and resolution.

awesome =)

-Pete

Popular posts from this blog

Installing Cisco CallManager 4.1(3) on VMware in 2025

Why is Cisco Licensing so terrible?

Installing Linux on a Cisco Content Engine