Posts

Showing posts from 2011

It's Been Awhile...

It's been awhile since I've posted here, and that is primarily because of a job change. Although there are all sorts of negative things I could say about my previous employer, I will instead focus on the positive aspects of my new position. Instead of being solely responsible for practically everything in the enterprise, I am now the Senior Engineer over Global Telephony. There are dedicated teams handling Network, Security, and Servers, and that allows me to focus my efforts and produce quality work. So, that being said, I will be posting more about Telephony, and less about other Data Center stuff. So, what is going on in my life? New CUCM Cluster in London, planning a new CUCM Cluster for Brazil, a CUCMBE installation in Singapore, and a CUCM Migration here in the States. That should keep me busy for a couple of months...

VMware hates it's loyal customers

Now that vSphere 5 has officially been announced, has anyone else reviewed the licensing changes [PDF]? They are changing the model to begin capping the total vRAM at a socket-license level. So, it works out like this, for each socket, these are the vRAM entitlements per license level: - 24GB vRAM for Essentials Kit - 24GB vRAM for Essentials Plus Kit - 24GB vRAM for Standard - 32GB vRAM for Enterprise - 48GB vRAM for Enterprise Plus Let's say that you are using an 8-node cluster, each with 96GB of Physical RAM, and each with Enterprise Plus licensing. That means you are now entitled to 768MB of Virtual RAM. Now let's say that you use a script, such as this one , to determine how much vRAM is in use in your current environment. If the answer is >768MB, you are now out of compliance. Let's say that you have fairly low consolidation ratios and you are consuming 1024MB of RAM. That means you need to purchase and additional 256MB of vRAM licensing, which equates to 6...

Nexus 5000 - FWM-2-STM_LOOP_DETECT

In a previous post, I mentioned problems we were having with one of our Nexus 5000 switches. During all of the Nexus 1000v issues, it was throwing these messages continually: 2011 Mar 29 05:22:13 N5K-2 %FWM-2-STM_LEARNING_RE_ENABLE: Re enabling dynamic learning on all interfaces 2011 Mar 29 05:22:20 N5K-2 %FWM-2-STM_LOOP_DETECT: Loops detected in the network among ports Eth1/10 and Eth1/2 vlan 801 - Disabling dynamic learn notifications for 180 seconds I couldn't tell if it was actually affecting anything, since VLAN 801 was being used as a FCoE VLAN. Looking at MAC addresses bound to VLAN 801 would reveal one MAC address in particular that would move around: N5K-2(config)# sho mac add vlan 801 Legend: * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC age - seconds since last seen,+ - primary entry using vPC Peer-Link VLAN MAC Address Type age Secure NTFY Ports ---------+-----------------+--------+---------+------+----+-----------...

More Dell PowerEdge M1000e woes

I previously commented about issues we had with one of the pass through I/O modules with our M1000e chassis. After opening a case with support, they had us do some things such as remove the blades, remove the modules, etc, and it started working. Still not a particularly promising sign. After building out our ESX servers and trying to put VMs on them, we had all kinds of unusual issues with trying to run FCoE Active/Active on them. We were getting errors such as: Apr 23 15:02:04 host vmkernel: 0:00:39:45.582 cpu0:4284)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic2: transmit timed out Apr 23 15:02:04 host vmkernel: 0:00:39:45.854 cpu3:4260)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.60060060060060060060060060060060". Apr 23 15:02:04 host vmkernel: 0:00:39:45.854 cpu4:4258)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.60060060060060060060060060060060". Apr 23 15:02:04 host vmker...

Nexus 1000v and Cisco Support

After writing my previous posts about my love/hate relationship with Nexus 1000v, I received a phone call from the Cisco Nexus 1000v Product Manager. I can only guess that he tracked me down because I posted a Bug ID in there. Regardless, he was very interested in making sure that my issues were resolved, and he pulled some resources together to help me out. I needed the help because my Secondary VSM had started into a reboot loop. Even deploying a fresh VSM would do the same thing after the Config Sync happened. While pulling some debugs off of the busted VSM, somehow 6 of our VEMs (Hosts) unregistered with the Primary VSM. My TAC Engineer was out of the office, but an Engineer from the 1000v Escalation Team got on the phone with me, and started digging around. What he found was this: a 3750 switch, home to several Development ESX Hosts, using a port-channel connected via vPC to our Nexus 7000 switches. The only traffic allowed across this port-channel was the Control/Packet/Managemen...

Dell PowerEdge M1000e

We recently installed a Dell PowerEdge M1000e chassis with a couple of blades in it. Since we wanted to run FCoE to the blade, our only option (at least that was available at the time) was the 10 Gig pass-through blade. After hooking everything up and installing ESX, we found that only one of the 10 Gig links would come up for each host - not a promising start. We also had a score of annoying little problems across the CMC management interface and the KVM interface. So far, I'm not exceedingly impressed by Dell's blade offering.

My love/hate relationship with Cisco Nexus 1000v Part 2

Continuing on with the hate part of my relationship, we recently ran into an issue where our Primary VSM died on us. It would boot up to this: Loader Loading stage 1.5. Loader loading, please wait... User Access Verification KCN1K login: admin Password: No directory, logging in with HOME=/ Cisco Nexus Operating System (NX-OS) Software TAC support: http://www.cisco.com/tac Copyright (c) 2002-2010, Cisco Systems, Inc. All rights reserved. The copyrights to certain works contained in this software are owned by other third parties and used and distributed under license. Certain components of this software are licensed under the GNU General Public License (GPL) version 2.0 or the GNU Lesser General Public License (LGPL) Version 2.1. A copy of each such license is available at http://www.opensource.org/licenses/gpl-2.0.php and http://www.opensource.org/licenses/lgpl-2.1.php System coming up. Please wait... Couldn't open shared segment for cli server System is not ready. Please retry when...

My love/hate relationship with Cisco Nexus 1000v Part 1

Over a year ago, we deployed Cisco Nexus 1000v virtual switching into our VMware cluster. I love some of the features it offers, but the problems we have run across still haunt me. The first time I deployed into our environment, it was into a virtualized vCenter instance. We were running it on a separate SQL server VM in the cluster. This was all fine and dandy - until we had a major power outage that took the entire cluster offline. The problem with this scenario is that it leads to recursive and cascading failures if not carefully designed. Even if it is carefully designed, it can still make for an unpleasant scenario. Think this through with me. In the event of a complete power outage, what is the recovery process with a typical virtualized vCenter installation? I believe it looks something like this: 1) Power the hosts back up 2) Power up AD server (physical) 3) Locate and power up SQL server (unless this is on-box with vCenter) 4) Locate the vCenter VM and boot it up What about wi...