Mar 04

Servers keep dropping their network connection, is it the server or the networks fault?

I had just arrived home from work and received a call stating many servers are dropping their network connection. The voice on the other end was very concerned that there was a major problem. I promptly logged into the network and started looking at the network equipment.

The specific servers were connected to Brocade MLXe switches via Multi Chassis Trunking (MCT). If you are not familiar with MCT, it is similar to Cisco’s Virtual Port Channel (VPC). It allows two MLX chassis to act like a single switch from the servers view. LACP is used to create a LAG (Trunk/Etherchannel) to the server.

Upon reviewing my log, I found the following.

Feb 28 16:08:14:W:LACP: 13/32 state changes from LACP_BLOCKED to FORWARD
Feb 28 16:08:14:I:LACP: Port 13/32 mux state transition: not aggregate -> aggregate
Feb 28 16:08:14:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP up
Feb 28 16:08:12:I:LACP: Port 13/32 partner port state transition: not aggregate -> aggregate
Feb 28 16:08:12:I:LACP: Port 13/32 rx state transition: defaulted -> current
Feb 28 16:06:43:I:LACP: Port 13/32 rx state transition: current -> expired (reason: timeout)
Feb 28 16:05:29:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP down
Feb 28 16:05:29:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP up
Feb 28 16:05:26:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP down
Feb 28 16:05:26:I:LACP: Port 13/32 mux state transition: aggregate -> not aggregate (reason: peer is out of sync)
Feb 28 16:05:26:W:LACP: 13/32 state changes from FORWARD to DOWN
Feb 28 16:04:36:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State FORWARDING (EnableFwding)
Feb 28 16:04:36:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State LEARNING (EnableLearning)
Feb 28 16:03:05:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State FORWARDING (EnableFwding)
Feb 28 16:03:05:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State LEARNING (EnableLearning)
Feb 28 14:29:28:I:RSTP: VLAN VLAN: 110 Port 16/6 – STP State FORWARDING (EnableFwding)

One thing I quickly noticed was the lack of interface up/down entries. I finally came to the conclusion that the log entries were a result of the interface switching to UP. The mass outage that I was called about, wasn’t such a mass outage after all. Yes, the log showed many servers going down, but not all at the same time and they were not down now.

After some more conversation with the server team the next day, we came to the conclusion that all of the HP Gen8 servers were having this issue. They would drop their connection, send an SNMP trap, then recover by the time a support engineer could take a look at the server. I was surprised to hear this was going on for many weeks. Knowing that the Brocade MLXe MCT has been stable for a couple of years now, I felt safe suggesting that the server team update the drivers on the server for the NIC. There was an update available and that resolved the issue.

I have had wireless NIC drivers cause connectivity issues in the past, but never on a server.
Can any of you share any stories where a server network card driver caused an issue?

Back to the switch log, where was the port UP/DOWN log?


After some more searching, I figured out that I had the following command applied to the interface “no snmp-server enable traps link-change”. I have this command on every interface that is NOT an uplink interface. This command prevents the switch from sending interface up/down traps to the monitoring system. I do this because I don’t want to receive port up/down traps when the servers do their scheduled reboot, or go down when the server team takes down the server for maintenance.

I removed the “no snmp-server enable traps link-change” command off of a test interface. I then connected my PC to the port, then disconnected it. I received the following log entries.

Feb 26 09:39:40:I:System: Interface ethernet 15/35, state down – link down
Feb 26 09:39:02:I:RSTP: VLAN VLAN: 4 Port 15/35 – STP State FORWARDING (EnableFwding)
Feb 26 09:39:02:I:RSTP: VLAN VLAN: 4 Port 15/35 – STP State LEARNING (EnableLearning)
Feb 26 09:39:02:I:System: Interface ethernet 15/35, state up

I found that the “no snmp-server enable traps link-change” command is preventing the link up/down log entries. After talking to Brocade support, this is a software defect in 5.2d.

Have you run into this software defect on the MLX?
Have you run into this driver issue on the HP Gen8 servers? If so, what switches were you using?

No account needed to post a reply, find the Reply button below and add your comment!!!

Feb 25

QOS on the 4500-E with IOS XE is different from the older 4500s.

For years I have configured QOS on Cisco switches. The 4500’s and 6500’s always caused me the most frustration. Depending on the line card, you may have 2 or 4 hardware queues with the Priority queue different on each platform. The 4500’s and 6500’s are different then the other Cisco platforms.

I recently had the privilege to setup a new 4506-E running IOS-XE 3.3.0X0(15.1(1)XO). For the most part, this switch was very similar to the older 4500’s. Most of my configuration from the other 4500’s easily pasted into the switch. I was doing well until I got to the QOS portion of the config. I found that with the exception of the Marking policy, nothing else worked.

What did not work?
1. Trust DSCP commands on the uplink ports
2. Egress queueing, mapping COS value to the hardware queue
3. COS To DSCP mapping
4. Selecting the priority queue, what queue was the priority queue?

After some more digging I found out that the 4500 trust DSCP and COS values by default. This explains why the commands would not work. This in itself may cause new challenges to you. If you are not careful, an end user or application could put all of their traffic in the EF queue and use up all of your priority bandwidth. To resolve this challenge, I mark all traffic coming in from an edge port. How do you handle this?

Egress queuing, this is done with a policy map just like a router. The switch comes with 8 hardware queues. (I’m very happy to hear that Cisco finally added 8 hardware queues per port on their switches. Other vendors have been doing this for years.) In your policy map you identify what queue you want to be the priority queue, then under each class you can specify the bandwidth you want to give that queue. For more information on how to configure the policy map, please refer to the Cisco IOS XE Documentation You do need to be careful while going through this guide. The IOS XE software works on routers too. You may find documentation that only applies to ASR routers, but does not work on the 4500 platform.

COS to DSCP mappings may not be needed. I mark all of my traffic at the edge port with DSCP values. I mark DSCP so I do not need to worry about the COS value being dropped going over an access port. With IOS XE, the outbound queuing policy is capable of queuing egress traffic by DSCP value. Due to this, I don’t have to worry about the COS to DSCP or DSCP to COS mappings.

Now for the priority queue and mapping QOS markings to a hardware queue. This has been a major frustration for me due to the different capabilities and commands on the variety of Cisco platforms. To me, this has been no different then if every platform was a different vendors equipment. Due to the challenges in the past, I was really getting upset when I wasn’t able to find any documentation on how to perform this mapping. After speaking with my Cisco SE, I found out that IOS XE will automatically place the different classes in your policy map into different hardware queues. The specific queue for the priority traffic is GONE!!! The unique commands of allocating QOS markings to hardware queue is GONE!!!

Even though this new IOS was frustrating at first, I believe the changes in IOS regarding QOS is a drastic improvement. The old method was more difficult then I feel it should be. As long as you understand Cisco’s MQC logic, the change in Trust and automatic queuing methods, I believe you will find this IOS much better to work with. Do you agree?

Other then Marking and queuing, what other changes have you notices in this new version of software?
Are you happy with the newer IOS XE on the 4500?