Servers keep dropping their network connection, is it the server or the networks fault?

I had just arrived home from work and received a call stating many servers are dropping their network connection. The voice on the other end was very concerned that there was a major problem. I promptly logged into the network and started looking at the network equipment.

The specific servers were connected to Brocade MLXe switches via Multi Chassis Trunking (MCT). If you are not familiar with MCT, it is similar to Cisco’s Virtual Port Channel (VPC). It allows two MLX chassis to act like a single switch from the servers view. LACP is used to create a LAG (Trunk/Etherchannel) to the server.

Upon reviewing my log, I found the following.

Feb 28 16:08:14:W:LACP: 13/32 state changes from LACP_BLOCKED to FORWARD
Feb 28 16:08:14:I:LACP: Port 13/32 mux state transition: not aggregate -> aggregate
Feb 28 16:08:14:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP up
Feb 28 16:08:12:I:LACP: Port 13/32 partner port state transition: not aggregate -> aggregate
Feb 28 16:08:12:I:LACP: Port 13/32 rx state transition: defaulted -> current
Feb 28 16:06:43:I:LACP: Port 13/32 rx state transition: current -> expired (reason: timeout)
Feb 28 16:05:29:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP down
Feb 28 16:05:29:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP up
Feb 28 16:05:26:I:CLUSTER FSM: Cluster CNS-Cluster (Id: 1), client (RBridge Id: 161) – Remote client CCEP down
Feb 28 16:05:26:I:LACP: Port 13/32 mux state transition: aggregate -> not aggregate (reason: peer is out of sync)
Feb 28 16:05:26:W:LACP: 13/32 state changes from FORWARD to DOWN
Feb 28 16:04:36:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State FORWARDING (EnableFwding)
Feb 28 16:04:36:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State LEARNING (EnableLearning)
Feb 28 16:03:05:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State FORWARDING (EnableFwding)
Feb 28 16:03:05:I:RSTP: VLAN VLAN: 110 Port 12/32 – STP State LEARNING (EnableLearning)
Feb 28 14:29:28:I:RSTP: VLAN VLAN: 110 Port 16/6 – STP State FORWARDING (EnableFwding)

One thing I quickly noticed was the lack of interface up/down entries. I finally came to the conclusion that the log entries were a result of the interface switching to UP. The mass outage that I was called about, wasn’t such a mass outage after all. Yes, the log showed many servers going down, but not all at the same time and they were not down now.

After some more conversation with the server team the next day, we came to the conclusion that all of the HP Gen8 servers were having this issue. They would drop their connection, send an SNMP trap, then recover by the time a support engineer could take a look at the server. I was surprised to hear this was going on for many weeks. Knowing that the Brocade MLXe MCT has been stable for a couple of years now, I felt safe suggesting that the server team update the drivers on the server for the NIC. There was an update available and that resolved the issue.

I have had wireless NIC drivers cause connectivity issues in the past, but never on a server.
Can any of you share any stories where a server network card driver caused an issue?

Back to the switch log, where was the port UP/DOWN log?


After some more searching, I figured out that I had the following command applied to the interface “no snmp-server enable traps link-change”. I have this command on every interface that is NOT an uplink interface. This command prevents the switch from sending interface up/down traps to the monitoring system. I do this because I don’t want to receive port up/down traps when the servers do their scheduled reboot, or go down when the server team takes down the server for maintenance.

I removed the “no snmp-server enable traps link-change” command off of a test interface. I then connected my PC to the port, then disconnected it. I received the following log entries.

Feb 26 09:39:40:I:System: Interface ethernet 15/35, state down – link down
Feb 26 09:39:02:I:RSTP: VLAN VLAN: 4 Port 15/35 – STP State FORWARDING (EnableFwding)
Feb 26 09:39:02:I:RSTP: VLAN VLAN: 4 Port 15/35 – STP State LEARNING (EnableLearning)
Feb 26 09:39:02:I:System: Interface ethernet 15/35, state up

I found that the “no snmp-server enable traps link-change” command is preventing the link up/down log entries. After talking to Brocade support, this is a software defect in 5.2d.

Have you run into this software defect on the MLX?
Have you run into this driver issue on the HP Gen8 servers? If so, what switches were you using?

No account needed to post a reply, find the Reply button below and add your comment!!!

3 thoughts on “Servers keep dropping their network connection, is it the server or the networks fault?

Leave a Reply