Hello fellow VMware admins. Time and time again, we have dealt with strange issues which are really hard to pinpoint.
One issue I’ve recently dealt with was very frustrating and had me going deep into troubleshooting, deep into the kernel and what I have found is listed below.
However, before we go further in the article, let me give you a bit of history. Several days before the issue, we added another three hosts to our current VMware Cluster. The hosts are all Rack Servers with 10GBE cards since we have a massive production environment.
After physically installing the servers, installing ESXi and adding the hosts to the cluster, we saw that the network connectivity to the hosts was dropping each time there was a vMotion operation of many machines, either manually or by DRS.
After looking into the kernel logs and getting a little help from VMware, we concluded that the 10GBE cards that were present on the new server were not certified for ESXi and the driver that they were using was not adequate.
Here are a couple of screenshots of the vmkernel while vMotion was running:
As we can see, all of a sudden the socket is closed and the driver fails, which brings the network adapter down, which brings the management network down also. A quick restart of the management network fixes the issue, temporarily, until you start another vMotion and it will fail again.
Now here comes the fun part! The permanent fix!
It looks like VMware has two sets of drivers for these types of cards: an ixgbe and ixgben driver.
This issue happens when the ixgben driver is in use and it looks like there is an incompatibility between that driver and the network card. It fails when the buffer value becomes higher due to high vMotion data.
So to fix this issue, we will just disable the ixgben driver and enable the ixgbe one.
To do that, we will run the following commands from the management shell:
# esxcli system module set --enabled=true --module=ixgbe
# esxcli system module set --enabled=false --module=ixgben
Now just restart the ESXi host and you’re all done. Hope this article helps and come back for a lot more!