Is the problem in wiring?

by Chris Josephes

Let's say you have evidence of network errors. Here's the symptoms that you see:

1. A lot of TCP retransmits (layer4)
2. No Ethernet frame errors, dropped packets, or CRC errors (layer 2)
3. No ICMP errors, or IP level errors. Pings report no lag or dropped packets. (layer 3)
4. Failures are only reported on two nodes in your network, but no errors on the switch between the two nodes.

Given the above evidence, would you look at the wiring between the two nodes, including the patch panel ports? If so, why?

No wrong answers, just trying to bring about an open discussion of opinions.



13 Comments

Rob
2007-12-14 06:59:26
I'll admit, my network-fu isn't the greatest ... but if there were lots of TCP problems but nothing on the lower layers (especially with the lack of lag/dropped packets) I'd look at the TCP/IP configs and then maybe the NICs themselves before I looked at the wiring.
Steve
2007-12-14 09:45:56
Usually the TCP retransmit is from some sort of network congestion. I guess retransmits could also happen when the speeds of the two nodes are different. In that case I'd check the NIC speed of both nodes, that could eventually lead to a wiring check to make sure all the pinouts are in the cables. It's possible someone could have made a custom patch cable and didn't wire it up to spec.
Kelly
2007-12-14 10:44:19
Because of the lack of layer 1 and 2 errors I would narrow things down a bit. Have you tried different switch ports? You might also try pinging with a payload. Sometimes you won't see errors with small pings, but larger pings show problems. Do you have other nodes running through the same switch that are not experiencing problems? Through trial and error you will need to eliminate the different possibilities. What type of switch are you using? If it's a Cisco or other smart switch, you should have good number diagnostic commands available to look at port statistics such as buffers.
Kelly
2007-12-14 10:51:02
I would also agree with Steve (from a prior comment) that you should make sure that the negotiation between the switch and each of the nodes match. If the NIC is set to auto negotiate speed and duplex, then the switch port should be set the same. Likewise if the NIC is specifically set to 100Mb full duplex, then you should change it to match the switch or vice versa.
Stefan
2007-12-14 10:55:50
Replacing the wiring can be an easy and quick fix. Try the low-hanging fruit first--- if it doesn't work, you only wasted 10 minutes. On the downside, you introduced another change to the system.


We had a similar problem a year ago.


There were no errors, dropped packets or CRC errors in the logfiles. We discovered that the devices weren't logging any errors at all-- the log config was misconfigured!


There were no ICMP errors or IP level errors. Pings report no lag or dropped packets. This is because the problem was intermittent, and the problem was gone by the time our network engineer ran his ping tests (Running continuous ping tests eventually showed some packet loss).


The problem? There were several--- an RJ45 connector was loose on one device (A bad handmade cable) and a second device was spamming the network for a few seconds at a time.

vance
2007-12-15 05:08:08
For my company, we were loosing wireless connection due to a microsowave oven... where we believed that the EMF was interfeering with the cisco routers. After purchasing another less powerfull microwave, the problem subsided.


Weird but interesting.

Andrew Green
2007-12-15 08:31:46
I saw something similar some years ago -- it turns out I'd plugged cables into ports 8 and 9 of an 8 port hub, where 9 was the hard-wired uplink socket and thus internally shared a connection with port 8. It happened because the hub was buried by gear and I made the connections by feel and got left and right mixed up.
Babak Farrokhi
2007-12-15 11:14:46
I would check the MTU and TCP MSS instead of wiring.
Andy Davidson
2007-12-19 01:37:08
Hey Chris


I'd be interested to know the role of the nodes which are having problems, since this may point to useful clues which will help you find a fix. If both nodes were fileservers, say, then I would expect the profile of the traffic to be significantly different to other important, high traffic nodes on your network like log hosts, etc - from the filer, the packets would be big, the traffic would be relatively bursty, it would be two-way and sensitive to latency, etc.


Do you have data on precisely what traffic is retransmitted ? e.g. is it for example exclusively lots of Ack traffic ? Is it Linux and are you using fastack (/proc/sys/net/ipv4/tcp_fack) ? Or it could be a tcp stack bug ...


A bad (short) RTT estimate will cause excessive retransmission since tcp isn't really good at recovering from such an event. Therefore although there are no icmp errors is it *consistently* low-latency.


MTU mismatch as earlier hinted ?


Do your affected nodes have a buggy TCP Window Auto-Tuning implementation (new feature that has appeared in this year's generation of operating systems, incl. Mac, MS Win, FreeBSD, Linux Kernel).


Lots of things it could be .... including wiring :-) ... but that is not where I would start.

Andy Davidson
2007-12-19 01:37:27
Hey Chris


I'd be interested to know the role of the nodes which are having problems, since this may point to useful clues which will help you find a fix. If both nodes were fileservers, say, then I would expect the profile of the traffic to be significantly different to other important, high traffic nodes on your network like log hosts, etc - from the filer, the packets would be big, the traffic would be relatively bursty, it would be two-way and sensitive to latency, etc.


Do you have data on precisely what traffic is retransmitted ? e.g. is it for example exclusively lots of Ack traffic ? Is it Linux and are you using fastack (/proc/sys/net/ipv4/tcp_fack) ? Or it could be a tcp stack bug ...


A bad (short) RTT estimate will cause excessive retransmission since tcp isn't really good at recovering from such an event. Therefore although there are no icmp errors is it *consistently* low-latency.


MTU mismatch as earlier hinted ?


Do your affected nodes have a buggy TCP Window Auto-Tuning implementation (new feature that has appeared in this year's generation of operating systems, incl. Mac, MS Win, FreeBSD, Linux Kernel).


Lots of things it could be .... including wiring :-) ... but that is not where I would start.

Charlie
2007-12-20 09:15:58
Given the evidence above, using cisco gear, I would not look at wiring.
If you have a wiring issue, it would manifiest as CRC errors, or at least dropped packets of all types.
The first thing that comes to mind with TCP issues would be MTU in a wan enviroment, especially where data is passing though VPN.
If its a local lan enviroment this should not be an issue, unless you have a cheap hub or switch inbetween that cannot handle throughput,but again this useally shows up as dropped packets.


The next think I would look at are access-list or more often rate limits, which can traget specific traffic.


Only have checking these items would I branch out into a layer 2 or 1 issue.

Richard
2007-12-20 09:47:49
We have sometimes seen unusual network problems where the switch ports have not auto-negotiated the correct speeds ie 1 port is set to 10Mbit when it should be 100Mbit, although this does tend to lead to some packet loss.
Ian
2008-01-28 16:07:00
ACLs are a good possibility, since they can be precise to a node or service, but they shouldn't send RST if they are security driven requirements.


MTU mismatch requires intent on most contemporary LAN equipment and OSs. MSS could be a problem if you don't trust the application - well behaved apps won't overdrive the TCP stack and won't 'forget' to send ACKs forcing the sliding window size down - though some Flash applications and Java applets seem to do just that. Both will be clearly evident in the TCP dump.


It could be a higher level application problem and the RSTs are "proper" behaviour. There are numerous applications that will reject connection attempts poorly resulting in the tcp connection being reset by the OS properly, particularly if you deal with lots of proxies or locally produced network daemons.


I've seen such behaviour from automated SSL tunnels when the cipher suite was misconfigured, client-server connections applications where the client IP differed from the server's cached information, and from http servers behind proxies when a session cookie arrives to the wrong virtual server instance. Some cases where a RST is proper, others where a FIN should have been generated but wasn't.


Applications (on the network) mess up networks far more often than in the network problems like wiring and configs (in the network) - because it is harder to break the network "just for me" inside L1-4.