Fine-tuning your network drivers

This technical note is intended to help you tune your network drivers for increased performance or reduced memory footprint. This note includes:

High-performance NICs
Low-performance NICs
Tuning high-performance NIC drivers
PHY probing
Speed and duplex

First, we need to talk about network driver interface hardware chips — ASICs — which are sometimes referred to as NICs (network interface controllers, or network interface chips).

At the risk of oversimplifying, we can categorize NICs into two groups: high-performance and low-performance. We aren't talking about the media bit rate (10, 100 or 1000 Mbit) but rather the ability of the complete system, when using the NIC, to avoid packet loss.

Hardware engineers work very hard to design media that rarely lose a packet. And there is a gain, or amplifying effect: when you lose 1% of your packets, you don't lose 1% of your throughput via a protocol, you lose around 50% of your throughput, due to the incurred software-level protocol timeouts and/or retransmissions.

High-performance NICs

What we call “high-performance” NICs have the ability, in a loaded system, to not lose any packets. They generally do this by using transmit and receive descriptor rings in main memory, which in turn point to packet buffers also in main memory.

High-performance NICs use bus-master DMA (direct memory access) to transfer packet data to and from main memory entirely independent of the CPU, using the descriptor rings as laundry lists of packet transmit and receive requests to carry out.

Thus, large scheduling latencies in software that service the NIC (e.g. io-pkt*) can be tolerated.

Low-performance NICs

What we call “low-performance” NICs have been observed by users, in loaded systems, to consistently lose packets, with corresponding poor data throughput performance. These NICs don't use descriptor rings and DMA, but for simplicity, instead attempt to buffer the entire packet in a (usually limited) on-chip buffer area.

Unfortunately, these low-performance NICs, because of their low cost and size, are very attractive to board designers. Examples of these (usually older, obsolete) NICs include:

Crystal 8900: devn-crys8900.so
SMC9xxx: devn-smc9000.so
National Semiconductor ne83815: devn-83815.so
National Semiconductor ne2000: devn-ne2000.so, devn-wd.so

On a fast (e.g. 2 GHz) lightly loaded machine, these low-performance NICs can function adequately, without packet loss.

However on a slower (e.g. 100 MHz) machine that's CPU-bound with applications that may increase the scheduling latency of io-pkt*, packet loss during receive can often result because the limited hardware buffer overflows.

You shouldn't use these NICs where you need high-performance data throughput. You should use them only for low-cost debug and diagnostic ports, which are often removed for production versions of boards.

If you're using NFS used with one of these low-performance NICs, you can get a great improvement by using the -B4096 or even -B2048 option to fs-nfs. Qnet in QNX Neutrino 6.3 and later generally automatically goes into “windowed mode” with these NICs to try to avoid packet loss.

Tuning high-performance NIC drivers

Common examples of high-performance NICs include:

Intel i82544: devn-i82544.so
Intel 82557/558/559: devn-speedo.so
Tulip (DEC 21x4x): devn-tulip.so
SMC 9432: devn-epic.so

These NICs are all at least 10/100 Mbit, and some are gigabit, but what makes them high-performance is their ability to function independently of the CPU and use the large CPU main memory for packet buffering, which the low performance NICs by design can't.

There are two critical data-transfer interfaces to a high-performance NIC, which you must tune correctly to avoid packet loss under load:

The first is the (usually PCI) bus itself. The high-performance NIC will have some FIFO memory, used to buffer data for immediate transmit and receive. As the FIFO drains for transmit, or fills up for receive, the NIC must request to become the bus master to burst data to the FIFO for transmit, or from the FIFO for receive.
If the latency to schedule the NIC as bus master is excessive, the FIFO will drain for transmit or will overflow for receive. Either will cause a packet to be lost.
Excessive bus master scheduling latency used to be more of a problem in QNX 4, where other devices (e.g. disk) were programmed with excessive DMA burst length; they would “park” themselves on the bus. This doesn't appear to be as much of a problem in QNX Neutrino, but you should be aware that it can be a problem if you're suffering from mysterious packet loss. The nicinfo output can often give you a clue here.
Far more likely, when you're encountering packet loss at the driver/hardware level, is that the transmit and receive descriptor rings are overflowing.
For receive, this usually happens when a high (e.g. greater than 21) priority thread runs READY and hogs the CPU for an extended period of time. This causes io-pkt* to not be scheduled, and the receive descriptor eventually fills up as packets arrive, and the NIC bus-masters the received packets into main memory.
For transmit, this usually happens when there's an extremely large burst of transmit activity (e.g. server) and possibly some kind of backup or congestion (e.g. PAUSE frames) which simply fills up the transmit descriptor ring faster than the NIC can get it out onto the wire.
In this network driver patch, the drivers for the high-performance NICs are generally configured with a default 64 transmit descriptors and 128 receive descriptors. You can change them using the transmit=XXXX and receive=XXXX command-line options to the drivers. Generally, the minimum allowed is 16, and the maximum is 2048. Due to the hardware design, stick to a power of 2, such as 16, 32, 64, 128, 256, 512, 1024, or 2048.
Transmit buffer descriptors are generally quite small, generally in the range of 8 to 64 bytes. So, the cost of increasing the transmit=XXXX value to, say, 1024 for a server (which sees large bursts of transmitted data) is quite small:
(1024 - 64) x 32 = 30,720 bytes
for a transmit descriptor of 32 bytes.
Receive buffer descriptors are similarly quite small, however there's a catch. For each receive descriptor, the driver must allocate a 1,500 byte Ethernet packet buffer. Because the packet buffers must be aligned, they aren't permitted to cross a 4 KB page boundary, so in reality, io-pkt* allocates a 2 KB buffer for each 1,500 byte Ethernet packet.
So, the cost of increasing the receive descriptor to 1024 from the default 128, with an almost insignificant 32-byte-sized receive descriptor is:
(1024 - 128) x (32 + 2048) = 1,863,680 bytes
or almost 2 Megabytes, which is nowhere near as much as the filesystem grabs by default for its cache, but still not an insignificant amount of memory for a memory-constrained embedded system.
For a memory-constrained system, you should carefully select the sizes of the transmit and receive descriptor rings so that they're minimum-sized, yet no packets are lost under load, with the scheduling latency for io-pkt* on your system.
Obviously, reducing the receive descriptor ring has more of an effect than reducing the transmit descriptor ring.
In an application where memory is of no concern, but maximum performance is, generally transmit and receive descriptor rings of 1024 or even 2048 are used.
Most of the time, bigger is better. There is, however, a potential catch: for some benchmarks, such as RFC2544 (fast forwarding), we've observed that excessively large descriptor rings decrease performance because of cache thrashing.
However, that's really getting out there. Most of the time, you simply need to configure the transmit and receive descriptor ring size to suit your application so that minimum memory is consumed, and no packets are lost.

PHY probing

Almost all of the network drivers in this patch have been optimized for performance with respect to PHY probing.

Prior to this patch, network drivers would periodically (e.g. every two or three seconds) communicate via the MII to the PHY chip connected to the NIC, to determine the speed and duplex of the current media connection.

The problem is that, while the PHY is being probed, packet loss can occur. The drivers in this patch contain an optimization to not probe the PHY, as long as there have recently been some packets received. This gives maximum performance for most users.

However, there is a nasty scenario: the NIC is connected to a 100 Mbit full-duplex link. The cable is rapidly unplugged and immediately replugged into a 10 Mbit half-duplex hub, which also has a steady stream of (e.g. broadcasted) received packets. In this scenario, because of the steady stream of received packets, the network driver won't probe the PHY, and will still think it's in 100 Mbit full-duplex. This is a problem, because the NIC isn't listening before it transmits; it's still full-duplex, on a half-duplex link. Excessive collisions and out-of-window collisions will result in packet loss.

However, if you leave the cable unplugged for three seconds before plugging it into another hub, the driver probes the PHY and relearns the media parameters and reprograms the NIC with the appropriate duplex.

If you need to rapidly unplug and replug the cable into network boxes with different duplexes, you should specify probe_phy=1 to the network driver, to force it to always periodically probe the PHY. Packet loss may result during this probing, but you will know that the driver is always in sync with the PHY with respect to the media.

For maximum performance, the default is probe_phy=0.

Speed and duplex

Most (but not all) of the drivers in this patch support more than one Ethernet speed and duplex. The most common are 10 and 100 Mbit, and half and full duplex, though the newest NICs support 1000 Mbit (gigabit) Ethernet as well.

All of the drivers let you specify speed=XX and duplex=Z, where XX is 10, 100 or 1000, and Z is 0 (zero) for half-duplex, and 1 (one) for full-duplex. Generally most 10 Mbit links are half-duplex (to old hubs or repeaters, which is the original Xerox blue-book Ethernet) and most 100 Mbit links are full-duplex (switches with point-to-point connections). However, for maximum confusion, you will occasionally see 10 Mbit/full-duplex, and 100 Mbit/half-duplex, but not very often.

If you don't specify speed and duplex, the driver attempts to auto-negotiate the speed and duplex to the fastest possible, by an IEEE specification. Most Ethernet hardware produced in the last few years is compliant with the IEEE specification, but there is some older hardware around that isn't.

In the absence of auto-negotiation, the PHY can figure out the speed pretty easily. However, the duplex is another matter. If auto-negotiation isn't supported, the remote device is assumed to be older and thus half-duplex, not full-duplex.

The moral of the story is that 99% of the time you shouldn't specify speed and duplex; the auto-negotiation should automatically figure it out for you. If you run the nicinfo utility, it will tell you what the auto-negotiated speed and duplex is, and most of the time, it will be correct.

It's crucial that both devices, at both ends of the link, use the same speed and duplex, otherwise heavy packet loss can occur (see above).

So, you should specify speed and duplex to the network driver only if you have older, perhaps broken or nonstandard Ethernet hardware, and you manually control both ends of the Ethernet link. For example, a managed hub or a cross-over cable.