When I started with this Atheros MIPS 11n stuff a few years ago, my first test board was a Routerstation Pro with a pair of AR9160 NICs. I could get ~ 150mbit/sec bridging performance out of it, and I thought I was doing pretty good.
Fast forward to now, and I’ve been bringing FreeBSD up on each of the subsequent boards. But the performance never improved. Now, I never bothered to look into it because I was always too busy with my day job, but finally someone trolled me correctly on the FreeBSD embedded IRC channel and I took a look.
It turns out that.. things could’ve benefitted from a lot of improvement.
First up – I’m glad George Neville-Neil brought up PMC (performance counters) on the MIPS24k platform. It made it easier for me to bring it up on the MIPS74k platform and it was absolutely instrumental in figuring out performance issues here. No, there’s no real ability to get DTrace up on these boards – some have 32MB of RAM. Heck, the packet filter (bpf) consumes most of a megabyte of RAM when you first start it up.
My initial tests are on an AP135 reference design board from Qualcomm Atheros. It’s a QCA9558 SoC with an AR8327 switch on board. Both on-chip ethernet ports (arge0, arge1) are available. I set it up as a straight bridge between arge0 and arge1 and then I used iperf between two laptops to measure performance.
The first test – 130mbit bridging performance. That’s terrible for this platform.
So I fired up hwpmc, and I found the first problem – packets were being copied in the receive and transmit path. Since I’m more familiar with the transmit path, I decided to look into that.
The AR7161 MAC requires both transmit and receive buffers to be both DWORD (32 bit or 4 bytes) aligned. In addition, all transmit frames save the last frame are required to be a multiple of DWORD in length. Plenty of frames don’t meet this requirement and end up being copied.
The AR7240 and later MAC relaxed this – transmit/receive buffers can now be byte-aligned. So that particular workaround can be removed. It still needs to do it for multi-descriptor transmits that aren’t DWORD sized (eg if you just prepend a fresh ethernet header) but that doesn’t happen in the bridging path in the normal case.
Fixing that got bridging performance from 130mbit to 180mbit. That’s not a huge difference, but it’s something.
Next up is the receive path. This was more .. complicated. The receive code copies the whole buffer back two bytes in order to ensure that the IP payload presented to the FreeBSD network stack is aligned. This is a problem in FreeBSD’s network stack – it assumes the hardware handles unaligned accesses fine. So if your receive engine is DWORD aligned, the 14 byte ethernet header will result in the start of the IP payload being non-DWORD aligned, and .. the stack blows up. Now, I have vague plans to start fixing that as a general rule, but I did the next worst hack – I grabbed a buffer, set its RX start point to two bytes in, so the ethernet header is unaligned but the IP header is. Now, the ethernet stack in FreeBSD handles unaligned stuff correctly, so that works.
Except it wasn’t faster. It turns out that the MIPS busdma code was doing very inefficient things with mbuf handling if everything wasn’t completely aligned. Ian Lepore (who does ARM work) recently fixed this for armv6, so he ported it to MIPS and I added it.
The result? bridging performance leaped from 180mbit to 420mbit. Quite nice, but not where Linux was.
I left it for a few days, and someone on the freebsd-mips mailing list pointed out big stability issues with his tests. I started looking at the Linux OpenWRT driver and the MIPS24K/MIPS74K memory coherency operations. I found a couple of interesting things:
- The busdma sync code never did a “SYNC” operation if things weren’t being copied or invalidated; and
- I was using cache-writethrough instead of cache-writeback for the cached memory attribute for MIPS74K.
Source: Adrian Chadd Blog