Last week Pia asked me to help her out with her yet to be name Australian OLPC deployment. The deployment involves two remote sites connected by an ADSL WAN and one of the key applications across this LAN is the use of the VideoChat activity.
The children at the site were experiencing audio blips and video artefacts, a sure sign of some sort of network related packet loss. With Pia at one site and myself at the other we did some testing to try and rule out the WAN itself as the problem and determine what the issue was.
It became quickly obvious that the WAN wasn’t at fault. We setup some pings with an interval of 1/10 of a second from the XO’s to their respective default gateways and between the default gateways themselves. Pia and I then started counting out loud, which got us a couple of strange looks from children playing around us :). During the audio blips there was no loss across the WAN but there was loss to the default gateways.
Now here comes the interesting part, the packet loss to the default gateways seemed to be syncronised. Now remember these are totally independant wireless networks sitting a couple of 100 kilometers apart. At this stage I was cooking up crazy theories about difficult to decode/encode video packets hitting both XOs at the same time but I was fairly dubious.
We did a little testing on XOs at the same site and while the problem didn’t seem to manifest in as obvious a manner it was still there (I think the latency involved across the WAN exacerbated the symptoms).
Back at home I did some further testing for a few days, trying all manner of different loads and writing various script to watch tcpdump output. To cut a long story short eventually while glancing at the XO during packet loss I noticed the antennae light was flashing which would indicate the XO is disassociating from the network.
A few minutes later I was able to verify that wireless scans were causing the problem and that it is easily reproducible by doing
ping -i 0.1 GATEWAY_IP & iwlist eth0 scan
You should notice the drop of about 4 packets.
I’ve filed the bug on the OLPC bug tracker
A temporary work around is to get Network Manager to stop performing scans, although I assume this means the network view probably won’t get updated. You can do this using wpa_cli.
wpa_cli > ap_scan 0