Kevin and I recently completed our PlanetLab project. This was basically a "Hello World" sort of task: we set up a slice of 130 nodes and had each node ping all of the others. Some prior familiarity with pssh made it fairly easy to set up the experiment. We generated a script that sequentially pinged each of the other nodes in the slice with "ping -c 10 -i .5 hostname" and then piped this script to pssh with the options "pssh -o output -e error -t 0 -h nodes.txt -l byu_cs660_1 -P -v -I -p 100 -O StrictHostKeyChecking=no". That looks like a lot of options, but it's not so bad when you consider all of the information it needed (where to store output and error files, which nodes to connect to, which user name to use, etc.). Anyway, pssh conveniently gave us one output file per node, which made the results easy to parse.
As usual, most of our time was spent analyzing and interpreting results. Availability on PlanetLab was surprisingly low. Five machines (4%) were completely down and never responded to even a single ping attempt. Nine additional machines (7%) responded to pings but never allowed us to log in. Even among the more cooperative 89% of nodes, packet loss was 38.3%. Additionally, about 5% of host pairs exhibited high RTT variance. Since ICMP traffic is lowest priority, I presume that UDP datagrams would have experienced less loss, but 38.3% is still significant.
I don't suspect that PlanetLab is particularly unreliable. Rather, any experiment on a large number of machines across a best-effort network is bound to run into problems. The takeaway message is that failures are inevitable, and systems should always be designed to tolerate such failures.