| Watchdog: Network Computing Review |
|---|
| Web Polygraph |
In May 1999, Network Computing magazine has published a review called "Speedy Performance, Rock-Bottom Price Put Squid Freeware on Top". Four caching products were tested: Cobalt Networks' CacheRaQ 2, InfoLibria's DynaCache, NLANR's Squid 2.0 and Novell's Internet Caching System (ICS). Web Polygraph was used for performance evaluation.
At the time of writing, Network Computing failed to provide us with workload parameters. Needless to say that publishing performance numbers without specifying the workload is a benchmarking crime. We hope that Network Computing will change their position.
What do we know about the workload? We know that Polygraph was configured to generate ``Best Effort'' request stream. That is, each simulated ``robot'' submits a new request as soon as the reply to the previous request is received. The number of robots used (the major parameter for best effort workload) is unknown to us. We know that all replies were cachable, and that the offered hit ratio was set at 50%.
The results of the performance tests shocked us. While Squid performance was within reasonable limits, the reported performance of the other three products was quite different from what one would expect:
| Cache | Throughput (requests per second) | |
|---|---|---|
| Network Computing | Common Sense | |
| CacheRaQ | 5 | 70 |
| DynaCache | 16 | 260 |
| Squid | 71 | 70 |
| ICS | 55 | 300 |
The ``Common Sense'' column depicts expected minimal throughput (i.e., lower bound) based on the first cache-off results, reported hit ratios, and general knowledge about the products.
Hit ratio of all products was measured at about 50%, matching the ratio offered by Polygraph. Response time details are not available.
At least three out of four products tested, performed much worse than one would expect. We do not have workload parameters and environment specifics to guarantee that our analysis is 100% correct. However, here is our best shot at explaining what went wrong.
Best effort workload has certain properties that usually make it unsuitable for ``general purpose'' tests. One of the major deficiencies of the workload is tight coupling of throughput and response time. In general, these two characteristics are independent or orthogonal. When the best effort workload is used, a robot waits for a reply before submitting the next request. Thus, throughput is simply the inverse of response time. For example, response time of 200msec corresponds to 5req/sec throughput (per robot). Consequently, even if a cache is capable of handling 500 requests per second, the workload with 10 robots would show only 50req/sec.
``How did Squid manage to perform so well regardless of the workload?'', you may ask. We speculate that the key difference was in the test setup (again, mostly unknown to us). Squid was installed on a Linux box by the Network Computing team. Other products came pre-installed on their own hardware. Let's assume that installations other than Squid added a small delay to HTTP transactions. If Squid mean response time was, say 50msec, while response time of DynaCache was 100msec, the negligible difference of 50msec would halve InfoLibria's throughput! Now, in real life (or if the correct workload were used), those extra 50msec would have no or marginal effect on the throughput. With best effort workload, the impact of delays is crucial.
An example of an artificial delay that can be introduced by the benchmarking environment is the TCP option known as ``delayed acks''. The option results in an additional ~200msec delay per TCP connection. It is possible that Linux box running Squid did not use that option and other boxes did. Note that it does not have to be this particular TCP option, but could be any negligible-as-it-may-seem difference in the setup.
Wrong performance analysis could have been prevented if the magazine consulted benchmarking experts before running the tests or, at least, used a workload and setup that were tested before.
We strongly recommend that all performance results are accompanied with Polygraph logs and setup details. Otherwise, the results lose any meaning and are impossible to verify. The reader should probably ignore any performance claims not supported by logs and setup details.
We think that the measurements reported in the Network Computing review are not related to the product's actual performance. The tests used inappropriate workload. The results are highly questionable and very difficult to explain, especially without knowing all the workload parameters.
Since the performance of caches contributed 10% towards overall rating and overall ratings of caches differ by less than 10%, the ratings are as questionable as the performance measurements.
After this Watchdog page was published on the Web, we had a good response from the caching community. People suggested various reasons for strange performance numbers measured in Network Computing lab. Network Computing also released the command line parameters used for the tests. The parameters and some performance data released by Network Computing are available here.
In short, our guesses about the workload proved to be correct. Unfortunately, the actual reason for extra network delays that corrupted the results remains unknown.
It is possible that a known Linux TCP stack bug (found in the Linux 2.2.6 kernel used for the tests; mentioned here) is to blame. Justin Pietsch (pietsch@cac.washington.edu) told us that the bug introduces 200msec network delays but does not show up if both machines are running Linux 2.2. The latter would explain good results for Squid and no-proxy experiments, the only two cases when all the boxes in the test were running Linux 2.2. We cannot verify if that is indeed what happened and will be happy to update this information if internal Network Computing investigation sheds some light on the problem.
On June 28th, Network computing put the following ``correction'' on their ``Letters'' page:
In our recent evaluation of Web-caching products ("Speedy Performance, Rock-Bottom Price Put Squid Freeware on Top," May 31, 1999, page 84), we used the Web Polygraph benchmarks for our performance testing, but we chose a scenario that was not a perfect model of real-life Web caching, and we used a best-effort workload with the benchmarks, which was also not ideal.
We regret that our performance numbers were not as realistic as we would have hoped. Performance criteria comprised only 10 percent of the total, so this did not weigh heavily in the overall score of the products tested--although we realize that many of you might rate performance higher. We will be retesting these products, and others, in future issues.
As one of the people affected by the review said after reading the correction, ``It's amazing what these magazines can get away with after truly botching a review''.
$Id: netcomp.sml,v 1.7 2001/05/29 19:34:48 wessels Exp $