Data Communications magazine was writing a feature story on Web caches and hired Polyteam to design and execute the performance tests. DataComm invited CacheFlow, Cisco, Cobalt, Entera, Eolian, Falcon Systems (now Tasmania Network Systems), IBM, InfoLibria, Inktomi, Microsoft, Netscape, Network Appliance, Novell, and Packetstorm to participate in the tests. Novell has also asked for Dell to be present as independent OEM running Novell's ICS proxy software. IBM decided to bail-out before the tests begin. Entera simply did not show up for their test slot.
In June-July 1999, we benchmarked the remaining seven caching products (see ``Executive Table'' below). The caches were tested in our Lab using Web Polygraph and DataComm-1 workload.
As always, we do not give purchasing advise. We provide performance measurements and expect the reader to use this information, along with other important factors, in their decision making process.
In developing our Web cache benchmark, we try to simulate ``real world conditions''. This is a difficult task. Many aspects of real Web traffic are not well understood. Some characteristics (such as network delays) vary greatly among different locations. Agreeing on a typical or representative value is difficult. Finally, we are somewhat constrained by the laboratory environment. The workload used for these tests is a good approximation to real Web traffic. No more, no less.
The workload models the following attributes of Web traffic. Most of the parameters were taken from measurements and analysis of production Web caches.
- Request Submission:
- Polygraph clients use a Poisson request submission model. The participant specifies the mean request rate. This parameter varies for each participant because each product has a different maximum throughput.
- Server-side Delays:
- Polygraph servers introduce an artificial delay for each request. The delays mimic traffic patterns on the Internet. Real Web requests experience delays due to factors such as packet loss and loaded origin servers. For the cache, these delays tie up resources (socket buffers, TCP ports, file descriptors) and may have an effect on performance. Also, because there are no artificial delays for cache hits, we can observe the cache hit ratio in the response time result. For these tests the server-side delays are taken from a normal distribution with a 3.0 second mean and 1.5 second standard deviation.
- Response Sizes:
- Obviously Web traffic consists of different response sizes. Many icons and images are very small, while audio and video files can be very large. We use an exponential distribution of response sizes with a mean of 13 kbytes.
- Popularity refers to the fact that some Web objects are requested more often than others. It is one of the most difficult things to model for the benchmark. See memory hit ratio footnote for some of the model caveats. The complexities of the popularity model are outside the scope of this article. Please see Web Polygraph documentation for details.
- Hit Ratio:
- Polygraph allows us to specify the document hit ratio that an ideal cache (of infinte size) would have. We use 55% for these tests.
- Life-cycle Model:
- Lifecycle refers to the way that Web objects change over time. These changes are manifest in Last-Modified and Expires headers found in most server responses. For these tests we use a very simple approach. All responses are "Last-Modified" one year ago, and all expire one year from the start of the run.
- Persistent Connections:
- We use HTTP/1.1 persistent connections for these tests. Polygraph clients may send anywhere from 1-64 requests per connection, and Polygraph servers send anywhere from 1-16 responses per connection.
- Uncachable Responses:
- Real caches handle some amount of uncachable responses, for example due to dynamic content. For these tests 20% of all responses were uncachable.
- A single test lasts for four hours. Most caching products exhibit performance degradation over time due to disk layout policies. Also, if the cache can hold less than four hours worth of cachable traffic, it will experience a decrease in hit ratio.
It is important to note that due to several fundamental differences, the results obtained on DataComm-1 workload should not be compared with the results on PolyMix-1 workload, used during the first bake-off.
- Hardware and Operating System:
- Polygraph machines run on Pentium II 400 MHz computers with 128 MB RAM and a single fast ethernet interface. For the operating system, we use FreeBSD, version 3.1-RELEASE. The FreeBSD kernel is tuned to support high TCP connection rates. Please see PolyLab Web pages for details.
- Polygraph clients and servers are bound together in pairs. Each client sends requests to one server. Each pair is capable of generating up to 700 requests per second.
- Networking Equipment:
- Particiapnts are required to also supply switches (and/or routers) to connect our Polygraph clients and servers to their cache.
We are aware of several shortcomings of the workload and environment. Many of the parameters were also used for our first Bake-Off, but some features (such as persistent connections) are new to this test. We are constantly working to improve the methodology and welcome your input and feedback.
The ``Executive Summary'' table below summarizes the performance results. Complete Polygraph logs are also available for those who want to reproduce the results or extract measurements not presented here.
The ``Total Price'' column is a sum of the list price of cache(s) and the cost of network gear in the test setup. In our price/performance analysis, we use total price rather than list price of a cache to adjust for expensive equipment that is sometimes required to achieve high performance numbers and/or aggregate individual caches into clusters. If the reader already has the required network components in place, the price/performance ratios should be adjusted.
Throughput column depicts the highest tested request rate for each Box. Tested products cover a wide spectrum of request rates. What is probably more interesting is that other performance figures like response time and hit ratio vary a lot as well. Hence it is impossible to select a single ``winner'' of the test without assigning weights to various performance factors. Various non-performance features add complexity to the decision making process.
Mean response time is reported separately for hits and misses to show the proxy savings and overheads on the two most important request paths. The ``All'' column depicts response time for all request classes.
Price/Performance column shows throughput (in req/sec) normalized by Box price (in thousand dollars). In other words, ``how much throughput one thousand dollars can buy?'' To be precise, the column should have been named ``performance/price'', but we decided to use the traditional wording.
The ``Persistent Connections'' column shows if a box supports HTTP/1.1 persistent connections on server and client side of a proxy. Persistent connections was a new workload feature introduced after the First IRCache bake-off and is perceived by many as an important optimization. Web Polygraph clients and servers use HTTP/1.0, but implement persistent connections according to HTTP/1.1 specs. Please see persistent connections footnote for caveats in interpreting this data.
All runs finished with less than 0.1% of failed transactions. Note that the rules disqualify a run with more than 3.0% of errors.
Participants had an opportunity to test two data points, and we selected only one for the baseline presentation. See the ``Other runs'' section for details.
This section gives a detailed analysis of major performance measurements. Before we proceed it is important to note that response time and hit ratio measurements discussed here are averages. Averages are meaningful in situations where performance does not change with time, or when changes are smooth and predictable. Cache design and characteristics of the workload ( e.g., variable memory hit ratio) led to interesting performance trends over the four hour duration of a test. Chart descriptions below point out exceptions where average values are poor representations of actual performance. The ``Raw Traces'' section gives more details about run-time trends in caches performance.
Presenting throughput results is a tedious task. Due to tremendous differences in request rates, a simple graph with request rates from the ``Executive Summary'' table requires logarithmic scale and is virtually unusable. Moreover, comparing throughput of a $300K three-head cluster with a small $3K PC is usually not interesting.
After trying various presentation forms, we have decided to normalize the throughput results by the price of a Box. The price is a universal, albeit not perfect, measurement of Box complexity and ability. The normalized graph not only provides a fair comparison but answers an important question: ``How much throughput one thousand dollars can buy?''
Throughput wise, NetHawk shows the best return on a dollar, beating competitors by a wide margin. Dell finishes second. Novell and NetCache entries show similar return on investment, and so do CacheFlow and Cobalt. Interestingly, two boxes running ICS software show quite different results (whether the sheer performance or pricing models of the two OEMs differ is hard to say).
Clearly, the ``Normalized Throughput'' chart does not necessarily suggest that, to support a given request rate, it is better to buy N NetHawk caches than M CacheFlow boxes. The reader must take into account the price of integrating those boxes together and various other performance factors and non-performance features. The chart is however a good place to start your price/performance analysis.
Hit ratio is a standard measurement of a cache performance. The DataComm-1 workload offers a hit ratio of 55% so a cache cannot achieve a higher hit ratio. However, due to various overload conditions, insufficient disk space, deficiencies of object replacement policy, and other reasons, the actual or measured cache hit ratio may be smaller than offered 55%.
The ``Document Hit Ratio'' graph shows how a cache maintains cache hit ratio under highest load. All boxes except CacheFlow were able to maintain close-to-ideal hit ratios. CacheFlow demonstrated 50% hit ratio due to Foundry Ethernet switch behavior. Under the load of 2000req/sec the switch was not forwarding at least 2% of client requests to the caches (as measured by Polygraph servers), resulting in a lower actual hit ratio. We can only speculate that if those 2-4% of request would reach the caches, the hit ratio would increase without overloading the box.
Byte hit ratio was close to document hit ratio because Polygraph object size distribution is symmetric with respect to the popularity of objects. In other words, popular objects have the same size distribution as unpopular ones.
It is interesting to note that no participant chose to sacrifice some hit ratio in order to increase throughput. During the First IRCache Bake-off, InfoLibria demonstrated such a trade-off to the surprise of other participants. After the bake-off, some vendors considered enabling similar algorithms in their boxes. DataComm tests allowed each participant to show two data points so demonstrating a trade-off was quite feasible. Apparently DataComm boxes did not have a trade-off ability, or participants simply decided not to demonstrate it.
The Mean Response Time graph (below) shows response times under peak load. To simulate real-world conditions, the DataComm-1 workload introduces an artificial delay on the server side. The delays are normally distributed with a 3 sec mean and 1.5 sec deviation. They play crucial role in creating a reasonable number of concurrent ``sessions'' in the cache. The delays along with 55% hit ratio also affect transaction response time. Ideal response time for this test would be 1.35 sec (45% of 3 sec mean), corresponding to zero-delay hits and 3 sec delay misses. An increase of response times above ~1.6 sec indicates loss of hit ratio and/or increased delays introduced by a cache.
Differences in response times are more dramatic than variations in hit ratio. When artificial server-side delays were first introduced (PolyMix-1 workload), we expected response times to mimic hit ratio patterns. However, we soon realized that the two metrics are related but not equal. Indeed, a fast cache with relatively poor hit ratio may still have better response times than a slow box with an ideal hit ratio.
Dell, Eolian, and Novell have close-to-optimal response times below 1.5 sec. All three boxes started with response times around 1.4 sec and showed less than 0.1 sec overall increase in response time during the experiment. However, Eolian mean performance suffered from a five minute response time spike at the beginning of the measurement phase (mean response time was about 3.6 secs during those five minutes).
CacheFlow and NetHawk show similar response times (around 2 sec) despite noticeable differences in hit ratios. The hit ratio alone does not determine responsiveness of a box. Both participants blamed boxes' TCP stack bugs or misconfigurations for suboptimal response times. CacheFlow also suspected the L4 switch in adding noticeable delays under high load. The response times for CacheFlow and NetHawk were relatively stable throughout the measurement phase of the test.
NetCache response time is just above 1.6 sec. Response time trace shows that NetCache box was able to maintain close-to-ideal response time for the first 2 hours of the test. During the last 2 hours the response time was gradually increasing up to 2.1 sec. The vendor attributes the decline in performance to the low memory hit ratio of the last half of the test due to the increasing Polygraph URL set. We also note that document hit ratio went down to 50% in the last 30 minutes. The latter is harder to explain with low memory hit ratio alone.
Cobalt shows worst mean response time of 2.7 sec, close to 3 sec mean of a no-proxy case. The run trace shows that response times of Cobalt box grew steadily from 2.1 sec in the first 30 minutes to about 3.4 sec in the last 30 minutes of the test. The hit ratio, on the other hand, was stable during the test. Apparently, the box was slightly overloaded and slowly degraded with time.
The graphs below show document hit ratio and response time traces averaged at 5 minute intervals. A line on a trace corresponds to a Polygraph log collected on one Polygraph client. The number of clients per box differed depending on the number of ports in the Ethernet switch used to connect clients to the caches. For each Box, we have selected exactly two clients (marked A and B). If a box included more than one cache, the selected clients belong to different caches (L4 switches were configured based on IP addresses).
We show two clients to double check that there are no strange anomalies connected to load generation or balancing among client-server pairs. Most boxes show no noticeable difference in traces from different clients. However, the performance of individual components in CacheFlow and Novell clusters suggests that either the caches were not performing identically or the connecting L4 switch dropped/bypassed more sessions for one cache than for the other.
At the first glance, some traces may look more ``smooth'' than others. The visual difference is however deceiving. High throughput tests produced more samples per 5 minute interval than low throughput tests, leading to more smooth curves. Hence the smooth appearance has nothing to do with the actual box performance. The traces illustrate overall performance pattern or trend very well, and minor variations in measurements should be ignored as ``noise''.
All traces are plotted with the same scale to ease the comparison. Both warm-up (first 24 minutes) and measurement phases are shown.
Each participant had an opportunity to test their boxes with two request rates. All participants except CacheFlow used their chance to achieve higher throughput rates. After all tests were completed, vendors were allowed to select the ``best'' data point for the baseline presentation. Only NetCache selected the higher request rate point. The choice was clear for all the vendors. With the exception of NetCache, the second test either failed or produced worse results.
|Mean Resp Time
The table above summarizes the results not selected for the baseline presentation. Polygraph logs contain all successful runs. We note that a failure on the second data point (relative or absolute) should not be perceived as something bad. The vendors simply used their cannot-lose chance to get slightly better numbers.
Here are the configuration details for tested products.
|Vendor||Product and version tested||Total price
|Unit list price
(Mbps x n)
|CacheFlow||CacheFlow 5000/2.1||191,990||79,995||2||8.0||237||14||Foundry ServerIron and FastIron||100 x 6||CacheOS
|Cobalt Networks||Cacheraq 2/ Update 2.0||2,599||2,349||1||0.1||2.5||1||Netgear FS108||100 x 1||Linux|
|Dell Computer||Novell ICS Powered by Dell/1.0||69,820||34,999||1||2.0||146||17||Packet Engine,
|1000 x 2||proprietary|
|Eolian||InfoStorm RA2/ Release 3.0||9,200||9,000||1||0.5||12||1||Netgear FS108||100 x 1||Linux|
|Network Appliance||Netcache C720s/3.4||23,125||22,875||1||0.5||27||4||Netgear FS108||100 x 1||proprietary|
|Novell||Novell ICS/1.0||293,967||65,999||3||12.0||324||39||see below||1000 x 6||proprietary|
|Tasmania Network Systems||Nethawk/1.0||8,300||n/a||1||0.5||45||6||Netgear FS516||100 x 1||FreeBSD|
All numbers are totals for the tested configuration unless noted otherwise.
Dell used Packet Engine PowerRail 5200 with 2 Gbit modules and 2 Fast Ethernet modules. Novell cluster used the following Ethernet switches: two Alteon 180, one Packet Engine PowerRail 5200 with 2 Gbit modules and 4 Fast Ethernet modules, and one Packet Engine Power Rail 1000 routing switch.
All vendors brought single CPU units with the exception of Novell that had two CPUs per unit.
Two participants used a layer 4 switch in their configurations. L4 switches distribute the load of a set of clients across a set of proxy servers. This is usually done by hashing the remote servers IP address, although there are alternative methods such as round robin. Non-IP-based methods are often suboptimal as they may reduce the hit ratio of the caches.
To prevent vendors from demonstrating virtually unlimited artificial scale, Polygraph clients (in a non-transparent mode) submit requests to a single proxy IP address. Consequently, vendors with more than one cache unit per Box must utilize some form of load balancing and redirection like L4 switching.
The DataComm-1 workload with its one to one client-server bindings and small number of well-known destination addresses makes configuring L4 switches easy. However even in this simplified environment, we encountered a number of difficulties related to both performance and functionality.
For the Novell tests, an Alteon 180 L4 switch was used to distribute the load across three machines. Two main problems were encountered in this setup. The first was the lack of knowledge or documentation on hashing algorithm in the Alteon switch. We used trial-and-error approach to determine where the traffic from a given client/server pair would be redirected. This problem may not exist in real world where the large number of destination addresses would make balancing automatic. However, we believe that the knowledge of the hashing algorithm may be still required for trouble shooting and fine tuning.
The second problem with Alteon was individual packets being passed through rather than being redirected. Packets which were not redirected could not (in Novell's setup) reach the servers. This caused a huge number of difficulties as only part of an HTTP connection did not get redirected, leaving a large number of TCP sessions on client and proxy waiting for more data. To correct this, each client was configured with the L4 switch as the default route, and then static routes were entered on the switch to direct the correct clients to the correct proxy. Note that this workaround would not have been possible in a real world environment, or if the DataComm-1 workload did not have one to one client-server mapping. Interestingly enough, the redirection problem was not related to the amount of traffic coming through the switch, but rather the number of different HTTP sessions outstanding on the switch. With a small number of permanent connections each making a large number of requests, the problem did not present itself. It only became apparent when there were a large number of HTTP connections.
CacheFlow front-ended their two machines with a Foundry Serveriron L4 switch. Their setup allowed traffic which was not redirected to be sent directly to the servers, effectively eliminating the kind of problems we have experienced with Novell setup. However, the Foundry switch did fail to redirect 2-5% of traffic at 2000 req/sec throughput. While it probably affected CacheFlow results, it did not prevent their test from functioning, primarily due to the fact that the client requests could always reach the servers.
Our experience shows that L4 switching, while an impressive technology, still has not developed to the point where its performance can be taken for granted. Some of the issues that we have encountered are likely to affect many real world configurations. The biggest potential problem would be the tendency of the switch to drop packets in the middle of a TCP session, creating delays and network retransmits. A user deploying L4 switching should speak carefully to their vendor and test the proposed solution before committing to it.
Polyteam is considering organizing an industry-wide competition for L4 switches and other Web traffic ``redirectors''.
It has become a Polyteam tradition to give participating vendors a chance to comment on the tests after they have seen the review draft. The comments below are verbatim vendor submissions. Polyteam has not verified any of the claims, promises, or speculations that these comments may contain.
Although these results are being published in October 1999, the actual tests were conducted in July 1999. Both the CacheFlow 5000 hardware and the CacheOS release 2.1 software were in very early stages of their release at that time. Since the July 1999 testing, both the model 5000 appliance and CacheOS v2.1 have been hardened, improved, and released to the general market, and are now running in production at many customer sites. Since that time, CacheFlow has also introduced several new high-performance caching appliances designed to hit different price/performance/redundancy requirements (e.g. the CacheFlow models 110, 515, 525, 545, and the 3000 series).
We have up-to-date, October 1999 Polygraph numbers for our products. We will be working with Polyteam in the next several weeks to retest, validate, and publish those numbers. For more information on CacheFlow Internet Caching Appliances go to http://www.cacheflow.com.
Based on these test results, Dell maintains its leadership position in single-cache peak request rate performance. The Dell ICS-635B also demonstrated near ideal average responses times for cache hits (0.05 seconds) and cache misses (3.11 seconds) under peak workloads.
To learn more about Dell's ICS cache appliances, see http://www.dell.com/products/poweredge/serversolutions/caching.htm.
Network Appliance would like to thank Data Communications Magazine and the Web Polygraph team for hosting this event and for all of their hard work. We look forward to future implementations of Polygraph that will hopefully be focused on improving the representation of real-world Web traffic--such as moving from a uniform access object popularity model to more of realistic zipf-like distribution: http://polygraph.ircache.net/Workloads/PolyMix-2/.
For more information on NetCache appliances from Network Appliance go to http://www.netapp.com/products/netcache/.
Novell's results on Intel Architecture (IA) systems demonstrate Novell ICS's fine-grained service tolerances under load. Both the Intel and Dell ICS results produced the lowest mean response times for cache hits and misses. The Intel solution delivered a 0.04 second average cache hit response time from a 324GB cache. Non-Novell ICS solutions in the competition delivered cache hit response times 100% to 3800% higher than Novell's. Latency induced by the competition during cache misses was 40% to 1000% higher than Novell's.
Using these results, Novell also demonstrated significant CPU headroom for value-added services. While servicing 6000 requests per second, the three Intel systems were not running at full capacity. The Layer-4 switch was the bottleneck in this configuration and the 3 ICS systems were operating at only 33% processor utilization during the workload's peak intensity. This performance capacity positions Novell ICS for in-the-flow value-added services that require significant CPU headroom and uniform response times under load.
Data Communications magazine invited Falcon Systems to participate in the tests. The printed DataComm version of this review may contain Falcon Systems as the NetHawk vendor.]
Falcon Web Systems, Inc. no longer exists. The principal architect and the development team at Falcon Web System are now at Tasmania Network Systems, Inc., which is responsible for the ongoing development and support for the NetHawk Web Caching proxy.
Web Polygraph implementation of persistent connection negotiation on client side caused compatibility problems with NetCache and, possibly, NetHawk proxies.
Polygraph was sending Connection: HTTP header field to a proxy that was expecting to receive a Proxy-Connection: field. Being an HTTP/1.0 client, Polygraph should have stayed away from using Connection: fields because some proxies take a conservative approach of not honoring that header when it comes to HTTP/1.0 clients. Polyteam considers Polygraph behavior with respect to Connection: header buggy because most HTTP/1.0 clients send Proxy-Connection: field, and benchmarks should follow common usage patterns rather than protocol standards.
Note that server-side persistent connections were not affected.
Unfortunately, the bug was discovered after the workload and code were frozen. NetCache team apparently did not test persistent connections performance at home and did not notice/report any bugs. During the unofficial part of the tests, Polyteam detected the compatibility problem and notified the vendor. NetCache team came up with a workaround that was used for the official tests.
NetHawk team had unidentified problems with persistent connections support during home trials. They decided to turn persistent connections off for the official tests because the code was not mature enough. We suspect that Polygraph compatibility bug was only a part of the problem (if it was). The fact that NetHawk did not use any persistent connections on the server side (the side unaffected by our bug) confirms our suspicion.
Other vendors had no problem with client-side persistent connections. Some were using transparent mode (where the bug is invisible), while other proxies were robust enough to honor both Connection: and Proxy-Connection: request headers.
CacheFlow product did not use persistent connections on the server side because the vendor was not confident that the server side code is ready.
Memory hit ratio is defined as a portion of hits that were served from cache RAM as opposed to the disk subsystem. Back-box benchmarks cannot reliably detect memory hits. However, the generated workload may affect memory hit ratio as measured by the proxy software.
To simulate stable document hit ratio, Polygraph steadily increases the set of URLs from which new requests are taken. This approach is inevitable if URLs from past runs cannot be used. When all known URLs have the same probability of being requested at a given time (the so called uniform popularity model), the ever increasing URL set has an unfortunate side-effect. As the known URL set grows, so does the set of ``hot'' URLs that some proxies tend to keep in memory. Another way to describe the same phenomenon is to say that each individual URL becomes ``cooler'' with time. This phenomenon makes caching objects in RAM less beneficial with time, decreasing memory hit ratio. The performance of memory cache depends, of course, on the actual caching policy used by a proxy and RAM size. We cannot compare the performance of various policies using macro-benchmarks due to insufficient data available on the benchmark side.
Polyteam is working on better popularity models that will have constant memory hit ratio per megabyte of memory cache (assuming a given caching policy).
$Id: index.sml,v 1.9 1999/10/11 19:03:23 rousskov Exp rousskov $