Polyteam members have given a few presentations on Polygraph and
proxy benchmarking in general. We've been asked for a copy of our slides.
Below is an HTML compilation of our most comprehensive slide set.
The slides are based on version 1.x of Web Polygraph.
Please let us know if our semi-automated compilation screwed something up.
Benchmarking Proxy Caches
with Web Polygraph
Polyteam
March 1999
1. Motivation
- Customer
-
- Performance based purchasing decisions
- Sizing decisions
- Potentially ``expensive'' features (e.g., huge ACLs)
- Upgrades
- Developer
-
- Bottleneck identification
- Optimization selection and tuning
- Component stress testing
- Marketing
-
- Fantastic peak performance numbers for the public
- Testing competitors' products
- Bake-offs
2. Available Benchmarks
- CacheFlow performance testing tool
- Inktomi large scale benchmark
- NetCache Load Generator
- Wisconsin Proxy Benchmark
- HTTP Blaster
- Web Polygraph
- other
2.1 CacheFlow performance testing tool
- Trace-driven clients (logs + synthetic) and real Web servers
- Load control, some cachability control, HTML parsing
``Focus on Web page response time, not throughput''
- Response time, throughput, object sizes
- Installation instructions and command samples
- Free binaries for Win32, Solaris, and Linux
(after answering 13+ questions!)
- http://www.cacheflow.com/technology/tool/
2.2 Inktomi large scale benchmark
- Synthetic clients and servers
- Synthetic workload; load control, hit ratio control
- Throughput (ops/sec and bytes/sec)
- Free Web page (binaries are probably available to clients)
- http://www.inktomi.com/inkbench/
2.3 NetCache Load Generator
- Synthetic clients and servers
- Synthetic workload; load control
- Response time for various object size ranges
- Installation instructions and command samples
- Free binary (Solaris) and source code
- ftp://ftp.netapp.com/netcache/cfh/
2.4 Wisconsin Proxy Benchmark
- Synthetic clients and servers
- Synthetic two-phase workload; hit ratio and temporal locality
(persistency, spatial locality, and more in 2.0)
- Response time, hit ratios
- White paper, command samples, and post-processing scripts
- Free source code
- http://www.cs.wisc.edu/~cao/
2.5 HTTP Blaster
2.6 Web Polygraph
- Synthetic clients and servers
- Synthetic or trace-based workload;
various types of load, hit ratio, cachability control, ...
- Throughput, response times, object sizes, concurrency levels
- Fairly detailed documentation;
benchmarking strategies, results, and caveats
- Free source code
- http://polygraph.ircache.net/
3. Benchmarking Methodology
- Sheer performance versus smart features
- ``Micro'' versus ``macro'' level benchmarking
- Synthetic versus real world workload
- White vs. gray vs. black box testing
3.1 Performance vs. Features
- Pro-performance
-
- straightforward tests and clear results
(``as good as it gets'')
- can actually compare boxes
- essential for developers
- Pro-features
-
- features affect ``perceived'' performance
(and only perceived performance matters)
- features are the real difference between vendors
- useful for customers
3.2 Micro vs. Macro
- Pro-micro
-
- can isolate bottlenecks
- gives insights on box operation and tuning
- essential for developers
- Pro-macro
-
- represents box performance as a whole
- eases result presentation (SPEC)
- essential for marketing
3.3 Synthetic vs. Real
- Pro-synthetic
-
- models whatever we want (in theory)
- shows relative importance of workload components
- controlled environment and repeatable results
- essential for developers
- Pro-real
-
- models whatever real boxes see (in theory)
- requires no understanding of traffic characteristics
- useful for users
3.4 White vs. Gray vs. Black
- White box testing is virtually always better
(getting more information rarely hurts)
- ``Gray'' box testing is virtually always possible
(and discarding important info is usually stupid)
- Black box testing is worry free!
4. Benchmarking strategy
- Performance oriented
- Micro level with a macro test at the end
- Synthetic workload
- Was verified on several caching boxes
4.1 The Plan
- No-nothing: Testing the network
- No-proxy: Testing your test suite
- Null-proxy: Trying proxy's shoes on
- Miss-only workload, empty cache
- Filling the cache
- Miss-only workload, full cache
- Hit-only workload
- Hit-and-miss workload
- Bursty traffic
4.2 (0) No-nothing
- determine sustained network bandwidth between hosts
(know your limits!)
- check for half-duplex / full-duplex conflicts and other sources of
packet loss
- use
dd | ttcp, netperf, etc. network testing tools
- do not be fooled by short experiments,
transmit at least 1GB to test a 100Mbs link.
4.3 (1) No-proxy
- stressing machine(s) running Polygraph software
- stressing network infrastructure
- "no-proxy" performance must be significantly better than
[expected] proxy performance
$ polysrv --goal $BigGoal --port $OriginPort
$ polyclt --ports 1024:30000 --proxy $Origin --origin $Origin \
--robots $cl --goal $BigGoal
4.4 (2) Null-proxy
- use null proxy or tunnel
- testing network and host bottlenecks (new paths)
- extra overhead for processing each client-server ``connection''
- not always feasible (software vs hardware solutions)
$ polysrv --goal $BigGoal --port $OriginPort
$ tunnel $ProxyPort $Origin
$ polyclt --ports 1024:30000 --proxy $Proxy --origin $Origin \
--robots $cl --goal $BigGoal
4.5 (3) Miss-only (empty cache)
- testing metadata maintenance and garbage collection
(to be compared with a full cache experiments later)
- two tests: cachable and uncachable reply generation
- uncachable test is arguably the easiest workload to handle!
- flush your cache after a cachable run
# to get cachable miss-only workload, use ``--rep_cachable yes''
$ polysrv --goal $Goal --port $OriginPort
$ polyclt --ports 1024:30000 --proxy $Proxy --origin $Origin \
--unique_urls 1 --robots $cl --goal $Goal
4.6 (4) Filling the cache
- watch performance change (if any) as disks get full, metadata grows,
and, at the end, garbage collection kicks in
- watch out for ``healing'' mode;
schedule for 20% extra time
- use the best number of clients from cachable miss-only runs
- the longest experiment, be sure you have enough time
$ polysrv --goal $HugeGoal --port $OriginPort
$ polyclt --ports 1024:30000 --proxy $Proxy --origin $Origin \
--rep_cachable 100p \
--unique_urls 1 --robots $cl --goal $HugeGoal
4.7 (5) Miss-only (full cache)
- similar to miss-only tests with an empty cache
- compare with previous experiments to see the effects of metadata,
garbage collection, and other full-cache factors
- cachable test is arguably the hardest workload to handle!
$ polysrv --goal $Goal --port $OriginPort
$ polyclt --ports 1024:30000 --proxy $Proxy --origin $Origin \
--unique_urls 1 --robots $cl --goal $Goal
# to get cachable workload, use ``--rep_cachable yes''
4.8 (6) Hit-only
- testing disk subsystem in read-only mode
- passes: fixed URL set, variable order of requests
- ideal HR: 0%, 100%, 100%, ...
actual HR: 0%, 80%, 90%, 95%, ...
- careful: only some benchmarks can do this test right
$ polysrv --goal $Goal --port $OriginPort
$ polyclt --order $pass --world_id hit-only.full.$cl \
--proxy $Proxy --origin $Origin --ports 1024:30000 \
--rep_cachable 100p
--unique_urls 1 --robots $cl --goal $Goal
4.9 (7) Hit-and-miss
- Macro benchmark
- mixture of misses and hits, Zipf-based object popularity
- vary offered hit ratio: 33%, 55%, 75%, ...
measure received hit ratio
- arguably close to real-world traffic
$ polysrv --xact_think norm:3sec,1.5sec \
--goal $Goal --port $OriginPort
$ polyclt --proxy $Proxy --origin $Origin --ports 1024:30000 \
--rep_cachable 80p
--dhr $hr --robots $cl --goal $Goal
4.10 (8) Bursty traffic
- constant number of concurrent requests is not realistic
(but easier to analyze)
- use Poisson stream with a given mean request rate
- caution: bursty traffic, proxy-independent request rate
(may want to try ``constant'' stream as well)
$ polysrv ... --goal $Goal --port $OriginPort
$ polyclt ... --req_rate $rate --robots 1 --goal $Goal
5. TODO
Developed areas
- packet delays
- memory hits
- persistent connections
- prefetching algorithms
- DNS delays
- lose client-server coupling
Premature?
- packet fragmentation
- proxy cooperation
- robustness
- compliance
- impact of source IP pool
- overheads of ACLs, content parsing, etc.
6. Results Presentation
6.1 Detailed report
- Response time versus throughput
- Throughput versus load
- Response time versus load
- Hit ratios (if any) versus load
- Total number of errors (if any) versus load
- Relative variation of trace data versus load
- For each run: raw traces (measurement vs. time)
for measurements (1) -- (3) above
6.2 Executive Summary
7. Gotchas
- Environmental issues
-
- Benchmarking a proxy without benchmarking the test environment and
the benchmark first
- Ignoring cron, update, etc. settings
- Ignoring known OS- or protocol-dependent weirdness
- Experiment start
-
- Starting all clients at once
- Starting measurement phase before all clients are active
- Experiment duration
-
- Using ``N replies per client'' as a termination condition
- Using the same ``N total replies'' as a termination condition
regardless of the number of simulated users
- Using idle timeouts on server side during "mix" or "hit-only" experiments
- Runs interdependencies
-
- Using the same URLs for independent experiments
- Using the same sequence (order) of requests for hit experiments
- Changing HTTP header sizes between experiments
- Cachability issues
-
- Trusting a proxy to cache ``cachable'' traffic
- Trusting a proxy to cache all ``cachable'' traffic
- Statistics collection
-
- Using statistics averaged over the entire run
- Ignoring trace (ping-like) statistics
- Ignoring various errors during a test
- Ignoring actual socket read/write size distributions
- Resource limits
-
- Running out of ephemeral ports with benchmarks that do not have built-in port management
- Using insufficient listen queue length on server side
- Using default MSL settings
- Other
-
- Expecting the number of concurrent users match
the number of concurrent connections (see graphs)
- Ignoring a proxy imposed limit on the number of persistent
connections per single IP address
- Vendors ``tricks''
-
- ``flush'' button does not flush
- healing modes
- tests that target memory hits, clustered I/Os, etc.
- compile-time optimizations (
$ cc -OspecBench ...)
- run-time benchmark detection
8. System Level Tuning
- Performance drops after 1K, 10K, 100K requests?!
(``mbuf size'' hack for FreeBSD)
- Router eats performance?
(undocumented TCP_ACK_HACK option for FreeBSD)
- Incorrect Content-Length in replies
(do not use SO_LINGER on Solaris)
- More at http://polygraph.ircache.net/
9. Conclusions
- A couple of decent benchmarks are available
- Benchmarking is a complex activity,
but strategies and guidelines are getting better
- Do it yourself, but watch out for known pitfalls
- Build a comprehensive report and share it with others
- Demand detailed benchmarking results from vendors
$Id: index.sml,v 1.2 1999/09/16 17:37:27 rousskov Exp $