You are currently browsing the daily archive for October 19, 2007.
Many of you have heard that there are GemStone/S applications running in production with thousands of vms and that other applications are doing thousands of commits per second. I bet you have been wondering what kind of performance you could get running Seaside on GemStone/S – I certainly was:)
Seriously, I spent most of the summer working on tools and with the exception of a few small tests, I didn’t take the time to focus on performance. When I came back from ESUG, I decided to run some scaling tests.
[For those of you who want to see the results and move on with your surfing, here are a couple of links: Goals, Results, Conclusions].
Background
From the tests that I had run over the summer it appeared that Apache was a limiting factor when trying to run at rates above 30 requests per second. I’d also seen some anomalies in the i/o on Linux, were we’d get flat spots in our performance graphs that appeared to be related to file system buffer flushing. If you have read the post on transactions, then you know that we write tranlog records on every commit (page request), so disk i/o can be a limiting factor.
With the help of our IS guy, it turned out that to get the best performance from Apache, we needed to use the MPM worker module. With the MPM worker module turned on, Apache performance fell way off the radar.
The issue with the i/o anomalies that we observed in Linux has not been as easy to resolve. I spent some time tuning GemStone/S to make sure that GemStone/S wasn’t the source of the anomaly. Finally our IS guy was able to reproduce the anomaly and he ran into a few other folks on the net that have observed similar anomalies.
At this writing we haven’t found a solution to the anomaly, but we are pretty optimistic that it is resolvable. We’ve seen different versions of Linux running on similar hardware that doesn’t show the anomaly, so it is either a function of the kernel version or the settings of some of the kernel parameters. As soon as we figure it out we’ll let you know.
For the purposes of these performance tests, I was able to work around the i/o anomaly by putting extents on raw partitions. In nearly all of the tests, the Shared Page Cache (SPC) is sized large enough to hold the entire working set for the test. Consequently there was very little read activity and the system was able to write dirty pages from the SPC fast enough so that random i/o to the raw extent partitions didn’t affect test results.
Goals
I had three goals in mind when I ran this set of tests.
- demonstrate anticipated performance of the GLASS appliance.
- demonstrate production performance for the Web Edition.
- demonstrate performance potential beyond the Web Edition.
For the GLASS applicance tests I wanted to illustrate what you could expect if you installed the VMWare image on a machine without paying particular attention to disk configurations. To simulate the GLASS applicance, I simply ran the tests using file-based tranlogs.
For the production-scale performance of the Web Edition, I wanted to illustrate what you could expect if you paid attention to the disk configuration (i.e., created some raw partitions and had a box with a minimum of 4 disk spindles).
I also wanted to illustrate what kind of performance improvements you could see if you were to add additional SPC or increase the number CPUs available.
Finally, based on a comment where Ramon Leon suggested I include some ‘speed comparisons between GemStone and Squeak’, I’ve included runs against a Squeak vm.
Test Strategy
I decided to base the performance tests on the Seaside Counter example. Since the Counter has dead-simple render logic and no significant application state, it is the perfect application for measuring baseline Seaside performance. For GemStone that means that when running tests against the Counter, we’ll be getting performance numbers that include the overhead of persisting and sharing session state (about 250 objects or 50k bytes per request).
Over the course of the summer I ran across siege, “an http regression testing and benchmarking utility” that is very easy to use. It can smack the heck out of a Web Application without putting too much of a load on the system running the test. It also provides some very basic stats about how your app withstood the barrage. Response Time, Transaction rate, and Concurrency being the most interesting.
Siege basically arranges to fire a number of concurrent requests at a given URL. In benchmark mode, as soon as a response is recieved another request is launched in its place. In this mode you can force an application to its knees – very nice for finding bottlenecks. In internet mode, siege waits a random amount of time before firing off the follow-on request, making for a simulation of what the end users of your application may experience.
I ran my tests using the URL ‘http://penny:8000/seaside/examples/counter’, which, as many of you Seasiders out there will recognize, means that a new Seaside session is created on each hit. For benchmarking purposes, that’s just fine – a little extra load never hurts. In the future, I plan to run some scaling tests that measures intra-session performance.
For quite a while now, I have felt that it was important for folks to plan on running multiple vms when they go into production using GLASS. For the best overall response times, one should plan on running at least one vm per concurrent request, which in practice should be 10 or more. For the benchmark tests, the Counter application is cpu bound, so when siege goes about slamming the web server into the ground the cpus are pegged and cpu contention between the processes becomes a factor. It turns out that with 10 vms running on a single cpu, there is a whole lot of contention going on. So after playing around a bit, I settled on using 5 vms for the benchmark tests while running with 10 concurrent requests. This combo gave good performance numbers in the single core tests while in the 4 core test when all 4 cpus were redlined we would minimize the amount of contention. I also ran a couple of internet tests using a single CPU and 20 vms to confirm that in the wild you could afford to run with more than 5 vms without suffering a performance hit. Finally I ran a benchmark test with 100 concurrent requests to see how the system behaved when it was being slashdotted.
Test Setup
For the hardware I used 3 machines foos, toronto, and penny.
Penny is a 2.6Ghz AMD Opteron, with 2 dual core cpus, running SUSE Enterprise 10, with 8Gb of ram. Penny was used to host Apache and Siege. We had a dedicated 1Gbs ethernet connection between penny and toronto.
Foos is a 2.4Ghz Intel, with 1 dual core cpu, running SUSE Enterprise 10, with 2Gb of ram and 2 disk drives (no raw partitions). Foos was used to simulate the GLASS appliance performance.
Toronto is a 2.2Ghz AMD Opteron with 2 dual core cpus, running SUSE Enterprise 10, with 8Gb of ram and 5 disk drives (raw partitions installed on 3 of the partitions). Toronto was used to simulate a typical production machine.
Siege was pointed at the Apache instance listening on port 8000. In addition to the mpm_worker_module, mod_proxy_balancer was used to round-robin requests to the various vms. GemStone was running with version Seaside2.8g1-dkh.490 of Seaside.
For the Squeak tests I used the latest development image from Damien Cassou (sq3.9-7067dev07.10.1), the 3.9 vm and loaded Seaside2.8a1-lr.492. I pointed siege directly at the Squeak image (both running on Toronto).
I did not enforce exclusive use on any of the machines (penny and toronto are shared by other folks in the company) during the tests. But between running most of the tests multiple times and keeping an eye out for anomalous events the numbers are good enough for government work.
Summary of Results
In all, I ran 15 different tests in 6 different categories (Squeak baseline, Squeak internet, Squeak benchmark, GemStone Web Edition, GemStone internet, and GemStone benchmark).
The following table is sorted by Req/Sec. Click on a Run number to jump to the category and a description of the test.
Run | Req/Sec | Core | Gem | VM | std | Siege | Machine | Notes |
1 | 10 | 1 | 1 | S | – | -b -c 10 | Toronto | 1.0 ART |
2 | 15 | 1 | 5 | G | 5 | -b -c 10 | Foos | file-based, 1G SPC |
3 | 16 | 1 | 1 | S | – | -i | Toronto | 0.3 ART |
4 | 25 | 2 | 5 | G | 3 | -b -c 10 | Foos | file-based, 1G SPC |
5 | 28 | 1 | 20 | G | 5 | -i | Toronto | raw, 1G SPC |
6 | 28 | 2 | 20 | G | 5 | -i | Toronto | raw, 5G SPC |
7 | 29 | 1 | 1 | G | 6 | -i | Toronto | 0.02 ART, raw,1G SPC |
8 | 32 | 1 | 1 | S | – | -b -c 1 | Toronto | 0.03 ART |
9 | 50 | 1 | 5 | G | 10 | -b -c 10 | Toronto | raw, 1G SPC |
10 | 75 | 1 | 5 | G | 13 | -b -c 10 | Toronto | 0.1 ART, raw, 5G SPC |
11 | 87 | 1 | 5 | G | 6 | -b -c 100 | Toronto | 1.3 ART, raw, 5G SPC |
12 | 91 | 1 | 1 | G | 6 | -b -c 10 | Toronto | 0.1 ART, raw, 1G SPC |
13 | 140 | 2 | 5 | G | 20 | -b -c 10 | Toronto | raw, 5G SPC |
14 | 185 | 3 | 5 | G | 37 | -b -c 10 | Toronto | raw, 5G SPC |
15 | 230 | 4 | 5 | G | 40 | -b -c 10 | Toronto | raw, 5G SPC |
ART in the Notes column is shorthand for Average Response Time, a stat from siege.
Squeak Results
Run1, Run3, and Run8 are scaling tests against the Squeak image. Run7 and Run12 are comparable tests run against a GemStone vm. Note that there is only 1 GemStone vm being used in these tests.
Run | Req/Sec | Core | Gem | VM | std | Siege | Machine | Notes |
1 | 10 | 1 | 1 | S | – | -b -c 10 | Toronto | 1.0 ART |
3 | 16 | 1 | 1 | S | – | -i | Toronto | 0.3 ART |
7 | 29 | 1 | 1 | G | 6 | -i | Toronto | 0.02 ART, 1G SPC |
8 | 32 | 1 | 1 | S | – | -b -c 1 | Toronto | 0.03 ART |
12 | 91 | 1 | 1 | G | 6 | -b -c 10 | Toronto | 0.1 ART, raw,1G SPC |
Baseline
Run8 showed the best results for Squeak with a rate of 32 request/second when hit with a siege from a single user.
Internet tests
For the internet test (Run3), the Squeak vm hit 16 requests/second (0.34 seconds average response time and an average of 6 concurrent requests). In a comparable test (Run7), the GemStone vm hit 29 requests/second (0.02 seconds average response time and an average of 0.6 concurrent requests).
Benchmark tests
When 10 concurrent users slammed the Squeak vm (Run1), performance dropped to 10 requests/second. Siege doesn’t collect stats on the standard deviation, but I observed a wide range of response times around the 1 second average reponse time. In the comparable GemStone test (Run12), the GemStone vm hit 91 requests/second, a standard deviation of 6 and an average response time of 0.1 seconds.
Under load it appears that GemStone is about 10 times faster processing Seaside requests than Squeak (Run1 comparied to Run12). While the GemStone vm is certainly faster than the Squeak vm, I don’t think that the GemStone vm is that much faster. I haven’t tried to analyze what might be going on, but my best guess is that under load, the garbage collector is siphoning cpu cycles away from the processing of requests.
GemStone Results
Web Edition tests
Run2, Run4, and Run9 are intended to illustrate the kind of performance you might expect when running a version of the Web Edition (i.e., 1 core, 1G SPC, and a 4G extent).
Run | Req/Sec | Core | Gem | VM | std | Siege | Machine | Notes |
2 | 15 | 1 | 5 | G | 5 | -b -c 10 | Foos | file-based, 1G SPC |
4 | 25 | 2 | 5 | G | 3 | -b -c 10 | Foos | file-based , 1G SPC |
9 | 50 | 1 | 5 | G | 10 | -b -c 10 | Toronto | raw, 1G SPC |
Run2 and Run4 used file-based tranlogs and extents and Run9 used raw tranlogs. You can see that Run9 is over 3 times faster than Run2. Using raw I/O for tranlogs makes a big difference.
Even with 2 cores, Run4 is still slower than Run9. Confirming that with file-based tranlogs, the test is i/o bound.
A sustained rate of 15 requests/second (24×7) is about the top rate that we’d recommend when using the Web Edition. In Run2 the disk-based garbage collector kept pace with the expiration of session state consuming roughly 1/5 of the 4G repository before it was garbage collected. During Run10 (at a rate of 75 requests/second) a 4G extent was consumed in 15 minutes!
Internet tests
Run5 and Run6 illustrate what you might expect for production performance with real world loads.
Run | Req/Sec | Core | Gem | VM | std | Siege | Machine | Notes |
5 | 28 | 1 | 20 | G | 5 | -i | Toronto | raw, 1G SPC |
6 | 28 | 2 | 20 | G | 5 | -i | Toronto | raw, 5G SPC |
The difference between Run5 and Run6 is that I used 2 cores and a 5G SPC in Run6. It is clear that neither a larger SPC or more cores are needed to sustain this rate.
These tests were run using 20 vms. In GemStone/S a vm handles a single http request at a time (each http request coming into the Seaside vm acquires the transaction mutex for the duration of a request), so in order to get concurrent handling of requests you need to have multiple vms running. A quick look at Run7, which runs at the same rate with a single vm, shows that you don’t sacrifice performance by spreading the load across 20 vms, while gaining the ability to handle up to 20 requests concurrently.
Benchmark tests
Run10, Run11, Run13, Run14, and Run15 are intended to illustrate how GemStone/S stands up to siege in benchmark mode and to illustrate the scaling performance as you add cores into the mix.
Run | Req/Sec | Core | Gem | VM | std | Siege | Machine | Notes |
10 | 75 | 1 | 5 | G | 13 | -b -c 10 | Toronto | 0.1 ART, raw, 5G SPC |
11 | 87 | 1 | 5 | G | 6 | -b -c 100 | Toronto | 1.3 ART, raw, 5G SPC |
13 | 140 | 2 | 5 | G | 20 | -b -c 10 | Toronto | raw, 5G SPC |
14 | 185 | 3 | 5 | G | 37 | -b -c 10 | Toronto | raw, 5G SPC |
15 | 230 | 4 | 5 | G | 40 | -b -c 10 | Toronto | raw, 5G SPC |
Run10, Run13, Run14, and Run15 show a nearly linear progression in performace as more cores are added.
With Run11 I cranked siege up to 100 concurrent requests to see what would happen. The fact that Run11 averaged a little more than Run10 even thought they are running on the same configuration, means to me that the cpu wasn’t running flat out in Run10. With 100 concurrent requests, though, the cpu was truly hammered. If you look at the Average Response Time for the two runs, you’ll see that with 10 times more concurrent requests, it took 10 times longer to respond to a request, which makes sense since the extra concurrent requests ended up being queued up behind the 5 vms.
At rates approaching 200 requests/second, I started seeing indications that the data structure used to store session state (RcKeyValueDictionary) was reaching the limit of its effectiveness. As we move forward I will be looking at data structures that are better suited to these rates.
Conclusions
You can expect very reasonable performance numbers with the Web Edition – pretty good value for your money:). Along with transparent persistence, the Web Edition will support rates up to 50 requests/second across a large number of vms. You will have to keep an eye on the size of your repository, if your sustained rate starts averaging near 15 requests/second. But hey, you can serve an awful lot of donuts at these rates.
If you start getting more traffic, it looks like GemStone/S can be scaled to some pretty respectable rates without having to change your application.