With AWS Cloud now spanning 18 regions in 54 availability zones, Apache Cassandra on AWS is a popular EC2 based architecture. Cassandra is a free, distributed open source NoSQL database designed to handle large amounts of data across many commodity servers for high availability with no single point of failure. Clusters can span multiple data centers and regions with asynchronous masterless replication which gives clients low latency operations. In addition to the execution of a wide variety of application workloads across a well supported AWS Global Infrastructure, deployment tasks can take advantage of automated deployment patterns.
AWS regularly evolves EC2 by offering new configurations of computing and storage resources designed to improve the performance of database platforms and their workloads with higher input/output values. So we asked: Do you really need to pay for more IOPS to get more database performance?
IOPS (input/output operations per second) is a popular performance metric used to distinguish one storage type from another. Similar to device makers, AWS associates IOPS values to the volume component backing the storage option. As IOPS values increase, performance needs and costs rise.
But it’s important to mind the gap between the promise of performance of a new configuration and the monitored performance of an existing configuration — especially if your database platform drives your highly-available Big Data application. Benchmarking your workload on a new configuration can help you make an informed decision before you reconfigure your platform.
To dig deeper, we designed our own benchmark experiment. We tested a single node Cassandra instance against read intensive (R/I), write intensive (W/I) and mixed read/write (R/W) workloads on a variety of EC2 configurations. We were interested in how new options of instance and elastic block stores would impact runtime, latency, and throughput.
Below, you’ll find our method, insights, and the code we developed to help us practice principles such as Infrastructure as Code (IaC) and Don’t Repeat Yourself (DRY). Using Terraform, Git, and Bash, we were able to structure logic and track settings as well as provision and destroy tests in a scalable, repeatable way.
Our goal was to test our database platform in a way that would control for CPU, network, and memory performance so we could better isolate and observe the performance of several different types of storage.
Instance types and sizes. We started by choosing three popular instance types. R3 and R4 are memory-optimized instance types announced in 2014 and 2016 respectively while the I3, a storage-optimized type for NoSQL workloads was announced in 2017.
From three instance types, we chose three sizes: an r4.4xlarge, r3.4xlarge, and i3.4xlarge. These oversized compute packages helped to normalize runtime and throughput of the database platform, highlighting differences in the data disk storage configurations.
Volume and storage types. Next, we chose the set of both volume types and storage types for the data disks to be tested. Volume types describe the capacities and capabilities of the low-level disk hardware, while storage types define the overall usability. Because Cassandra’s write path flushes data from memtable to SSTable files located on these test disks, we chose both local (physically connected) and elastic (networked) configurations. The following table provides more details about the different configurations we used:
Test Database Platforms. We set out to test the Apache Cassandra 3.11.1 database platform on the following instance configurations. We made minor tweaks to account for differences between storage mount points and memory limits:
Benchmarking. We used the popular benchmarking tool Yahoo! Cloud Serving Benchmark (YCSB) to initialize and run our workload parameters.
We provisioned YCSB 0.12.0 on a separate m4.4xlarge instance with storage defaults. It was responsive enough that there was no delay in read/write operation times between it and the database instance.
We also configured YCSB to load 60GB of data in 10,000,000 operations, running each of the following target workloads with 100 and 250 threads:
● Write Intensive (read 20% / write 80%)
● Read Intensive (read 80% / write 20%)
● Read/Write Mixed (read 50% / write 50%)
Note: The YCSB client executes a workload in a thread as a background process to increase the amount of load against a database. More threads will impact your choice of YCSB instance configuration.
With 4 instances and 3 workloads we provisioned and ran 12 tests (each at 2 thread counts). We provisioned each test as a pair of EC2 instances, one for the database platform and the other for YCSB.
Provisioning Tests. To provision the tests at scale, we used Terraform and — dare I say it — Bash. Terraform provided a common runtime context which coordinated the parameters of our tests across the two instances. Bash allowed the assembly of Terraform components into a single workspace directory and then enabled test provisioning as a background process.
Running Tests. To run at scale, tests were grouped into batches. This strategy gave us the ability to run more than one test at a time across multiple AWS accounts and/or regions within account limits. For example, we could only provision two i3.4xlarge instances at once. By employing an additional account, we doubled our testing ability in roughly the same amount of test run time.
Monitoring Test Runs. We employed Datadog to monitor the database machines and give us insights into opportunities for last minute tweaks to improve provisioning logic.
● Amazon Web Services Command Line Interface (AWS CLI)
As we evaluate our results below, keep the following points in mind:
- Because databases are designed to solve different problems in different ways, please avoid direct comparisons between Cassandra results and other databases.
- For each target workload, insights are provided based on three key metrics in the YCSB output statistics. The following table outlines our intended use of these three measurements:
Read Intensive Results
(R/I) Runtime. The differences are mind blowing! At 100 threads, runtimes on the newer i3.4xlarge with NVMe were about 20 times faster than on the older r3.4xlarge. At 250 threads, the differences grows to 33 times faster, as runtime on the R3 suffers under YCSB client activity.
(R/I) Throughput. The benefit is obvious from provisioned IOPS (NVMe). Throughput values measured at 20 and 33 times higher at 100 and 250 thread values respectively, compared to the older r3.4xlarge general purpose configuration. We also saw that adding more threads actually reduced throughput slightly.
(R/I) Latency. While quite diverse across the instance configurations, latency values get better with newer instance type configurations. Worst-case wait times occurred under the older r3.4xlarge configuration and became even more pronounced as thread count increased. This result makes sense, as when there are more threads, read operations, and latency, there is necessarily less throughput.
Write Intensive Results
(W/I) Runtime. NVMe provided the best runtime metrics and was at least 32% faster than its closest competitor, the 10K io1. Runtimes were significantly longer on the r4.4xlarge gp2 configuration, where we encountered YCSB timeout errors.
(W/I) Throughput. The fast runtimes on NVMe we saw above are supported by the greatest throughput. Compared to the r4.4xlarge 1TB gp2, throughput on the i3.4xlarge NVMe increased by over 700% more operations per second.
(W/I) Latency. Comparing the graphs at 100 thread count and 250 threads, we found a couple of interesting results.
● At 100 threads, the two older general purpose gp2 configurations experienced less latency than both the newer 10K io1 and NVMe configurations.
● At 250 threads, all latency values were higher than at 100 threads. However, the NVMe instance measured values closest to latency at 100 threads and was only 12.7% higher than the r4.4xlarge 10K io1. Therefore, NVMe, showed the best result.
Read/Write Intensive Results
(R/W) Runtime. Similar to the Read Intensive results, runtimes on the newer i3.4xlarge NVMe configuration beat all other configurations. At 100 threads, the i3.4xlarge was nearly 12 times faster than the r3.4xlarge, and 3.8 times faster than the 10K io1 configurations. Even though at 250 threads we saw runtimes worsen for all configurations, there were bright spots where runtimes on the R4 gp2 and NVMe only grew by minor percentages (2% and 6.7% respectively).
(R/W) Throughput. This workload resulted in higher throughput from the provisioned IOPS (io1) and NVMe configurations. But what’s remarkable is that NVMe values are nearly 4 times higher than io1 and 20 times higher than gp2 configurations.
(R/W) Latency. Here, the differences are prominent. The highest levels correlate to worst-case wait times on the general purpose configurations. The worst-case gp2, the R3 configuration, is 22 times and 30 times slower than the NVMe configuration at 100 and 250 threads.
Better outcomes with larger numbers. After reviewing YCSB run times across all the tests, we noted that lengthening YCSB run times by using larger values for workload parameters (like recordcount and operationcount) would have provided a larger sample of results from which to draw conclusions.
There’s a lot more detail than you realize. Making use of the Infrastructure as Code (IaC) and Don’t Repeat Yourself (DRY) principals by employing a Git repository and implementing Terraform component scripts really helped track all of the moving parts and modify them in a simple way.
Some tests didn’t complete. Along the way, a pattern of timeout issues reared its head while testing on a few database machine configurations:
After reviewing the results, the error conditions demonstrated problematic performance of the database platform in these cases. YCSB run commands could generate errors like:
Cassandra timeout during write query at consistency ONE
(1 replica were required but only 0 acknowledged the write)
In the future, we would revisit the default configuration to apply more tweaks.
Some tests didn’t perform as well as expected. Did you notice the poor read performance on R3 and write performance on R4? These results were due to test configurations that combined network access (elastic block storage) with low or restrictive IOPS values.
Given another opportunity, we would ensure these configurations were defined to better reflect more real runtime scenarios. However, if your current runtime configuration matches one that we tested, our results show that choosing one of the newer storage configurations of network gp2 with restrictive (960 / 3,000) or default (3,000) IOPS values offers a significant gain in performance.
So do you really need to pay for more IOPS to get more database performance? After considering our results and findings from tests run on Cassandra, we’ve come up with the following decision matrix:
Our results suggest taking a nuanced approach when debating whether to upgrade your database platform or not. By modelling workloads to measure the performance based on runtime, throughput, and latency, we observed varied results from the impact of alternative EC2 instance configurations.
Based on a default configuration, we found nearly all Cassandra workloads measured better values with NVMe storage. If you don’t want to update your Cassandra configuration for mount points, then 10K io1 is your next best option. No matter your choice, in general, using a newer storage option is better.
The remainder of the workloads measured better values on an R4 with general purpose SSD storage. For the majority of our models, we would achieve better performance by paying for more IOPS. However, if we were already using an older instance type with general purpose storage, the type of workload would be a big factor in deciding whether to upgrade our configuration.
So, which database platforms would you benchmark in AWS? There’s hundreds of them! Get started by visiting our Kenzan Labs @ Github repository, then let us know about your experiences and what instance types you chose.
As with everything at Kenzan, this project wouldn’t have come to fruition without our incredibly talented team:
Co-writers and development
Lead Platform Engineer
Platform Engineer (Emeritus)