Vetting Apache Cassandra

Just some rando programmer, trying to strike a balance between productivity, gaming, friends, family, healthy living, and all this new stuff I keep learning.

Prologue

Check out Event-driven Data or: How I Learned to Stop Living in the Present and Travel Through Time by Zach Lark for a story about the data model that drove most of our research and performance testing.

Before I get into the setup, I want to clear the air. This isn't some writeup about how great Apache Cassandra is as a database or how NoSQL is the new world order. This is a look back at the long process by which we proved to ourselves and our colleagues that Cassandra was a great fit for our project. What I hope you, the reader, can take away from this is a sense of how we approached a challenge of this scale and nebulousness, and how we avoided feeling overwhelmed by breaking the project up into fun little chunks. Our actual metrics will be largely irrelevant here.

To quickly introduce Cassandra, I'll say it's an open source distributed database, like a clustered, load-balanced web service that happens to store and retrieve data. Each instance of the database running on a server is a node, and when networked together they form a ring or cluster. The distributed nature of the system can offer a high level of resilience, but brings with it a set of interesting challenges.

Choosing Our Own Adventure

Another developer and I were given an opportunity to explore new database technologies for a project that was years in the making, but not yet put together on any platform. We'd been building applications on top of Oracle Relational Database Management System (RDBMS) as long as either of us had been at the company, so this was a "Big Deal." We were, in essence and seemingly out of the blue, given carte blanche to find the database that was right for the project. Imagine, for a moment, being told that cost is not a factor, and that you should base your decision on pure, objective, technical analysis. Exciting, right?
It should be noted, though, that a task granting this much freedom assigns equal responsibility. We were not simply choosing a database. Rather, we were equipping our managers with the information they would need to make an impactful decision for the company, along with procurement folk, executives, a DBA team with decades of combined Oracle experience, and dozens of other people used to certain kinds of software running on certain kinds of hardware. Here's our little success story.

Enabling the Decision

The first thing we read online about any database was either a raving review from a diehard evangelist, a scathing diatribe from a fan of something else, or blanket statements about the product's superiority from the company's own marketing team. A dozen fastest and most reliable databases created a frustrating and unnavigable landscape, so our first goal was to split up, cut through the BS, and attempt to understand the technologies. As a mad genius once put it, "Instead of seeing what they want you to see, you gotta open your brains to the possibilities."

We needed to apply each database's recommended best practices to our particular data model, query patterns, volume, and peak load. We catalogued driver availability, support agreements, and hardware and network topology requirements. High availability, performance, security, and ease of administration were top priorities. To bring all that and more together, we kept a journal for the duration of the project detailing all of our research and findings on our company's wiki, so nobody could claim we didn't "do our due diligence."

After what felt like weeks of pure research, and several cuts, we compiled a very short list of contenders, including Cassandra, and subjected those to a thorough and well-catalogued series of tests that we basically kept scaling up until something broke.

Creating a Consistent Environment

For the most part, databases come with some assembly required. As we devised our test strategy for Cassandra, we reviewed the documentation as thoroughly as time would allow, looking for the configuration that would best serve our purposes. Turns out it was "default," but whatever. We setup our best production-style linux virtual machine in a cloud platform, installed the database and any other necessary software, created a template based on that machine, and then assaulted our objective with an army of clones. Virtual machine clones, of course. Every time we had a test to run, we'd have a fresh, perfectly configured database sitting on consistent hardware. I can't stress how beneficial the cloud environment was for our evaluation. Misconfigure something? Bring up a new clone. A bug in your test garbles your data? Clone. Discover a new setting that would serve as a new baseline? Create a new template, then clone that.

The cloning process was super helpful when we wanted to experiment with new settings. Cassandra provides a few options for defining a data replication strategy, and in general they rely on essentially identical configuration between the various nodes in the cluster. Once a strategy is chosen, it can be setup as a template and cloned to any number of machines. Additional clones can be used to expand the cluster with minimal work. Opportunities for human error are greatly reduced. This is more important for some humans, but it's good to keep in mind regardless.

Creating Representative Data Sets

We had a handful of baseline snapshots based on varying data volumes from which we cloned all of our environments, and it was a pretty cool little side project to figure out what those baselines should be. Our team created projections for the new feature's customer base, and thus also the data volume, out to seven years. We chose a handful of milestones from those projections, generally where something would grow by an order of magnitude, maybe the number of users, the number of records of a particular type, or the average rate of changes to those records. Variance in user profiles gave us some interesting data sets to play with. For example, some objects in our system might be composed of a dozen elements, where others might be thousands, but we need great performance for both. This variance created excellent learning opportunities around best practices for query patterns and denormalization.

private void createFourYearData() {
 IdentityEvent identityEvent;
 for (int i = 0; i < numberOfIdentities; i++) {
 identityEvent = createNewIdentityEvent();
 identityEventDao.save(identityEvent);
 eventDao.save(createNewEvent(identityEvent.getNetworkId(), ID_EVENT));
 eventDao.save(createNewEvent(identityEvent.getNetworkId(), ADDR_EVENT));
 for (int j = 0; j < partitionSize; j++) {
 addressEventDao.save(createNewAddressEvent(identityEvent.getNetworkId()));
 }
 }
}

The above code snippet highlights the simple scaling properties of our baselines. Over time, we expect more "identities" in the system, and also more "address change" events associated with each identity. Our target databases reflected growth in these dimensions in vastly different ways.

Creating Tools, Enabling Repeatability

We had an array of tests laid out for each database, and were prepared to gather more metrics than even seemed reasonable. We had read through some very cool articles in the Jepsen series on Aphyr.com, and were inspired to build up our own toolkit, albeit somewhat mashed in with our in-house code repositories. Not only would it serve our immediate needs, but it would help others at the company with a similar mission. Not to mention, if our results ever seemed unbelievable, we could reproduce them in a jiffy and say, "I told you so."

Our first tools were our sample data hoses to blast those fresh database snapshots. We'd use these to generate fairly representative data sets through our prototype microservice, which had the handy benefit of making it a breeze to swap out minimal code to support different databases. These services were quickly deployed to the same cloud environments as the databases and included in the snapshots so a single web service request would build up any data configuration we could imagine. We were starting to feel powerful.

A couple more members of the team poured lots of time and energy into building up JMeter test suites that were basically designed to knock over either the microservice or the database. These were repeatable, scalable tests based on high concurrent user counts and other worst-case scenarios. Our virtual environments were expanded again to include machines dedicated to running these tests, and new batches of snapshots were taken.

By this point we basically had at least the corner pieces of the puzzle, and maybe some edges, but we were burying ourselves in data, piles of test results, heaps of potential but nothing substantive yet. We also had several sources to cross-reference, like JMeter results and the metrics tracked by DataStax OpsCenter, a web-based monitoring interface for Cassandra. Our next move was obvious: whip up a tool to parse JMeter results, fetch and interpret OpsCenter data via its REST API, compute averages, percentiles and such, and spit everything out in a CSV. Then, naturally, build some snazzy charts with the company colors. A few simple plots distilled weeks of setup, testing, and analysis into an at-a-glance view that succinctly but accurately described the performance limitations for our particular environment and data model.

Run Those Tests

Our actual tests were generally long-running simulations of extreme load or system instability, run on top of each of our baselines. Where the baselines scaled based on data volume and partition size, the tests generally scaled in terms of concurrent threads. We accounted for variable usage patterns by running different versions of each test for write-heavy, read-heavy, and balanced query workloads. Testing durability and availability kept things rather exciting, despite this phase consisting of lots of waiting and crunching numbers. During extra executions of some of the tests, we would kill one of the Cassandra nodes, wait a while, and bring it back online. We were most interested in whether or how much the overall cluster performance was impacted during and after the outage, whether any writes were lost, and how long it would take the cluster to repair. Lastly, while our primary focus was on scaling the client and data aspects of our system, another variable we felt deserved some attention was cluster size. We ran a few more of our tests with extra Cassandra nodes, and measured the performance against the baseline clusters. This provided fairly straightforward cost-versus-performance guidelines, a handy reference for future discussions, especially if customer growth exceeds expectations.

Distilling Data Yields Good Proof

In hindsight, an important goal of each of our tests was that it end in some sort of failure. In most cases, it had something to do with average and max response times reaching ridiculous levels. These were represented by steeply inclined walls on the far right of our graphs, and they gave us confidence in two things. The first was that the database could handle orders of magnitude more traffic than we planned to throw at it any time soon. The second was that we'd know what kinds of growth would impact performance and what to do about it.

Knowing where each database would fail against our use cases and with our data model was key, though. Imagine if we had simply tested our expected user loads and data volumes, even at our biggest seven year baseline, and called it a draw when several databases handled them marvellously. If one we chose that way would have failed to maintain low response times with, let's say, twice as many concurrent users, or thirty percent more data, we'd probably have found out in production sometime. Definitely not cool.

One of our bonus takeaways was an amusing anecdote about the benefits of consistent, repeatable tests. We had been plagued by an odd performance issue for a couple weeks at least, during most of our tests. We identified some of the conditions associated with the spikes in our otherwise smooth graphs, but after significant debugging efforts, we could not track down the root cause. We'd had a suspicion early on that it might be something weird with the cloud platform, but didn't have anything to base it on outside of intuition. We were also basically told that what we were seeing shouldn't be possible, which, by the way, is my absolute favorite software description. Our wild goose chase tracking down everything that wasn't the problem gave us enough information to have an effective conversation with the vendor's technical people, and offload our findings. They found and patched a bug in their software within a day or two, which was impressive. I'm not trying to knock this vendor or anything, I'm just saying, repeatable tests in a consistent environment can help out everyone involved in ways that you can't predict.

Happy Hunting

I hope our experiences researching database solutions can help give perspective to anyone feeling daunted by a similar task. Just remember that finding or building tools is the best counter to time constraints, and when a test's usefulness is in doubt, turn it up to eleven. If you break something, you'll learn something. Thanks for reading!

Labels

Technology