SQL vs NoSQL KO. Postgres vs Mongo


Software Engineer at @LanetixCRM. Talk to me about programming languages, and computers in general.


This article will compare and contrast one of the most commonly used relational databases (Postgres) with one of the most commonly used document databases (Mongo). It will look at the strengths and weaknesses of both, and try to make practical recommendations for when to use which. It is by no means a comprehensive survey of the database landscape. Entire families are not represented (graph databases for one).
This article is going to assume a relatively high level of general technical ability. Some understanding of JSON would be advantageous. Knowledge of databases is not required, as background theory will be explained. If you already have a solid understanding of database theory, the first chapter can be skipped with no loss of information. The second chapter will compare, at a high level, the features provided by both databases, and attempt to resolve this battle once and for all! In the interests of continuing the whimsy, when a round ends with a clearly dominant winner, I will declare it by knockout, but when it is slightly less clear cut, I will call it a technical decision, and include some justification.

1. Theoretical Background

In this chapter we will talk about the theoretical background necessary for a serious discussion of databases. Everything discussed here, will be assumed as background knowledge throughout the rest of this article.

1.1 The CAP theorem.

Any serious discussion of databases will include at least passing reference to the CAP theorem. The CAP theorem states that a networked shared-data system can have at most two of the following three guarantees: consistency (at any given point, every node sees the same data, and the data is up to date) availability (every request is correctly notified by every node), partition tolerance (The system functions irrespective of network partitioning). This theorem is further used to loosely classify databases into three families, CA, CP, and AP. To represent database systems that fulfill two of these criteria. CA systems are patently impossible, as the introduction of even a single networked client renders the system partitioned. Any vendor selling a CA system is lying outright. CP systems will, in the presence of a partition, choose to serve only consistent data. If data consistency can not be ensured, they will refuse to service the request. AP systems will, in the presence of a partition, continue responding, but will make no guarantees of the consistency of the response.

1.2 Transactions

A database transaction is a representation of a single unit of work against the database, to be processed as such.

1.2.1 ACID Transactions

ACID stands for: atomic(Everything in the transaction is part of a single 'atom'. Either everything succeeds or everything fails.) consistent(Both the starting and finishing state of the transaction are valid and do not violate any database rules) isolated(Transactions are isolated from each other and can not interfere with each other) durable(All completed transactions persist permanently). The consistency in CAP is in fact a much stronger guarantee, and is commonly referred to as linearizability.

1.3 Linearizability

The real meaning of the C in CAP. Informally, the guarantee is that once an operation completes, all later operations, irrespectively of transaction or node, should see a persistent, durable result of that operation, or a later state. Simply put, if we've written to variable x, any and all alter operations must pull the correct value from x.

1.4 Normalization

Wikipedia defines normalization as "the process of organizing the attributes and tables of a relational database to minimize data redundancy." There are many normalization strategies, but a database is usually considered normalized if it is in third normal form (3NF). What this means is the following:
  1. Each attribute contains only atomic values. This explicitly rules out complex JSON or array structures stored at the leaf node.
  2. No data is redundantly represented based on any non unique subsets. Put differently, for every unique set of entries (this is called a candidate key), no other attribute depends on any subset of the candidate key.
  3. No data is dependent on anything other then the key. Some examples: Let's take the following simplistic table:
dob   name   attributes
1/1/1936   Edsger   {hair: blue, eyes: green}
2/2/1937   Alan   {hair: green, eyes: purple}
This table fails the first condition (commonly referred to as first normal form or 1NF), as the attributes column is non atomic, e.g. represents multiple pieces of information per value. Now lets take a table that passes that criteria but fails the second (commonly referred to as second normal form or 2NF).
specialty   name   eye_color   hair_color
databases   Edsger   blue   green
programming   Alan   green   purple
algorithms   Alan   green   purple
algorithms   Edsger   blue   green
This fails the condition, since neither the specialty nor the name are sufficient to uniquely identify a row, however the eye and hair color depend only on the name, thus leading to duplication of information. Finally lets take a look at the following table:
specialty   name   company_worked_at   company_address
databases   Edsger   Foobar LTD   1 Street
programming   Alan   Foobar LTD   2 Street
algorithms>   Alan   Fubar GMBH   3 Street
algorithms   Edsger   Blue Corp   4 Street
This table shows us which companies the experts have worked at and in which capacity, however the addresses refer to the company rather then the expert or their specialty, this violates 3NF. Put simply, this can be remembered as "Everything must provide a fact about the key, the whole key, and nothing but the key, so help me Codd". (Codd being the scientist who coined these forms.) Now at this point a very reasonable question to ask might be "Why is this useful?" and "How does this apply in practice". In short, by normalizing data, we ensure that data is not duplicated, we ensure that the Postgres query optimizer can make the best possible decisions (the exact algorithm for this is outside the scope of this article), and we also ensure that we don't have corrupted data anywhere in our system (due to abandoned join tables for example).

1.5 Relational Databases (RDBMS)

A Relational Database Management System (RDBMS) is a database based on the relational model. It is designed to focus on relations between data, and stores data in tables of rows of columns. A detailed discussion of the mathematics behind the relational model is out of scope for this article, but the concepts of normalization apply to RDBMS to a very large degree.

1.6 Document Databases

Are designed to store what is known as semi-structured data, data where there is no clear separation between the data's schema (think of this as the shape of data, what relations it has, how many columns, etc), and the data itself.

1.7 Indexing

Wikipedia defines a database index as:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure.[1]
More loosely speaking, a database index is a way for a database to keep track of the values of a particular column or attribute, thus making said column or attribute much faster to read or search for, but slower to write to.

2. The Fight

2.1. Classification (the contestants).

2.1.1 Classification of Postgres (in the right corner).

Postgres is an RDBMS with ACID compliant transactions, up-to and including full serializability of transactions if the transaction level is set to be Serializable. Its Query language is a flavor of SQL, and it is CP, as is demonstrated here[2]

2.1.2 Classification of MongoDB (in the left corner)

MongoDB is a multi storage engine beast. The default storage engine is MMAP, though WiredTiger is a newly released storage engine that purports to address some of the more systemic flaws. However as evidenced by the large number of bugs related to both data loss and memory leakage, it is clearly not yet ready for prime time. As such, while improvements purported by WiredTiger will be called out and analyzed where possible, the contender in the left corner is MMAP. MongoDB is a schema-less, document oriented DBMS with JSON like objects forming the model. It does not support transactions out of the box. It is not AP because it is a single-master system, and all reads to to the primary node (although this can be ameliorated with Replica Sets and automatic failover, whereby a new primary is selected on network partition). If the primary node is down, It is also not CP as its consistency claims are disproven here[3]

2.2 Consistency and Transactions (Round 1):

2.2.1 Transactions and Durability in Postgres

Postgres does not require read locks (except if transaction level is set to Serializable) since every transaction has a snapshot of the database. An inconsistent read (also known as a dirty read) is therefore impossible. Postgres has 3 levels of transaction isolation. For an in depth discussion, look at the documentation[4].
The first is Read committed. This is the default. What this means is that each query within the transaction only sees data that had already been committed to the database when that query began. No changes made during the execution of the query will be visible, however concurrent changes that are written during between queries by other transactions will leak into the transaction.
The second is Repeatable Read. This transaction level guarantees that no changes made by concurrent transactions during the transaction execution will be visible. In effect, the transactions changes are applied onto a snapshot independently of other transactions. If a transaction fails due to a different transaction having modified data the transaction will fail with a concurrency exception, and the application you are writing will have to retry the transaction.
The final is Serializable. This transaction level guarantees that any transactions executed concurrently will have the exact same effect as if they had been executed one after the other. While this may appear on the surface to be very similar to Repeatable Read there is a great explanation of the difference here[5].
A short (practical) example (taken from the above documentation) is a bank which allows you to overdraw any one of your account as long as the cumulative sum in all of your accounts is above the negative. A malicious attacker might attempt to exploit Repeatable Read by withdrawing large sums from all accounts concurrently. Since Repeatable Read gets a snapshot, each individual transaction will be successful, resulting in a large net loss to the bank. It is worthy of note here, that while a lot of work has been done to optimize the higher transaction levels, it is still true that the higher transaction levels come with some performance overhead. Whether or not the higher consistency is worth the performance cost, is something that should be evaluated on a case by case basis, using benchmarks on your data and operations.

2.2.1 Transactions and Durability in MongoDb

MongoDB does not support transactions out of the box. It does however allow, configuration of its Write Concern, which while not related to transactions, does speak to the durability of write operations. The default is Acknowledged, which guarantees that the write operation has reached the database, but does not guarantee that the data has actually been written to disk. Other options are:
Journaled: the write request has been written to MongoDB's Journal (A queue of operations that has not yet been persisted to the disk proper)
Majority: The write has propagated to the majority of nodes, and been acknowledged by them.
While Majority does ensure consistency in the absence of network partitions, network partitioning may result in inconsistent data, and/or data loss, even within the confines of a single document. It is also true that MongoDB does provide an $isolated operator, which, while enforcing consistency by way of a write lock, also prevents all sharding, and does not actually guarantee atomicity as an error during the write operation does not roll back the entire "transaction". MongoDB also does provide a loose guideline for implementing a two phase commit[6].
Loosely speaking this will require you to: Create a transaction with all of your queued changes. Execute each change, sequentially, tracking the succeeded changes. If any change fails, roll back every change already made, and cancel the transaction. Mark the transaction done.
Keep in mind that this means you need to be able to encode an inverse, or rollback operation for every operation, and that this is a very hard problem to get right, a full solution for which is out of scope of this article. It also means that: 1) Your transactions is provided client side, rather than by mongodb itself 2) There is no global lock across all shards, which means that documents can be modified mid transaction.

2.2.2 Transactions in WiredTiger

Wiredtiger purports to provide ACID transactions (I do not have sufficient computing power to put this to the test, and unfortunately third party tests have yet to occur). Taking the documentation at its word, it provides a maximum level of Snapshot Isolation, equivalent to Postgres' Read Committed.
Winner: Postgres. By knockout. Although once WiredTiger is ready for primetime, this gap will narrow somewhat.

2.3 Performance and Denormalized Data

2.3.1 Denormalized data in Postgres.

Postgres supports 4 column types for storing denormalized data.
hstore: this is for storing flat key - value pairs. It is largely only there for legacy use.
json: this stores json blobs converted to a string. It does run some validations to ensure well formed json, and provide convenience operators for access, but that's about it. It does not really provide indexing on This is also largely only for legacy use.
jsonb: This stores json as binary, and displays it as json, not as a single text value, and allows indexing into arbitrary attributes for speed of lookup. It does however still require a full update to write to, although there are functions coming out in 9.5 that will allow updates to nested paths more easily[7].
array: This stores an array of some other datatype (text, number, whatever). To find out more about the operators available on all of these, the documentation is a great place to go[8].

2.3.2 Denormalized data in Mongo

Storing Json is what Mongo is optimized to do. Mongo stores its data in a binary format called BSON, which is (roughly) just a binary representation of a superset of JSON. The reason for the roughly quantifier is the lack of a number type, while the reason for superset is the support for direct binary data.

2.3.3 BSON VS JsonB

So how do our contenders perform? The first thing an astute reader will notice, is the similarity between JSONB and BSON, both are a json structure stored as binary internally. How do they differ? The first point of difference is that JSONB will output fully standards compliant JSON, as described by the RFC[9], while BSON has does not. This is, however a double edged sword. For example, JSONB does not support a native binary type unlike BSON, nor, more pertinently a date type.

2.3.4 Performance Comparison

As demonstrated here: Postgres is faster, uses less memory on disk, and is all around more performant for JSON storage and reads then Mongo.[10]
Winner: Postgres. By technical Victory. Postgres is faster, less memory intensive, and more standards compliant then MongoDB. However, if you require some of the intrinsic type checking of BSON, and this type checking must be done in a denormalized manner, rather then by table columns, or if your need to do complex access updates on a JSON attribute, Mongo ekes out a victory. Once Postgres 9.5 ships, the latter seized to be a concern.

2.4 Complex Model Relations, Access Patterns, and Normalized Data

2.4.1 Normalized Data in Postgres

This is more or less what postgres does. It allows you to encode data relationships into tables, using foreign keys to encode relationships between tables. It allows you to join between tables to pull in data from across table boundaries[11]. For example:
CREATE TABLE USERS( id SERIAL PRIMARY KEY, organization_id INTEGER, name TEXT ); 
CREATE TABLE ORGANIZATIONS( id SERIAL PRIMARY KEY, name TEXT );
Now to select we can either:
SELECT USERS.* FROM USERS, ORGANIZATIONS WHERE USERS.organization_id = ORGANIZATIONS.id AND ORGANIZATIONS.name LIKE '%bar%';
Or
SELECT USERS.* FROM USERS INNER JOIN ORGANIZATIONS ON USERS.organization_id = ORGANIZATIONS.id WHERE ORGANIZATIONS.name LIKE '%bar%';

2.4.2 Normalized Data in MongoDB

Loosely speaking Mongo collections map to Postgres tables, while Mongo Documents map to Postgres Rows. It is important to note that MongoDB does not support Joins, forcing you to query for nested relationships directly if you choose to store either a direct key id or a DBRef, or letting you fetch directly from the nested object. While storing a nested object like so:
db.createCollection('users'); 
db.users.insert({name: 'jack', organization: {name: 'foo corp'}});
Is a trivial solution, it is also denormalized, and from a business logic point of view, presumes that organization has no life, independently of user. What if this isn't true, or you would like to have normalized data? MongoDB allows two ways to achieve this:
1) DBRefs allow you to embed direct references to other documents in your documents. However, this will force additional queries to be run every time to fetch your referenced documents, and should (as per MongoDB's documentation) be avoided when possible
2)
db.createCollection('users'); 
db.createCollection('organizations'); 
db.organizations.insert({name: 'foo corp'}); 
db.organizations.insert({name: 'bar corp'}); 
var foo = db.organizations.find({name: 'foo corp'}); 
db.users.insert({name: "Jack", organization_id: foo});
You can then use application level logic to extract the organization_id, fetch that organization separately, and join the data in the application. Note that, this bypasses any transaction logic you have built or use, unless your transaction logic is handled at the application level. Winner: Postgres By knockout.

2.4.3 Performance Comparison

MongoDB also provides an aggregation framework that lets users mimic many of the JOIN functioanlities of the Relational world. A good (albeit somewhat biased in writing style) performance comparison is provided here[12]. TL;DR; Postgres is about three times faster at joins and aggregation than Mongo. Additionally, the MongoDB aggregation Pipeline can only handle a single collection. An important note here, is that this benchmark only really applies to a one machine database.

2.5 Scaling

There are fundamentally two types of scaling, horizontal scale, and vertical scale. Vertical scaling loosely means adding resources to a given machine. More RAM, more CPU cores. Horizontal Scale means multiple machines running your database.

2.5.1 Scaling Postgres

As long as you can maintain Vertical Scale, Postgres scaling is trivial. You add more power to your machine, bump up the resources allocated to Postgres, and you're off to the races. But sooner or later this will hit a ceiling. Scaling Postgres Horizontally is significantly harder, but doable. There are several valid strategies to achieve this. At the core these are replication for reads (a master machine that allows writes, and multiple read only machines), and sharding. Sharding is a complex topic with multiple solutions, from application level load balancing, down to database level logic to store the shardid as part of the primary key as done at instagram[13] A detailed discussion of the approaches is out of scope of this article.

2.5.2 Scaling Mongo

MongoDB Supports Sharding at the technology level. When sharded, collections are partitioned by a shard key. Mongo's query routers can then identify the right shard to read from. A great resource to achieve good sharding, complete with best practices for balancing shard sizes, can be found in the documentation[14]
Winner: MongoDB. Technical Victory due to native sharding support and ease.

2.6 Rapid Prototyping

So, you have investors breathing down your neck, and you owe a prototype yesterday. What technology do you choose for your data store? While Postgres seems to be the clear victor above, there are a few advantages to using Mongo:
  1. As the store is schema-less, as your requirements rapidly change you do not need to continuously write migrations
  2. You do not need to think through your data-model, ensuring normalization.
  3. You do not need to write SQL, as The query language is JSON like, and will feel very familiar to anybody with Javascript experience.
  4. It is probably fair to say that at this early juncture a lot of your data is of suboptimal importance, and your organization can survive its loss or corruption, thus the strong guarantees provided by Postgres are not necessary.
The downside: All data is equally likely to be lost. If your organization deals with enterprise customers, or handles financial data, MongoDB is very simply not an option[15]. Additionally, while it is true that MongoDB is easier to scale down the road, Postgres is also scalable (if its good enough for Instagram....).
Victory: MongoDB, Technical Victory. Assuming you do not already have postgres and/or database expertise, MongoDB's simpler query interface and lack of requirements for schema migration/maintenance make it easier to rapidly prototype in. Just be aware that unless you fit a small set of niche use cases where individual, small scale dataloss is truly irrelevant (running large scale analytics on normally distributed datasets), you will eventually have to throw your database away and rewrite swathes of your application.

3. Summary

Postgres comes out the clear victor of this fight. There are valid use cases for MongoDB such as reporting on and aggregating large datasets of normally distributed data, and storing TRULY denormalized data, data where relationships are mostly non existent, documents are large, mostly unstructured, and with little to no overlap, while data loss is largely irrelevant (Log Parsing and Caching come to mind). However as a general purpose database, Postgres is clearly the dominant fighter in this arena, and if some denormalized data is required, like say a set of optional parameters on a user (eye color, height, weight, hair color), Postgres' JSONB column is more than sufficient.

Post a Comment

[disqus][blogger][facebook]

Afrogalaxy

Contact Form

Name

Email *

Message *

Powered by Blogger.
Javascript DisablePlease Enable Javascript To See All Widget