NoSQL

In most of my experience as a Data Science at a large Bank, I have retrieved data from relational databases. More recently, we have built a new platform, where I have been working with Cassandra for timeseries on a Kubernetes cluster. I happened to watch this 1-hour presentation on NoSQL databases, which furthered my motivation to understand the topic, so I read a textbook on NoSQL by Pramod Sadalage and Martin Fowler. Below are my notes.

Polyglot Persistence is the idea of using different data stores in different circumstances. The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
NoSQL is a movement, not a technology.
Relational databases are not designed to run on clusters and thus scaling presents a challenge. Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
Relational databases often work very well. Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
Integration databases are a single source of data for multiple applications. The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
Application databases are the paradigm of NoSQL. The application “knows” the database structure. A schemaless db shifts the schema into the application that accesses it. This type of db is more forgiving for evolving needs.
Relational databases are good for analyzing data. NoSQL databases are not flexible for querying.
There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
- Key-Value: Redis, Riak, Dynamo; value is opaque
- Document: MongoDB, Couch; value has structure.
- Column-family: Cassandra, HBase
- Graph: NodeJs
An aggregate is a collection of related objects that are treated as an object. They form the boundaries of an ACID operation. It is central to running on a cluster.
Distribution Model. Demonstrates the trade-off between consistency and availability.
- Single-server
- Sharding: different parts of data onto different servers. Each server is a single source of a subset of data.
- Master/slave replication: Replicating data across multiple nodes with one node the authority (master). Helps read scalability.
- Peer-to-peer replication: Replicating data across all nodes, no authority. Helps write scalability. Linear scaling because no master (Cassandra is example).
Consistency:
- Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
- Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
- To get good consistency, many nodes should be involved but the reduces latency.
- CAP Theorem: when you create partitions, you trade-off consistency with availability.