In most of my experience as a Data Science at a large Bank, I have retrieved data from relational databases. More recently, we have built a new platform, where I have been working with Cassandra for timeseries on a Kubernetes cluster. I happened to watch this 1-hour presentation on NoSQL databases, which furthered my motivation to understand the topic, so I read a textbook on NoSQL by Pramod Sadalage and Martin Fowler. Below are my notes.

  • Polyglot Persistence is the idea of using different data stores in different circumstances. The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
  • NoSQL is a movement, not a technology.
  • Relational databases are not designed to run on clusters and thus scaling presents a challenge. Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
  • Relational databases often work very well. Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
  • Integration databases are a single source of data for multiple applications. The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
  • Application databases are the paradigm of NoSQL. The application “knows” the database structure. A schemaless db shifts the schema into the application that accesses it. This type of db is more forgiving for evolving needs.
  • Relational databases are good for analyzing data. NoSQL databases are not flexible for querying.
  • There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
    • Key-Value: Redis, Riak, Dynamo; value is opaque
    • Document: MongoDB, Couch; value has structure.
    • Column-family: Cassandra, HBase
    • Graph: NodeJs
  • An aggregate is a collection of related objects that are treated as an object. They form the boundaries of an ACID operation. It is central to running on a cluster.
  • Distribution Model. Demonstrates the trade-off between consistency and availability.
    • Single-server
    • Sharding: different parts of data onto different servers. Each server is a single source of a subset of data.
    • Master/slave replication: Replicating data across multiple nodes with one node the authority (master). Helps read scalability.
    • Peer-to-peer replication: Replicating data across all nodes, no authority. Helps write scalability. Linear scaling because no master (Cassandra is example).
  • Consistency:
    • Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
    • Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
    • To get good consistency, many nodes should be involved but the reduces latency.
    • CAP Theorem: when you create partitions, you trade-off consistency with availability.
Figure