Which “Big Data”-base should you use?
Here is a look at some “big data” databases and why they are used.
When we talk about “big data,” we aren’t just talking about large amounts of data. We’re talking about lots of somewhat unruly data that doesn’t easily fall into the precisely defined tables of relational databases. NoSQL databases are popular choices when dealing with very large amounts of weakly structured data. Several NoSQL databaseshave this aim. So, how do they compare?
MongoDB
MongoDB is one of the most popular. It’s open source, it’s free, and it does many things well. Its central concept is the document, which is like a row in a relational database but more flexible. It groups documents in collections, which are analogous to SQL tables but don’t have a fixed structure. Programmers can add new fields without having to update the database schema.
It’s common to use MongoDB in connection with Apache Hadoop, a storage system geared toward very large collections of data. Its heart is a distributed file system which can span a cluster of computers. Both MongoDB and Hadoop use the MapReduce data model for processing large amounts of data.
Cassandra and DataStax
Apache Cassandra is likewise designed to handle large amounts of data spanning multiple servers. It began life as a Facebook project before becoming publicly available through the Apache Software Foundation. Like MongoDB, it works with Hadoop and MapReduce. It’s available in a commercial version called DataStax.
Cassandra’s design allows for easy scaling; adding a new machine to the cluster is just a matter of telling Cassandra where its nodes are. Its CQL query language is similar to SQL. Redundancy is designed in, with data replicated over multiple machines.
A Cassandra data store is called a keyspace. It contains “column families,” which, like SQL tables, consist of a number of rows. As with MongoDB, there’s no predefined schema; you can freely add columns to a column family.
Neo4j
Neo4j is a “graph database.” This doesn’t mean it stores graphs, but that it uses what computer scientists call a graph structure. Lines called “edges” connect data nodes, and “properties” specify what relationship an edge identifies. For instance, a Neo4j database could have data nodes named “Barack Obama” and “White House,” with an edge connecting them and identified with the property “lives in.” Graph databases are useful when semantic relationships among data objects are important.
Which to choose?
MongoDB and Cassandra are both solid choices for large databases with a flexible structure. MongoDB is oriented more toward JavaScript programming, while Cassandra may offer easier learning to SQL programmers. Cassandra offers some advantages in scalability, while MongoDB is easier to set up.
Graph databases such as Neo4j are in a category of their own that involves a different way of thinking. They’re best suited to applications in which “X is a Y” relationships are of primary importance and can get complicated. For instance, a collection of information on a whole population, covering all their interconnections, would work well with a graph database.
Whatever your big data needs, I can help you to connect with the right people for developing your project. Please contact me for more information