Amit Mund: NoSQL and Graph Databases

Few collection on Graph Databases:

http://en.wikipedia.org/wiki/Relational_database

A relational database is a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model. The term is in contrast to only one table as the database, and in contrast to other models which also have many tables in one database.
In the relational model, each table schema must identify a column or group of columns, called the primary key, to uniquely identify each row. A relationship can then be established between each row in the table and a row in another table by creating a foreign key, a column or group of columns in one table that points to the primary key of another table. The relational model offers various levels of refinement of table organization and reorganization called database normalization. (See Normalization below.) The database management system (DBMS) of a relational database is called an RDBMS, and is the software of a relational database.

NoSQL: Not only SQL
http://en.wikipedia.org/wiki/NoSQL
http://nosql-database.org/

The name attempted to label the emergence of a growing number of non-relational, distributed data stores that often did not attempt to provide atomicity, consistency, isolation and durability guarantees that are key attributes of classic relational database systems.

A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as "Not only SQL" to emphasize that they do in fact allow SQL-like query languages to be used.

There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlaps it is difficult to get and maintain an overview of non-relational databases. Nevertheless, the basic classification that most would agree on is based on data model. A few of these and their prototypes are:

Column: HBase, Accumulo
Document: MongoDB, Couchbase, Apache CouchDB
Key-value : Dynamo, Riak, Redis, Cache, Project Voldemort, Apache Cassandra, Memcached
Graph: Neo4J, Allegro, Virtuoso

Term	Matching Database
KV Store	Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
KV Store - Eventually consistent	Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
KV Store - Hierarchical	GT.m, Cache
KV Store - Ordered	TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
KV Cache	Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
Tuple Store	Gigaspaces, Coord, Apache River
Object Database	ZopeDB, DB40, Shoal
Document Store	CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
Wide Columnar Store	BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Document Store:
Different implementations offer different ways of organizing and/or grouping documents:

Collections
Tags
Non-visible Metadata
Directory hierarchies

Compared to relational databases, for example, collections could be considered as tables as well as documents could be considered as records. But they are different: every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different.

Documents are addressed in the database via a unique key that represents that document. One of the other defining characteristics of a document-oriented database is that, beyond the simple key-document (or key–value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow retrieval of documents based on their contents. Some NoSQL document stores offer an alternative way to retrieve information using MapReduce techniques, in CouchDB the usage of MapReduce is mandatory if you want to retrieve documents based on the contents, this is called "Views" and it's an indexed collection with the results of the MapReduce algorithms.

NAME Language Notes
Apache CouchDB Erlang JSON database
MongoDB c++, C#, Go BSON store (binary format json)
SimpleDB Erlang Online Service

Graph:
This kind of database is designed for data whose relations are well represented as a graph (elements interconnected with an undetermined number of relations between them). The kind of data could be social relations, public transport links, road maps or network topologies, for example.

Main article: Graph database

FlockDB Scala
InfiniteGraph Java
Neo4j Java
AllegroGraph SPAROL RDF GraphStore

Key-Value Stores:

Key–value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model. The following types exist:

-> KV - eventually consistent
Apache Cassandra
Dynamo
Riak

-> KV - hierarchical
InterSystems Cache

-> KV - cache in RAM
memcached

-> KV - solid state or rotating disk
BidTable
Couchbase Server
MemcacheDB

-> KV - ordered
MemcacheDB

Object database:

ObjectDB

Tabular:

Apache Accumulo
BigTable
Apache Hbase

Hosted:

Datastore on Google Appengine
Amazon DynamoDB

Graph Databases:

http://en.wikipedia.org/wiki/Graph_database
http://en.wikipedia.org/wiki/Graph_theory

Graph databases are based on graph theory. Graph databases employ nodes, properties, and edges. Nodes are very similar in nature to the objects that object-oriented programmers will be familiar with

Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of. Properties are pertinent information that relate to nodes. For instance, if "Wikipedia" were one of the nodes, one might have it tied to properties such as "website", "reference material", or "word that starts with the letter 'w'", depending on which aspects of "Wikipedia" are pertinent to the particular database. Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two. Most of the important information is really stored in the edges. Meaningful patterns emerge when one examines the connections and interconnections of nodes, properties, and edges.

Graph database projects

Neo4j -> A highly scalable open source graph database that supports ACID, has high-availability clustering for enterprise deployments, and comes with a web-based administration tool that includes full transaction support and visual node-link graph explorer. Neo4j is accessible from most programming languages using its built-in REST web API interface. Neo4j is the most popular graph database in use today.

http://en.wikipedia.org/wiki/Big_data

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions

As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 quintillion (2.5×10¹⁸) bytes of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration

External Link:
http://docs.neo4j.org/chunked/stable/cypher-query-lang.html
http://readwrite.com/2011/04/20/5-graph-databases-to-consider#awesm=~ojivymypzN5PBX
http://jasperpeilee.wordpress.com/2011/11/25/a-survey-on-graph-databases/
http://stackoverflow.com/questions/tagged/neo4j
http://docs.neo4j.org/chunked/milestone/introduction-pattern.html#_working_with_relationships

cypher-query-lang:
http://docs.neo4j.org/chunked/stable/cypher-query-lang.html

http://readwrite.com/2011/04/20/5-graph-databases-to-consider

Of the major categories of NoSQL databases - document-oriented databases, key-value stores and graph databases - we've given the least attention to graph databases on this blog. That's a shame, because as many have pointed out it may become the most significant category.

Graph databases apply graph theory to the storage of information about the relationships between entries. The relationships between people in social networks is the most obvious example. The relationships between items and attributes in recommendation engines is another. Yes, it has been noted by many that it's ironic that relational databases aren't good for storing relationship data. Adam Wiggins from Heroku has a lucid explanation of why that is here. Short version: among other things, relationship queries in RDBSes can be complex, slow and unpredictable. Since graph databases are designed for this sort of thing, the queries are more reliable.

Google has its own graph computing system called Pregel (you can find the paper on the subject here), but there are several commercial and open source graph databases available. Let's look at a few.

Neo4j

This is one of the most popular databases in the category, and one of the only open source options. It's the product of the company Neo Technologies, which recently moved the community edition of Neo4j from the AGPL license to the GPL license (see our coverage here). However, its enterprise edition is still ~~proprietary~~ AGPL. Neo4j is ACID compliant. It's Java based but has bindings for other languages, including Ruby and Python.
Neo Technologies cites several customers, though none of them are household names.

Here's a fun illustration of how relationship data in graph databases works, from an InfoQ article by Neo Technologies COO Peter Neubauer:

FlockDB

FlockDB was created by Twitter for relationship related analytics. Twitter's Kevin Weil talked about the creation of the database, along with Twitter's use of other NoSQL databses, at Strange Loop last year. You can find our coverage here.

There is no stable release of FlockDB, and there's some controversy as to whether it can be truly referred to as a graph database. In a DevWebPro article Michael Marr wrote:

The biggest difference between FlockDB and other graph databases like Neo4j and OrientDB is graph traversal. Twitter's model has no need for traversing the social graph. Instead, Twitter is only concerned about the direct edges (relationships) on a given node (account). For example, Twitter doesn't want to know who follows a person you follow. Instead, it is only interested in the people you follow. By trimming off graph traversal functions, FlockDB is able to allocate resources elsewhere.

This lead MyNoSQL blogger Alex Popescu to write: "Without traversals it is only a persisted graph. But not a graph database."

However, because it's in use at one of the largest sites in the world, and because it may be simpler than other graph DBs, it's worth a look.

AllegroGraph

AllegroGraph is a graph database built around the W3C spec for the Resource Description Framework. It's designed for handling Linked Data and the Semantic Web, subjects we've written about often. It supports SPARQL, RDFS++, and Prolog.

AllegroGraph is a proprietary product of Franz Inc., which markets a number of Semantic Web products - including its flagship set of LISP-based development tools. The company claims Pfizer, Ford, Kodak, NASA and the Department of Defense among its AllegroGraph customers.

GraphDB

GraphDB is graph database built in .NET by the German company sones. sones was founded in 2007 and received a new round of funding earlier this year, said to be a "couple million" Euros. The community edition is available under an APL 2 license, while the enterprise edition is commercial and proprietary. It's available as a cloud-service through Amazon S3 or Microsoft Azure.

InfiniteGraph

InfiniteGraph is a proprietary graph database from Objectivity, the company behind the object database of the same name. Its goal is to create a graph database with "virtually unlimited scalability."
According to Gavin Clarke at The Register: "InfiniteGraph map is already being used by the CIA and Department of Defense running on top of the existing Objectivity/DB database and analysis engine."

Others

There are many more graph databases, including OrientDB, InfoGrid and HypergraphDB. Ravel is working on an open source implementation of Pregel. Microsoft is getting into the game with the Microsoft Reasearch project Trinity.

You can find more by looking at the Wikipedia entry for graph databases or NoSQLpedia.

http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html

http://jasperpeilee.wordpress.com/2011/11/25/a-survey-on-graph-databases/

Amit Mund

Pages

Friday, 4 October 2013

NoSQL and Graph Databases