The NoSQL Movement

By Xah Lee. Date: 2010-01-26

In the past few years, there's new fashionable thinking about anti relational database, now blessed with a rhyming term: NoSQL. Basically, it considers that relational database is outdated, and not “horizontally” scalable. I'm quite dubious of these claims.

fault-tolerance NoSQL — Comic by John Muellerleile ([ @jrecursive ] [ https://twitter.com/jrecursive ])

According to Wikipedia Scalability article, vertical scalability means adding more resource to a single node, such as more cpu, memory. (You can easily do this by running your db server on a more powerful machine.), and “Horizontal scalability” means adding more machines. (and indeed, this is not simple with sql databases, but again, it is the same situation with any software, not just database. To add more machines to run one single software, the software must have some sort of grid computing infrastructure builtin. This is not a problem of the software per se, it is just the way things are. It is not a problem of databases.)

I'm quite old fashioned when it comes to computer technology. In order to convenience me of some revolutionary new-fangled technology, i must see improvement based on math foundation. I am a expert of SQL, and believe that relational database is pretty much the gist of database with respect to math. Sure, a tight definition of relations of your data may not be necessary for many applications that simply just need store and retrieve and modify data without much concern about the relations of them. But still, that's what relational database technology do too. You just don't worry about normalizing when you design your table schema.

The NoSQL movement is really about scaling movement, about adding more machines, about some so-called “cloud computing” and services with simple interfaces. (like so many fashionable movements in the computing industry, often they are not well defined.) It is not really about anti relation designs in your data. It's more about adding features for practical need such as providing easy-to-user APIs (so the db users don't have to be knowledgeable about SQL or Schema), ability to add more nodes, provide commercial interface services to your database, provide parallel systems that access your data. Of course, these needs are all done by any big old relational database companies such as Oracle over the years as they constantly adopt the changing industry's needs and cheaper computing power. If you need any relations in your data, you can't escape relational database model. That is just the cold truth of math.

Important data, such as used in the bank transactions, has relations. You have to have tight relational definitions and assurance of data integrity.

Here is a second hand quote from Microsoft's Technical Fellow David Campbell. http://reddevnews.com/blogs/data-driver/2009/12/nosql-heat_0.aspx

I've been doing this database stuff for over 20 years and I remember hearing that the object databases were going to wipe out the SQL databases. And then a little less than 10 years ago the XML databases were going to wipe out…. We actually … you know… people inside Microsoft, [have said] 'let's stop working on SQL Server, let's go build a native XML store because in five years it's all going….'

LOL. That's exactly my thought.

Though, i'd have to have some hands on experience with one of those new database services to see what it's all about.

Amazon S3 and Dynamo

Look at Wikipedia Structured storage. That seems to be what these nosql databases are. Most are just a key-value pair structure, or just storage of documents with no relations. I don't see how this differ from a sql database using one single table as schema.

Amazon's Amazon S3 is another storage service, which uses Amazon's Dynamo (storage system), indicated by Wikipedia to be one of those NoSQL db. Looking at the S3 and Dynamo articles, it appears the db is just a Distributed hash table system, with added http access interface. So, basically, little or no relations. Again, i don't see how this is different from, say, MySQL with one single table of 2 columns, added with distributed infrastructure. (distributed database is often a integrated feature of commercial dbs, for example: Wikipedia Oracle database article cites Oracle Real Application Clusters )

Here is a interesting quote on S3:

Bucket names and keys are chosen so that objects are addressable using HTTP URLs:

http://s3.amazonaws.com/bucket/key

http://bucket.s3.amazonaws.com/key

http://bucket/key (where bucket is a DNS CNAME record pointing to bucket.s3.amazonaws.com)

Because objects are accessible by unmodified HTTP clients, S3 can be used to replace significant existing (static) web hosting infrastructure.

So this means, for example, i can store all my images in S3, and in my HTML document, the inline images are just normal img tags with normal URLs. This applies to any other type of file, pdf, audio, but HTML too. So, S3 becomes the web host server as well as the file system.

Here is Amazon's instruction on how to use it as image server. Seems quite simple: How to use Amazon S3 for hosting web pages and media files? ~~http://www.bucketexplorer.com/documentation/amazon-s3--how-to-use-Amazon-s3-for-web-hosting.html~~

Google BigTable

Another is Google's BigTable. I can't make much comment. To make a sensible comment, one must have some experience of actually implementing a database. For example, a file system is a sort of database. If i created a scheme that allows me to access my data as files in NTFS that are distributed over hundreds of PC, communicated thru http running Apache. This will let me access my files. To insert, delete, data, one can have cgi scripts on each machine. Would this be considered as a new fantastic NoNoSQL?