What a ridiculous name yes? SQL is not what is meant by NoSQL, and now we are seeing SQL interfaces to NoSQL databases. It is one of those terms that do not accurately represent what the discussion is about.
List of NoSQL requirements
I am fairly neutral on using SQL vs NoSQL. I have been using BDB, Filesystems and many other NoSQL solutions for 20 years, and I have been using RDBMS for almost as long. They all fit nicely.
This section discusses the main arguments for NoSQL and my interpretation of how they can be done using an RDBMS.
Just create a single table: http://bret.appspot.com/entry/how-friendfeed-uses-mysql
Basically you can provide the same fields that many of the NoSQL systems do, e.g. ID, Body, Last Udpated, etc.
BTW. This is not a lesson in "you should use schemaless" - just showing that you can achieve schemaless in an RDBMS.
See DBAs below.
Eventual consistency (aka Performance)
Eventual consistency is not really a requirement of NoSQL. It is a benefit, if it provides performance boost, which is the main claim.
Clustering aka Scalability
One of the first reasons. The comment is that NoSQL database designed to scale out, while RDBMs ar designed to scale up. This is a bit old school thinking. 10 years ago it was hard work, not now.
Drizzle is adding a native JSON interface.
This is the one hardest to understand. There is no less management now than there was.
The major issue here is that your scenario of data use and data storage changes per product. Removing the ability to customise is not a good thing. A standard RDBMS may allow you to add indexes, improving a query (on the fly too, no software changes required) at some loss on updates - you get to choose, and you get to choose per table, even per query. And that is only a minor example of what you can change.
But your DBA is not just an index creator - really most systems running a database is a system admin. They look at backup, recovery, system admin, log management etc. There is no difference here.
Schemaless: It has been a while since I have worked with a full time dedicated DBA in a company, but when I did, they would evaluate your schema changes thoroughly. They became the keepers of sensible structures, data that is understandable and kept over long periods. Now this is not a given with any structure, and worse than that, most schemas have not been well thought out. Now take away the schema, and allow anything to be stored anywhere. Most projects deal with this by adding an object model, which is reviewed instead. That is just moving the schema - but in a way now that makes it difficult for a heterogeneous environment to work (e.g. multiple programming languages).
One thing I can agree on, applications still need better defaults on installation. Although tuning a database is dependent on the use cases, data stored etc, you do seem on many products to have better setup - and this is the case, we are seeing products constantly improving. Just installing a package in linux now has a fraction of the questions.
I like this quote: "In short, NoSQL solutions offer a way to store and use large amounts of data, but with: Less overhead, Less work, Less downtime, Faster results" - it is not a magic bullet.
"I work for a NoSQL database company. The NoSQL advantages that our customers commonly relate are performance, scalability and ease of development (compared to RDBMS')." (see ref below).
- Performance - covered above.
- Scalability - covered above (cluster)
- Ease of Development - definitely not.
Most NoSQL developers will tell you that the learning curve and nuances of developing, especially with things like eventual consistency add to development time. I think this also comes down to the style of developers we are seeing out of education. The one language does everything approach. The desire to wrap anything that is not he one language in an object model, or dealing with native structures.
Therefore there is a lack of developers who know how to do a query. But this only is a problem because it is an advantage. If you can do structured queries, joins etc, then you get a language that makes that easy. If however you are doing just a simple query to get a single row by an ID, then you don't have to learn very much anyway, and I am sure there are plenty of wrappers to help you in every language. This goes onto a whole new topic though...
I have my own personal advantages that I would like to consider.
Lately I have been developing more with NodeJS because it allows me to run a fairly large number of concurrent connections with good speed on a very very light server. It used to be that servers grew so fast, I never really thought about memory and CPU. Micro instances of Amazon servers, and many other virtualisation frameworks, has made me rethink some of this. Sometimes a small idea with very little revenue may still be very social and have a large number of users and specifically concurrent users. I have not looked at how things compare yet - but MySQL, although light for an RDBMS, is not small. It is definitely the top object on all my servers for memory. Something to consider...
To NoSQL ?
An interesting read on Dr Dobb's - http://drdobbs.com/blogs/database/229400759 - lead me to think about the statements made there and analysis.
Statistics is a wonderful thing, and even better is conclusions. They are wonderful because you can practically determine any thing by just excluding some statistic or drawing your own conclusion.
I have seen all of the problems defined in the article in projects no matter what database or system they have used. The problems described in the article are architectural ones which unfortunately too often happen. I have been prone to making the same mistakes, especially when a project grows quicker than development time or costs. Classic mistake of adding monitoring only after an incident.
One statement in the article is also very misleading. The author talks about eventual consistency and NoSQL - eventual consistency is only some of the NoSQL servers. The use of the NoSQL server is too big a bucket. It is a bit like saying that we have Physiotherapy and Medicine - just lumping all other database together causes us to make assumptions.
One NoSQL system that is often not mentioned is the file system. Not only does this allow very easy key/value pairs, can be replicated, networked, sharded - but it has been around since before RDBMS and is still heavily used. Many products out there still use the file system for large data - e.g. attachments.
I have seen mutual locking bugs in file systems (try not to use locks... there is almost always better ways), but I have never seen an article on the net complaining about the back end.
The article goes on to explain that we don't have 25 years academic research into these databases - yet Berkley DB, File system and many other key value data stores has been around and used in projects for longer than RDBMS.
There is lots of cases on the internet documenting failures of MySQL. This is to be expected, and most of these are to do (IMO) with the architecture of the system rather than MySQL.
Fortunately the article does say that the root cause may be architectural, montoring etc - he does say we have far more experience with SQL. I don't see projects who choose SQL over an non-SQL alternative doing more or less monitoring.
There is some things that are worth consideration if you are choosing a NoSQL modern daemon such as:
- Eventual consistency - is the performance benefit worth the coding issues
- API - you can plug in JDBC, DBI, ODBC into most SQL but not all NoSQL sources.
- NoSQL is not a magic bullet, any more than SQL is.
Consider your data carefully and decide on the best approach.
- If you are storing data that is used relationally - you would be crazy not to use a relational database. Invoices, items, inventory - these things relate very nicely.
- If you are storing documents such as a CMS then consider the file system, also consider version control systems. Obviously though you need to keep metadata available - then you should probably use an RDBMS.
- Want to store authentication tokens (from a cookie) or users - consider a flat database (this is what we used to call NoSQL before NoSQL became popular term). Also consider an in memory cache. If you can use it in the application (e.g. Apache itself if you are authenticating there), or if you need multiple applications accessing the data something like Memcached.
- How about syncing data between a mobile application and a desktop application. Lets say a todo list. You have millions of users, each having their own set of data. This could be an opportunity to use a relational database per user. My choice for this is to use SQLite in a file system. SQLite locks the whole database during transactions, but in this case you have probably only a single user accessing your queries. This works well if data structures do not change very often.
- Need complex text search - consider Lucene. I particularly like CLucene.
- Caching pages created - consider memcached.
Developers have been using non SQL / RDBMS solutions in products for ages. There has been some fantastic innovations lately, and this has lead to some interesting advocacy. When I started coding you were frowned upon to not program in C, now it is probably Java. Lets not make RDBMS any more the answer than IBM was. There is no one solution, different solutions have different advantages and different problems. Java is not the only programming language; SQL is not the only database.
There is a lot more to consider and here is just some of them:
- Security - products like memcached are designed to be fast, but low security. That is by design, but needs to be considered before storing some data.
- Scaling - the holy grail - what is scaling anyway? If you want to scale AdHoc queries, or high number of writes, or high reads, or common queries, or page generations, or authentication - these all require different solutions.
- MORE to come.