A few years ago I joined a company that was in the process of selecting new ERP system. Vendors were usual suspects: consultants peddling SAP, Microsoft, Oracle and some other ERPs. I got to quiz vendors regarding different features and capabilities of the systems they were selling. I wanted to find how how costly integration with my company other enterprise systems would be, and asked every vendor whether their system exposed their functionality as web services. All of them did, of course. Then I asked them what I considered a very logical and simple next question. I told them, since we are going to orchestrate data updates across multiple disparate systems, I needed to know whether their web services supported transaction, like WS-Trasaction from the WS-* stack? I got blank stares and promises to find that out. I asked whether they were ever asked about this before by other enterprise architects, and to my huge surprise all of them said no. I asked how many ERP implementations they had under their belts, and all of them had dozens. Later all vendors got back to me with an astonishing information: none of their systems' web services supported transactions. That meant that garbage data were bound to accumulate over time and nobody even thought that was a problem.
That struck me as very odd and made me think: in my day-to-day life even mid-level developers usually have decent grasp of what transactions are for and often don't need supervision in applying them, as long as we are talking about SQL programming or writing data access layer of business applications. But as soon as people leave database world, somehow even professional enterprise software integrators become completely unconcerned about transactions. That matched my experience of virtually every enterprise system I ever encountered having lots of garbage data in it, requiring lots of effort/money to cleanse data. But the conclusion was inescapable: by and large, as a practical matter, corporations are pretty comfortable with not having transactions guarding integrity of their data. As a matter of fact, companies don't care about their data consistency.
Now, my thinking went like this: if people voluntarily give up transactions without getting anything in return, what can be gained if transactions are avoided not by neglect, but instead, by design? Well, if we believe CAP theorem, letting go of data consistency should let us gain high availability and partition tolerance, which translates into high scalability. And, ladies and gentlemen, that's what "big data" management systems offer: high scalability and high availability if you can deal with eventual consistency of data. And since lots of companies are not even trying to achieve data consistency, switching to NoSQL-based "big data" platforms becomes a no-brainer.
Now, NoSQL and "big data" have becomes such an incredibly abused buzzwords, that I need to stop for a moment and state that, for example, MongoDB, in my opinion, although a very fast NoSQL data management system, is not necessarily a *bit data* system, because it was not designed to be one - it was built for speed and sacrificed parts of CAP to achieve high performance. Then let's look at Hadoop. No question that's a big data management system. But the main problem is that it's not for on-line data processing - it's strictly batch processing using map/reduce approach. And if you want to set up Hadoop cluster, it's a pretty expensive proposition.
All that said, I argue that first truly useful general-purpose big data on-line processing data management system was Amazon AWS DynamoDB. It has eventual consistency, no transactions, limited returned data set size, and other limitations, but it scales in a nearly linear manner. Then Microsoft came up with Azure Tables and now Document Database. Even though you may say that eventual consistency in not really what data online processing is, I say these systems latency is tolerable enough that these systems could be considered pseudo-online.
Now lets review the landscape again: transactions are abandoned, non-transactional highly-scalable data management systems are available as a part of PaaS stacks from Amazon and Microsoft, so... there is pretty much no reason to have your data processing strategy depend completely on ACID databases. Moreover, if we, developers, train ourselves to deal with more complex DAL tiers underpinned by Azure and AWS eventually-consistent big data engines, there is no real reason to use ACID databases as a default position, which is equal to "everyone, to the cloud!"
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.