The story begins with Neal Ford's 2006 post on polygot programming. People started thinking about when to use what kind of programming language: weakly-typed script languages vs. strongly-typed compiled ones or functional vs. object-oriented languages, etc.:
Applications of the future will take advantage of the polyglot nature of the language world. … We should embrace this idea. … It's all about choosing the right tool for the job and leveraging it correctly.
Then, in 2011, Martin Fowler came along and coined the term polyglot persistence causing people to think about if a relational database is truly the best fit to store their data for any kind of workload or use case.
I'm confident to say that if you starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. … One of the interesting consequences of this is that we are gearing up for a shift to polyglot persistence where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data.
Below I'll argue that we—as time of writing (mid 2014)—are witnessing a transition from polyglot persistence to something new, something that focuses on how data is manipulated and queried, rather than how or where it is stored, in the first place.
In order to appreciate what is happening, let's step back a bit and have a look at two examples of—partially heated and controversial—discussions going on these days:
- Only recently, we saw a number of (independent) discussions around the 'death' of MapReduce: be it when Google introduces its new Big Data Saas-offering Cloud Data Flow or Apache Spark's increasing popularity.
- The many SQL on Hadoop offerings that exist currently, and though it is reasonable to expect that there will be some consolidation taking place, it is likely that for certain workloads specialised solutions will find a niche.
And last but not least there is Polyglot Persistence's dirty little secret (and don't get me wrong, I'm a huge fan of it, it's just I think there's an often overlooked challenge buried here): the data that is managed by the isolated datastores needs to be brought together. Depending on your requirements and preferences you might end up with a more loosely or more tightly coupled system, integrating and synchronising the data in the different participating datastores. But you need to up-front make choices concerning your authoritative data source and it's something you don't change lightly.
So here we are. Above observations motivate me to suggest a new term that aims to capture the shift of focus toward processing of the data: polyglot processing— which essentially means using the right processing engine for a given task. To the best of my knowledge no one has suggested or attempted to define this term yet, besides a somewhat related mentioning in the realm of the Apache Bigtop project, however in a much narrower context.
Example manifestations of polyglot processing are meta-architectures such as the Lambda Architecture, coined by Nathan Marz or the Kappa Architecture, suggested by Jay Kreps. Abstracting, both architectures have in common that they utilise different compute components—be it in batch-mode or in streaming mode—to achieve the overall goal.
|Lambda Architecture||Kappa Architecture|
And while many people (correctly) point out that combining real-time processing and batch processing has been done for many years, I argue that only in the past five or so years a wider audience actually was able to gain substantial experience with large-scale processing systems. This is due, on the one hand, to the commoditisation of infrastructure (such as IaaS cloud offerings from AWS or Google Cloud) and on the other hand the availability of Open Source processing frameworks such as Hadoop or Spark. The increase in the collective experience is, I believe, the reason we see the emergence of the polyglot processing meme taking place now.
An implication of polyglot processing is that one ends up dealing with a number of data manipulation and query engines—such as Kafka, Storm, Spark, or Drill—and it is advisable to have a unified persistency layer that enables all of these tools to access the data. A distributed filesytems such as HDFS represents exactly this: a flexible, highly scalable and cost-effectice way to store the data.
This does not contradict polyglot persistence. The data may exist in various different formats (from CSV to Parquet) and it may exist directly in the DFS or in higher-level abstractions such as HBase. The point is that the compute comes to the data, or, put in other words: the (potentially very large) data is moved as little as possible, thereby reducing network traffic, CPU load and RAM usage.