Principia Data

Obviously, the post title is stolen from inspired by the Principia Mathematica without claiming to be as comprehensive or influential. With that out of the way, let's focus on what are the underlying characteristics of data as well as the implications that come along with it for processing and storing.

Shape

Data comes in different shapes: tabular, nested, graphy.

As I have pointed out here and there, in terms of data shapes, one can distinguish between logical and physical data layouts:

A non-exhaustive taxonomy for logical and physical data layoutsand serialisation formats—Notes on Physical & Logical Data Layouts

A non-exhaustive taxonomy for logical and physical data layoutsand serialisation formats, Notes on Physical & Logical Data Layouts (2013).

When you process data, you want to be aware of the physical layout, in order to exploit it—for example column-oriented formats such as Parquet for analytical workloads—and you also want to accomodate the logical layout, be it explicitly or through interfaces (CSV file vs. Google Spreadsheet).

Granularity

Data has granularity, or better say we choose to treat it with a certain granularity.

One valid definition of data granularity is Wikipedia's although it's arguably a simplified one. To appreciate the real depth of it, I suggest you read Martin Kleppmann's post Stream Processing, Event Sourcing, Reactive, CEP … and making sense of it all, where he concisely makes the case for raw events vs. aggregates, including their use cases and pro/cons.
Stream Processing, Event Sourcing, Reactive, CEP … and making sense of it all In this post Martin also makes the case that one can have both fast reads & writes when decoupling the input and output schemata. Just read the post, I can honestly not add more here ;)

Gravity

Data has gravity.

That means, in a nutshell, it tends to put up resistence when moved and tends to be more sticky than, for example, code. Say you've got a cluster with 3PB on-prem and your objective is to design a DR solution hosted in a public cloud. What would you do?
Dave McCrory's data gravity concept

Dave McCrory's data gravity concept (2010).

Again, I'm not gonna drill down here, I'll just refer you to experts on this topic: datagravity.org.

Temperature

Data has a temperature.

One of the most interesting and practical pieces on this topic I came across is HDFS Storage Efficiency using Tiered Storage from the eBay engineering team:

HDFS Storage Efficiency using Tiered Storage by Benoy Antony

HDFS Storage Efficiency using Tiered Storageby Benoy Antony (2015).

Do you know of other insightful ones?


There we go: data has a temperature, gravity, granularity, and comes in different shapes. I'm sure as we together explore this space we will encounter even more underlying characteristics of data.

comments powered by Disqus