Grounds for this post
In college I concentrated on distributed systems and big data. The other options were AI, compilers or "software design" so it was an easy decision for me; the others seemed meh. Because of the classes I chose, I went on to Seagtate (a hardware company) to design analytics software for their HPC products. The system I work on is a highly distributed, microservice oriented system that processes, stores and makes available big data. Without giving away any particulars, I learned a lot about big data in production.
Utilizing 20 nodes in the school computer lab to run a MapReduce job at 8pm on a Friday, in order to benchmark an algorithm, allows you to experience the greatest of performance levels. Unfortunately just because you hit those numbers once, doesn't mean you will most of the time, if ever again, because you know, the real world! And that's what this post is about, big data in production.
When to go big data?
When it comes to big data, people get very excited, very easily. Whether it's the managers bragging about their teams big data abilities or the developer using smart words around the interns, some people look at it as an elite ability, this ability to know/process/understand/program big data. I even see data scientist regularly wanting to use things like Spark for calculations of Hive for inquiring about the data, when really a python program or a simple SQL database will do just fine (and the data scientists should know better!). Big data is still a buzzword, or phrase (since it's two words haha) and with that comes the excitement of leaning a new tool and making something cool with it.
When I first started my current project at Seagate we were using things like Spark, HDFS, Hbase, Solr, Kafka and Zookeeper. Were weren't processing a whole lot of data, but we were told that very soon we would be and that the system must be able to handle it all. Well it turned out that a year later we still aren't handling break neck data processing speed requirements but we are seeing a very necessary need for big data storage.
So back to the original question: when to go bid data? The answer is when you believe you will max out on two of the following areas: Velocity, Variety and Volume. I have seen articles on the 3 V's, 4 V's and the 5 V's of big data. I think it all can get a bit redundant so I believe you can distill it to the 3 big data V's. Again if you think you will max out on 2 of the three then you might have a big data system on your hands so get ready.
When you are expecting to have a system with high I/O then it's time to pull out the speed demons. Things like load balances and queues. For this you can use tools like Nginx or HAproxy for load balancing and Kafka for handling high amounts of traffic.
Data variety is what relational databases have nightmares about. These are the data points you can't put into a relational database without having a sparse table or worse yet it's not even data: they're file objects! What ever will you do? You can use columnar database like Hbase or a document store like Mongo to get the job done.
When considering your 6 month data volume will you be able to handle it? Will your filesystem be able to handle it or will you have to use something like HDFS? Will one MySQL instance be enough or will you need to shard soon? If you filesystem is too small or sharding is in your future (hopefully not because of those bad tacos) then maybe starting off with a big data storage tool like Cassandra or HDFS would be wise.
The other end of the spectrum
The other end of the spectrum entails not using big data tools at all, either scubas the need is not present or because the tooling is not allowed or what have you. I have seen relational databases with well structured data (i.e. it is predictable), hundreds of millions of rows, and millions of hits a day and because of load balancing and aching tools like Redis, everything was fine. Was this a big data situation like described above? Kinda. This use case has velocity and volume maxed out, and one cold argue for a distributed database like Cassandra and tools like Kafka to queue the requests on the database , but things seemed just fine without all that.
This use case is to show that sometimes you can use normal tooling and still have two of the three big data V's. I feel that 80% of the time you can get away with just regular boring tooling. This even applies to things like frontend development where you might need just pure JS and not Angular or React.
The gray area
The gray area in our project was the fact that we were promised to receive MB's of data per minute. This means up to a few Gb per hour. In one test case I ran all our archived, un-processed data through our pipeline and because of Kafka, we were able to handle multiple GB's per hour. The real number here was 7GB of data per hour and it could have easily handled more. That might not seem like a lot, but considering how long single threaded apps would take to churn through 7GB, this is a fantastic. The problem we ran into was that our real time data processing engine was slow, as we were running two instances of the processor on Docker. So we bumped it to 4 instances and the processing time, which was preventing the Kafka queues from moving faster, was cut in half. Now making our processing time up to 14GB per hour.
After the data is processed it is stored in Hbase as file objects in addition to a columnar format, aka a dictionary (or a hashmap for you Java folk). Anyway, we used to use Solr and Lily-Hbase-Indexer to index the Hbase data. This is because Hbase can only do O(n) table scans for filtering queries. It can however do O(1) row retrieval because the Hbase master at least indexes the row keys. Solr is very fast at doing filter queries but at times we'd face instances where Lily missed some row keys and so therefore was not suitable for production level availability because it was out of sync. We switched over to Postgres for indexing and never looked back. In participial, Django was used and it was implemented in such a way that the write to Hbase and to Postgres was a single atomic transaction. This means that when the processed data is written to Hbase it must also be written to Postgres and vice versa. It's really a brilliant thing. This the beauty of not being tied to just big data tooling. Here we are using a sturdy, well known, battle tested tool, Postgres, and new fangled big data columnar database in order to accomplish the goal of making data real time. Postgres here is used in place of Solr and does a great job once all your indexes are set up.
The key take away here is don't get cornered by your tooling. In order to always have a way out, it pays huge dividends to invest some up front time in learning your tooling very well and knowing why it is exactly that you will be using it. This way you are not tied to the standard web search on how to do X, Y or Z, but instead you can create your own architectures and pipelines. This is the true meaning of software engineering and distributed systems: to compose many tools in order to create a unified system that solves a real problem and provides value.
It's a cool thing, big data. It comes with so many promises, so much potential. But it can be the downfall of the best-intentioned projects because the end solution might be too slow and lumpy to use. In order to move away from an R&D setting we must use tools that can provide us big data processing storage and delivery that seems like a non-big-data system. What you deliver in the end to a user, whether it be an API, a web or mobile app, graphs, spreadsheets, etc, it should be the tip of a very fast moving, highly tolerant, scalable iceberg. Don't be timid to mix big data tools with "regular" tools. This type of creativity is the essence of being a software engineer. So if you're gonna go big data, do it right, or go home.
Contact me for questions about your next big data project!