Bermann

N.B. This post was migrated from oli-hall.github.io to oli-hall.com on 18/04/2019

Spark unit testing sucks. After battling slow and inefficient tests for a long time, I finally decided to sit down and research the issue. Well, it turns out that there really isn't a clean solution. The current best practice is to instantiate a single Spark Context instance globally in your test suite, then use that for all your tests. I have to say, I'm not the biggest fan of this.

Why is this a bad idea? Spark Contexts are slow to instantiate, and rather heavy resource-wise. For processing thousands of rows, sure. For processing a couple of test data rows, it's way over the top. Anything which slows down a unit test suite is generally a bad thing, and Spark Contexts are pretty slow. Also, having a single instance of the Spark Context means that your Spark tests are now single threaded, which eliminates a bunch of the benefits that come with a properly parallelised unit test framework. Finally, for processing just a few rows, Spark is really slow. Whilst it may be extremely quick at processing thousands or millions of rows, processing just a handful of rows can take up to 30 seconds just to perform a simple map. This is not acceptable.

Introducing Bermann

So, what's the solution? I decided to start from first principles, and mock out the SparkContext and RDD classes in Python, performing all the operations with vanilla python built-in functions. This proved to be vastly faster than using Spark for tests, which means that not only does our existing Spark test suite run way faster, but there's now a sensible mechanism to perform much more granular tests of Spark logic. Previously, with the slowness involved, there was little reason to test at any level other than pipeline level.

It's available here, in an early form, so please go ahead and test it out! It's worth noting that for now, only RDD operations are supported. There's a minimal DataFrame implementation, but most logic has not yet been fleshed out - DataFrames are kinda complex, and implementing them neatly in Python is proving tricky. If anyone has any suggestions, I'm all ears!