Getting Started with Apache Spark and Cassandra

What is Apache Spark?

The standard description of Apache Spark is that it’s ‘an open source data analytics cluster computing framework’. Another way to define Spark is as a VERY fast in-memory, data-processing framework – like lightning fast. 100x faster than Hadoop fast. As the volume and velocity of data collected from web and mobile apps rapidly increases, it’s critical that the speed of data processing and analysis stay at least a step ahead in order to support today’s Big Data applications and end user expectations. Spark offers the competitive advantage of high velocity analytics by way of stream processing large volumes of data, versus what has been traditionally a more heavily “batch-oriented” approach to data processing as seen with Hadoop. Spark also provides a more inclusive framework allowing for multiple analytics processing options including: fast interactive queries, streaming analytics, graph analytics and machine learning.

Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014. It has since become  one of the largest open source communities in big data, with over 200 contributors in 50+ organizations.

To learn more about Apache Spark, check out this video "Apache Spark - The SDK for All Big Data Platforms" from Cassandra Summit 2014.

Visit the DataStax's Spark Driver for Apache Cassandra Github for install instructions

The State of the Art

The state of the big data analysis art has changed over the last couple years. Hadoop has dominated the space for many years now, but there are a number of new tools available that are worth a look, whether you’re new to the analysis game or have been using Hadoop for a long time.

So one of the first questions you might ask is, “What’s wrong with Hadoop?”

Well, for starters, it’s 11 years old--hard to believe, I know. Some of you are thinking, wow, I’m late to the game. To those people I would say, this is good news for you, because you’re unencumbered by any existing technology investment.

Hadoop was groundbreaking at its introduction, but by today’s standards it’s actually pretty slow and inefficient. It has several shortcomings:

Everything gets written to disk, including all the interim steps.
In many cases you need a chain of jobs to perform your analysis, making #1 even worse.
Writing MapReduce code sucks, because the API is rudimentary, hard to test, and easy to screw up. Tool like Cascading, Pig, Hive, etc., make this easier, but that’s just more evidence that the core API is fundamentally flawed.
It requires lots of code to perform even the simplest of tasks.
The amount of boilerplate is crazy.
It doesn’t do anything out of the box. There’s a good bit of configuration and far too many processes to run just to get a simple single-node installation working.

Start Learning with DS320: Analytics with Apache Spark

A New Set of Tools

Fortunately we are seeing a new crop of tools designed to solve many of these problems, like Apache Drill, Cloudera’s Impala, proprietary tools (Splunk, keen.io, etc.), Spark, and Shark. But the Hadoop ecosystem is already suffering from an overabundance of bolt-on components to do this and that, so it’s easy for the really revolutionary stuff to get lost in the shuffle.

Let’s quickly break these options down to see what they offer.

Cloudera’s Impala and Apache Drill are both Massively Parallel query processing tools designed to run against your existing Hadoop data. Basically they give you fast SQL-style queries for HDFS and HBase, with the theoretical possibility of integrating with other InputFormats, although I’m not aware of anyone having done this successfully with Cassandra.

There are also a number of vendors that offer a variety of analysis tools, which are mostly cloud-based solutions that require you to send them your data, at which point you’re able to run queries using their UIs and/or APIs.

Spark, and its SQL query API, provide a general-purpose in-memory distributed analysis framework. You deploy it as a cluster and submit jobs to it, very much like you would with Hadoop. Shark is essentially Hive on Spark. It uses your existing Hive metastore, but it executes the queries on your Spark cluster. At Spark Summit this year, it was announced that Shark will be replaced by Spark SQL.

… And the Winner Is?

So which tool or tools should you use? Well, actually you don’t really have to choose only one. Most of the major Hadoop vendors have embraced Spark, as has DataStax. It’s actually an ideal Hadoop replacement. Or if you haven’t started using Hadoop, at this point it makes sense to simply redirect your efforts towards Spark instead.

There’s been a lot of investment in Hadoop connectors, called InputFormats, and these can all be leveraged in Spark. So, for example, you can still use your Mongo-Hadoop connector inside Spark, even though it was designed for Hadoop. But you don’t have to install Hadoop at all, unless you decide to use HDFS as Spark’s distributed file system. If you don’t, you can choose one of the other DFS’s that it supports.

Unlike Hadoop, Spark supports both batch and streaming analysis, meaning you can use a single framework for your batch processing as well as your near real time use cases. And Spark introduces a fantastic functional programming model, which is arguably better suited for data analysis than Hadoop’s Map/Reduce API. I’ll show you the difference in a minute.

Perhaps most importantly, since this is a Cassandra blog, I assume you’re all interested in how well this works with Cassandra. The good news is you don’t even have to use the Hadoop InputFormat, as DataStax has built an excellent direct Spark driver. And yes, it’s open source!

Spark and Cassandra

DataStax has recently open sourced a fantastic Spark driver that’s available on their GitHub repo. It gets rid of all the job config cruft that early adopters of Spark on Cassandra previously had to deal with, as there’s no longer a need to go through the Spark Hadoop API.

It supports server-side filters (basically WHERE clauses), that allow Cassandra to filter the data prior to pulling it in for analysis. Obviously this can save a huge amount of processing time and overhead, since Spark never has to look at data you don’t care about. Additionally, it’s token aware, meaning it knows where your data lives on the cluster and will try to load data locally if possible to avoid network overhead.

Topology

So what does this all look like on a physical cluster? Well, let’s start by looking at an analysis data center running Cassandra with Hadoop, since many people are running this today. You need a Hadoop master with the NameNode, SecondaryNameNode, and JobTracker, then your Cassandra ring with co-located DataNodes and TaskTrackers. This is the canonical setup for Hadoop and Cassandra running together.

Spark has a similar topology. Assuming you’re running HDFS, you’ll still have your Hadoop NameNode and SecondaryNameNode, and DataNodes running on each Cassandra node. But you’ll replace your JobTracker with a Spark Master, and your TaskTrackers with Spark Workers.

Building a Spark Application

Let’s take a high-level look at what it looks like to actually build a Spark application. To start, you’ll configure your app by creating a SparkConf, then use it to instantiate your SparkContext:

/**Configures Spark. */
val conf = new SparkConf(True)
    .set(“spark.cassandra.connection.host”, CassandraSeed)
    .set("spark.cleaner.ttl", SparkCleanerTtl.toString)
    .setMaster(SparkMaster)
    .setAppName(SparkAppName)

/** Connect to the Spark cluster: */
lazy val sc = new SparkContext(conf)

Next, you’ll pull in some data, in this case using the DataStax driver’s cassandraTable method. For this example, we’ll assume you have a simple table with a Long for a key and a single column of type String:

val myTable = sc.cassandraTable[(Long, String)](“myKeyspace”, “myTable”)

This gives us an RDD, or Resilient Distributed Dataset. The RDD is at the core of Spark’s functionality. It is a distributed collection of items, and it’s at the core of Spark’s fault tolerance, because it can be recomputed at any point in the case of slow or failed nodes. Spark does this automatically!

You can perform two types of operations on an RDD:

Transformations are lazy operations that result in another RDD. They don’t require actual materialization of the data into memory. Operations like filter, map, flatMap, distinct, groupBy, etc., are all transformations, and therefore lazy.
Actions are immediately evaluated, and therefore do materialize the data. You can tell an action from a transformation because it doesn’t return an RDD. Operations like count, fold, and reduce are actions, and therefore NOT lazy.

An example:

val adults = persons.filter(_.age > 17)  // this is a transformation
val total = adults.count()  // this is an action

Saving Data to Cassandra

Writing data into a Cassandra table is very straightforward. First, you generate an RDD of tuples that matches the CQL row schema you want to write, then you call saveToCassandra on that RDD:

adults.saveToCassandra(“test”, “adults”)

Running SQL Queries

To run SQL queries (using Spark 1.0 or later), you’ll need a case class to represent your schema. After that, you’ll register the RDD as a table, then you can run SQL statements against the table, as follows:

case class Person(name: String, age: Int)
sc.cassandraTable[Person](“test”, “persons”).registerAsTable(“persons”)
val adults = sql(“SELECT * FROM persons WHERE age > 17”)

The result is an RDD, just like the filter call produced in the earlier example.

Server-Side Filters

There are two ways to filter data on the server: using a select call to reduce the number of columns and using a where call to filter CQL rows. Here’s an example that does both:

sc.cassandraTable(“test”, “persons”).select(“name”, “age”).where(“age = 17”)

As a result, you have less data being ingested by Spark, and therefore less work to do in your analysis job. This is a good idea, as you should always filter early and often.

Spark in the Real World

I have shown you some simple examples that demonstrate how easy it is to get started with Spark, but you might be wondering how you structure your application for real-world task distribution.

For starters RDD operations are similar to the scala API, but …

Your choice of operations and the order in which they are applied is critical to performance.
You must organize your processes with task distribution and memory in mind.

The first thing is to determine if you data is partitioned appropriately. A partition in this context is merely a block of data. If possible, partition your data before Spark even ingests it. If this is not practical or possible, you may choose to repartition the data immediately following the load. You can repartition to increase the number of partitions or coalesce to reduce the number of partitions.

The number of partitions should, as a lower bound, be at least 2x the number of cores that are going to operate on the data. Having said that, you will also want to ensure any task you perform takes at least 100ms to justify the distribution across the network. Note that a repartition will always cause a shuffle, where coalesce typically won’t. If you’ve worked with MapReduce, you know shuffling is what takes most of the time in a real job.

As I stated earlier, filter early and often. Assuming the data source is not preprocessed for reduction, your earliest and best place to reduce the amount of data spark will need to process is on the initial data query. This is often achieved by adding a where clause. Do not bring in any data not necessary to obtain your target result. Bringing in any extra data will affect how much data may be shuffled across the network, and written to disk. Moving data around unnecessarily is a real killer and should be avoided at all costs

At each step you should look for opportunities to filter, distinct, reduce, or aggregate the data as much as possible prior to proceeding to the operation.

Use pipelines as much as possible. Pipelines are a series of transformations that represent independent operations on a piece of data and do not require a reorganization of the data as a whole (shuffle). For example: a map from a string -> string length is independent, where a sort by value requires a comparison against other data elements and a reorganization of data across the network (shuffle).

In jobs which require a shuffle see if you can employ partial aggregation or reduction before the shuffle step (similar to a combiner in MapReduce). This will reduce data movement during the shuffle phase.

Some common tasks that are costly and require a shuffle are sorts, group by key, and reduce by key. These operations require the data to be compared against other data elements which is expensive. It is important to learn the Spark API well to choose the best combination of transformations and where to position them in your job. Create the simplest and most efficient algorithm necessary to answer the question.

Start Learning with DS320: Analytics with Apache Spark

Authors

Robbie Strickland, Director of Software Development at The Weather Channel
Robbie works for The Weather Channel's digital division, as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, Hadoop integration, leading the Atlanta Cassandra Users Group, and answering lots of questions on StackOverflow.

Matt Kew, Internet Application Developer at The Weather Channel
Matt Kew is a Software Engineer with a background in emerging and disruptive technologies. He has experience with full-stack application development including high volume backend services, big data analytics, and highly responsive interactive client side applications. He enjoys building, training, and leading teams of software engineers to apply the right balance of innovation and transform existing technology stacks to best achieve company goals. In his career he has been recognized for best-in-breed software applications which led to strategic consultancy roles with teams of engineers from Google and partnerships with Microsoft.

Helena Edelson, Senior Engineer at DataStax.
Helena is a committer on several open source projects including Akka, the Spark Cassandra Connector and previously Spring Integration and Spring AMQP. She has been working with Scala since 2010 and is currently a Senior Software Engineer on the DSE Analytics team at DataStax, working with Apache Spark, Cassandra, Kafka, Scala and Akka. Most recently she has been a speaker at Big Data and Scala conferences.

Additional Spark and Cassandra Resources

Blog Postings

Installing the Cassandra / Spark OSS Stack by Al Tobey, Apache Cassandra Open Source Mechanic

Fast Spark Queries on In-Memory Datasets is a blog post from Ooyala that describes their use of Spark and Cassandra to help them derive actionable information from over 2 billion video events per day

SlideShare Presentations

Escape from Hadoop: with Apache Spark and Cassandra with the Spark Cassandra Connector by Helena Edelson: What is Apache Spark, why use it over Cassandra without Hadoop and how. What is Spark Streaming, and how to use it with Apache Cassandra via the Spark Cassandra Connector. Example application: http://github.com/killrweather/killrweather

Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka by Helena Edelson: Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: http://github.com/killrweather/killrweather

Interactive Analytics with Spark and Cassandra In this slideshare by Evan Chan, learn to take your analytics to the next level by using Apache Spark to accelerate complex interactive analytics using your Apache Cassandra data. Includes an introduction to Spark as well as how to read Cassandra tables in Spark

Video Presentations

Apache Spark - The SDK for All Big Data Platforms Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today. https://www.youtube.com/watch?list=PLqcm6qE9lgKJkxYZUOIykswDndrOItnn2&v=Ot-nrOaJHLE&feature=player_embedded

AMP Camp 2014

Spark SQL: https://www.youtube.com/watch?v=KiAnxVo8aQY&list=PLbDk7g7PotW149tgUaxX-5Fh10wP8NG6L&index=2

Spark MLLib: https://www.youtube.com/watch?v=qSPqh7DiREM&list=PLbDk7g7PotW149tgUaxX-5Fh10wP8NG6L&index=4

Spark ML Pipelines: https://www.youtube.com/watch?v=_MHA5AzuixA&index=12&list=PLbDk7g7PotW149tgUaxX-5Fh10wP8NG6L

Spark GraphX: https://www.youtube.com/watch?v=Dp-dEt4aMCY&index=5&list=PLbDk7g7PotW149tgUaxX-5Fh10wP8NG6L

Interactive OLAP Queries using Apache Cassandra and Spark: How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan’s experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data. https://www.youtube.com/watch?feature=player_embedded&v=zke8mp-kMMo&list=PLqcm6qE9lgKJkxYZUOIykswDndrOItnn2

Training

DS320: Analytics with Apache Spark: https://academy.datastax.com/courses/getting-started-apache-spark

Public Apache Spark Training Workshops http://databricks.com/spark-training

Online Apache Spark Training Workshops - Basic and Advanced Apache Spark slides with videos: http://databricks.com/spark-training-resources

Reference Applications And Demos