Skip to content

Spark and the Resilient Distributed Dataset (RDD)

The SparkContext

  • Created by your dirver program
  • Is responsible for making RDDs resilient and distributed
  • Creates RDD's
  • The spark shell creates a "sc" object for you

Creating rdd's

  • nums = parallelize([1,2,3,4])
  • sc.testFile("file:///somewhere/test.txt")
    • or s3n://, hdfs://
  • hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")
  • Can also create from:
    • JDBC
    • Cassandra
    • HBase
    • Elasticsearch
    • JSON, CSV, sequence files, object files, various compressed formats

Transforming RDD's

  • Map
    • rdd = sc.parallelize([1,2,3,4])
    • rdd.map(lambda x: x*x)
    • this yields 1,4,9,16
  • Flatmap
  • Filter
  • Distinct
  • Sample
  • Union, intersection, subtract, cartesian

RDD Actions

  • Collect
  • Count
  • CountByValue
  • Take
  • Top
  • Reduce
  • and more

Lazy evaluation

  • Nothing actually happens in your driver program until an action is called.