Skip to content

Spark introductionΒΆ

  • Spark is a fast an general engine for large-scale data processing.
  • It's scalable, consists of:
    • Driver Program (spark context)
    • Cluster manager (spark, yarn)
    • Executors (cache, tasks)
  • It's fast
    • Run programs up to 100x faster than hadoop mapreduce in memory, or 10x faster on disk
    • DAG Engine (directed acyclic graph) optimizes workflows
  • It's hot
    • Amazon
    • Ebay: log analysis and aggregation
    • NASA JPL: Deep Space Network
    • Groupon
    • TripAdviser
    • Yahoo
    • Many others
  • it's not hard
    • Code in python, java or scala
    • Built around one main concept - the resilient distributed dataset (RDD)
  • Components:
    • Spark streaming
    • Spark SQL
    • MLLib
    • GraphX
    • Spark Core
  • Python vs scala
    • Why python?
      • No need to compile, manage dependencies etc
      • Less coding overhead
      • You already know python
      • Lets us focus on the concepts instead of a new language
    • But...
      • Scala is probably more popular choice with spark
      • Spark is built in scala so coding scala is native to spark
      • New features, libraries tend to be scala first.