Spark introduction
- Spark is a fast an general engine for large-scale data processing.
- It's scalable, consists of:
- Driver Program (spark context)
- Cluster manager (spark, yarn)
- Executors (cache, tasks)
- It's fast
- Run programs up to 100x faster than hadoop mapreduce in memory, or 10x faster on disk
- DAG Engine (directed acyclic graph) optimizes workflows
- It's hot
- Amazon
- Ebay: log analysis and aggregation
- NASA JPL: Deep Space Network
- Groupon
- TripAdviser
- Yahoo
- Many others
- it's not hard
- Code in python, java or scala
- Built around one main concept - the resilient distributed dataset (RDD)
- Components:
- Spark streaming
- Spark SQL
- MLLib
- GraphX
- Spark Core
- Python vs scala
- Why python?
- No need to compile, manage dependencies etc
- Less coding overhead
- You already know python
- Lets us focus on the concepts instead of a new language
- But...
- Scala is probably more popular choice with spark
- Spark is built in scala so coding scala is native to spark
- New features, libraries tend to be scala first.