Apache Spark Onsite Training

Apache Spark Onsite Training - Onsite, Instructor-led

Running with Hadoop, Zeppelin and Amazon Elastic Map Reduce (AWS EMR)

Integrating Spark with Amazon Kinesis, Kafka and Cassandra

This three to 5 day Spark training course introduces experienced developers and architects to Apache Spark™. Developers will be enabled to build real-world, high-speed, real-time analytics systems. This course has extensive hands-on examples. The idea is introduce key concepts that make Apache Spark™ such an important technology. This course should prepare architects, development managers, and developers to understand the possibilities with Apache Spark™.

Apache Spark is a fast growing library and framework that enables advances data analytics with its open source cluster computing system. Apache Spark’s rapid success is due to its power and simplicity. It is productive and much faster than the typical MapReduce based analysis. It puts the power of Hadoop, BigData and realtime analytics into the hands of mere mortal developers. Spark supports Scala, Java and Python. The course will have examples in all three environments including using REPL for Python and Scala. In addition of full labs in Scala and Python. This course covers Spark SQL, Spark Streaming with and introduction to GraphX and ML.

Spark is enabling the next generation of OLAP which includes realtime analytics at scale. Your company can’t afford to left behind this critical advance in Information Technology.

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. We also cover integrating with important AWS technologies like Amazon EMR, Amazon S3 and Amazon Kinesis.

The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface (e.g. Spark SQL and DataFrames). It also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and provides an overview of Spark GraphX (graph processing) and Spark MLlib (machine learning). Finally, the course explores possible performance issues and strategies for optimization.

The course is very hands-on, with many labs. Participants will interact with Spark through the Spark shell (for interactive, ad-hoc processing) as well as through programs using the Spark API . Labs currently support Scala - contact us for Python/Java support.

The Apache Spark distributed computing engine is rapidly becoming a primary tool in the processing and analyzing of large-scale data sets. It has many advantages over existing engines, such as Hadoop, including runtime speeds that are 10-100x faster, as well as a much simpler programming model. After taking this course, you will be ready to work with Spark in an informed and productive manner.

Spark Training Course Details

Duration: 3 days to 4 days

Labs: Minimum 50% hands-on labs

Objectives for Spark Training

  • Principles of Spark
  • RDD (Resilient distributed data)
  • Spark SQL
  • Importing data into Spark
  • Understanding Spark Clustering
  • Spark and JSON import
  • Spark Streaming
  • Spark and Cassandra integration
  • Spark and Kafka integration
  • Spark/EMR and Kinesis integration
  • Spark and S3 integration

Prerequisites for Spark Training:

Reasonable programming experience. An overview of Scala is provided for those who don’t know it. Basic knowledge of Java, Scala or Python with some knowledge of core CS concepts and databases would be helpful. Experience with distributed data grids or Hadoop would be a plus, but not required.

Spark Knowledge and Skills Gained:

  • Understand the need for Spark in data processing
  • Understand the Spark architecture and how it distributes computations to cluster nodes
  • Be familiar with basic installation / setup / layout of Spark
  • Use the Spark shell for interactive and ad-hoc operations
  • Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
  • Understand/use RDD ops such as map(), filter(), reduce(), groupByKey(), join(), etc.
  • Understand Spark’s data caching and its usage
  • Write/run standalone Spark programs with the Spark API
  • Use Spark SQL / DataFrames to efficiently process structured data
  • Use Spark Streaming to process streaming (real-time) data
  • Understand performance implications and optimizations when using Spark
  • Be familiar with Spark GraphX and MLlib
  • Optional Day 4: Run Spark in Amazon EMR and understand how Spark works in the Hadoop ecosystem.
  • Optional Day 4: Use Spark with Apache Zeppelin

Spark Training Outline

Session 1 (Optional): Scala Ramp Up

  • Scala Introduction, Variables, Data Types, Control Flow
  • The Scala Interpreter
  • Collections and their Standard Methods (e.g. map())
  • Functions, Methods, Function Literals
  • Class, Object, Trait

Session 2: Introduction to Spark

  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Acquiring and Installing Spark
  • The Spark Shell

Session 3: RDDs and Spark Architecture

  • RDD Concepts, Lifecycle, Lazy Evaluation
  • RDD Partitioning and Transformations
  • Working with RDDs - Creating and Transforming (map, filter, etc.)
  • Key-Value Pairs - Definition, Creation, and Operations
  • Caching - Concepts, Storage Type, Guidelines

Session 4: Spark API

  • Overview, Basic Driver Code, SparkConf
  • Creating and Using a SparkContext
  • Building and Running Applications
  • Application Lifecycle
  • Cluster Managers
  • Logging and Debugging

Session 5: Spark SQL

  • Introduction and Usage
  • DataFrames and SQLContext
  • Working with JSON
  • Querying - The DataFrame DSL, and SQL
  • Data Formats

Session 6: Spark Streaming

  • Overview and Streaming Basics
  • DStreams (Discretized Steams),
  • Architecture, Stateless, Stateful, and Windowed Transformations
  • Spark Streaming API
  • Programming and Transformations

Session 7: Performance Characteristics and Tuning

  • The Spark UI
  • Narrow vs. Wide Dependencies
  • Minimizing Data Processing and Shuffling
  • Using Caching
  • Using Broadcast Variables and Accumulators

Session 8 (Optional): Spark GraphX Overview

  • Introduction
  • Constructing Simple Graphs
  • GraphX API
  • Shortest Path Example

Session 9 (Optional): MLLib Overview

  • Introduction
  • Feature Vectors
  • Clustering / Grouping, K-Means
  • Recommendations
  • Classifications

Session 10 (Optional): AWS EMR

  • Understanding Hadoop on AWS EMR
  • Relationship of Spark to Hadoop on EMR
  • Importing data from S3
  • Setting up a new EMR
  • Building a new Cluster and initializing the data with EMR steps
  • Using Zeppelin to visualize and prototype with Spark
  • Debugging
  • Using Hive

Session 11 (Optional): Using Cassandra and Spark SQL together

  • Introduction to Cassandra
  • Using Cassandra from Spark

Session 12 (Optional): Using Kafka and Spark Streaming together

  • Introduction to Kafka
  • Using Kafka from Spark

Session 13 (Optional): Using Kinesis and Spark Streaming together

  • Introduction to Kinesis
  • Using Kinesis from Spark

Check out all of our SMACK training