Apache Spark Onsite Training - Onsite, Instructor-led

Running with Hadoop, Zeppelin and Amazon Elastic Map Reduce (AWS EMR)

Integrating Spark with Amazon Kinesis, Kafka and Cassandra

This three to 5 day Spark training course introduces experienced developers and architects to Apache Spark™. Developers will be enabled to build real-world, high-speed, real-time analytics systems. This course has extensive hands-on examples. The idea is introduce key concepts that make Apache Spark™ such an important technology. This course should prepare architects, development managers, and developers to understand the possibilities with Apache Spark™.

Apache Spark is a fast growing library and framework that enables advances data analytics with its open source cluster computing system. Apache Spark’s rapid success is due to its power and simplicity. It is productive and much faster than the typical MapReduce based analysis. It puts the power of Hadoop, BigData and realtime analytics into the hands of mere mortal developers. Spark supports Scala, Java and Python. The course will have examples in all three environments including using REPL for Python and Scala. In addition of full labs in Scala and Python. This course covers Spark SQL, Spark Streaming with and introduction to GraphX and ML.

Spark is enabling the next generation of OLAP which includes realtime analytics at scale. Your company can’t afford to left behind this critical advance in Information Technology.

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. We also cover integrating with important AWS technologies like Amazon EMR, Amazon S3 and Amazon Kinesis.

The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface (e.g. Spark SQL and DataFrames). It also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and provides an overview of Spark GraphX (graph processing) and Spark MLlib (machine learning). Finally, the course explores possible performance issues and strategies for optimization.

The course is very hands-on, with many labs. Participants will interact with Spark through the Spark shell (for interactive, ad-hoc processing) as well as through programs using the Spark API . Labs currently support Scala - contact us for Python/Java support.

The Apache Spark distributed computing engine is rapidly becoming a primary tool in the processing and analyzing of large-scale data sets. It has many advantages over existing engines, such as Hadoop, including runtime speeds that are 10-100x faster, as well as a much simpler programming model. After taking this course, you will be ready to work with Spark in an informed and productive manner.

Spark Training Course Details

Duration: 3 days to 4 days

Labs: Minimum 50% hands-on labs

Objectives for Spark Training

Principles of Spark
RDD (Resilient distributed data)
Spark SQL
Importing data into Spark
Understanding Spark Clustering
Spark and JSON import
Spark Streaming
Spark and Cassandra integration
Spark and Kafka integration
Spark/EMR and Kinesis integration
Spark and S3 integration

Prerequisites for Spark Training:

Reasonable programming experience. An overview of Scala is provided for those who don’t know it. Basic knowledge of Java, Scala or Python with some knowledge of core CS concepts and databases would be helpful. Experience with distributed data grids or Hadoop would be a plus, but not required.

Spark Knowledge and Skills Gained:

Understand the need for Spark in data processing
Understand the Spark architecture and how it distributes computations to cluster nodes
Be familiar with basic installation / setup / layout of Spark
Use the Spark shell for interactive and ad-hoc operations
Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
Understand/use RDD ops such as map(), filter(), reduce(), groupByKey(), join(), etc.
Understand Spark’s data caching and its usage
Write/run standalone Spark programs with the Spark API
Use Spark SQL / DataFrames to efficiently process structured data
Use Spark Streaming to process streaming (real-time) data
Understand performance implications and optimizations when using Spark
Be familiar with Spark GraphX and MLlib
Optional Day 4: Run Spark in Amazon EMR and understand how Spark works in the Hadoop ecosystem.
Optional Day 4: Use Spark with Apache Zeppelin

Spark Training Outline

Session 1 (Optional): Scala Ramp Up

Scala Introduction, Variables, Data Types, Control Flow
The Scala Interpreter
Collections and their Standard Methods (e.g. map())
Functions, Methods, Function Literals
Class, Object, Trait

Session 2: Introduction to Spark

Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell

Session 3: RDDs and Spark Architecture

RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs - Creating and Transforming (map, filter, etc.)
Key-Value Pairs - Definition, Creation, and Operations
Caching - Concepts, Storage Type, Guidelines

Session 4: Spark API

Overview, Basic Driver Code, SparkConf
Creating and Using a SparkContext
RDD API
Building and Running Applications
Application Lifecycle
Cluster Managers
Logging and Debugging

Session 5: Spark SQL

Introduction and Usage
DataFrames and SQLContext
Working with JSON
Querying - The DataFrame DSL, and SQL
Data Formats

Session 6: Spark Streaming

Overview and Streaming Basics
DStreams (Discretized Steams),
Architecture, Stateless, Stateful, and Windowed Transformations
Spark Streaming API
Programming and Transformations

Session 7: Performance Characteristics and Tuning

The Spark UI
Narrow vs. Wide Dependencies
Minimizing Data Processing and Shuffling
Using Caching
Using Broadcast Variables and Accumulators

Session 8 (Optional): Spark GraphX Overview

Introduction
Constructing Simple Graphs
GraphX API
Shortest Path Example

Session 9 (Optional): MLLib Overview

Introduction
Feature Vectors
Clustering / Grouping, K-Means
Recommendations
Classifications

Session 10 (Optional): AWS EMR

Understanding Hadoop on AWS EMR
Relationship of Spark to Hadoop on EMR
Importing data from S3
Setting up a new EMR
Building a new Cluster and initializing the data with EMR steps
Using Zeppelin to visualize and prototype with Spark
Debugging
Using Hive

Session 11 (Optional): Using Cassandra and Spark SQL together

Introduction to Cassandra
Using Cassandra from Spark

Session 12 (Optional): Using Kafka and Spark Streaming together

Introduction to Kafka
Using Kafka from Spark

Session 13 (Optional): Using Kinesis and Spark Streaming together

Introduction to Kinesis
Using Kinesis from Spark

Apache Spark Onsite Training