Apache Spark Onsite Training - Onsite, Instructor-led
Running with Hadoop, Zeppelin and Amazon Elastic Map Reduce (AWS EMR)
Integrating Spark with Amazon Kinesis, Kafka and Cassandra
This three to 5 day Spark training course introduces experienced developers and architects to Apache Spark™. Developers will be enabled to build real-world, high-speed, real-time analytics systems. This course has extensive hands-on examples. The idea is introduce key concepts that make Apache Spark™ such an important technology. This course should prepare architects, development managers, and developers to understand the possibilities with Apache Spark™.
Apache Spark is a fast growing library and framework that enables advances data analytics with its open source cluster computing system. Apache Spark’s rapid success is due to its power and simplicity. It is productive and much faster than the typical MapReduce based analysis. It puts the power of Hadoop, BigData and realtime analytics into the hands of mere mortal developers. Spark supports Scala, Java and Python. The course will have examples in all three environments including using REPL for Python and Scala. In addition of full labs in Scala and Python. This course covers Spark SQL, Spark Streaming with and introduction to GraphX and ML.
Spark is enabling the next generation of OLAP which includes realtime analytics at scale. Your company can’t afford to left behind this critical advance in Information Technology.
This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. We also cover integrating with important AWS technologies like Amazon EMR, Amazon S3 and Amazon Kinesis.
The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface (e.g. Spark SQL and DataFrames). It also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and provides an overview of Spark GraphX (graph processing) and Spark MLlib (machine learning). Finally, the course explores possible performance issues and strategies for optimization.
The course is very hands-on, with many labs. Participants will interact with Spark through the Spark shell (for interactive, ad-hoc processing) as well as through programs using the Spark API . Labs currently support Scala - contact us for Python/Java support.
The Apache Spark distributed computing engine is rapidly becoming a primary tool in the processing and analyzing of large-scale data sets. It has many advantages over existing engines, such as Hadoop, including runtime speeds that are 10-100x faster, as well as a much simpler programming model. After taking this course, you will be ready to work with Spark in an informed and productive manner.
Spark Training Course Details
Duration: 3 days to 4 days
Labs: Minimum 50% hands-on labs
Objectives for Spark Training
- Principles of Spark
- RDD (Resilient distributed data)
- Spark SQL
- Importing data into Spark
- Understanding Spark Clustering
- Spark and JSON import
- Spark Streaming
- Spark and Cassandra integration
- Spark and Kafka integration
- Spark/EMR and Kinesis integration
- Spark and S3 integration
Prerequisites for Spark Training:
Reasonable programming experience. An overview of Scala is provided for those who don’t know it. Basic knowledge of Java, Scala or Python with some knowledge of core CS concepts and databases would be helpful. Experience with distributed data grids or Hadoop would be a plus, but not required.
Spark Knowledge and Skills Gained:
- Understand the need for Spark in data processing
- Understand the Spark architecture and how it distributes computations to cluster nodes
- Be familiar with basic installation / setup / layout of Spark
- Use the Spark shell for interactive and ad-hoc operations
- Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
- Understand/use RDD ops such as map(), filter(), reduce(), groupByKey(), join(), etc.
- Understand Spark’s data caching and its usage
- Write/run standalone Spark programs with the Spark API
- Use Spark SQL / DataFrames to efficiently process structured data
- Use Spark Streaming to process streaming (real-time) data
- Understand performance implications and optimizations when using Spark
- Be familiar with Spark GraphX and MLlib
- Optional Day 4: Run Spark in Amazon EMR and understand how Spark works in the Hadoop ecosystem.
- Optional Day 4: Use Spark with Apache Zeppelin
Spark Training Outline
Session 1 (Optional): Scala Ramp Up
Session 2: Introduction to Spark
Session 3: RDDs and Spark Architecture
Session 4: Spark API
Session 5: Spark SQL
Session 6: Spark Streaming
Session 7: Performance Characteristics and Tuning
Session 8 (Optional): Spark GraphX Overview
Session 9 (Optional): MLLib Overview
Session 10 (Optional): AWS EMR
Session 11 (Optional): Using Cassandra and Spark SQL together
Session 12 (Optional): Using Kafka and Spark Streaming together
Session 13 (Optional): Using Kinesis and Spark Streaming together