AgileWorks Romania

Spark v2.2 Workshop

This 4-hour training course introduces Apache Spark v2.2, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

The main focus will be on what is new in Spark v2.2 and this includes DataSets (compile-time type-safe DataFrames), Structured Streaming, as well as the de-emphasizing of RDDs.

The plan is to follow the agenda below but if participants want to dive deeper into high-complexity topics I will instead focus on live coding ad-hoc demos.

1. The first part of the workshop covers Spark SQL with Scala, specifically the limited toy examples emphasized by Spark documentation and tutorials. Spark SQL, used in isolation, can realistically only be used for such didactic use cases. As a practitioner I know from experience that when ingesting real-world datasets, Spark SQL will very quickly show its limitations and therefore some more powerful techniques are needed.

2. The second part of the workshop covers these techniques without which Spark SQL is largely ineffective. This section of the workshop is about sharing lessons learned the hard way, and experience gathered in the trenches of the real world.

3. The third part of the workshop, titled "Machine Learning By Example", covers multiclass classification using SparkML's Pipeline API with Scala. SparkML is the machine learning module that ships with Spark.

4. During the remaining time, we'll focus on a Scala / Spark Streaming application that ingests data from Apache Kafka (an open-source, high-performance, distributed message queue), performs streaming analytics, then saves the analytics results back into Kafka.

All examples will be in Scala.

Please bring your laptop with you.

The workshop is free of charge and seating is first-come-first-serve.

The workshop has some requirements. Please consider the following:
1. Bring your own laptop.
2. Have Docker already installed before the workshop.
3. Have the Docker image already pulled and available locally.
Here are the necessary instructions (prefix these commands with sudo if required):
2. Install Docker
Ubuntu: apt-get -y install
CentOS: yum -y install docker
Linux / Other: curl -fsSL | sh
Mac and Windows:
3. docker pull dserban/dockersparknotebook

Filed under: Comments Off
Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.