SparkSQL and DataFrames with PySpark
July 31 @ 9:30 am - 12:00 pm
Modern Languages Building (MLB), Room 2001A
Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. Industry has quickly adopted Spark and deployed it at scale for processing big data. Its main advantage include in-memory processing and a rich set of operations for wrangling data using DataFrames.
In this workshop, we’ll introduce attendees to SparkSQL and DataFrames for basic data manipulation, file I/O and SQL querying. Spark has language bindings to R, Python, Scala and Java. We’ll be using PySpark (the Python API) in our workshop. The workshop is intended for users with INTERMEDIATE knowledge of R, Python, or comparable language. Attendees should be familiar with DataFrames in Python (pandas) or R (dplyr). Attendees will NEED to have a Flux account beforehand to participate.