Hadoop and Spark Workshop

By |

Overview

Learn how to process large amounts (up to terabytes) of data using SQL and/or simple programming models available in Python, Scala, and Java. Computers will be provided to follow along with hands-on examples; users can also bring laptops.

Prerequisites

Intro to the Linux Command Line or equivalent. This course assumes familiarity with the Linux command line.

A user account on Flux. If you do not have a Flux user account, click here to go to the account application page at: https://arc-ts.umich.edu/fluxform/

Duo authentication.

Duo two-factor authentication is required to log in to the cluster. When logging in, you will need to type your UMICH password as well as authenticate through Duo in order to access Flux.

If you need to enroll in Duo, follow the instructions at Getting Started: How to Enroll in Duo.

Hadoop queue membership. If you did not ask to be put on the training Hadoop queue when applying for a Flux user account, send an email to hpc-support@umich.edu asking to be put on the training queue.

click here to register

Instructor

Brock Palen
Associate Director
ARC-TS

Brock has over 10 years of high performance computing and data intensive computing experience in an academic environment. He currently works with the team at ARC-TS to provide HPC, Data Science, storage, and other research computing services to the University. Brock also is the NSF XSEDE projects Campus Champion representing the schools to this and other national computing infrastructures and organizations.

Materials

Course Preparation

In order to participate successfully in the class exercises, you must have a Flux user account, an MToken, and be added to a Hadoop queue. The user account allows you to log in to the cluster, create, compile, and test applications, and transfer data into Hadoop’s filesystem for processing. The Hadoop queue allows you to submit those jobs, executing those applications in parallel on the cluster.

Flux user account

A single Flux user account can be used to prepare and submit jobs using various allocations. If you already already possess a user account, you can use it for this course, you can skip to “Flux allocation” below. If not, please visit https://arc-ts.umich.edu/fluxform to obtain one. A user account is free to members of the University community. Please note that obtaining an account requires human processing, so be sure to do this at least two business days before class begins.

Hadoop queue

We’ll add you to the training queue so you can run jobs on the cluster during the course. If you already have an existing Hadoop queue, you can use that as well, if you like.

Duo Authentication

Duo two-factor authentication is required to log in to the cluster. When logging in, you will need to type your UMICH password as well as authenticate through Duo in order to access Flux.

If you need to enroll in Duo, follow the instructions at Getting Started: How to Enroll in Duo.

More help

Please email hpc-support@umich.edu for questions, comments, or to seek further assistance.

New Data Science Computing Platform Available to U-M Researchers

By | General Interest, Happenings, HPC, News

Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.

The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.

The following functionalities are immediately available:

  • Structured databases:  MySQL/MariaDB, and PostgreSQL.
  • Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
  • Data ingestion: Redis, Kafka, RabbitMQ.
  • Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.

Other types of databases can be created upon request.

These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact hpc-support@umich.edu.

At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.

ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.

MatLab II

By |

MatLab is a powerful tool for solving engineering and scientific problems. This session is designed for participants who have some experience with the basic operations but would like to expand their knowledge in handling data. Topics include: reading from and writing to files, manipulation as well as visualization of data. The session will be held in a computer lab and participants will be able to work either individually or in small groups on a few practice exercises.

OFFICE HOURS

9:00 a.m. – 5:00 p.m., Monday through Friday
Closed 12pm – 1:00 p.m. every Tuesday for staff meeting.
Voice: (734) 764-7828 (4-STAT from a campus phone)
Fax: (734) 647-2440
Email: cscar@umich.edu

ADDRESS

Consulting for Statistics, Computing and Analytics Research (CSCAR)
The University of Michigan
3550 Rackham
915 E. Washington St.
Ann Arbor, MI 48109-1070

MatLab I

By |

MatLab is a powerful tool for solving engineering and scientific problems. This session is designed for participants who would like to have an introduction to MatLab. The session focuses on the creation and manipulation of arrays and matrices as well as on the creation of functions. The session will be held in a computer lab and participants will be able to work either individually or in small groups on a few practice exercises.

OFFICE HOURS

9:00 a.m. – 5:00 p.m., Monday through Friday
Closed 12pm – 1:00 p.m. every Tuesday for staff meeting.
Voice: (734) 764-7828 (4-STAT from a campus phone)
Fax: (734) 647-2440
Email: cscar@umich.edu

ADDRESS

Consulting for Statistics, Computing and Analytics Research (CSCAR)
The University of Michigan
3550 Rackham
915 E. Washington St.
Ann Arbor, MI 48109-1070

MatLab I

By |

MatLab is a powerful tool for solving engineering and scientific problems. This session is designed for participants who would like to have an introduction to MatLab. The session focuses on the creation and manipulation of arrays and matrices as well as on the creation of functions. The session will be held in a computer lab and participants will be able to work either individually or in small groups on a few practice exercises.

OFFICE HOURS

9:00 a.m. – 5:00 p.m., Monday through Friday
Closed 12pm – 1:00 p.m. every Tuesday for staff meeting.
Voice: (734) 764-7828 (4-STAT from a campus phone)
Fax: (734) 647-2440
Email: cscar@umich.edu

ADDRESS

Consulting for Statistics, Computing and Analytics Research (CSCAR)
The University of Michigan
3550 Rackham
915 E. Washington St.
Ann Arbor, MI 48109-1070

Basic Go programming with data (part 2)

By |

This workshop continues our discussion of using the Go language for data processing.  We will introduce concurrent programming in Go, discuss data serialization, and talk through several case studies.

Basic Go programming with data (part 1)

By |

Go (golang.org) is an open-source programming language that can yield very high performance for large-scale data processing applications.  This workshop is an introduction to programming in Go with data.  Participants should have programming experience in some language, but prior exposure to Go is not expected.  We will cover writing a basic Go program, using the Go tools, Go data structures, and reading files.

Info Session: Data Services at U-M

By |

Representatives of Consulting for Statistics, Computing and Analytics Research (CSCAR) and the U-M Library (UML) will give an overview of services that are now available to support data-intensive research on campus.  As part of the U-M Data Science Initiative, CSCAR and UML are expanding their scopes and adding capacity to support a wide range of research involving data and computation.  This includes consulting, workshops, and training designed to meet basic and advanced needs in data management and analysis, as well as specialized support for areas such as remote sensing and geospatial analyses, and a funding program for dataset acquisitions.  Many of these services are available free of charge to U-M researchers.

This event will begin with overview presentations about CSCAR and Library system data services.  There will also be opportunities for researchers to discuss individualized partnerships with CSCAR and UML to advance specific data-intensive projects.  Faculty, staff, and students are welcome to attend.

Hadoop and Spark Workshop

By |

Overview

Learn how to process large amounts (up to terabytes) of data using SQL and/or simple programming models available in Python, Scala, and Java. Computers will be provided to follow along with hands-on examples; users can also bring laptops.

Prerequisites

Intro to the Linux Command Line or equivalent. This course assumes familiarity with the Linux command line.

A user account on Flux. If you do not have a Flux user account, click here to go to the account application page at: https://arc-ts.umich.edu/fluxform/

Duo authentication.

Duo two-factor authentication is required to log in to the cluster. When logging in, you will need to type your UMICH password as well as authenticate through Duo in order to access Flux.

If you need to enroll in Duo, follow the instructions at Getting Started: How to Enroll in Duo.

Hadoop queue membership. If you did not ask to be put on the training Hadoop queue when applying for a Flux user account, send an email to hpc-support@umich.edu asking to be put on the training queue.

click here to register

Instructor

Brock Palen
Associate Director
ARC-TS

Brock has over 10 years of high performance computing and data intensive computing experience in an academic environment. He currently works with the team at ARC-TS to provide HPC, Data Science, storage, and other research computing services to the University. Brock also is the NSF XSEDE projects Campus Champion representing the schools to this and other national computing infrastructures and organizations.

Materials

Course Preparation

In order to participate successfully in the class exercises, you must have a Flux user account, an MToken, and be added to a Hadoop queue. The user account allows you to log in to the cluster, create, compile, and test applications, and transfer data into Hadoop’s filesystem for processing. The Hadoop queue allows you to submit those jobs, executing those applications in parallel on the cluster.

Flux user account

A single Flux user account can be used to prepare and submit jobs using various allocations. If you already already possess a user account, you can use it for this course, you can skip to “Flux allocation” below. If not, please visit https://arc-ts.umich.edu/fluxform to obtain one. A user account is free to members of the University community. Please note that obtaining an account requires human processing, so be sure to do this at least two business days before class begins.

Hadoop queue

We’ll add you to the training queue so you can run jobs on the cluster during the course. If you already have an existing Hadoop queue, you can use that as well, if you like.

Duo Authentication

Duo two-factor authentication is required to log in to the cluster. When logging in, you will need to type your UMICH password as well as authenticate through Duo in order to access Flux.

If you need to enroll in Duo, follow the instructions at Getting Started: How to Enroll in Duo.

More help

Please email hpc-support@umich.edu for questions, comments, or to seek further assistance.

Numerical computing in Python with Numpy

By |

Numpy is the powerful and widely-used array and linear algebra library for Python. We will cover the basics of array manipulation using Numpy, and cover selected more advanced topics including broadcasting and type conversion. The workshop assumes an intermediate level of Python programming, but no prior knowledge of numpy is required.