Large Scale Data Engineering

2019-2020
Dit vak wordt in het Engels aangeboden. Omschrijvingen kunnen daardoor mogelijk alleen in het Engels worden weergegeven.

Doel vak

The goal of the course is to gain insight into and experience with
algorithms and infrastructures for managing big data.

Inhoud vak

This course confronts the students with some data management tasks,
where the challenge is that the mere size of this data causes naive
solutions, and/or solutions that work only on a single machine, to stop
being practical. Solving such tasks requires the computer scientist to
have insight in the main factors that underlie algorithm performance
(data access patterns, hardware latency/bandwidth), as well as possess
certain skills and experience in managing large-scale computing
infrastructure. Apart from the data being of large volume, another
problem invariably is that data comes in strange forms and formats, is
polluted, and needs to be transformed and cleaned. The main part of the
course is the second assignment: a large big data analysis project where
each student teams tackles a different problem, and while doing so gains
experience in multiple aspects of large-scale data engineering (critical
thinking, data management technologies, visualization techniques, paper
writing).

More information is found on http://event.cwi.nl/lsde - also check the
"showcase" section where you can see past project results
(visualizations and papers).

Onderwijsvorm

There are two lectures per week, and the course requires significant
practical work. The practicals are done outside lecture hours, at the
discretion of the students who are supported remotely through slack,

Toetsvorm

In the first assignment (writing a hand-coded program doing graph
analysis in a single machine) the students can work either on their own
laptops via a prepared VM, or in the cloud using an Amazon EC2 Micro
Instance; and there is an online competition between practicum teams for
the best result. The second assignment is done on large datasets (TB
scale) using Spark on a cluster. For this assignment, a report of 5-8
pages must be written. The students also need to read two scientific
papers of choice, related to the second
assignment. There is no written exam; the grade is based on the two
assignments grades and the grade for the in-class presentation.

Vereiste voorkennis

Programming experience with Java as well as with C or C++.

Theoretical knowledge of database systems, computer architectures, and
operating systems.

Experience with Linux (to work with Hadoop environments) and SQL.

Literatuur

The course website http://event.cwi.nl/lsde provides all this
information. In this website, each lecture is provided, but also its
main points are summarized. Further, each lecture page provides links to
useful videos and presentations, but also the scientific papers to be
read.

Doelgroep

mCS, mPDCS

Aanbevolen voorkennis

Knowing how to work with a debugger (gdb for C/C++) and git (github).

Algemene informatie

Vakcode X_405116
Studiepunten 6 EC
Periode P1
Vakniveau 500
Onderwijstaal Engels
Faculteit Faculteit der Bètawetenschappen
Vakcoördinator prof. dr. P.A. Boncz
Examinator prof. dr. P.A. Boncz
Docenten prof. dr. P.A. Boncz

Praktische informatie

Voor dit vak moet je zelf intekenen.

Voor dit vak kun je last-minute intekenen.

Werkvormen Hoorcollege