Large Scale Data Engineering

Dit vak wordt in het Engels aangeboden. Omschrijvingen kunnen daardoor mogelijk alleen in het Engels worden weergegeven.

Doel vak

The goal of the course is to gain insight into and experience with
algorithms and infrastructures for managing big data.

Inhoud vak

This course confronts the students with some data management tasks,
where the challenge is that the mere size of this data causes naive
solutions, and/or solutions that work only on a single machine, to stop
being practical. Solving such tasks requires the computer scientist to
have insight in the main factors that underlie algorithm performance
(data access patterns, hardware latency/bandwidth), as well as possess
certain skills and experience in managing large-scale computing
infrastructure. Apart from the data being of large volume, another
problem invariably is that data comes in strange forms and formats, is
polluted, and needs to be transformed and cleaned. The main part of the
course is the second assignment: a large big data analysis project where
each student teams tackles a different problem, and while doing so gains
experience in multiple aspects of large-scale data engineering (critical
thinking, data management technologies, visualization techniques, paper

More information is found on - also check the
"showcase" section where you can see past project results
(visualizations and papers).


There are two lectures per week, and the course requires significant
practical work. The practicals are done outside lecture hours, at the
discretion of the students who are supported remotely through Skype.


In the first assignment the students can work either on their own
laptops via a prepared VM, or in the cloud using an Amazon EC2 Micro
Instance; and there is an online competition between practicum teams for
the best result. The second assignment, using a Hadoop Cluster, are done
on the SurfSARA Hadoop cluster (90 machines, 720 cores, 1.2PB storage).
For this assignment, a report of 5-8 pages must be written. The students
also need to read two scientific papers of choice, related to the second
assignment. There is no written exam; the grade is based on the two
assignments grades and the grade for the in-class presentation.

Vereiste voorkennis

Hadoop environments consist of Linux machines, so some basic ability in
working with these comes in handy. Also, you must have some programming
skills in C,C++ or Java.


The course website provides all this
information. In this website, each lecture is provided, but also its
main points are summarized. Further, each lecture page provides links to
useful videos and presentations, but also the scientific papers to be



Aanbevolen voorkennis

Programming proficiency in C/C++ or Java

Algemene informatie

Vakcode X_405116
Studiepunten 6 EC
Periode P1
Vakniveau 500
Onderwijstaal Engels
Faculteit Faculteit der Bètawetenschappen
Vakcoördinator prof. dr. P.A. Boncz
Examinator prof. dr. P.A. Boncz
Docenten prof. dr. P.A. Boncz

Praktische informatie

Voor dit vak moet je zelf intekenen.

Voor dit vak kun je last-minute intekenen.

Werkvormen Hoorcollege

Dit vak is ook toegankelijk als: