Large Scale Data Engineering


Course Objective

The goal of the course is to gain insight into and experience with
algorithms and infrastructures for managing big data.

Course Content

This course confronts the students with some data management tasks,
where the challenge is that the mere size of this data causes naive
solutions, and/or solutions that work only on a single machine, to stop
being practical. Solving such tasks requires the computer scientist to
have insight in the main factors that underlie algorithm performance
(data access patterns, hardware latency/bandwidth), as well as possess
certain skills and experience in managing large-scale computing
infrastructure. Apart from the data being of large volume, another
problem invariably is that data comes in strange forms and formats, is
polluted, and needs to be transformed and cleaned. The main part of the
course is the second assignment: a large big data analysis project where
each student teams tackles a different problem, and while doing so gains
experience in multiple aspects of large-scale data engineering (critical
thinking, data management technologies, visualization techniques, paper

More information is found on - also check the
"showcase" section where you can see past project results
(visualizations and papers).

Teaching Methods

There are two lectures per week, and the course requires significant
practical work. The practicals are done outside lecture hours, at the
discretion of the students who are supported remotely through slack,

Method of Assessment

In the first assignment (writing a hand-coded program doing graph
analysis in a single machine) the students can work either on their own
laptops via a prepared VM, or in the cloud using an Amazon EC2 Micro
Instance; and there is an online competition between practicum teams for
the best result. The second assignment is done on large datasets (TB
scale) using Spark on a cluster. For this assignment, a report of 5-8
pages must be written. The students also need to read two scientific
papers of choice, related to the second
assignment. There is no written exam; the grade is based on the two
assignments grades and the grade for the in-class presentation.

Entry Requirements

Programming experience with Java as well as with C or C++.

Theoretical knowledge of database systems, computer architectures, and
operating systems.

Experience with Linux (to work with Hadoop environments) and SQL.


The course website provides all this
information. In this website, each lecture is provided, but also its
main points are summarized. Further, each lecture page provides links to
useful videos and presentations, but also the scientific papers to be

Target Audience


Recommended background knowledge

Knowing how to work with a debugger (gdb for C/C++) and git (github).

General Information

Course Code X_405116
Credits 6 EC
Period P1
Course Level 500
Language of Tuition English
Faculty Faculty of Science
Course Coordinator prof. dr. P.A. Boncz
Examiner prof. dr. P.A. Boncz
Teaching Staff prof. dr. P.A. Boncz

Practical Information

You need to register for this course yourself

Last-minute registration is available for this course.

Teaching Methods Lecture