# Data science and machine learning in Big Data

Data science and machine learning in Big Data

## Overview

This course gives an overview of a typical data science process in the setting of big data. It covers the whole track from collecting your data over creating prediction models to presenting/integrating your final results

The course is the second part of our Big Data track.

The data science process is explained from the ground up and the math behind creating models is left out as much as possible (if interested, we refer to the course “The math behind data science”)

Since there are too many good machine learning libraries, this course can only provide some of them. In the setting of Big Data, you will be provided with hands on knowledge of Mahout and Spark MLlib.

Practically, this course gives an overview of the entire data science process in the setting of Big data

- Collect (+ cleaning and sampling)

- Describe

- Discover

- Predict

- Advise

Use cases covering the entire cycle will be provided.

**Learning objectives:**

- Explain the data science process and the difference with typical IT development cycles

- Get an idea of what machine learning can do for you

- Understand the difference of data science and machine learning in general and in Big Data

- Use Apache Mahout to create recommendations or prediction models

- Use Spark MLlib to create recommendations or prediction models

## Topics

CHAPTER 1: Introduction to data science

- What is data science?

- The data science process

- Data science use cases

CHAPTER 2: Introduction to machine learning

- What is machine learning?

- A quick overview of some algorithms

- Machine learning use cases

CHAPTER 3: Introduction to Big Data

- What is Big Data?

- Machine learning and data science in Big Data

- Big data use cases

CHAPTER 4: The data science process: pre-modeling

- Collect and clean your data

- To sample or not to sample

- Summary statistics and plots

- Exploratory data analysis

- Hands on exercise

CHAPTER 5: The data science process: prediction models

- Prediction models on samples

- Prediction models with Mahout

- Hands on exercises with Mahout

- Prediction models with Spark MLlib

- Hands on exercises with Spark MLlib

CHAPTER 6: The data science process: Advise

- How to use your actual predictions

- Integration in applications

CHAPTER 7: The data science process: the entire cycle

- Example use case

- Use case as exercise

## Prerequisites

- The first part of the Big Data track “Big Data: introduction to Hadoop & hands on with Hive and Pig” is recommended, but not required

- Some experience with working with the Linux command line

- Some basic understanding of prediction models

## Audience

This course is aimed towards developers seeking insight in prediction models, machine learning and the entire data science process.

It is also aimed towards data scientists/BI personnel willing to build their prediction models with Big Data tools.