Introduction to Hadoop & hands on with Hive and Pig

Introduction to Hadoop & hands on with Hive and Pig


Course code: 
Time Unit: 

This course gives the necessary insights in Apache Hadoop and hands on knowledge to get started working with Apache Hive and Pig.
The course is the first part of our Big Data track.
The theoretical background on Big Data and Hadoop is explained from the ground up – requiring no existing knowledge on these topics

Since Hadoop consists of a complete ecosystem of interrelated tools, this course gives a clear insight in the most important ones, and the situations in which they tend to fit.

Practically, this course gives a head start on development with the most important Hadoop tools:

- MapReduce: the low-level layer (Java API) for parallel processing in Apache Hadoop
- Hive: the SQL-like querying infrastructure for semi-structured data
- Pig: the tool and the Pig Latin scripting language for data processing
- Sqoop: the SQL-to-Hadoop tool for migrating data to and from an Hadoop environment

Learning objectives:
- Explain the nature of Big Data systems and the difference with typical IT systems
- Understand the concepts behind Big Data, semi-structured data and Hadoop
- Introduce the infrastructure related to Apache Hadoop and the HDFS
- Understand the MapReduce pattern and how it can be applied to solve problem in a parallelized processing setup
- Grasp the role of each tool in the Apache Hadoop ecosystem, and the situations in which they best fit
- Query semi-structured data with HiveQL via Apache Hive
- Extend Apache Hive with User Defined Functions (UDF’s) and Customer SerDe’s
- Process semi-structured data with Apache Pig
- Extend Apache Pig with User Defined Functions (UDF’s)
- Use Apache Sqoop to migrate back and forth between relational (SQL) databases and Apache HDFS Build data processing pipelines with Hive, Pig and MapReduce



CHAPTER 1: Introduction to Big Data
- The problem with traditional IT systems
- What is Big Data?
- Example Big Data use cases
- Hadoop in Enterprise Applications

CHAPTER 2: Introduction to Hadoop
- Concepts
- MapReduce
- Hadoop Tooling Ecosystem: Pig, Hive, Sqoop, Oozie, HBase, ...

CHAPTER 3: HDFS – The Hadoop Distributed File System
- Concepts
--- Distributed Filesystem characteristics
--- NameNode
--- SecundaryNameNode
--- DataNode
--- Replication
--- Balancing
- Command line access
- Java API
- Excercises on HDFS

- Concepts: Tables, External Tables, HiveQL, SerDe’s
- Creating tables
- Inserting data
- Moving data files
- Creating external tables
- Usage scenario’s for Hive
- Extending Hive via UDF’s
- Excercises on Hive

- Concepts: Tuples, Bags, …
- Loading data
- Pig Latin: Manipulating data: sort, group,
- Extending Pig via UDF’s
- Excercises on Pig

CHAPTER 6: MapReduce
- MR Concepts
--- Mapper, Reducer, Job, Driver, Counters
--- InputFormat, OutputFormat, Combiner, Partitioner, Input Splits
- YARN Concepts: Resource Manager, Node Manager
- Map-only jobs
- Java MapReduce 2 API
--- Overview
--- WordCount example explained
- Developing MapReduce jobs
--- Custom Writables
--- Running MapReduce jobs locally
--- Running MapReduce jobs on a mini-cluster
--- Running MapReduce jobs on a cluster
--- Unit testing MapReduce jobs with MRUnit
--- Debugging MapReduce programs
- A glimpse of MapReduce Design Patterns
- Excercises on MapReduce : writing a MapReduce Job with MRUnit tests

CHAPTER 7: Sqoop: SQL-to-Hadoop
- Moving data from SQL to Hadoop
- Incremental imports
- Moving data from Hadoop to SQL
- Excercises on Sqoop



- Java programming skills (JDK, Eclipse, Maven) is useful to get hands on with Hadoop, but not strictly required and certainly not necessary when you are mainly interested in Hive and Pig
- Some experience with working with the Linux command line
- Some knowledge of SQL syntax is very useful, but not strictly required



This course is aimed towards developers seeking insight in Big Data concepts and a first introduction to data processing in Apache Hadoop with MapReduce, Hive and Pig.