Big Data Hadoop and Spark Developer Certification - eLearning

450,00 EUR

30 hours

eLearning

This Big Data Hadoop Certification course is designed to give you an in-depth knowledge of the big data framework using Hadoop and Spark. In this hands-on big data course, you will execute real-life, industry-based projects using Simplilearn’s integrated labs. Prerequisites: It is recommended that you have knowledge of: - Core - Java SQL

Course timeline

Introduction to Big Data and Hadoop
Lesson 01
- Introduction to Big Data and Hadoop
- Introduction to Big Data
- Big Data Analytics
- What is Big Data?
- Four vs of Big Data
- Case Study Royal Bank of Scotland
- Challenges of Traditional System
- Distributed Systems
- Introduction to Hadoop
- Components of Hadoop Ecosystem Part One
- Components of Hadoop Ecosystem Part Two
- Components of Hadoop Ecosystem Part Three
- Commercial Hadoop Distributions
- Demo: Walkthrough of Simplilearn Cloudlab
- Key Takeaways
- Knowledge Check
Hadoop Architecture Distributed Storage (HDFS) and YARN
Lesson 02
- Hadoop Architecture Distributed Storage (HDFS) and YARN
- What is HDFS
- Need for HDFS
- Regular File System vs HDFS
- Characteristics of HDFS
- HDFS Architecture and Components
- High Availability Cluster Implementations
- HDFS Component File System Namespace
- Data Block Split
- Data Replication Topology
- HDFS Command Line
- Demo: Common HDFS Commands
- Practice Project: HDFS Command Line
- Yarn Introduction
- Yarn Use Case
- Yarn and its Architecture
- Resource Manager
- How Resource Manager Operates
- Application Master
- How Yarn Runs an Application
- Tools for Yarn Developers
- Demo: Walkthrough of Cluster Part One
- Demo: Walkthrough of Cluster Part Two
- Key Takeaways Knowledge Check
- Practice Project: Hadoop Architecture, distributed Storage (HDFS) and Yarn
Data Ingestion into Big Data Systems and ETL
Lesson 03
- Data Ingestion Into Big Data Systems and Etl
- Data Ingestion Overview Part One
- Data Ingestion Overview Part Two
- Apache Sqoop
- Sqoop and Its Uses
- Sqoop Processing
- Sqoop Import Process
- Sqoop Connectors
- Demo: Importing and Exporting Data from MySQL to HDFS
- Practice Project: Apache Sqoop
- Apache Flume
- Flume Model
- Scalability in Flume
- Components in Flume’s Architecture
- Configuring Flume Components
- Demo: Ingest Twitter Data
- Apache Kafka Aggregating User Activity Using Kafka
- Kafka Data Model
- Partitions
- Apache Kafka Architecture
- Demo: Setup Kafka Cluster
- Producer Side API Example
- Consumer Side API
- Consumer Side API Example
- Kafka Connect
- Demo: Creating Sample Kafka Data Pipeline Using Producer and Consumer
- Key Takeaways
- Knowledge Check
- Practice Project: Data Ingestion Into Big Data Systems and ETL
Distributed Processing MapReduce Framework and Pig
Lesson 04
- Distributed Processing Mapreduce Framework and Pig
- Distributed Processing in Mapreduce
- Word Count Example
- Map Execution Phases
- Map Execution Distributed Two Node Environment
- Mapreduce Jobs
- Hadoop Mapreduce Job Work Interaction
- Setting Up the Environment for Mapreduce Development
- Set of Classes
- Creating a New Project
- Advanced Mapreduce
- Data Types in Hadoop
- Output formats in Mapreduce
- Using Distributed Cache
- Joins in MapReduce
- Replicated Join
- Introduction to Pig
- Components of Pig
- Pig Data Model
- Pig Interactive Modes
- Pig Operations
- Various Relations Performed by Developers
- Demo: Analyzing Web Log Data Using Mapreduce
- Demo: Analyzing Sales Data and Solving Kpis Using Pig Practice Project: Apache Pig
- Demo: Wordcount
- Key Takeaways
- Knowledge Check
- Practice Project: Distributed Processing - Mapreduce Framework and Pig
Apache Hive
Lesson 05
- Apache Hive
- Hive SQL over Hadoop MapReduce
- Hive Architecture
- Interfaces to Run Hive Queries
- Running Beeline from Command Line
- Hive Metastore
- Hive DDL and DML
- Creating New Table
- Data Types Validation of Data
- File Format Types
- Data Serialization
- Hive Table and Avro Schema
- Hive Optimization Partitioning Bucketing and Sampling
- Non-Partitioned Table
- Data Insertion
- Dynamic Partitioning in Hive
- Bucketing
- What Do Buckets Do?
- Hive Analytics UDF and UDAF
- Other Functions of Hive
- Demo: Real-time Analysis and Data Filtration
- Demo: Real-World Problem
- Demo: Data Representation and Import Using Hive
- Key Takeaways
- Knowledge Check
- Practice Project: Apache Hive
NoSQL Databases HBase
Lesson 06
- NoSQL Databases HBase
- NoSQL Introduction
- Demo: Yarn Tuning
- Hbase Overview
- Hbase Architecture
- Data Model
- Connecting to HBase
- Practice Project: HBase Shell
- Key Takeaways
- Knowledge Check
- Practice Project: NoSQL Databases - HBase
Basics of Functional Programming and Scala
Lesson 07
- Basics of Functional Programming and Scala
- Introduction to Scala
- Demo: Scala Installation
- Functional Programming
- Programming With Scala
- Demo: Basic Literals and Arithmetic Programming
- Demo: Logical Operators
- Type Inference Classes Objects and Functions in Scala
- Demo: Type Inference Functions Anonymous Function and Class
- Collections
- Types of Collections
- Demo: Five Types of Collections
- Demo: Operations on List Scala REPL
- Demo: Features of Scala REPL
- Key Takeaways
- Knowledge Check
- Practice Project: Apache Hive
Apache Spark Next - Generation Big Data Framework
Lesson 08
- Apache Spark Next-Generation Big Data Framework
- History of Spark
- Limitations of Mapreduce in Hadoop
- Introduction to Apache Spark
- Components of Spark
- Application of In-memory Processing
- Hadoop Ecosystem vs Spark
- Advantages of Spark
- Spark Architecture
- Spark Cluster in Real World
- Demo: Running a Scala Programs in Spark Shell
- Demo: Setting Up Execution Environment in IDE
- Demo: Spark Web UI
- Key Takeaways
- Knowledge Check
- Practice Project: Apache Spark Next-Generation Big Data Framework
Spark Core Processing RDD
Lesson 09
- Introduction to Spark RDD
- RDD in Spark
- Creating Spark RDD
- Pair RDD
- RDD Operations
- Demo: Spark Transformation Detailed Exploration Using Scala Examples
- Demo: Spark Action Detailed Exploration Using Scala
- Caching and Persistence
- Storage Levels
- Lineage and DAG
- Need for DAG
- Debugging in Spark
- Partitioning in Spark
- Scheduling in Spark
- Shuffling in Spark
- Sort Shuffle Aggregating Data With Paired RDD
- Demo: Spark Application With Data Written Back to HDFS and Spark UI
- Demo: Changing Spark Application Parameters
- Demo: Handling Different File Formats
- Demo: Spark RDD With Real-world Application
- Demo: Optimizing Spark Jobs
- Key Takeaways
- Knowledge Check
- Practice Project: Spark Core Processing RDD
Spark SQL Processing DataFrames
Lesson 10
- Spark SQL Processing DataFrames
- Spark SQL Introduction
- Spark SQL Architecture
- Dataframes
- Demo: Handling Various Data Formats
- Demo: Implement Various Dataframe Operations
- Demo: UDF and UDAF
- Interoperating With RDDs
- Demo: Process Dataframe Using SQL Query
- RDD vs Dataframe vs Dataset
- Practice Project: Processing Dataframes
- Key Takeaways
- Knowledge Check
- Practice Project: Spark SQL - Processing Dataframes
Spark MLib Modelling BigData with Spark
Lesson 11
- Spark Mlib Modeling Big Data With Spark
- Role of Data Scientist and Data Analyst in Big Data
- Analytics in Spark
- Machine Learning
- Supervised Learning
- Demo: Classification of Linear SVM
- Demo: Linear Regression With Real World Case Studies
- Unsupervised Learning
- Demo: Unsupervised Clustering K-means
- Reinforcement Learning
- Semi-supervised Learning
- Overview of Mlib
- Mlib Pipelines
- Key Takeaways
- Knowledge Check
- Practice Project: Spark Mlib - Modelling Big data With Spark
Stream Processing Frameworks and Spark Streaming
Lesson 12
- Streaming Overview
- Real-time Processing of Big Data
- Data Processing Architectures
- Demo: Real-time Data Processing Spark Streaming
- Demo: Writing Spark Streaming Application
- Introduction to DStreams
- Transformations on DStreams
- Design Patterns for Using Foreachrdd
- State Operations
- Windowing Operations
- Join Operations Stream-dataset Join
- Demo: Windowing of Real-time Data Processing Streaming Sources
- Demo: Processing Twitter Streaming Data
- Structured Spark Streaming-
- Use Case Banking Transactions
- Structured Streaming Architecture Model and Its Components
- Output Sinks
- Structured Streaming APIs
- Constructing Columns in Structured Streaming
- Windowed Operations on Event-time
- Use Cases
- Demo: Streaming Pipeline
- Practice Project: Spark Streaming
- Key Takeaways
- Knowledge Check
- Practice Project: Stream Processing Frameworks and Spark Streaming
Spark GraphX
Lesson 13
- Spark GraphX
- Introduction to Graph
- GraphX in Spark
- GraphX Operators
- Join Operators
- GraphX Parallel System
- Algorithms in Spark
- Pregel API
- Use Case of GraphX
- Demo: GraphX Vertex Predicate
- Demo: Page Rank Algorithm
- Key Takeaways
- Knowledge Check
- Practice Project: Spark GraphX Project Assistance

Learning Outcomes

This Big Data Hadoop and Spark Developer Course, you will learn to:

Hadoop Ecosystem

Learn how to navigate the Hadoop ecosystem and understand how to optimize its use

Ingest Data

Ingest data using Sqoop, Flume, and Kafka.

Hive

Implement partitioning, bucketing, and indexing in Hive

Apache Spark

Work with RDD in Apache Spark

Data Streaming

Process real-time streaming data and Perform DataFrame operations in Spark using SQL queries

Implementation

Implement User-Defined Functions (UDF) and User-Defined Attribute Functions (UDAF) in Spark

Course End Projects

Project 1: Analyzing Historical Insurance Claims

Use Hadoop features to predict patterns and share actionable insights for a car insurance company. This project uses New York Stock Exchange data from 2010 to 2016, captured from 500+ listed companies. The data set consists of each listed company’s intraday prices and volume traded. The data is used in both machine learning and exploratory analysis projects for the purposes of automating the trading process and predicting the next trading-day winners or losers. The scope of this project is limited to exploratory data analysis.

Project 2: Employee Review of Comment Analysis

The HR team is surfing social media to gather current and ex-employee feedback and sentiments. This information will be used to derive actionable insights and take corrective actions to improve the employer-employee relationship. The data is web-scraped from Glassdoor and contains detailed reviews of 67K employees from Google, Amazon, Facebook, Apple, Microsoft, and Netflix.

Project 3: K-Means Clustering for Telecommunications Domain

LoudAcre Mobile is a mobile phone service provider whichthat has introduced a new open network campaign. As a part of this campaign, the company has invited users to complain about mobile phone network towers in their area if they are experiencing connectivity issues with their present mobile network. LoudAcre has collected the dataset of users who have complained.

Project 4: Market Analysis in Banking Domain

Our client, a Portuguese banking institution, ran a marketing campaign to convince potential customers to invest in a bank term deposit promotion. The marketing campaign pitches were delivered by phone calls. Often, however, the same customer was contacted more than once. You have to perform the marketing analysis of the data generated by this campaign, keeping in mind the redundant calls.

Big Hadoop and Spark Developer - eLearning

Target Audience and Prerequisites

Target Audience

Analytics professionals
Senior IT professionals
Testing and mainframe professionals
Data management professionals
Business intelligence professionals
Project managers
Graduates looking to begin a career in big data analytics

Prerequisites:

It is recommended that you have knowledge of:

Core
Java SQL

Need corporate solutions or LMS integration?

Didn't find the course or program which would work for your business? Need LMS integration? Write us, we will solve everything!

;