The Case for Apache Hadoop
• A brief history of Hadoop
• Core Hadoop components
• Fundamental concepts
The Hadoop Distributed File System
• HDFS features
• HDFS design assumptions
• Overview of HDFS architecture
• Writing and reading files
• NameNode considerations
• An overview of HDFS security
• Hands-On Exercise
MapReduce
• What is MapReduce?
• Features of MapReduce
• Basic MapReduce concepts
• Architectural overview
• Failure recovery
• Hands-On Exercise
An Overview of the Hadoop Ecosystem
• What is the Hadoop ecosystem?
• Integration tools
• Analysis tools
• Data storage and retrieval tools
Planning your Hadoop Cluster
• General planning considerations
• Choosing the right hardware
• Network considerations
• Configuring nodes
Hadoop Installation
• Installing Hadoop
• Using Cloudera Manager for easy installation
• Basic configuration parameters
• Hands-On Exercise
Managing and Scheduling Jobs
• Managing running jobs
• Hands-On Exercise
• The FIFO Scheduler
• The FairScheduler
• Configuring the FairScheduler
• Hands-On Exercise
Cluster Maintenance
• Checking HDFS status
• Hands-On Exercise
• Copying data between clusters
• Adding and removing cluster nodes
• Rebalancing the cluster
• Hands-On Exercise
• NameNode Metadata backup
• Cluster upgrading
Cluster Monitoring and Troubleshooting
• General system monitoring
• Managing Hadoop's log files
• Using the NameNode and JobTracker Web UIs
• Hands-On Exercise
• Cluster monitoring with Ganglia
• Common troubleshooting issues
• Benchmarking your cluster
Populating HDFS from External Sources
• An overview of Flume
• Hands-On Exercise
• An overview of Sqoop
• Best practices for importing data
Installing and Managing Other Hadoop Projects
• Hive
• Pig
• HBase
HADOOP Developer Content:
Introduction
The Motivation for Hadoop
• Problems with traditional large-scale systems
• Requirements for a new approach
Hadoop: Basic Concepts
• An Overview of Hadoop
• The Hadoop Distributed File System
• Hands-On Exercise
• How MapReduce Works
• Hands-On Exercise
• Anatomy of a Hadoop Cluster
• Other Hadoop Ecosystem Components
Writing a MapReduce Program
• The MapReduce Flow
• Examining a Sample MapReduce Program
• Basic MapReduce API Concepts
• The Driver Code
• The Mapper
• The Reducer
• Hadoop's Streaming API
• Using Eclipse for Rapid Development
• Hands-on exercise
• The New MapReduce API
Integrating Hadoop into the Workflow
• Relational Database Management Systems
• Storage Systems
• Importing Data from RDBMSs With Sqoop
• Hands-on exercise
• Importing Real-Time Data with Flume
• Accessing HDFS Using FuseDFS and Hoop
Delving Deeper Into The Hadoop API
• More about ToolRunner
• Testing with MRUnit
• Reducing Intermediate Data With Combiners
• The configure and close methods for Map/Reduce Setup and Teardown
• Writing Partitioners for Better Load Balancing
• Hands-On Exercise
• Directly Accessing HDFS
• Using the Distributed Cache
• Hands-On Exercise
Common MapReduce Algorithms
• Sorting and Searching
• Indexing
• Machine Learning With Mahout
• Term Frequency – Inverse Document Frequency
• Word Co-Occurrence
• Hands-On Exercise
Using Hive and Pig
• Hive Basics
• Pig Basics
• Hands-on exercise
Practical Development Tips and Techniques
• Debugging MapReduce Code
• Using LocalJobRunner Mode For Easier Debugging
• Retrieving Job Information with Counters
• Logging
• Splittable File Formats
• Determining the Optimal Number of Reducers
• Map-Only MapReduce Jobs
• Hands-On Exercise
Joining Data Sets in MapReduce
• Map-Side Joins
• The Secondary Sort
• Reduce-Side Joins
Graph Manipulation in Hadoop
• Introduction to graph techniques
• Representing graphs in Hadoop
• Implementing a sample algorithm: Single Source Shortest Path
Creating Workflows With Oozie
• The Motivation for Oozie
• Oozie's Workflow Definition Format
• Hands-On Exercise