Skip to content

Hadoop ecosystem components

Apache Hadoop is a framework designed for the storage and processing of large datasets.^[600-developer__big-data__big-data.md] The Hadoop ecosystem consists of various open-source components that handle different aspects of the big data pipeline, including distributed storage, resource management, data processing, and analysis.^[600-developer__big-data__big-data.md]

Core Components

The foundational layer of the ecosystem includes the following distributed storage and processing systems:

  • HDFS (Hadoop Distributed File System): The primary storage system that breaks down large files into blocks and distributes them across the cluster^[600-developer__big-data__big-data.md].
  • MapReduce: The original computational engine for processing large data sets in parallel across the cluster^[600-developer__big-data.md#L21-22].
  • HBase: A NoSQL, column-oriented distributed database that runs on top of HDFS, designed for real-time read/write access to large datasets^[600-developer__big-data.md#L22-23].

Data Analysis and Query Engines

These components provide interfaces to query and analyze data stored in the Hadoop ecosystem without writing complex MapReduce programs:

  • Hive: A data warehouse infrastructure that provides SQL-like querying (HQL) to facilitate data summarization and analysis^[600-developer__big-data.md#L25-26].
  • Pig: A high-level platform for creating programs that run on Hadoop, utilizing a scripting language (Pig Latin) for data manipulation and ETL (Extract, Transform, Load) processes^[600-developer__big-data.md#L26-27].

Data Ingestion and Workflow

To move data into and out of the system, and to manage dependencies between jobs, the ecosystem utilizes specific tools for transport and automation:

  • Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases^[600-developer__big-data.md#L28-29].
  • Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data^[600-developer__big-data.md#L29-30].
  • Oozie: A workflow scheduler system used to manage Hadoop jobs, coordinating dependencies between different tasks^[600-developer__big-data.md#L31-32].

Management and Interfaces

  • HUE (Hadoop User Experience): A web-based tool for interacting with the Hadoop ecosystem, providing an interface for components like HDFS, Hive, and Pig^[600-developer__big-data.md#L30-31].

Sources

  • 600-developer__big-data__big-data.md