Hadoop ecosystem components¶
Apache Hadoop is a framework designed for the storage and processing of large datasets.^[600-developer__big-data__big-data.md] The Hadoop ecosystem consists of various open-source components that handle different aspects of the big data pipeline, including distributed storage, resource management, data processing, and analysis.^[600-developer__big-data__big-data.md]
Core Components¶
The foundational layer of the ecosystem includes the following distributed storage and processing systems:
- HDFS (Hadoop Distributed File System): The primary storage system that breaks down large files into blocks and distributes them across the cluster^[600-developer__big-data__big-data.md].
- MapReduce: The original computational engine for processing large data sets in parallel across the cluster^[600-developer__big-data.md#L21-22].
- HBase: A NoSQL, column-oriented distributed database that runs on top of HDFS, designed for real-time read/write access to large datasets^[600-developer__big-data.md#L22-23].
Data Analysis and Query Engines¶
These components provide interfaces to query and analyze data stored in the Hadoop ecosystem without writing complex MapReduce programs:
- Hive: A data warehouse infrastructure that provides SQL-like querying (HQL) to facilitate data summarization and analysis^[600-developer__big-data.md#L25-26].
- Pig: A high-level platform for creating programs that run on Hadoop, utilizing a scripting language (Pig Latin) for data manipulation and ETL (Extract, Transform, Load) processes^[600-developer__big-data.md#L26-27].
Data Ingestion and Workflow¶
To move data into and out of the system, and to manage dependencies between jobs, the ecosystem utilizes specific tools for transport and automation:
- Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases^[600-developer__big-data.md#L28-29].
- Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data^[600-developer__big-data.md#L29-30].
- Oozie: A workflow scheduler system used to manage Hadoop jobs, coordinating dependencies between different tasks^[600-developer__big-data.md#L31-32].
Management and Interfaces¶
- HUE (Hadoop User Experience): A web-based tool for interacting with the Hadoop ecosystem, providing an interface for components like HDFS, Hive, and Pig^[600-developer__big-data.md#L30-31].
Sources¶
600-developer__big-data__big-data.md