Skip to content

Hadoop ecosystem

The Hadoop ecosystem refers to the collection of open-source software utilities and frameworks that facilitate the storage, processing, and management of large datasets.^[600-developer-big-data-big-data.md] It is built around two primary concerns in big data: data storage (數據的存儲) and data computation (數據的計算).^[600-developer-big-data-big-data.md]

Core Components

The fundamental components of the Hadoop framework include:

  • HDFS: The distributed file system for storage.
  • MapReduce: The programming model for processing large data sets.
  • HBase: A NoSQL database running on top of HDFS.^[600-developer-big-data-big-data.md]

Data Analysis and Ingestion

The ecosystem provides specific engines for analyzing and collecting data:

  • Analysis Engines: [[Hive]] and [[Pig]] are used for data analysis.^[600-developer-big-data-big-data.md]
  • Data Collection Engines: [[Sqoop]] and [[Flume]] are utilized for data ingestion.^[600-developer-big-data-big-data.md]

Management and Workflow

To manage the infrastructure and data processing tasks, the ecosystem includes:

  • Web Management: [[Hue]] provides a web-based interface.
  • Workflow: [[Oozie]] is used for workflow scheduling and management.^[600-developer-big-data-big-data.md]

Sources

^[600-developer-big-data-big-data.md]

  • [[Big Data]]
  • [[Spark]]
  • [[HDFS]]
  • [[MapReduce]]
  • [[Hive]]