Big data architecture components¶

Big data architecture is fundamentally defined by two core requirements: data storage and data computation^[600-developer-big-data-big-data.md].

Major Frameworks¶

The architecture of big data systems is generally built upon two primary frameworks: Hadoop and Spark^[600-developer-big-data-big-data.md].

The Hadoop framework provides a comprehensive suite of components for distributed storage and processing^[600-developer-big-data-big-data.md]:

Core Components:
- HDFS (Hadoop Distributed File System): Distributed storage.
- MapReduce: A programming model for large-scale data processing.
- HBase: A NoSQL database running on top of HDFS.
Data Analysis Engines:
- Hive: A data warehouse infrastructure for summarization and query.
- Pig: A high-level platform for creating programs that run on Hadoop.
Data Collection Engines:
- Sqoop: Designed for efficiently transferring bulk data between Hadoop and structured datastores.
- Flume: A service for efficiently collecting, aggregating, and moving large amounts of log data.
Management:
- HUE: A web interface for interacting with Hadoop components.
- Oozie: A workflow scheduler system to manage Hadoop jobs.

Spark is often used as an alternative to MapReduce for faster data processing^[600-developer-big-data-big-data.md]:

Languages: Written in Scala.
Core Computing:
- Spark code: Performs the primary data calculations.
Modules:
- Spark SQL: A module for working with structured data.
- Spark Streaming: Enables scalable, high-throughput, fault-tolerant Stream processing.

^[600-developer-big-data-big-data.md]