Big data architecture components¶
Big data architecture is fundamentally defined by two core requirements: data storage and data computation^[600-developer-big-data-big-data.md].
Major Frameworks¶
The architecture of big data systems is generally built upon two primary frameworks: Hadoop and Spark^[600-developer-big-data-big-data.md].
Hadoop ecosystem¶
The Hadoop framework provides a comprehensive suite of components for distributed storage and processing^[600-developer-big-data-big-data.md]:
- Core Components:
- HDFS (Hadoop Distributed File System): Distributed storage.
- MapReduce: A programming model for large-scale data processing.
- HBase: A NoSQL database running on top of HDFS.
- Data Analysis Engines:
- Hive: A data warehouse infrastructure for summarization and query.
- Pig: A high-level platform for creating programs that run on Hadoop.
- Data Collection Engines:
- Sqoop: Designed for efficiently transferring bulk data between Hadoop and structured datastores.
- Flume: A service for efficiently collecting, aggregating, and moving large amounts of log data.
- Management:
- HUE: A web interface for interacting with Hadoop components.
- Oozie: A workflow scheduler system to manage Hadoop jobs.
Spark Ecosystem¶
Spark is often used as an alternative to MapReduce for faster data processing^[600-developer-big-data-big-data.md]:
- Languages: Written in Scala.
- Core Computing:
- Spark code: Performs the primary data calculations.
- Modules:
- Spark SQL: A module for working with structured data.
- Spark Streaming: Enables scalable, high-throughput, fault-tolerant Stream processing.
Sources¶
^[600-developer-big-data-big-data.md]