Skip to content

Big data architecture components

Big data architecture is fundamentally defined by two core requirements: data storage and data computation^[600-developer-big-data-big-data.md].

Major Frameworks

The architecture of big data systems is generally built upon two primary frameworks: Hadoop and Spark^[600-developer-big-data-big-data.md].

Hadoop ecosystem

The Hadoop framework provides a comprehensive suite of components for distributed storage and processing^[600-developer-big-data-big-data.md]:

  • Core Components:
    • HDFS (Hadoop Distributed File System): Distributed storage.
    • MapReduce: A programming model for large-scale data processing.
    • HBase: A NoSQL database running on top of HDFS.
  • Data Analysis Engines:
    • Hive: A data warehouse infrastructure for summarization and query.
    • Pig: A high-level platform for creating programs that run on Hadoop.
  • Data Collection Engines:
    • Sqoop: Designed for efficiently transferring bulk data between Hadoop and structured datastores.
    • Flume: A service for efficiently collecting, aggregating, and moving large amounts of log data.
  • Management:
    • HUE: A web interface for interacting with Hadoop components.
    • Oozie: A workflow scheduler system to manage Hadoop jobs.

Spark Ecosystem

Spark is often used as an alternative to MapReduce for faster data processing^[600-developer-big-data-big-data.md]:

  • Languages: Written in Scala.
  • Core Computing:
    • Spark code: Performs the primary data calculations.
  • Modules:
    • Spark SQL: A module for working with structured data.
    • Spark Streaming: Enables scalable, high-throughput, fault-tolerant Stream processing.

Sources

^[600-developer-big-data-big-data.md]