Skip to content

Apache Storm Architecture

Apache Storm is a distributed real-time computation system designed for processing unbounded data streams.^[storm-01.md] Its architecture is defined by a specific set of components that manage data flow, cluster coordination, and concurrency.

Core Data Model

The fundamental unit of data in Storm is the Tuple, which acts as the main data structure.^[storm-01.md] An unordered sequence of these tuples constitutes a Stream.^[storm-01.md] The architecture defines the flow of these streams through a graph of processing units.

Topology and Processing Units

The computation logic is structured as a Topology, which connects two main types of components:^[storm-01.md]

  • Spouts: These serve as the source of the stream, responsible for ingesting or fetching data.^[storm-01.md] In a programming context, a Spout typically implements IRichSpout and utilizes a SpoutOutputCollector to emit tuples into the topology.^[storm-01.md]
  • Bolts: These act as logical processing units that consume data from spouts or other bolts, perform transformations, and pass the results onward.^[storm-01.md] A Bolt typically implements IRichBolt, using an OutputCollector to handle tuples.^[storm-01.md]

The mechanism that dictates how data flows between spouts and bolts (or between bolts themselves) is known as Stream Grouping.^[storm-01.md]

Cluster Architecture

Storm runs as a distributed cluster, managed by two primary node types:^[storm-01.md]

  • Nimbus: This is the master node of the Storm cluster.^[storm-01.md] It is responsible for the central coordination of tasks.
  • Supervisor: These are worker nodes that receive instructions from the Nimbus and execute the tasks.^[storm-01.md]

Execution Model (Concurrency)

To execute a topology across the cluster, Storm employs a hierarchy of execution elements designed for parallelism:^[storm-01.md]

  • Worker Processes (JVMs): A topology runs across multiple worker nodes. Each worker process is a distinct JVM that executes a subset of the topology's tasks.^[storm-01.md]
  • Executors: An executor is a single thread spawned by a worker process.^[storm-01.md]
  • Tasks: The task is the basic unit where actual data processing occurs.^[storm-01.md] Multiple tasks can be assigned to a single executor thread, but the configuration allows for fine-grained control over parallelism (e.g., setting supervisor.slots.ports or via the config object).^[storm-01.md]

Sources

^[storm-01.md]

  • [[Distributed Systems]]
  • Stream Processing
  • [[Kafka]] (often integrated with Storm for data streaming)