Skip to content

Storm architecture components

Apache Storm is a distributed real-time computation system designed to process unbounded data streams. Its architecture relies on several core components that work together to manage data flow, cluster coordination, and execution logic.^[600-developer-big-data-storm-storm-01.md]

Core Data and Flow Concepts

At the fundamental level, Storm operates on streams and tuples. A Tuple serves as the main data structure, essentially a unit of data, while a Stream is defined as an unordered sequence of these tuples flowing through the system.^[600-developer-big-data-storm-storm-01.md]

The logic of a Storm application is structured as a Topology. Unlike traditional MapReduce jobs which may finish, a Storm topology runs continuously until explicitly killed. A topology is a graph of computation where the nodes represent processing logic and the edges represent the flow of data.^[600-developer-big-data-storm-storm-01.md]

Topology Elements

A topology consists of two main types of nodes:

  • Spouts: These act as the source of the stream. A spout reads data from external sources (such as a Message Queue like Kafka or a database) and emits it into the topology as tuples.^[600-developer-big-data-storm-storm-01.md]
  • Bolts: These are the logical processing units. Bolts consume input tuples, process them (performing operations such as filtering, aggregation, or joining), and may subsequently emit new tuples to other bolts in the chain.^[600-developer-big-data-storm-storm-01.md]

The connection between spouts and bolts—and bolts to other bolts—is managed by Stream Grouping. This mechanism defines how data flows from one component to the next (e.g., shuffling, grouping by fields).^[600-developer-big-data-storm-storm-01.md]

Physical and Execution Architecture

To understand how a topology runs, it is necessary to distinguish between the logical programming model and the physical execution model.

Cluster Nodes * Nimbus: This is the master node of the Storm cluster. It is responsible for assigning tasks to other machines and monitoring their health.^[600-developer-big-data-storm-storm-01.md] * Supervisor: These nodes follow the instructions given by the Nimbus. Each supervisor runs one or more worker processes.^[600-developer-big-data-storm-storm-01.md]

Process and Thread Hierarchy * Worker Process (JVM): A topology runs in a distributed manner across potentially multiple worker nodes. A specific topology is executed by one or more worker processes, each running in its own JVM.^[600-developer-big-data-storm-storm-01.md] * Executor: An executor is a single thread spawned by a worker process. It is the runtime entity that actually runs the component logic (a spout or bolt instance).^[600-developer-big-data-storm-storm-01.md] * Task: A task represents the basic unit of execution performed by an executor. While an executor (thread) may run multiple tasks for the same component (spout/bolt) to increase parallelism, the task is the entity performing the actual data processing.^[600-developer-big-data-storm-storm-01.md]

Sources

^[600-developer-big-data-storm-storm-01.md]