Skip to content

Apache Storm core concepts

Apache Storm is a distributed real-time computation system designed for processing unbounded data streams. It is often used in conjunction with other tools like Flume (for data acquisition), Kafka (for temporary data storage), and Redis (an in-memory database for saving results)^[600-developer-big-data-storm-storm-01.md].

Core Data Model

The fundamental data structure in Storm is the Tuple, which acts as the main unit of data transferred through the system^[600-developer-big-data-storm-storm-01.md]. A sequence of these tuples constitutes a Stream, which is defined as an unordered sequence of tuples^[600-developer-big-data-storm-storm-01.md].

Topology and Execution Model

The computation logic is structured as a Topology, which connects Spouts and Bolts^[600-developer-big-data-storm-storm-01.md].

  • Spouts: These act as the source of the stream^[600-developer-big-data-storm-storm-01.md]. Their primary role is to read data from external sources and emit it into the topology for processing^[600-developer-big-data-storm-storm-01.md].
  • Bolts: These serve as the logical processing units^[600-developer-big-data-storm-storm-01.md]. They consume input tuples, perform processing or transformation, and may emit new tuples to other bolts^[600-developer-big-data-storm-storm-01.md].
  • Stream Grouping: This mechanism dictates how the stream of data flows from spouts to bolts, or from one bolt to another^[600-developer-big-data-storm-storm-01.md].

Cluster Architecture

Apache Storm operates as a distributed cluster involving two main node types^[600-developer-big-data-storm-storm-01.md]:

  • Nimbus: The master node responsible for managing the cluster^[600-developer-big-data-storm-storm-01.md].
  • Supervisor: Worker nodes that follow instructions given by the Nimbus^[600-developer-big-data-storm-storm-01.md].

The actual execution workload is divided as follows^[600-developer-big-data-storm-storm-01.md]:

  • Worker Process (JVM): A specific topology runs across one or more worker processes. Each worker executes a subset of the topology's tasks^[600-developer-big-data-storm-storm-01.md].
  • Executor: An executor is a single thread spawned by a worker process^[600-developer-big-data-storm-storm-01.md].
  • Task: A task represents the fundamental unit where actual data processing is performed^[600-developer-big-data-storm-storm-01.md].

Programming Interface

Developers implement the processing logic primarily using two interfaces:

  • IRichSpout: Implemented to define the data source. It typically uses the nextTuple() method to emit data via a collector^[600-developer-big-data-storm-storm-01.md].
  • IRichBolt: Implemented to define processing logic. It uses the execute() method to process input tuples received from the stream^[600-developer-big-data-storm-storm-01.md].

Sources

  • 600-developer-big-data-storm-storm-01.md