Apache Storm core concepts¶
Apache Storm is a distributed real-time computation system designed for processing unbounded data streams. It is often used in conjunction with other tools like Flume (for data acquisition), Kafka (for temporary data storage), and Redis (an in-memory database for saving results)^[600-developer-big-data-storm-storm-01.md].
Core Data Model¶
The fundamental data structure in Storm is the Tuple, which acts as the main unit of data transferred through the system^[600-developer-big-data-storm-storm-01.md]. A sequence of these tuples constitutes a Stream, which is defined as an unordered sequence of tuples^[600-developer-big-data-storm-storm-01.md].
Topology and Execution Model¶
The computation logic is structured as a Topology, which connects Spouts and Bolts^[600-developer-big-data-storm-storm-01.md].
- Spouts: These act as the source of the stream^[600-developer-big-data-storm-storm-01.md]. Their primary role is to read data from external sources and emit it into the topology for processing^[600-developer-big-data-storm-storm-01.md].
- Bolts: These serve as the logical processing units^[600-developer-big-data-storm-storm-01.md]. They consume input tuples, perform processing or transformation, and may emit new tuples to other bolts^[600-developer-big-data-storm-storm-01.md].
- Stream Grouping: This mechanism dictates how the stream of data flows from spouts to bolts, or from one bolt to another^[600-developer-big-data-storm-storm-01.md].
Cluster Architecture¶
Apache Storm operates as a distributed cluster involving two main node types^[600-developer-big-data-storm-storm-01.md]:
- Nimbus: The master node responsible for managing the cluster^[600-developer-big-data-storm-storm-01.md].
- Supervisor: Worker nodes that follow instructions given by the Nimbus^[600-developer-big-data-storm-storm-01.md].
The actual execution workload is divided as follows^[600-developer-big-data-storm-storm-01.md]:
- Worker Process (JVM): A specific topology runs across one or more worker processes. Each worker executes a subset of the topology's tasks^[600-developer-big-data-storm-storm-01.md].
- Executor: An executor is a single thread spawned by a worker process^[600-developer-big-data-storm-storm-01.md].
- Task: A task represents the fundamental unit where actual data processing is performed^[600-developer-big-data-storm-storm-01.md].
Programming Interface¶
Developers implement the processing logic primarily using two interfaces:
- IRichSpout: Implemented to define the data source. It typically uses the
nextTuple()method to emit data via a collector^[600-developer-big-data-storm-storm-01.md]. - IRichBolt: Implemented to define processing logic. It uses the
execute()method to process input tuples received from the stream^[600-developer-big-data-storm-storm-01.md].
Related Concepts¶
- Apache Kafka
- [[Redis]]
- Stream Processing
Sources¶
- 600-developer-big-data-storm-storm-01.md