Big data core challenges¶
The concept of Big Data fundamentally revolves around two primary operational requirements: data storage and data computation.^[600-developer__big-data__big-data.md]
Managing these requirements has led to the development of specific software frameworks designed to handle large-scale information processing.^[600-developer__big-data__big-data.md] The two dominant frameworks in this domain are Hadoop and Spark.^[600-developer__big-data__big-data.md]
Hadoop ecosystem¶
The Hadoop ecosystem addresses big data challenges through a distributed storage and processing model.^[600-developer__big-data__big-data.md] Its core components and related engines include:
- Storage and Processing: [[HDFS]] (Hadoop Distributed File System) for storage, [[MapReduce]] for processing, and [[HBase]] (a NoSQL database).^[600-developer__big-data__big-data.md]
- Data Analysis: Engines such as [[Hive]] and [[Pig]] facilitate data analysis on large datasets.^[600-developer__big-data__big-data.md]
- Data Ingestion: Tools like [[Sqoop]] and [[Flume]] are used for data collection and transfer.^[600-developer__big-data__big-data.md]
- Management: [[HUE]] provides web-based management, while [[Oozie]] is used for workflow scheduling.^[600-developer__big-data__big-data.md]
Spark Ecosystem¶
[[Spark|Apache Spark]] is another unified framework used to meet the challenges of big data, particularly known for its speed and ease of use.^[600-developer__big-data__big-data.md] It addresses computation through various components:
- Core: Built on [[Scala]], Spark includes a core engine for general data computation.^[600-developer__big-data__big-data.md]
- Querying: [[Spark SQL]] is used for structured data processing.^[600-developer__big-data__big-data.md]
- Streaming: [[Spark Streaming]] enables real-time data processing and stream computing.^[600-developer__big-data__big-data.md]
Sources¶
- 600-developer__big-data__big-data.md