Big data core challenges¶

The concept of Big Data fundamentally revolves around two primary operational requirements: data storage and data computation.^[600-developer__big-data__big-data.md]

Managing these requirements has led to the development of specific software frameworks designed to handle large-scale information processing.^[600-developer__big-data__big-data.md] The two dominant frameworks in this domain are Hadoop and Spark.^[600-developer__big-data__big-data.md]

Hadoop ecosystem ¶

The Hadoop ecosystem addresses big data challenges through a distributed storage and processing model.^[600-developer__big-data__big-data.md] Its core components and related engines include:

Storage and Processing: [[HDFS]] (Hadoop Distributed File System) for storage, [[MapReduce]] for processing, and [[HBase]] (a NoSQL database).^[600-developer__big-data__big-data.md]
Data Analysis: Engines such as [[Hive]] and [[Pig]] facilitate data analysis on large datasets.^[600-developer__big-data__big-data.md]
Data Ingestion: Tools like [[Sqoop]] and [[Flume]] are used for data collection and transfer.^[600-developer__big-data__big-data.md]
Management: [[HUE]] provides web-based management, while [[Oozie]] is used for workflow scheduling.^[600-developer__big-data__big-data.md]

Spark Ecosystem¶

[[Spark|Apache Spark]] is another unified framework used to meet the challenges of big data, particularly known for its speed and ease of use.^[600-developer__big-data__big-data.md] It addresses computation through various components:

Core: Built on [[Scala]], Spark includes a core engine for general data computation.^[600-developer__big-data__big-data.md]
Querying: [[Spark SQL]] is used for structured data processing.^[600-developer__big-data__big-data.md]
Streaming: [[Spark Streaming]] enables real-time data processing and stream computing.^[600-developer__big-data__big-data.md]

Sources¶

600-developer__big-data__big-data.md

Big data core challenges¶

Hadoop ecosystem¶

Spark Ecosystem¶

Sources¶

Hadoop ecosystem ¶