Skip to content

Big data core challenges

The concept of Big Data fundamentally revolves around two primary operational requirements: data storage and data computation.^[600-developer__big-data__big-data.md]

Managing these requirements has led to the development of specific software frameworks designed to handle large-scale information processing.^[600-developer__big-data__big-data.md] The two dominant frameworks in this domain are Hadoop and Spark.^[600-developer__big-data__big-data.md]

Hadoop ecosystem

The Hadoop ecosystem addresses big data challenges through a distributed storage and processing model.^[600-developer__big-data__big-data.md] Its core components and related engines include:

  • Storage and Processing: [[HDFS]] (Hadoop Distributed File System) for storage, [[MapReduce]] for processing, and [[HBase]] (a NoSQL database).^[600-developer__big-data__big-data.md]
  • Data Analysis: Engines such as [[Hive]] and [[Pig]] facilitate data analysis on large datasets.^[600-developer__big-data__big-data.md]
  • Data Ingestion: Tools like [[Sqoop]] and [[Flume]] are used for data collection and transfer.^[600-developer__big-data__big-data.md]
  • Management: [[HUE]] provides web-based management, while [[Oozie]] is used for workflow scheduling.^[600-developer__big-data__big-data.md]

Spark Ecosystem

[[Spark|Apache Spark]] is another unified framework used to meet the challenges of big data, particularly known for its speed and ease of use.^[600-developer__big-data__big-data.md] It addresses computation through various components:

  • Core: Built on [[Scala]], Spark includes a core engine for general data computation.^[600-developer__big-data__big-data.md]
  • Querying: [[Spark SQL]] is used for structured data processing.^[600-developer__big-data__big-data.md]
  • Streaming: [[Spark Streaming]] enables real-time data processing and stream computing.^[600-developer__big-data__big-data.md]

Sources

  • 600-developer__big-data__big-data.md