java - 最近对大数据感兴趣，hadoop是不是过时了，应该深入学习spark？

Question

最近对大数据感兴趣，后面想往这方面发展，hadoop是不是过时了，应该深入学习spark？
因为也在上班，时间有限，担心花了时间学习hadoop后，公司却都不使用了，
因为了解到现在公司都在搞spark，因为是内存运算，效率会搞很多，
往过来人解惑，感谢！！

天蓬老师 · Answer

Hadoop并不仅仅是指Hadoop的计算模型MapReduce，而是指Hadoop生态圈，包括HDFS、HBase、Hive等。Spark也只是替代和丰富了Hadoop中的计算模型，其运行还需依赖于Hadoop生态圈的其它部分。所以我觉得如果仅仅是指Hadoop中的计算模型MapReduce, it is indeed outdated to a certain extent (but there are suitable scenes).

If you are interested, you can read this article: Spark And Hadoop Are Friends, Not Foes

迷茫 · Answer

Currently, Hadoop has entered the 2.0 era. It has three components: HDFS, YARN and MapReduce. HDFS is a distributed file system, responsible for storing input and output data; YARN is a distributed resource management system, responsible for scheduling the CPU and memory of the cluster; and MapReduce is a distributed computing framework, which is used by Google for Designed for web page ranking (PageRank), it is a very general programming model that can be used to write various big data processing programs such as word counting and web page ranking (PageRank).

Hadoop MapReduce, Spark, Storm, etc. are all distributed computing frameworks, each suitable for different application scenarios. Hadoop MapReduce does offline computing such as log processing, Spark runs machine learning, and Storm does real-time stream computing. Put it this way, they are equivalent to different APPs on mobile phones with different functions. Therefore, strictly speaking, there is no question of who replaces whom. Different computing frameworks are suitable for different application scenarios. Of course, Spark and Hadoop YARN can be used to complete the same task, and Spark has better execution performance, but Spark consumes more memory. Therefore, Spark cannot completely replace Hadoop MapReduce, because some applications have no problem using Hadoop MapReduce for a longer execution time and can save memory resources.

Also, Hadoop Mapreduce, Spark, Storm and many other distributed computing frameworks belong to the Hadoop ecosystem, and they can run in the same Hadoop cluster, sharing HDFS and YARN. If these computing frameworks are compared to apps on a mobile phone, then Hadoop's HDFS and YARN are equivalent to the operating system of the mobile phone.

So, my suggestion is:

Hadoop is a must-have for getting started with big data. Because MapReduce is the most basic distributed computing framework, and other distributed computing frameworkssuch as Spark are built on it. Only by understanding MapReduce can you understand other systems. Then, Hadoop is the running platform for other Hadoop ecosystem computing frameworks and cannot be avoided.
Learn Spark and other computing frameworks according to the needs of the company. Self-study is just an introduction, and you can truly master it by writing practical applications.

My blog may help you quickly set up a Hadoop test environment:

Build an upgraded version of Hadoop cluster based on Docker

阿神 · Answer

Hadoop is the infrastructure for distributed computing. At best, Spark can only replace Hadoop MapReduce. Many big data technology tools are based on HDFS and MapReduce, including HBASE Hive Sqoop kafka, etc. Of course, it is better to learn Spark directly when doing development, and it is easy to get started

阿神 · Answer

There is no conflict between learning Hadoop and learning Spark. Currently, when most companies use Spark, their data is stored on Hadoop HDFS. Both Spark SQL and Hive can use SQL-like methods and are similar.