What are the three core components of HADOOP?
The three core components of HADOOP are HDFS, MapReduce and YARN. Detailed introduction: 1. HDFS: Distributed file system, used to store large amounts of data in Hadoop clusters. It has high fault tolerance, can store data across multiple data nodes, and provides high-throughput data access; 2. MapReduce: used for parallel processing of large-scale data sets. It decomposes big data tasks into multiple small tasks, processes them in parallel on multiple nodes, and finally summarizes the results; 3. YARN: Responsible for the allocation and management of cluster resources.
The three core components of Hadoop are HDFS (distributed file storage), MapReduce (distributed computing) and YARN (resource scheduling).
1. HDFS: HADOOP Distributed File System
HDFS (Hadoop Distributed File System) is the core sub-project of the Hadoop project and is mainly responsible for the storage and reading of cluster data. Take, HDFS is a distributed file system with a master/slave architecture. HDFS supports a traditional hierarchical file organization structure, where users or applications can create directories and then store files in these directories. The hierarchical structure of the file system namespace is similar to that of most existing file systems, and files can be created, read, updated, and deleted through file paths. However, due to the nature of distributed storage, it is obviously different from traditional file systems.
HDFS advantages:
- High fault tolerance. The data uploaded by HDFS automatically saves multiple copies, and its fault tolerance can be increased by adding data in the copies. If a replica is lost, HDFS will replicate the replica on the other machine, and we don't have to worry about its implementation.
- Suitable for big data processing. HDFS is capable of handling gigabytes, terabytes, and even petabytes of data, ranging in size to millions, which is very large. (1PB=1024TB, 1TB=1014GB)
- Streaming data access. HDFS uses a streaming data access model to store very large files, writing once and reading many times. That is, once a file is written, it cannot be modified, but can only be added. This maintains data consistency.
2. MapReduce: Large-scale data processing
MapReduce is the core computing framework of Hadoop, suitable for programming of parallel operations on large-scale data sets (greater than 1TB) The model includes two parts: Map (mapping) and Reduce (reduction).
When a MapReduce task is started, the Map side will read the data on HDFS, map the data into the required key-value pair type and transfer it to the Reduce side. The Reduce side receives the key-value pair type data from the Map side, groups it according to different keys, processes each group of data with the same key, obtains a new key-value pair and outputs it to HDFS. This is the core idea of MapReduce.
A complete MapReduce process includes data input and sharding, Map stage data processing, Reduce stage data processing, data output and other stages:
- Read input data. Data in the MapReduce process is read from the HDFS distributed file system. When a file is uploaded to HDFS, it is generally divided into several data blocks according to 128MB, so when running the MapReduce program, each data block will generate a Map, but you can also adjust the number of Maps by resetting the file fragment size. When running MapReduce, the file will be re-split (Split) according to the set split size, and a data block of the split size will correspond to a Map.
- Map stage. The program has one or more Maps, determined by the default number of storage or shards. For the Map stage, data is read in the form of key-value pairs. The value of the key is generally the offset between the first character of each line and the initial position of the file, that is, the number of characters in between, and the value is the data record of this line. Process key-value pairs according to requirements, map them into new key-value pairs, and pass the new key-value pairs to the Reduce side.
- Shuffle/Sort stage: This stage refers to the process of starting from the Map output and transferring the Map output to Reduce as input. This process will first integrate the output data with the same key in the same Map to reduce the amount of data transmitted, and then sort the data according to the key after integration.
- Reduce stage: There can also be multiple Reduce tasks. According to the data partition set in the Map stage, one partition data is processed by one Reduce. For each Reduce task, Reduce will receive data from different Map tasks, and the data from each Map is in order. Each processing in a Reduce task is to reduce the data for all data with the same key and output it to HDFS as a new key-value pair.
3. Yarn: Resource Manager
Hadoop’s MapReduce architecture is called YARN (Yet Another Resource Negotiator, another resource coordinator). More efficient resource management core.
YARN mainly consists of three modules: Resource Manager (RM), Node Manager (NM), and Application Master (AM):
- Resource Manager is responsible for the monitoring, allocation and management of all resources. Management;
- Application Master is responsible for the scheduling and coordination of each specific application;
- Node Manager is responsible for the maintenance of each node.
The above is the detailed content of What are the three core components of HADOOP?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Java Errors: Hadoop Errors, How to Handle and Avoid When using Hadoop to process big data, you often encounter some Java exception errors, which may affect the execution of tasks and cause data processing to fail. This article will introduce some common Hadoop errors and provide ways to deal with and avoid them. Java.lang.OutOfMemoryErrorOutOfMemoryError is an error caused by insufficient memory of the Java virtual machine. When Hadoop is

As one of Sora's compelling core technologies, DiT utilizes DiffusionTransformer to extend the generative model to a larger scale, thereby achieving excellent image generation effects. However, larger model sizes cause training costs to skyrocket. The research team of Yan Shuicheng and Cheng Mingming from SeaAILab, Nankai University, and Kunlun Wanwei 2050 Research Institute proposed a new model called MaskedDiffusionTransformer at the ICCV2023 conference. This model uses mask modeling technology to speed up the training of DiffusionTransfomer by learning semantic representation information, and has achieved SoTA results in the field of image generation. this one

With the advent of the big data era, data processing and storage have become more and more important, and how to efficiently manage and analyze large amounts of data has become a challenge for enterprises. Hadoop and HBase, two projects of the Apache Foundation, provide a solution for big data storage and analysis. This article will introduce how to use Hadoop and HBase in Beego for big data storage and query. 1. Introduction to Hadoop and HBase Hadoop is an open source distributed storage and computing system that can

As the amount of data continues to increase, traditional data processing methods can no longer handle the challenges brought by the big data era. Hadoop is an open source distributed computing framework that solves the performance bottleneck problem caused by single-node servers in big data processing through distributed storage and processing of large amounts of data. PHP is a scripting language that is widely used in web development and has the advantages of rapid development and easy maintenance. This article will introduce how to use PHP and Hadoop for big data processing. What is HadoopHadoop is

An in-depth analysis of the core components and functions of the Java technology platform. Java technology is widely used in many fields and has become a mainstream programming language and development platform. The Java technology platform consists of a series of core components and functions, which provide developers with a wealth of tools and resources, making Java development more efficient and convenient. This article will provide an in-depth analysis of the core components and functions of the Java technology platform, and explore its importance and application scenarios in software development. First, the Java Virtual Machine (JVM) is Java

Java big data technology stack: Understand the application of Java in the field of big data, such as Hadoop, Spark, Kafka, etc. As the amount of data continues to increase, big data technology has become a hot topic in today's Internet era. In the field of big data, we often hear the names of Hadoop, Spark, Kafka and other technologies. These technologies play a vital role, and Java, as a widely used programming language, also plays a huge role in the field of big data. This article will focus on the application of Java in large

1: Install JDK1. Execute the following command to download the JDK1.8 installation package. wget--no-check-certificatehttps://repo.huaweicloud.com/java/jdk/8u151-b12/jdk-8u151-linux-x64.tar.gz2. Execute the following command to decompress the downloaded JDK1.8 installation package. tar-zxvfjdk-8u151-linux-x64.tar.gz3. Move and rename the JDK package. mvjdk1.8.0_151//usr/java84. Configure Java environment variables. echo'

As the amount of data continues to increase, large-scale data processing has become a problem that enterprises must face and solve. Traditional relational databases can no longer meet this demand. For the storage and analysis of large-scale data, distributed computing platforms such as Hadoop, Spark, and Flink have become the best choices. In the selection process of data processing tools, PHP is becoming more and more popular among developers as a language that is easy to develop and maintain. In this article, we will explore how to leverage PHP for large-scale data processing and how