Hadoop implements a distributed file system(HadoopDistributedFileSystem), referred to as HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (highthroughput) to access application data, suitable for those with Applications with large datasets (largedataset) HDFS relaxes POSIX requirements and can access data in the file system in the form of streams
Hadoop's framework. The core design is: HDFS and MapReduce. HDFS provides storage for massive data, and MapReduce provides computing for massive data. In a word, Hadoop is storage plus calculation. ## The name Hadoop is not an abbreviation, but a fictitious name. The creator of the project, Doug Cutting, explained how Hadoop got its name: "This name was given to a brown elephant toy by my child.
Hadoop is a distributed computing platform that allows users to easily
structure and use it. Users can easily develop and run applications that handle massive amounts of data on Hadoop. It mainly has the following advantages: 1. High reliability Hadoop's ability to store and process data bit by bit is worthy of people's trust.
2. Highly scalable Hadoop distributes data and completes computing tasks among available computer clusters. These clusters can be easily expanded to thousands of nodes.
3. Efficiency Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
4. Highly fault-tolerant Hadoop can automatically save multiple copies of data and automatically redistribute failed tasks.
5. Low cost Compared with all-in-one computers, commercial data warehouses, and data marts such as QlikView and YonghongZ-Suite, hadoop is open source, so the software cost of the project will be greatly reduced.
Hadoop comes with a framework written in java language, so it is ideal to run on
Linux production platform. Applications on Hadoop can also be written in other languages, such as C++. The significance of Hadoop big data processing
Hadoop’s wide application in big data processing applications benefits from its natural advantages in data extraction, transformation and loading (ETL). The distributed architecture of Hadoop places the big data processing engine as close to the storage as possible, which is relatively suitable for batch processing operations such as ETL, because the batch processing results of such operations can go directly to storage. Hadoop's MapReduce function breaks a single task into pieces and sends the fragmented tasks (Map) to multiple nodes, and then loads (Reduce) them into the data warehouse in the form of a single data set.
PHP Chinese website Hadoop learning route information:
1. HadoopCommon: a module at the bottom of the Hadoop system, providing various tools for Hadoop sub-projects, such as:
Configuration files and log operations, etc. . 2. HDFS: Distributed file system, providing high-throughput application data access. To external clients, HDFS is like a traditional hierarchical file system. Files can be created,
delete, moved or renamed, and more. However, the architecture of HDFS is built based on a specific set of nodes (see Figure 1), which is determined by its own characteristics. These nodes include NameNode (just one), which provides metadata services inside HDFS; DataNode, which provides storage blocks to HDFS. This is a drawback (single point of failure) of HDFS since only one NameNode exists. Files stored in HDFS are divided into blocks, and these blocks are then copied to multiple computers (DataNode). This is very different from traditional RAID architecture. The block size (usually 64MB) and the number of blocks copied are determined by the client when the file is created. NameNode can control all file operations. All communications within HDFS are based on the standard
TCP/IP protocol. 3. MapReduce: A software framework set for distributed massive data processing computing cluster.
4. Avro: RPC project hosted by dougcutting, mainly responsible for
data serialization. Somewhat similar to Google's protobuf and Facebook's thrift. Avro will be used for Hadoop's RPC in the future, making Hadoop's RPC module communication faster and the data structure more compact. 5. Hive: Similar to CloudBase, it is also a set of software based on the Hadoop distributed computing platform that provides the SQL function of datawarehouse. It simplifies the summary and ad hoc query of massive data stored in Hadoop. hive provides a set of QL query language, based on sql, which is very convenient to use. 6. HBase: Based on HadoopDistributedFileSystem, it is an open source, scalable distributed database
based on column storage model , and supports the storage of structured data in large tables. 7. Pig: It is an advanced data flow language and execution framework for parallel computing. The SQL-like language is an advanced query language built on MapReduce. It compiles some operations into the Map and Reduce of the MapReduce model. , and users can define their own functions. 8. ZooKeeper: An open source implementation of Google’s Chubby. It is a reliable coordination system for large-scale distributed systems. It provides functions including: configuration maintenance, name service, distributed synchronization, group service, etc. The goal of ZooKeeper is to encapsulate complex and error-prone key services and provide users with a simple and easy-to-use
interface
and a system with efficient performance and stable functions. 9. Chukwa: A data collection system for managing large-scale distributed systems contributed by yahoo. 10. Cassandra: A scalable multi-master database with no single point of failure.
11. Mahout: A scalable machine learning and data mining library.
The initial design goals of Hadoop were high reliability, high scalability, high fault tolerance and efficiency. It is these inherent advantages in design that made Hadoop popular with many large companies as soon as it appeared. favored, and also attracted widespread attention from the research community. So far, Hadoop technology has been widely used in the Internet field.
The above is a detailed introduction to what Hadoop is and the Hadoop learning route. If you want to know more news and information about Hadoop, please pay attention to the official website of the platform, WeChat and other platforms. The platform IT career online learning and education platform provides you with authority. Big data Hadoop training course and
video
tutorial system, the first set of adaptive Hadoop online video course system recorded online by a gold medal lecturer on the big platform, allowing you to quickly master the practical skills of Hadoop from entry to proficiency in big data development .
The above is the detailed content of A brief discussion on what Hadoop is and its learning route. For more information, please follow other related articles on the PHP Chinese website!