Big data learning route-Common Problem-php.cn

Home

Common Problem

Big data learning route

(*-*)浩

Jun 05, 2019 am 10:59 AM

Big Data

Big data learning route

##java(Java se,[mysql])

Linux (shell, high concurrency architecture, lucene, solr)

Hadoop(Hadoop, HDFS, Mapreduce, yarn, hive, hbase, sqoop, zookeeper, flume)

Machine learning (R, mahout)

Storm(Storm,kafka,redis)

Spark(scala,spark,spark core,spark sql,spark streaming,spark mllib,spark graphx)

Python(python , spark python) (recommended learning:

Python video tutorial)

Computing platform (docker, kvm, openstack)

Term explanation

There are many points that beginners need to pay attention to when learning big data, but no matter what, since you have chosen to enter the big data industry, you will only have to take care of the ups and downs. As the saying goes, never forget your original intention and you will always succeed. What you need most when learning big data is a perseverance.

javase basics [including mysql], note that it is javase, not javaee. Knowledge of javaweb is not necessary for big data engineers

Linux

lucene: Full-text search engine architecture

solr : The full-text search server based on Lucene is configurable, scalable, optimizes query performance, and provides a complete function management interface.

Hadoop

HDFS: Distributed storage system, including NameNode, DataNode. NameNode: Metadata, DataNode. DataNode: stores data.

yarn: It can be understood as the coordination mechanism of MapReduce, which is essentially the processing and analysis mechanism of Hadoop, divided into ResourceManager and NodeManager.

MapReduce: Software framework, writing programs.

Hive: Data warehouse can be queried with SQL and can run Map/Reduce programs. Used to calculate trends or website logs, and should not be used for real-time queries as it takes a long time to return results.

HBase: Database. It is very suitable for real-time query of big data. Facebook uses Hbase to store message data and conduct real-time analysis of messages

ZooKeeper: A reliable coordination system for large-scale distributed. Hadoop's distributed synchronization is implemented by Zookeeper, such as multiple NameNodes and active standby switching.

Sqoop: Transfer databases to each other, relational databases and HDFS to each other

Mahout: Scalable machine learning and data mining library. Used for recommendation mining, aggregation, classification, and frequent item set mining.

Chukwa: An open source collection system that monitors large distributed systems, built on HDFS and Map/Reduce frameworks. Display, monitor, and analyze results.

Ambari: Used to configure, manage and monitor Hadoop clusters, based on the Web and with a friendly interface.

Cloudera

Cloudera Manager: Management Monitoring Diagnosis Integration

Cloudera CDH: (Cloudera's Distribution, including Apache Hadoop) Cloudera has made corresponding changes to Hadoop Changed, the release version is called CDH.

Cloudera Flume: Log collection system supports customizing various data senders in the log system to collect data.

Cloudera Impala: Provides direct query and interactive SQL for data stored in Apache Hadoop's HDFS and HBase.

Cloudera hue: web manager, including hue ui, hui server, hui db. hue provides shell interface interfaces for all CDH components, and mr can be written in hue.

Machine Learning/R

R: Language and operating environment for statistical analysis and graphics, currently Hadoop-R

mahout: Provided Scalable implementation of classic algorithms in the field of machine learning, including clustering, classification, recommendation filtering, frequent sub-item mining, etc., and can be extended to the cloud through Hadoop.

#storm

Storm: A distributed, fault-tolerant real-time streaming computing system that can be used for real-time analysis, online machine learning, information flow processing, and continuous computing. Distributed RPC, processing messages and updating the database in real time.

Kafka: A high-throughput distributed publish-subscribe messaging system that can handle all action streaming data (browsing, searching, etc.) in consumer-scale websites. Compared with Hadoop's log data and offline analysis, real-time processing can be achieved. Currently, Hadoop's parallel loading mechanism is used to unify online and offline message processing

Redis: Written in c language, it supports the network, is a log-type, key-value database that can be memory-based and persistent.

Spark

Scala: A fully object-oriented programming language similar to java.

jblas: A fast linear algebra library (JAVA). The ATLAS ART implementation is based on BLAS and LAPACK, the de facto industry standard for matrix calculations, and uses advanced infrastructure for all calculation procedures, making it very fast.

Spark: Spark is a general parallel framework similar to Hadoop MapReduce implemented in Scala language. In addition to the advantages of Hadoop MapReduce, it is different from MapReduce in that the intermediate output results of jobs can be saved in memory, thus There is no need to read or write HDFS, so Spark is better suited to MapReduce algorithms that require iteration, such as data mining and machine learning. It can operate in parallel with the Hadoop file system. Third-party cluster frameworks using Mesos can support this behavior.

Spark SQL: As part of the Apache Spark big data framework, it can be used for structured data processing and can perform SQL-like Spark data queries

Spark Streaming: A real-time computing framework built on Spark, Expands Spark's ability to process big data streaming data.

Spark MLlib: MLlib is Spark's implementation library for commonly used machine learning algorithms. Currently (2014.05) it supports binary classification, regression, clustering and collaborative filtering. It also includes a low-level gradient descent optimization basic algorithm. MLlib relies on the jblas linear algebra library, and jblas itself relies on the remote Fortran program.

Spark GraphX: GraphX is an API for graphs and graph parallel computing in Spark. It can provide a one-stop data solution on top of Spark and can complete a complete set of pipeline operations for graph computing conveniently and efficiently.

Fortran: The earliest high-level computer programming language, widely used in scientific and engineering computing fields.

BLAS: Basic linear algebra subroutine library, with a large number of programs that have been written about linear algebra operations.

LAPACK: Well-known open software, including solving the most common numerical linear algebra problems in scientific and engineering calculations, such as solving linear equations, linear least squares problems, eigenvalue problems and singular value problems, etc.

ATLAS: An optimized version of the BLAS linear algorithm library.

Spark Python: Spark is written in scala language, but for promotion and compatibility, java and python interfaces are provided.

Python

Python: An object-oriented, interpreted computer programming language.

Cloud computing platform

Docker： Open source application container engine

kvm： (Keyboard Video Mouse)

openstack： Open source Cloud Computing Management Platform Project

For more Python-related technical articles, please visit the Python Tutorial column to learn!

The above is the detailed content of Big data learning route. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

1 months ago By DDD

R.E.P.O. Best Graphic Settings

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7385

Java Tutorial

1629

CakePHP Tutorial

1357

Laravel Tutorial

1267

PHP Tutorial

1216

Related knowledge

PHP's big data structure processing skills May 08, 2024 am 10:24 AM

Big data structure processing skills: Chunking: Break down the data set and process it in chunks to reduce memory consumption. Generator: Generate data items one by one without loading the entire data set, suitable for unlimited data sets. Streaming: Read files or query results line by line, suitable for large files or remote data. External storage: For very large data sets, store the data in a database or NoSQL.

C++ development experience sharing: Practical experience in C++ big data programming Nov 22, 2023 am 09:14 AM

In the Internet era, big data has become a new resource. With the continuous improvement of big data analysis technology, the demand for big data programming has become more and more urgent. As a widely used programming language, C++’s unique advantages in big data programming have become increasingly prominent. Below I will share my practical experience in C++ big data programming. 1. Choosing the appropriate data structure Choosing the appropriate data structure is an important part of writing efficient big data programs. There are a variety of data structures in C++ that we can use, such as arrays, linked lists, trees, hash tables, etc.

Five major development trends in the AEC/O industry in 2024 Apr 19, 2024 pm 02:50 PM

AEC/O (Architecture, Engineering & Construction/Operation) refers to the comprehensive services that provide architectural design, engineering design, construction and operation in the construction industry. In 2024, the AEC/O industry faces changing challenges amid technological advancements. This year is expected to see the integration of advanced technologies, heralding a paradigm shift in design, construction and operations. In response to these changes, industries are redefining work processes, adjusting priorities, and enhancing collaboration to adapt to the needs of a rapidly changing world. The following five major trends in the AEC/O industry will become key themes in 2024, recommending it move towards a more integrated, responsive and sustainable future: integrated supply chain, smart manufacturing

Application of algorithms in the construction of 58 portrait platform May 09, 2024 am 09:01 AM

1. Background of the Construction of 58 Portraits Platform First of all, I would like to share with you the background of the construction of the 58 Portrait Platform. 1. The traditional thinking of the traditional profiling platform is no longer enough. Building a user profiling platform relies on data warehouse modeling capabilities to integrate data from multiple business lines to build accurate user portraits; it also requires data mining to understand user behavior, interests and needs, and provide algorithms. side capabilities; finally, it also needs to have data platform capabilities to efficiently store, query and share user profile data and provide profile services. The main difference between a self-built business profiling platform and a middle-office profiling platform is that the self-built profiling platform serves a single business line and can be customized on demand; the mid-office platform serves multiple business lines, has complex modeling, and provides more general capabilities. 2.58 User portraits of the background of Zhongtai portrait construction

Discussion on the reasons and solutions for the lack of big data framework in Go language Mar 29, 2024 pm 12:24 PM

In today's big data era, data processing and analysis have become an important support for the development of various industries. As a programming language with high development efficiency and superior performance, Go language has gradually attracted attention in the field of big data. However, compared with other languages such as Java and Python, Go language has relatively insufficient support for big data frameworks, which has caused trouble for some developers. This article will explore the main reasons for the lack of big data framework in Go language, propose corresponding solutions, and illustrate it with specific code examples. 1. Go language

AI, digital twins, visualization... Highlights of the 2023 Yizhiwei Autumn Product Launch Conference! Nov 14, 2023 pm 05:29 PM

Yizhiwei’s 2023 autumn product launch has concluded successfully! Let us review the highlights of the conference together! 1. Intelligent inclusive openness, allowing digital twins to become productive Ning Haiyuan, co-founder of Kangaroo Cloud and CEO of Yizhiwei, said in his opening speech: At this year’s company’s strategic meeting, we positioned the main direction of product research and development as “intelligent inclusive openness” "Three core capabilities, focusing on the three core keywords of "intelligent inclusive openness", we further proposed the development goal of "making digital twins a productive force". 2. EasyTwin: Explore a new digital twin engine that is easier to use 1. From 0.1 to 1.0, continue to explore the digital twin fusion rendering engine to have better solutions with mature 3D editing mode, convenient interactive blueprints, and massive model assets

Getting Started Guide: Using Go Language to Process Big Data Feb 25, 2024 pm 09:51 PM

As an open source programming language, Go language has gradually received widespread attention and use in recent years. It is favored by programmers for its simplicity, efficiency, and powerful concurrent processing capabilities. In the field of big data processing, the Go language also has strong potential. It can be used to process massive data, optimize performance, and can be well integrated with various big data processing tools and frameworks. In this article, we will introduce some basic concepts and techniques of big data processing in Go language, and show how to use Go language through specific code examples.

Big data processing in C++ technology: How to use in-memory databases to optimize big data performance? May 31, 2024 pm 07:34 PM

In big data processing, using an in-memory database (such as Aerospike) can improve the performance of C++ applications because it stores data in computer memory, eliminating disk I/O bottlenecks and significantly increasing data access speeds. Practical cases show that the query speed of using an in-memory database is several orders of magnitude faster than using a hard disk database.