ThinkinginBigDate(九)大数据hadoop集群下离线数据存储和挖掘
前序: 2月23日,在中关村,海淀黄庄丹棱街SOHO大厦好未来会议室,hadoop专家吴超大侠,分享使用hadoop进行论坛日志分析。在回来的第二天,赶上了这次草根面对面交流。说是草根,像我这样的是草根,其余的都是大侠。在这一次交流中,主要是针对初级想了解had
前序:
2月23日,在中关村,海淀黄庄丹棱街SOHO大厦好未来会议室,hadoop专家吴超大侠,分享使用hadoop进行论坛日志分析。在回来的第二天,赶上了这次草根面对面交流。说是草根,像我这样的是草根,其余的都是大侠。在这一次交流中,主要是针对初级想了解hadoop的人员的,主要讲的内容,在我的上一篇Thinking in BigDate(八)大数据Hadoop核心架构HDFS+MapReduce+Hbase+Hive内部机理详解 博客中基本都有所涉及。这里我们又为什么又费言说这么多,只有一个目的,从这里你可以获得扩展你知识的另一个途径。
这段时间一直忙着,架构图的梳理与后期项目该如何开展,以及自己的学习状况。虽然看上去很简单的一个架构图,其实它需要你了解其中每一个点。我记得上次July和夏粉来北航讲座,July说到一句话:当你把你知道的东西,写下来,让人看明白是一种境界;当你能把自己写下来的东西给人讲明白,又是另一种境界。在这个过程中,我们都需要历练。虽然自己写博客并没有太长的时间,但是我深知吴超大侠、July的痛苦,说明白点,博客就是一件太耽误时间的事,而选择权在你手上。就这样还有人一直傻二愣的的再写。
基于hadoop集群下海量离线数据存储和挖掘分析架构:
架构图采用主流的Hadoop+Hive+Hbase集群架构平台。最简单的利用,包含了基本的基于hadoop集群下的日志分析过程。但此架构图,又不仅局限于简单的基于日志数据处理。我们可以把它定位到,把基于传统数据挖掘技术,移植到Hadoop集群平台上,提高计算效率,节省时间,降低开发成本。说到这里就必须多说一点,传统数据挖掘和基于Hadoop集群下的数据分析过程有什么区别?
我想这也是一直困扰大家的问题。旁人看热闹,行人看门道。把基于传统数据挖掘的过程移植到hadoop集群中,好在哪儿?问题在于:传统数据挖掘过程,基于单机或放在内存比较大的小型机上去跑数据,去建模型,7-8GB的数据,在参数不多的情况下,建模的过程,我想稍微熟悉建模过程的人,会有一个时间上的概念,10几个小时或者上天已经是好的了。太耗时了,太耽误时间了。而当数据越来越大,就面临这一瓶颈。自此,分布式的概念提出来了,分布式出来了,自然就会引入集群的概念。集群就是一群机器处理一个问题,或集群中不同的机器处理不同阶段的问题。除此时间问题之外,还有什么优势?其实,也一直困扰着我,我一有机会就会向那些大牛去请教,还有什么优点,他们也是堂堂不知其所言。
这里再多说两句还有什么优势:1、非关系型数据(Nosql),类日志文件数据。2、实时性。但这两点又不是传统数据挖掘的核心。其实,一个时间节省的问题,就足可以为之探究了。
这里没有采用现主流基于内存计算引擎Spark集群架构。后续如有涉及,再细讨论。
1、数据存储层
功能:数据收集、处理、存储、装载
包含:数据集成、ETL、数据仓库
工具:Sqoop、Flume、Kettle、Hive。
简介:
(1)Sqoop:数据收集工具,用于把相关数据导入Hadoop集群中。
(2)Flume:分布式日志收集工具,适用于网站、服务器等日志文件的收集。
(3)Kettle:一种开源免费的ETL工具。还有很多收费的ETL工具。在中国这都免费。
(4)Hive:基于Hadoop集群架构下的数据仓库的建立工具。主要是为了,类SQL与SQL之间的转换。
数据存储层,是前提。而前提的前提,就是数据的收集与ETL,在前面的博客中提到前期数据搜集和ETL过程可能会占整个项目工程的75%甚至以上的时间。可见,前期的工作多么的重要,没有前面,后面无从谈起。
2、集群架构层
功能:离线数据分析系统
核心:大数据存储和集群系统:Hive0.12.0 & Hadoop2.2.0 & HBase0.96.1
简介:
(1)Hadoop:开源集群分布式架构平台。2.2.0为最新版本。
(2)HBase:面向列的分布式数据库,适合构建低并发延时性数据服务系统。
(3)HDFS:分布式文件系统,是海量数据存储的标准。
集群架构层:说的是,也是集群平台的核心。我们常说的搭建hadoop平台,一般指的就是Hive+Hadoop+HBase。这需要自己去按照说明文档,在linux下搭建平台。其实,在我们配置Hadoop相关系统文件的时候,我们已经可以测试数据了,我们可以通过上传一个不是很大数据,测试hadoop是否运行成功。HBase+Hive是为大数据处理准备的。这里不介绍如何去配置系统文件,综合网上相关的文档,配置安装应该都没有问题。
目的在于,梳理一下整个大数据挖掘整体的流程。在脑海里梳理一下,有一张架构图。
3、分布式计算引擎层
功能:针对密集型数据计算
核心:Yarn、MapReduce
简介:
(1)Yarn:分布式资源管理框架,也可以理解为管理类MapReduce这种分布式处理平台的框架。
(2)Map/Reduce:基于密集型离线数据分析框架。这区别于现在很火的基于内存数据处理的Spark架构。
这里可能涉及到数据处理的过程,在上一篇博客中,谈到MapReduce的内部机理。其实就是把数据分块分发到不同机器上并发处理数据,最后把处理完的数据整合到一起,输出。其实看似简单,细分到每一块,我们就会看到,数据是如何在单机上去走的。这里逃不掉到的是数据还是一行行的读取,你也没有别的办法。这里你要做的工作就是,去写MapReduce函数,这个是根据数据的类型,业务需求,去写相应的函数。
4、算法合成层
功能:集成数据挖掘算法
核心:HiveQL、R语言、Mahout
简介:
(1)HiveQL:上面提到,类SQL,这也是选择Hive的原因,有利于传统数据库操作员到NoSql数据库操作之间的转型。
(2)R语言:主要用与统计分析、绘图的语言等。提供了一套完整的数据处理、计算和制图软件系统,也为下面的数据可视化提供了前提。
(3)Mahout:主要是集成机器学习等相关经典算法的实现。可以更有效的提供,挖掘数据背后隐藏的规律。
算法合成层,其实是数据挖掘,数据规律之间挖掘的核心。通过这些经典的或优化过的算法,为我们在海量数据面前,挖掘出有用价值的数据提供了方面。如果大家,了解一些数据挖掘和机器学习的一些内容的话,我们会知道两个概念:一、训练集。二、测试集。这里我们也会更多的提到建模,而构建模型的两个范畴就是,构建训练集合测试集的过程。训练集,是把原始数据抽取一部分用来构建模型,找到其中的一些规律。然后用剩下的数据,当测试集,去测试模型构建的准确率。其实更深入讨论一下,我们就会面临一个业界头疼的问题,准确率问题。因为我们所有的测试都是针对线下的数据去构建模型,这种方式对离线数据分析没有太大的影响,原因在于:离线数据,是不可变的,在很大情况下满足,在训练集测试的规律满足测试集的规律。而在更多的情况下,如基于实时线上数据的机器学习,这要求就非常的高了。这就会遇到一个通用的诟病:如何解决线下测试准确了极高的模型,如何保证在线上准确率却很差。他们给出的办法:就是没有办法,调参数,不断的测试,提高准确率。
这里不再多说,先梳理整个架构。
5、数据可视化层
其实上面已经讲到了一个可视化集成工具,就是R语言。当我们把通过Hadoop集群,业务梳理后的数据再写回HDFS中时候,这些数据有些已经是有规律的数据了。有些数据是提取出来制作报表、饼图或柱状图等。其实对上面已经处理完的数据还有下一步的处理过程就是:把HDFS或Hive数据仓库中的数据导入传统关系型数据库。用传统可视化工具进行展示,这是目前很主流的方法。当数据导入传统关系型数据库中,最后一步就是BI,传统BI。大家都在忙着吵大数据概念,可不要把传统的优势忘记,不然也只是丢了西瓜,捡了芝麻。
说了这么多废话,其实就是为了引出,基于传统离线数据存储和挖掘架构图。这是为我们自己接下来的工作,提前梳理好要做的内容。
(自己梳理的过程)
总结:
最近一段时间,一直在整理技术核心架构,一方面为写策划方案;一方面是为了接下来学习打下基础。上面的架构图基本已经涉及基于传统数据挖掘移植到Hadoop集群的一些流程。为不清楚或初学者提供一个解决方案,知道一个流程应该从哪方面入手。对于熟悉整个流程的Hadoop工程师来说,可能上面的工作是多此一举。但是能整理出来,在时间上的消费,为后来者提供一个解决方案,自是一件好事。
自己也是作为一个初学者。还有时间,也愿意抽出时间,把最近一段时间的学习整理一下,是为了积累。如有不足,后续改正。
Copyright?BUAA

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



DDREASE is a tool for recovering data from file or block devices such as hard drives, SSDs, RAM disks, CDs, DVDs and USB storage devices. It copies data from one block device to another, leaving corrupted data blocks behind and moving only good data blocks. ddreasue is a powerful recovery tool that is fully automated as it does not require any interference during recovery operations. Additionally, thanks to the ddasue map file, it can be stopped and resumed at any time. Other key features of DDREASE are as follows: It does not overwrite recovered data but fills the gaps in case of iterative recovery. However, it can be truncated if the tool is instructed to do so explicitly. Recover data from multiple files or blocks to a single

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

If you need to know how to use filtering with multiple criteria in Excel, the following tutorial will guide you through the steps to ensure you can filter and sort your data effectively. Excel's filtering function is very powerful and can help you extract the information you need from large amounts of data. This function can filter data according to the conditions you set and display only the parts that meet the conditions, making data management more efficient. By using the filter function, you can quickly find target data, saving time in finding and organizing data. This function can not only be applied to simple data lists, but can also be filtered based on multiple conditions to help you locate the information you need more accurately. Overall, Excel’s filtering function is a very practical

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

This week, FigureAI, a robotics company invested by OpenAI, Microsoft, Bezos, and Nvidia, announced that it has received nearly $700 million in financing and plans to develop a humanoid robot that can walk independently within the next year. And Tesla’s Optimus Prime has repeatedly received good news. No one doubts that this year will be the year when humanoid robots explode. SanctuaryAI, a Canadian-based robotics company, recently released a new humanoid robot, Phoenix. Officials claim that it can complete many tasks autonomously at the same speed as humans. Pheonix, the world's first robot that can autonomously complete tasks at human speeds, can gently grab, move and elegantly place each object to its left and right sides. It can autonomously identify objects
