Home Database Mysql Tutorial Impala与Hive的比较

Impala与Hive的比较

Jun 07, 2016 pm 04:31 PM
cloud hive impala Architecture Compare

1. Impala架构 Impala是Cloudera在受到Google的Dremel启发下开发的实时交互SQL大数据查询工具,Impala没有再使用缓慢的Hive+MapReduce批处理,而是通过使用与商用并行关系数据库中类似的分布式查询引擎(由Query Planner、Query Coordinator和Query Exec Eng

1. Impala架构

       Impala是Cloudera在受到Google的Dremel启发下开发的实时交互SQL大数据查询工具,Impala没有再使用缓慢的Hive+MapReduce批处理,而是通过使用与商用并行关系数据库中类似的分布式查询引擎(由Query Planner、Query Coordinator和Query Exec Engine三部分组成),可以直接从HDFS或HBase中用SELECT、JOIN和统计函数查询数据,从而大大降低了延迟。其架构如图 1所示,Impala主要由Impalad, State Store和CLI组成。

impala_3

图 1

        Impalad: 与DataNode运行在同一节点上,由Impalad进程表示,它接收客户端的查询请求(接收查询请求的Impalad为Coordinator,Coordinator通过JNI调用java前端解释SQL查询语句,生成查询计划树,再通过调度器把执行计划分发给具有相应数据的其它Impalad进行执行),读写数据,并行执行查询,并把结果通过网络流式的传送回给Coordinator,由Coordinator返回给客户端。同时Impalad也与State Store保持连接,用于确定哪个Impalad是健康和可以接受新的工作。在Impalad中启动三个ThriftServer: beeswax_server(连接客户端),hs2_server(借用Hive元数据), be_server(Impalad内部使用)和一个ImpalaServer服务。


        Impala State Store: 跟踪集群中的Impalad的健康状态及位置信息,由statestored进程表示,它通过创建多个线程来处理Impalad的注册订阅和与各Impalad保持心跳连接,各Impalad都会缓存一份State Store中的信息,当State Store离线后(Impalad发现State Store处于离线时,会进入recovery模式,反复注册,当State Store重新加入集群后,自动恢复正常,更新缓存数据)因为Impalad有State Store的缓存仍然可以工作,但会因为有些Impalad失效了,而已缓存数据无法更新,导致把执行计划分配给了失效的Impalad,导致查询失败。


        CLI: 提供给用户查询使用的命令行工具(Impala Shell使用python实现),同时Impala还提供了Hue,JDBC, ODBC使用接口。

2. 与Hive的关系

        Impala与Hive都是构建在Hadoop之上的数据查询工具各有不同的侧重适应面,但从客户端使用来看Impala与Hive有很多的共同之处,如数据表元数据、ODBC/JDBC驱动、SQL语法、灵活的文件格式、存储资源池等。Impala与Hive在Hadoop中的关系如图 2所示。Hive适合于长时间的批处理查询分析,而Impala适合于实时交互式SQL查询,Impala给数据分析人员提供了快速实验、验证想法的大数据分析工具。可以先使用hive进行数据转换处理,之后使用Impala在Hive处理后的结果数据集上进行快速的数据分析。

impala_s

图 2


3. Impala的查询处理过程

        Impalad分为Java前端与C++处理后端,接受客户端连接的Impalad即作为这次查询的Coordinator,Coordinator通过JNI调用Java前端对用户的查询SQL进行分析生成执行计划树,不同的操作对应不用的PlanNode, 如:SelectNode, ScanNode, SortNode, AggregationNode, HashJoinNode等等。

        执行计划树的每个原子操作由一个PlanFragment表示,通常一条查询语句由多个Plan Fragment组成, Plan Fragment 0表示执行树的根,汇聚结果返回给用户,执行树的叶子结点一般是Scan操作,分布式并行执行。

        Java前端产生的执行计划树以Thrift数据格式返回给Impala C++后端(Coordinator)(执行计划分为多个阶段,每一个阶段叫做一个PlanFragment,每一个PlanFragment在执行时可以由多个Impalad实例并行执行(有些PlanFragment只能由一个Impalad实例执行,如聚合操作),整个执行计划为一执行计划树),由Coordinator根据执行计划,数据存储信息(Impala通过libhdfs与HDFS进行交互。通过hdfsGetHosts方法获得文件数据块所在节点的位置信息),通过调度器(现在只有simple-scheduler, 使用round-robin算法)Coordinator::Exec对生成的执行计划树分配给相应的后端执行器Impalad执行(查询会使用LLVM进行代码生成,编译,执行。对于使用LLVM如何提高性能这里有说明),通过调用GetNext()方法获取计算结果,如果是insert语句,则将计算结果通过libhdfs写回HDFS当所有输入数据被消耗光,执行结束,之后注销此次查询服务。


        Impala的查询处理流程大概如图3所示:

impala_2

图 3

        下面以一个SQL查询语句为例分析Impala的查询处理流程。如select sum(id), count(id), avg(id) from customer_small  group by id; 以此语句生成的计划为:

PLAN FRAGMENT 0
  PARTITION: UNPARTITIONED

  4:EXCHANGE
     tuple ids: 1

PLAN FRAGMENT 1
  PARTITION: HASH_PARTITIONED:

  STREAM DATA SINK
    EXCHANGE ID: 4
    UNPARTITIONED

  3:AGGREGATE
  |  output: SUM(), SUM()
  |  group by:
  |  tuple ids: 1
  |  
  2:EXCHANGE
     tuple ids: 1

PLAN FRAGMENT 2
  PARTITION: RANDOM

  STREAM DATA SINK
    EXCHANGE ID: 2
    HASH_PARTITIONED:

  1:AGGREGATE
  |  output: SUM(id), COUNT(id)
  |  group by: id
  |  tuple ids: 1
  |  
  0:SCAN HDFS
     table=default.customer_small #partitions=1 size=193B
     tuple ids: 0

        执行行计划树如图 4所示, 绿色的部分为可以分布式并行执行:

impala_squery1
图 4

4. Impala相对于Hive所使用的优化技术

1、没有使用MapReduce进行并行计算,虽然MapReduce是非常好的并行计算框架,但它更多的面向批处理模式,而不是面向交互式的SQL执行。与MapReduce相比:Impala把整个查询分成一执行计划树,而不是一连串的MapReduce任务,在分发执行计划后,Impala使用拉式获取数据的方式获取结果,把结果数据组成按执行树流式传递汇集,减少的了把中间结果写入磁盘的步骤,再从磁盘读取数据的开销。Impala使用服务的方式避免每次执行查询都需要启动的开销,即相比Hive没了MapReduce启动时间。

2、使用LLVM产生运行代码,针对特定查询生成特定代码,同时使用Inline的方式减少函数调用的开销,加快执行效率。

3、充分利用可用的硬件指令(SSE4.2)。

4、更好的IO调度,Impala知道数据块所在的磁盘位置能够更好的利用多磁盘的优势,同时Impala支持直接数据块读取和本地代码计算checksum。

5、通过选择合适的数据存储格式可以得到最好的性能(Impala支持多种存储格式)。

6、最大使用内存,中间结果不写磁盘,及时通过网络以stream的方式传递。

5. Impala与Hive的异同

数据存储:使用相同的存储数据池都支持把数据存储于HDFS, HBase。


元数据:两者使用相同的元数据。


SQL解释处理:比较相似都是通过词法分析生成执行计划。


执行计划
Hive: 依赖于MapReduce执行框架,执行计划分成map->shuffle->reduce->map->shuffle->reduce…的模型。如果一个Query会被编译成多轮MapReduce,则会有更多的写中间结果。由于MapReduce执行框架本身的特点,过多的中间过程会增加整个Query的执行时间。
Impala: 把执行计划表现为一棵完整的执行计划树,可以更自然地分发执行计划到各个Impalad执行查询,而不用像Hive那样把它组合成管道型的map->reduce模式,以此保证Impala有更好的并发性和避免不必要的中间sort与shuffle。


数据流
Hive: 采用推的方式,每一个计算节点计算完成后将数据主动推给后续节点。
Impala: 采用拉的方式,后续节点通过getNext主动向前面节点要数据,以此方式数据可以流式的返回给客户端,且只要有1条数据被处理完,就可以立即展现出来,而不用等到全部处理完成,更符合SQL交互式查询使用。


内存使用
Hive: 在执行过程中如果内存放不下所有数据,则会使用外存,以保证Query能顺序执行完。每一轮MapReduce结束,中间结果也会写入HDFS中,同样由于MapReduce执行架构的特性,shuffle过程也会有写本地磁盘的操作。
Impala: 在遇到内存放不下数据时,当前版本1.0.1是直接返回错误,而不会利用外存,以后版本应该会进行改进。这使用得Impala目前处理Query会受到一定的限制,最好还是与Hive配合使用。Impala在多个阶段之间利用网络传输数据,在执行过程不会有写磁盘的操作(insert除外)。


调度
Hive: 任务调度依赖于Hadoop的调度策略。
Impala: 调度由自己完成,目前只有一种调度器simple-schedule,它会尽量满足数据的局部性,扫描数据的进程尽量靠近数据本身所在的物理机器。调度器目前还比较简单,在SimpleScheduler::GetBackend中可以看到,现在还没有考虑负载,网络IO状况等因素进行调度。但目前Impala已经有对执行过程的性能统计分析,应该以后版本会利用这些统计信息进行调度吧。


容错
Hive: 依赖于Hadoop的容错能力。
Impala: 在查询过程中,没有容错逻辑,如果在执行过程中发生故障,则直接返回错误(这与Impala的设计有关,因为Impala定位于实时查询,一次查询失败,再查一次就好了,再查一次的成本很低)。但从整体来看,Impala是能很好的容错,所有的Impalad是对等的结构,用户可以向任何一个Impalad提交查询,如果一个Impalad失效,其上正在运行的所有Query都将失败,但用户可以重新提交查询由其它Impalad代替执行,不会影响服务。对于State Store目前只有一个,但当State Store失效,也不会影响服务,每个Impalad都缓存了State Store的信息,只是不能再更新集群状态,有可能会把执行任务分配给已经失效的Impalad执行,导致本次Query失败。


适用面
Hive: 复杂的批处理查询任务,数据转换任务。
Impala:实时数据分析,因为不支持UDF,能处理的问题域有一定的限制,与Hive配合使用,对Hive的结果数据集进行实时分析。

6. Impala的优缺点

优点

  1. 支持SQL查询,快速查询大数据。
  2. 可以对已有数据进行查询,减少数据的加载,转换。
  3. 多种存储格式可以选择(Parquet, Text, Avro, RCFile, SequeenceFile)。
  4. 可以与Hive配合使用。

缺点

  1. 不支持用户定义函数UDF。
  2. 不支持text域的全文搜索。
  3. 不支持Transforms。
  4.  不支持查询期的容错。
  5. 对内存要求高。
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to enable nfc function on Xiaomi Mi 14 Pro? How to enable nfc function on Xiaomi Mi 14 Pro? Mar 19, 2024 pm 02:28 PM

Nowadays, the performance and functions of mobile phones are becoming more and more powerful. Almost all mobile phones are equipped with convenient NFC functions to facilitate users for mobile payment and identity authentication. However, some Xiaomi 14Pro users may not know how to enable the NFC function. Next, let me introduce it to you in detail. How to enable nfc function on Xiaomi 14Pro? Step 1: Open the settings menu of your phone. Step 2: Find and click the "Connect and Share" or "Wireless & Networks" option. Step 3: In the Connection & Sharing or Wireless & Networks menu, find and click "NFC & Payments". Step 4: Find and click "NFC Switch". Normally, the default is off. Step 5: On the NFC switch page, click the switch button to switch it to on.

How to use TikTok on Huawei Pocket2 remotely? How to use TikTok on Huawei Pocket2 remotely? Mar 18, 2024 pm 03:00 PM

Sliding the screen through the air is a feature of Huawei that is highly praised in the Huawei mate60 series. This feature uses the laser sensor on the phone and the 3D depth camera of the front camera to complete a series of functions that do not require The function of touching the screen is, for example, to use TikTok from a distance. But how should Huawei Pocket 2 use TikTok from a distance? How to take screenshots from the air with Huawei Pocket2? 1. Open the settings of Huawei Pocket2 2. Then select [Accessibility]. 3. Click to open [Smart Perception]. 4. Just turn on the [Air Swipe Screen], [Air Screenshot], and [Air Press] switches. 5. When using it, you need to stand 20~40CM away from the screen, open your palm, and wait until the palm icon appears on the screen.

How to set line spacing in WPS Word to make the document neater How to set line spacing in WPS Word to make the document neater Mar 20, 2024 pm 04:30 PM

WPS is our commonly used office software. When editing long articles, the fonts are often too small to be seen clearly, so the fonts and the entire document are adjusted. For example: adjusting the line spacing of the document will make the entire document very clear. I suggest that all friends learn this operation step. I will share it with you today. The specific operation steps are as follows, come and take a look! Open the WPS text file you want to adjust, find the paragraph setting toolbar in the [Start] menu, and you will see the small line spacing setting icon (shown as a red circle in the picture). 2. Click the small inverted triangle in the lower right corner of the line spacing setting, and the corresponding line spacing value will appear. You can choose 1 to 3 times the line spacing (as shown by the arrow in the figure). 3. Or right-click the paragraph and it will appear.

1.3ms takes 1.3ms! Tsinghua's latest open source mobile neural network architecture RepViT 1.3ms takes 1.3ms! Tsinghua's latest open source mobile neural network architecture RepViT Mar 11, 2024 pm 12:07 PM

Paper address: https://arxiv.org/abs/2307.09283 Code address: https://github.com/THU-MIG/RepViTRepViT performs well in the mobile ViT architecture and shows significant advantages. Next, we explore the contributions of this study. It is mentioned in the article that lightweight ViTs generally perform better than lightweight CNNs on visual tasks, mainly due to their multi-head self-attention module (MSHA) that allows the model to learn global representations. However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied. In this study, the authors integrated lightweight ViTs into the effective

What is the architecture and working principle of Spring Data JPA? What is the architecture and working principle of Spring Data JPA? Apr 17, 2024 pm 02:48 PM

SpringDataJPA is based on the JPA architecture and interacts with the database through mapping, ORM and transaction management. Its repository provides CRUD operations, and derived queries simplify database access. Additionally, it uses lazy loading to only retrieve data when necessary, thus improving performance.

iPhone 16 Pro CAD drawings exposed, adding a second new button iPhone 16 Pro CAD drawings exposed, adding a second new button Mar 09, 2024 pm 09:07 PM

The CAD files of the iPhone 16 Pro have been exposed, and the design is consistent with previous rumors. Last fall, the iPhone 15 Pro added an Action button, and this fall, Apple appears to be planning to make minor adjustments to the size of the hardware. Adding a Capture button According to rumors, the iPhone 16 Pro may add a second new button, which will be the second consecutive year to add a new button after last year. It is rumored that the new Capture button will be set on the lower right side of the iPhone 16 Pro. This design is expected to make camera control more convenient and also allow the Action button to be used for other functions. This button will no longer be just an ordinary shutter button. Regarding the camera, from the current iP

The difference and comparative analysis between C language and PHP The difference and comparative analysis between C language and PHP Mar 20, 2024 am 08:54 AM

Differences and comparative analysis between C language and PHP C language and PHP are both common programming languages, but they have obvious differences in many aspects. This article will conduct a comparative analysis of C language and PHP and illustrate the differences between them through specific code examples. 1. Syntax and usage: C language: C language is a process-oriented programming language, mainly used for system-level programming and embedded development. The syntax of C language is relatively simple and low-level, can directly operate memory, and is efficient and flexible. C language emphasizes the programmer's completeness of the program

How to use Xiaomi Mi 14 Ultra AI smart image expansion? How to use Xiaomi Mi 14 Ultra AI smart image expansion? Mar 16, 2024 pm 12:37 PM

The progress of the times has made many people's incomes higher and higher, and the mobile phones they usually use will be changed frequently. The Xiaomi Mi 14 Ultra recently launched by Xiaomi must be familiar to users. It has very high performance configuration and can provide users with more In order to provide a comfortable and smooth experience, new mobile phones will inevitably encounter many functions that are not used. For example, how to use Xiaomi 14UltraAI smart image expansion? Come and take a look at the usage tutorial below! How to use Xiaomi 14UltraAI smart image expansion? First open Xiaomi 14Ultra, enter the photo album, select the picture you want to enlarge, and enter the photo album editing option. Click Crop Rotate, click Crop, and click Smart Expand in the selection that appears. Finally, choose the way to expand the image according to your own needs.

See all articles