Alex的Hadoop菜鸟教程:第10课Hive入门教程
Hive 安装 相比起很多教程先介绍概念,我喜欢先动手装上,然后用例子来介绍概念。我们先来安装一下Hive 先确认是否已经安装了对应的yum源,如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339 Hive是什么 Hi
Hive 安装
相比起很多教程先介绍概念,我喜欢先动手装上,然后用例子来介绍概念。我们先来安装一下Hive
先确认是否已经安装了对应的yum源,如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339
Hive是什么
Hive 提供了一个让大家可以使用sql去查询数据的途径。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢,官方文档说Hive 适用于高延时性的场景而且很费资源。
举个简单的例子,可以像这样去查询
hive> select * from h_employee; OK 1 1 peter 2 2 paul Time taken: 9.289 seconds, Fetched: 2 row(s)
这个h_employee不一定是一个数据库表
metastore
Hive 中建立的表都叫metastore表。这些表并不真实的存储数据,而是定义真实数据跟hive之间的映射,就像传统数据库中表的meta信息,所以叫做metastore。实际存储的时候可以定义的存储模式有四种:
内部表(默认)分区表桶表外部表 举个例子,这是一个简历内部表的语句CREATE TABLE worker(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
这个语句的意思是建立一个worker的内部表,内部表是默认的类型,所以不用写存储的模式。并且使用逗号作为分隔符存储
建表语句支持的类型
基本数据类型tinyint / smalint / int /bigint
float / double
boolean
string
复杂数据类型
Array/Map/Struct
没有date /datetime
建完的表存在哪里呢?
在 /user/hive/warehouse 里面,可以通过hdfs来查看建完的表位置$ hdfs dfs -ls /user/hive/warehouse Found 11 items drwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employee drwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employee2 drwxrwxrwt - wlsuser supergroup 0 2014-12-04 17:21 /user/hive/warehouse/h_employee_export drwxrwxrwt - root supergroup 0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logs drwxrwxrwt - root supergroup 0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_log drwxrwxrwt - username supergroup 0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1 drwxrwxrwt - username supergroup 0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2 drwxrwxrwt - username supergroup 0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslog drwxrwxrwt - root supergroup 0 2014-12-02 15:12 /user/hive/warehouse/hive_employee
一个文件夹对应一个metastore表
Hive 各种类型表使用
内部表
CREATE TABLE workers( id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
通过这样的语句就建立了一个内部表叫 workers,并且分隔符是逗号, \054 是ASCII 码
我们可以通过 show tables; 来看看有多少表,其实hive的很多语句是模仿mysql的,当你们不知道语句的时候,把mysql的语句拿来基本可以用。除了limit比较怪,这个后面会说
hive> show tables; OK h_employee h_employee2 h_employee_export h_http_access_logs hive_employee workers Time taken: 0.371 seconds, Fetched: 6 row(s)
建立完后,我们试着插入几条数据。这边要告诉大家Hive不支持单句插入的语句,必须批量,所以不要指望能用insert into workers values (1,'jack') 这样的语句插入数据。hive支持的插入数据的方式有两种: 从文件读取数据从别的表读出数据插入(insert from select) 这里我采用从文件读数据进来。先建立一个叫 worker.csv的文件
$ cat workers.csv 1,jack 2,terry 3,michael
用LOAD DATA 导入到Hive的表中
hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers; Copying data from file:/home/alex/workers.csv Copying file: file:/home/alex/workers.csv Loading data to table default.workers Table default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0] OK Time taken: 0.655 seconds
注意 不要少了那个 LOCAL , LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件,一个是从hdfs上找文件如果加上OVERWRITE可以再导入之前先清空表,比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers; 查询一下数据
hive> select * from workers; OK 1 jack 2 terry 3 michael Time taken: 0.177 seconds, Fetched: 3 row(s)
我们去看下导入后在hive内部表是怎么存的
# hdfs dfs -ls /user/hive/warehouse/workers/ Found 1 items -rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv
原来就是原封不动的把文件拷贝进去啊!就是这么土! 我们可以试验再放一个文件 workers2.txt (我故意把扩展名换一个,其实hive是不看扩展名的)
# cat workers2.txt 4,peter 5,kate 6,ted
导入
hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers; Copying data from file:/home/alex/workers2.txt Copying file: file:/home/alex/workers2.txt Loading data to table default.workers Table default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0] OK Time taken: 0.79 seconds
去看下文件的存储结构
# hdfs dfs -ls /user/hive/warehouse/workers/ Found 2 items -rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv -rwxrwxrwt 2 root supergroup 21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt
多出来一个workers2.txt 再用sql查询下
hive> select * from workers; OK 1 jack 2 terry 3 michael 4 peter 5 kate 6 ted Time taken: 0.144 seconds, Fetched: 6 row(s)
分区表
分区表是用来加速查询的,比如你的数据非常多,但是你的应用场景是基于这些数据做日报表,那你就可以根据日进行分区,当你要做2014-05-05的报表的时候只需要加载2014-05-05这一天的数据就行了。我们来创建一个分区表来看下create table partition_employee(id int, name string) partitioned by(daytime string) row format delimited fields TERMINATED BY '\054';
可以看到分区的属性,并不是任何一个列 我们先建立2个测试数据文件,分别对应两天的数据
# cat 2014-05-05 22,kitty 33,lily # cat 2014-05-06 14,sami 45,micky
导入到分区表里面
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05'); Copying data from file:/home/alex/2014-05-05 Copying file: file:/home/alex/2014-05-05 Loading data to table default.partition_employee partition (daytime=2014-05-05) Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] OK Time taken: 1.154 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06'); Copying data from file:/home/alex/2014-05-06 Copying file: file:/home/alex/2014-05-06 Loading data to table default.partition_employee partition (daytime=2014-05-06) Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0] OK Time taken: 0.763 seconds
导入的时候通过 partition 来指定分区。
查询的时候通过指定分区来查询
hive> select * from partition_employee where daytime='2014-05-05'; OK 22 kitty 2014-05-05 33 lily 2014-05-05 Time taken: 0.173 seconds, Fetched: 2 row(s)
我的查询语句并没有什么特别的语法,hive 会自动判断你的where语句中是否包含分区的字段。而且可以使用大于小于等运算符
hive> select * from partition_employee where daytime>='2014-05-05'; OK 22 kitty 2014-05-05 33 lily 2014-05-05 14 sami 2014-05-06 45 mick' 2014-05-06 Time taken: 0.273 seconds, Fetched: 4 row(s)
我们去看看存储的结构
# hdfs dfs -ls /user/hive/warehouse/partition_employee Found 2 items drwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05 drwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06
我们试试二维的分区表
create table p_student(id int, name string) partitioned by(daytime string,country string) row format delimited fields TERMINATED BY '\054';
查入一些数据
# cat 2014-09-09-CN 1,tammy 2,eric # cat 2014-09-10-CN 3,paul 4,jolly # cat 2014-09-10-EN 44,ivan 66,billy
导入hive
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN'); Copying data from file:/home/alex/2014-09-09-CN Copying file: file:/home/alex/2014-09-09-CN Loading data to table default.p_student partition (daytime=2014-09-09, country=CN) Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] OK Time taken: 0.736 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN'); Copying data from file:/home/alex/2014-09-10-CN Copying file: file:/home/alex/2014-09-10-CN Loading data to table default.p_student partition (daytime=2014-09-10, country=CN) Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0] OK Time taken: 0.691 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN'); Copying data from file:/home/alex/2014-09-10-EN Copying file: file:/home/alex/2014-09-10-EN Loading data to table default.p_student partition (daytime=2014-09-10, country=EN) Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0] OK Time taken: 0.622 seconds
看看存储结构
# hdfs dfs -ls /user/hive/warehouse/p_student Found 2 items drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09 drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10 # hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09 Found 1 items drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN
查询一下数据
hive> select * from p_student; OK 1 tammy 2014-09-09 CN 2 eric 2014-09-09 CN 3 paul 2014-09-10 CN 4 jolly 2014-09-10 CN 44 ivan 2014-09-10 EN 66 billy 2014-09-10 EN Time taken: 0.228 seconds, Fetched: 6 row(s)
hive> select * from p_student where daytime='2014-09-10' and country='EN'; OK 44 ivan 2014-09-10 EN 66 billy 2014-09-10 EN Time taken: 0.224 seconds, Fetched: 2 row(s)
桶表
桶表是根据某个字段的hash值,来将数据扔到不同的“桶”里面。外国人有个习惯,就是分类东西的时候摆几个桶,上面贴不同的标签,所以他们取名的时候把这种表形象的取名为桶表。桶表表专门用于采样分析下面这个例子是官网教程直接拷贝下来的,因为分区表跟桶表是可以同时使用的,所以这个例子中同时使用了分区跟桶两种特性
CREATE TABLE b_student(id INT, name STRING) PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETS row format delimited fields TERMINATED BY '\054';
意思是根据userid来进行计算hash值,用viewTIme来排序存储 做数据跟导入的过程我就不在赘述了,这是导入后的数据
hive> select * from b_student; OK 1 tammy 2014-09-09 CN 2 eric 2014-09-09 CN 3 paul 2014-09-10 CN 4 jolly 2014-09-10 CN 34 allen 2014-09-11 EN Time taken: 0.727 seconds, Fetched: 5 row(s)
从4个桶中采样抽取一个桶的数据
hive> select * from b_student tablesample(bucket 1 out of 4 on id); Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1406097234796_0041 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-12-08 17:35:56,995 Stage-1 map = 0%, reduce = 0% 2014-12-08 17:36:06,783 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec 2014-12-08 17:36:07,845 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec MapReduce Total cumulative CPU time: 2 seconds 900 msec Ended Job = job_1406097234796_0041 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 2.9 sec HDFS Read: 482 HDFS Write: 22 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 900 msec OK 4 jolly 2014-09-10 CN
外部表
外部表就是存储不是由hive来存储的,比如可以依赖Hbase来存储,hive只是做一个映射而已。我用Hbase来举例先建立一张Hbase表叫 employee
hbase(main):005:0> create 'employee','info' 0 row(s) in 0.4740 seconds => Hbase::Table - employee hbase(main):006:0> put 'employee',1,'info:id',1 0 row(s) in 0.2080 seconds hbase(main):008:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 row(s) in 0.0610 seconds hbase(main):009:0> put 'employee',1,'info:name','peter' 0 row(s) in 0.0220 seconds hbase(main):010:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 1 row(s) in 0.0450 seconds hbase(main):011:0> put 'employee',2,'info:id',2 0 row(s) in 0.0370 seconds hbase(main):012:0> put 'employee',2,'info:name','paul' 0 row(s) in 0.0180 seconds hbase(main):013:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 2 column=info:id, timestamp=1417591500179, value=2 2 column=info:name, timestamp=1417591512075, value=paul 2 row(s) in 0.0440 seconds
建立外部表进行映射
hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") > TBLPROPERTIES ("hbase.table.name" = "employee"); OK Time taken: 0.324 seconds hive> select * from h_employee; OK 1 1 peter 2 2 paul Time taken: 1.129 seconds, Fetched: 2 row(s)
查询语法
具体语法可以参考官方手册https://cwiki.apache.org/confluence/display/Hive/Tutorial 我只说几个比较奇怪的点显示条数
展示x条数据,用的还是limit,比如hive> select * from h_employee limit 1 > ; OK 1 1 peter Time taken: 0.284 seconds, Fetched: 1 row(s)
下课!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Dewu APP is currently a very popular brand shopping software, but most users do not know how to use the functions in Dewu APP. The most detailed usage tutorial guide is compiled below. Next is the Dewuduo that the editor brings to users. A summary of function usage tutorials. Interested users can come and take a look! Tutorial on how to use Dewu [2024-03-20] How to use Dewu installment purchase [2024-03-20] How to obtain Dewu coupons [2024-03-20] How to find Dewu manual customer service [2024-03-20] How to check the pickup code of Dewu [2024-03-20] Where to find Dewu purchase [2024-03-20] How to open Dewu VIP [2024-03-20] How to apply for return or exchange of Dewu

How to upgrade numpy version: Easy-to-follow tutorial, requires concrete code examples Introduction: NumPy is an important Python library used for scientific computing. It provides a powerful multidimensional array object and a series of related functions that can be used to perform efficient numerical operations. As new versions are released, newer features and bug fixes are constantly available to us. This article will describe how to upgrade your installed NumPy library to get the latest features and resolve known issues. Step 1: Check the current NumPy version at the beginning

After rain in summer, you can often see a beautiful and magical special weather scene - rainbow. This is also a rare scene that can be encountered in photography, and it is very photogenic. There are several conditions for a rainbow to appear: first, there are enough water droplets in the air, and second, the sun shines at a low angle. Therefore, it is easiest to see a rainbow in the afternoon after the rain has cleared up. However, the formation of a rainbow is greatly affected by weather, light and other conditions, so it generally only lasts for a short period of time, and the best viewing and shooting time is even shorter. So when you encounter a rainbow, how can you properly record it and photograph it with quality? 1. Look for rainbows. In addition to the conditions mentioned above, rainbows usually appear in the direction of sunlight, that is, if the sun shines from west to east, rainbows are more likely to appear in the east.

1. First open WeChat. 2. Click [+] in the upper right corner. 3. Click the QR code to collect payment. 4. Click the three small dots in the upper right corner. 5. Click to close the voice reminder for payment arrival.

PhotoshopCS is the abbreviation of Photoshop Creative Suite. It is a software produced by Adobe and is widely used in graphic design and image processing. As a novice learning PS, let me explain to you today what software photoshopcs5 is and how to use photoshopcs5. 1. What software is photoshop cs5? Adobe Photoshop CS5 Extended is ideal for professionals in film, video and multimedia fields, graphic and web designers who use 3D and animation, and professionals in engineering and scientific fields. Render a 3D image and merge it into a 2D composite image. Edit videos easily

Testing a monitor when buying it is an essential part to avoid buying a damaged one. Today I will teach you how to use software to test the monitor. Method step 1. First, search and download the DisplayX software on this website, install it and open it, and you will see many detection methods provided to users. 2. The user clicks on the regular complete test. The first step is to test the brightness of the display. The user adjusts the display so that the boxes can be seen clearly. 3. Then click the mouse to enter the next link. If the monitor can distinguish each black and white area, it means the monitor is still good. 4. Click the left mouse button again, and you will see the grayscale test of the monitor. The smoother the color transition, the better the monitor. 5. In addition, in the displayx software we

With the continuous development of smart phones, the functions of mobile phones have become more and more powerful, among which the function of taking long pictures has become one of the important functions used by many users in daily life. Long screenshots can help users save a long web page, conversation record or picture at one time for easy viewing and sharing. Among many mobile phone brands, Huawei mobile phones are also one of the brands highly respected by users, and their function of cropping long pictures is also highly praised. This article will introduce you to the correct method of taking long pictures on Huawei mobile phones, as well as some expert tips to help you make better use of Huawei mobile phones.

PHP Tutorial: How to Convert Int Type to String In PHP, converting integer data to string is a common operation. This tutorial will introduce how to use PHP's built-in functions to convert the int type to a string, while providing specific code examples. Use cast: In PHP, you can use cast to convert integer data into a string. This method is very simple. You only need to add (string) before the integer data to convert it into a string. Below is a simple sample code
