HBase数据迁移（2）- 使用bulk load 工具从TSV文件中导入数据-Mysql Tutorial-php.cn

Home

Database

Mysql Tutorial

HBase数据迁移（2）- 使用bulk load 工具从TSV文件中导入数据

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:29 PM

hbase load ts use tool data migrate

英文原文摘自：《HBase Administration Cookbook》??编译：ImportNew?-?陈晨本篇文章是对数据合并的系列文章之二（共三篇），针对的情景模式就是将现有的各种类型的数据库或数据文件中的数据转入至 HBase 中。上一篇 ???《HBase数据迁移（1）- 通过单个客

英文原文摘自：《HBase Administration Cookbook》??编译：ImportNew?-?陈晨

本篇文章是对数据合并的系列文章之二（共三篇），针对的情景模式就是将现有的各种类型的数据库或数据文件中的数据转入至HBase中。

上一篇 ???《HBase数据迁移（1）- 通过单个客户端导入MySQL数据》

HBase提供importtsv工具支持从TSV文件中将数据导入HBase。使用该工具将文本数据加载至HBase十分高效，因为它是通过MapReduce Job来实施导入的。哪怕是要从现有的关系型数据库中加载数据，也可以先将数据导入文本文件中，然后使用importtsv 工具导入HBase。在导入海量数据时，这个方式运行的很好，因为导出数据比在关系型数据库中执行SQL快很多。
importtsv 工具不仅支持将数据直接加载进HBase的表中，还支持直接生成HBase自有格式文件（HFile），所以你可以用HBase的bulk load工具将生成好的文件直接加载进运行中的HBase集群。这样就减少了在数据迁移过程中，数据传输与HBase加载时产生的网络流量。下文描述了importtsv 和bulk load工具的使用场景。我们首先展示使用importtsv 工具从TSV文件中将数据加载至HBase表中。当然也会包含如何直接生成HBase自有格式文件，以及如何直接将已经生成好的文件加载入HBase。

准备
我们在本文中将使用 “美国国家海洋和大气管理局气候平均值”的公共数据集合。访问http://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/下载。我们使用在目录 products | hourly 下的小时温度数据（可以在上述链接页面中找到）。下载hly-temp-10pctl.txt文件。
下载后的数据因为格式不支持的原因，不能直接用importtsv工具加载。我们提供了脚本来帮助你将数据转换为TSV文件。除了原有数据，被加载的TSV文件中还必须包含一个栏位用于表示HBase表数据行的row key。本文附带的_tsv_hly.py脚本从NOAA的小时数据文件中读取数据，生成row key并将数据输出至本地文件系统的TSV文件:

$ python to_tsv_hly.py -f hly-temp-10pctl.txt -t hly-temp-10pctl.tsv

Copy after login

因为importtsv工具是通过运行MapReduce Job来实施导入动作，我们需要在集群上运行MapReduce。在主节点上执行下述命令以开启MapReduce守护进程：

hadoop$ $HADOOP_HOME/bin/start-mapred.sh

Copy after login

我们在客户端服务器上添加hac用户用于运行job；建议在生产环境如此实施。为了能够从客户端运行MapReduce Job，你需要将${hadoop.tmp.dir}目录的写权限开放给客户端的hac用户。我们假设${hadoop.tmp.dir}目录为/usr/local/hadoop/var:

root@client1# usermod -a -G hadoop hac
root@client1# chmod -R 775 /usr/local/hadoop/var

Copy after login

在HDFS中为hac用户建立主文件夹：

hadoop@client1$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hac
hadoop@client1$ $HADOOP_HOME/bin/hadoop fs -chown hac /user/hac

Copy after login

同时也确认hac用户在HDFS中的MapReduce的临时目录中也有写权限：

hadoop@client1$ $HADOOP_HOME/bin/hadoop fs -chmod -R 775 /usr/local/hadoop/var/mapred

Copy after login

如何实施
使用MapReduce将数据从TSV文件加载至HBase的table，按照如下步骤实施：
1.在HDFS中建立文件夹，并且将TSV文件从本地文件系统拷贝至HDFS中：

hac@client1$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hac/input/2-1
hac@client1$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal hly-temp-10pctl.tsv /user/hac/input/2-1

Copy after login

2.在HBase中添加目标表。连接到HBase，添加hly_temp表：

hac@client1$ $HBASE_HOME/bin/hbase shell
hbase> create 'hly_temp', {NAME => 't', VERSIONS => 1}

Copy after login

3.若表已经存在（上一节中已经建好），则添加一个新列族：

hbase> disable 'hly_temp'
hbase> alter 'hly_temp', {NAME => 't', VERSIONS => 1}
hbase> enable 'hly_temp'

Copy after login

4.将hbase-site.xml文件放置在Hadoop的配置目录中就能够加入Hadoop的环境变量了：

hac@client1$ ln -s $HBASE_HOME/conf/hbase-site.xml $HADOOP_HOME/conf/hbase-site.xml

Copy after login

5.编辑客户端服务器的$HADOOP_HOME/conf 下的hadoop-env.sh文件，添加HBase的依赖库到Hadoop的环境变量中：

hadoop@client1$ vi $HADOOP_HOME/conf/hadoop-env.sh
export HADOOP_CLASSPATH=/usr/local/zookeeper/current/zookeeper-3.4.3.jar:/usr/local/hbase/current/lib/guava-r09.jar

Copy after login

6.使用hac用户运行importtsv工具，执行如下脚本：

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-
0.92.1.jar importtsv \ -Dimporttsv.columns=HBASE_ROW_KEY,t:v01,t:v02,t:v03,t:v04,t:v05,t:v06,t:v07,t:v08,t:v09,t:v10,t:v11,t:v12,t:v13,t:v14,t:v15,t:v1
6,t:v17,t:v18,t:v19,t:v20,t:v21,t:v22,t:v23,t:v24 \
hly_temp \
/user/hac/input/2-1

Copy after login

7.通过MapReduce Job管理页面—http://master1:50030/jobtracker.jsp检查任务状态。
8. 验证HBase目标表中的导入数据。通过验证hly_temp表中的数据总量，并且还要检查表中的一些样本数据。表中的数据总量应该于文件中的行数相同。表中的row key应该与文件中的第一个字段相同。每行数据都有t:v01, t:v02, …, t:v24等单元格，每个单元格的值都应当与TSV文件中的栏位相同：

hbase> count 'hly_temp'
95630 row(s) in 12.2020 seconds
hbase> scan 'hly_temp', {COLUMNS => 't:', LIMIT => 10}
AQW000617050110 column=t:v23,
timestamp=1322959962261, value=781S
AQW000617050110 column=t:v24,
timestamp=1322959962261, value=774C
10 row(s) in 0.1850 seconds

Copy after login

运行原理
Importtsv工具只从HDFS中读取数据，所以一开始我们需要将TSV文件从本地文件系统拷贝到HDFS中，使用的是hadoop fs -copyFromLocal命令。在步骤2中，我们在HBase中建立了表(hly_temp) 以及列族 (t)。若表已经存在，我们可以修改表，加入列族。所有的数据都加载进新建的列族中，已经存在原有列族中的数据则不受影响。运行MapReduce Job，需要使用hadoop的jar命令来运行包含class编译文件的JAR文件。为了在命令行中能够使用HBase的配置信息，我们将hbase-site.xml放到$HADOOP_HOME/conf 目录下，从而产生关联；在该目录下的所有文件都会被hadoop命令行加入到Java进程的环境变量中。
步骤5中，设置hadoop-env.sh中的HADOOP_CLASSPATH以加入运行时依赖。除了ZooKeeper库之外，guava-r09.jar也是importtsv运行依赖库，它是用于转换TSV文件的库。
Importtsv本身是一个在HBase的JAR文件中的JAVA类。在步骤6中，我们通过hadoop的jar命令来运行该工具。这个命令会启动一个Java进程，并且自动添加所有的依赖。需要运行哪个JAR是通过指定hadoop jar命令的第一个参数，在这里是使用hbase-0.92.1.jar。
下列参数要被传递至hbase-0.92.1.jar的主类：

?TSV文件的字段索引与HBase表中列的对应信息是对 -Dimporttsv.columns参数进行设置，在本文中，TSV文件格式是(rowkey, value1, value2, …, value24)。我们将数据存入HBase的列族 t 中，使用v01 对应 value1, v02 对应value2等类似方式。HBASE_ROW_KEY 中存放的就是row key字段。
?在 -Dimporttsv.columns 参数之后，我们还需要为命令行指定表名参数（hly_temp）以及TSV文件路径 (/user/hac/input/2-1)参数

还有一些其他选项可以被指定。运行importtsv不带任何参数就会打印出使用信息摘要：

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.92.1.jar
importtsv
用法: importtsv -Dimporttsv.columns=a,b,c 
将指定路径的TSV数据导入指定的表中。
…

Copy after login

其他包含 -D的可指定的选项包括：
-Dimporttsv.skip.bad.lines=false – 若遇到无效行则失败
‘-Dimporttsv.separator=|’ – 文件中代替tabs的分隔符
-Dimporttsv.timestamp=currentTimeAsLong – 导入时使用指定的时间戳
-Dimporttsv.mapper.class=my.Mapper – 使用用户指定的Mapper类来代替默认的org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
这个工具启动了MapReduce Job。在job的map阶段，它从指定路径的TSV文件中读取并转换，然后根据栏位映射信息将数据写入HBase的table中。此处读和写的操作是在多台服务器上并行执行，所以相比从单台节点读取速度快很多。该job中默认是没有reduce阶段。我们能够在MapReduce的管理页面上查看job的进度，统计以及其他MapReduce信息。
需要查看表中插入的数据，可以使用HBase Shell中的scan命令。我们可以指定列为 ‘t’（COLUMNS => ‘t:’）来只对表中的t列族进行查询。

更多内容
Importtsv工具默认使用了HBase的Put API来将数据插入HBase表中，在map阶段使用的是TableOutputFormat 。但是当 -Dimporttsv.bulk.输入选项被指定时，会使用HFileOutputFormat来代替在HDFS中生成HBase的自有格式文件（HFile）。而后我们能够使用completebulkload 来加载生成的文件到一个运行的集群中。根据下列步骤可以使用bulk 输出以及加载工具：
1.在HDFS中建立文件夹用于存放生成的文件：

hac@client1$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hac/output

Copy after login

2.运行importtsv并加上bulk输出选项：

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-
0.92.1.jar importtsv \
-Dimporttsv.bulk.output=/user/hac/output/2-1 \
-Dimporttsv.columns=HBASE_ROW_KEY,t:v01,t:v02,t:v03,t:v04,t:v05,t:v06,t:v07,t:v08,t:v09,t:v10,t:v11,t:v12,t:v13,t:v14,t:v15,t:v16,t:v17,t:v18,t:v19,t:v20,t:v21,t:v22,t:v23,t:v24 \
hly_temp \
/user/hac/input/2-1

Copy after login

3.完成bulk加载：

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.92.1.jar completebulkload \
/user/hac/output/2-1 \
hly_temp

Copy after login

completebulkload 工具读取生成的文件，判断它们归属的族群，然后访问适当的族群服务器。族群服务器会将HFile文件转移进自身存储目录中，并且为客户端建立在线数据。

英文原文摘自：《HBase Administration Cookbook》??编译：ImportNew?-?陈晨

译文链接：http://www.importnew.com/3645.html

【如需转载，请在正文中标注并保留原文链接、译文链接和译者等信息，谢谢合作！】

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7316

Java Tutorial

1625

CakePHP Tutorial

1349

Laravel Tutorial

1261

PHP Tutorial

1208

Related knowledge

Troubleshooting ThinkPHP6 message queue problem: How to solve the return data exception and the queue cannot be consumed? Mar 31, 2025 pm 11:33 PM

thinkphp6...

What are the recommended websites for virtual currency app software? Mar 31, 2025 pm 09:06 PM

This article recommends ten well-known virtual currency-related APP recommendation websites, including Binance Academy, OKX Learn, CoinGecko, CryptoSlate, CoinDesk, Investopedia, CoinMarketCap, Huobi University, Coinbase Learn and CryptoCompare. These websites not only provide information such as virtual currency market data, price trend analysis, etc., but also provide rich learning resources, including basic blockchain knowledge, trading strategies, and tutorials and reviews of various trading platform APPs, helping users better understand and make use of them

Top 10 of the formal Web3 trading platform APP rankings (authoritatively released in 2025) Mar 31, 2025 pm 08:09 PM

Based on market data and common evaluation criteria, this article lists the top ten formal Web3 trading platform APPs in 2025. The list covers well-known platforms such as Binance, OKX, Gate.io, Huobi (now known as HTX), Crypto.com, Coinbase, Kraken, Gemini, BitMEX and Bybit. These platforms have their own advantages in user scale, transaction volume, security, compliance, product innovation, etc. For example, Binance is known for its huge user base and rich product services, while Coinbase focuses on security and compliance. Choosing a suitable platform requires comprehensive consideration based on your own needs and risk tolerance.

How to roll positions in digital currency? What are the digital currency rolling platforms? Mar 31, 2025 pm 07:36 PM

Digital currency rolling positions is an investment strategy that uses lending to amplify trading leverage to increase returns. This article explains the digital currency rolling process in detail, including key steps such as selecting trading platforms that support rolling (such as Binance, OKEx, gate.io, Huobi, Bybit, etc.), opening a leverage account, setting a leverage multiple, borrowing funds for trading, and real-time monitoring of the market and adjusting positions or adding margin to avoid liquidation. However, rolling position trading is extremely risky, and investors need to operate with caution and formulate complete risk management strategies. To learn more about digital currency rolling tips, please continue reading.

On which platform is web3 transaction? Mar 31, 2025 pm 07:54 PM

This article lists the top ten well-known Web3 trading platforms, including Binance, OKX, Gate.io, Kraken, Bybit, Coinbase, KuCoin, Bitget, Gemini and Bitstamp. The article compares the characteristics of each platform in detail, such as the number of currencies, trading types (spot, futures, options, NFT, etc.), handling fees, security, compliance, user groups, etc., aiming to help investors choose the most suitable trading platform. Whether it is high-frequency traders, contract trading enthusiasts, or investors who focus on compliance and security, they can find reference information from it.

Recommended tutorial for newbies in the commonly used virtual currency exchange in the currency circle Mar 31, 2025 pm 10:45 PM

This article provides detailed exchange recommendations and introductory tutorials for beginners in the currency circle. Commonly used exchanges such as Coinbase, Binance, Kraken, Ouyi and Sesame Open Door are recommended, and the steps for registration, identity verification, security settings, recharge and trading are introduced. The article also emphasizes the importance of security awareness, risk control and continuous learning, aiming to help beginners enter the digital asset field safely and rationally.

What are the free market software websites Mar 31, 2025 pm 10:36 PM

There are six free market viewing software websites: 1. Binance platform, suitable for digital asset investors; 2. OKX platform, providing rich market data; 3. Sesame Open Door (Gate.io) platform, suitable for users who trade in Gate.io; 4. TradingView, providing professional charting tools; 5. CoinMarketCap, covering a wide range of digital asset data; 6. CoinGecko, providing project fundamental evaluation. When choosing a platform, you need to consider investment objects, chart function requirements, data comprehensiveness and user experience.

The latest tutorial for commonly used virtual currency exchanges in the currency circle 2025 Mar 31, 2025 pm 10:57 PM

This article recommends several commonly used and relatively safe virtual currency exchanges for beginners entering the currency circle in 2025, including Binance, Ouyi, Coinbase and Sesame Open Door. The article provides detailed tutorials on registration, authentication, security setup and transaction process, and emphasizes the importance of risk control, security awareness and ongoing learning, aiming to help beginners securely get started with digital asset trading.

See all articles