Home Database Mysql Tutorial HBase Bulkload bug修复及patch提交

HBase Bulkload bug修复及patch提交

Jun 07, 2016 pm 04:32 PM
bug hbase patch repair submit

第一部分:问题排查。 在店铺搜索相关需求的开发自测过程中,碰到了一个问题:bulkload数据的过程时间过长,运行了很久都没有结束,于是查看日志,发现bulkload的程序在不停的重试,信息如下(当天信息未保存,这是刚重现时截的)。 这些信息看起来没啥问题

第一部分:问题排查。

在店铺搜索相关需求的开发自测过程中,碰到了一个问题:bulkload数据的过程时间过长,运行了很久都没有结束,于是查看日志,发现bulkload的程序在不停的重试,信息如下(当天信息未保存,这是刚重现时截的)。

这些信息看起来没啥问题,bulkload在往表test_shopinfo里load各个hfile,失败了,但是错误是可恢复的,将会重试,接着又看到如下的信息:

好了,问题就是这样,bulkload在不停的失败,不停的重试,没有个尽头。开始怀疑是hbase集群出了情况,经过对hbase的一番排查,最后在regionserver的日志里发现了对应的一些信息:

从日志里看到,regionserver检查ladder这个family的hfile bounds,发现与regionserver的bounds匹配上了,应该是成功往里load了,但是ecrm这个family的hfile load失败了,日志里的错误信息是由于发生了split才失败的,但是是可以恢复的。

但是我们对于hbase表的策略是通过设定hfile的最大size来避免发生split的,所以基本上不会发生split(我们将最大max设得很大),于是觉得regionserver在处理ecrm的hfile时一定出现了问题,接着找到了HRegion.java的代码,相关代码如下:

      // validation failed, bail out before doing anything permanent.
      if (failures.size() != 0) {
        StringBuilder list = new StringBuilder();
        for (Pair p : failures) {
          list.append('n').append(Bytes.toString(p.getFirst())).append(' : ')
            .append(p.getSecond());
        }
        // problem when validating
        LOG.warn('There was a recoverable bulk load failure likely due to a' +
            ' split.  These (family, HFile) pairs were not loaded: ' + list);
        return false;
      }
Copy after login

接着看failures的来源,代码如下,就在上面这段代码的上方:

      List ioes = new ArrayList();
      List> failures = new ArrayList>();
      for (Pair p : familyPaths) {
        byte[] familyName = p.getFirst();
        String path = p.getSecond();
        Store store = getStore(familyName);
        if (store == null) {
          IOException ioe = new org.apache.hadoop.hbase.exceptions.DoNotRetryIOException(
              'No such column family ' + Bytes.toStringBinary(familyName));
          ioes.add(ioe);
          failures.add(p);
        } else {
          try {
            store.assertBulkLoadHFileOk(new Path(path));
          } catch (WrongRegionException wre) {
            // recoverable (file doesn't fit in region)
            failures.add(p);
          } catch (IOException ioe) {
            // unrecoverable (hdfs problem)
            ioes.add(ioe);
          }
        }
      }
Copy after login

一共两处代码往failures里add了东西,下面一处,是先调用了HStore.assertBulkLoadHFileOk(),查看该方法代码后发现,regionserver日志中检查hfile和region bounds的内容就是该方法输出的,而对于ecrm这个family的hfile,根本没有输出相关的bounds信息,因此确定是由上面这段代码第一处failures.add(p)添加进去的,这个时候才反应过来:ecrm这个family是这一次新添加的数据,但是对应hbase表没有重建以添加该family。于是在环境里把hbase表重建,再跑bulkload,很轻松的成功跑完。OK,自测的问题到此已经解决,但是遗留了一个问题:往这hbase表里bulkload不存在的family的hfile,日志竟然告诉我recoverable,然后无限的重试,这不是坑爹吗?于是有了下面的故事。

第二部分:hbase社区上的一番折腾

本着排查问题刨根问底的精神,我又回到了那段坑爹的代码上,仔细的看了两遍,然后发现了问题:

先看这段代码所在方法的说明:

  /**
   * Attempts to atomically load a group of hfiles.  This is critical for loading
   * rows with multiple column families atomically.
   *
   * @param familyPaths List of Pair
   * @param bulkLoadListener Internal hooks enabling massaging/preparation of a
   * file about to be bulk loaded
   * @param assignSeqId
   * @return true if successful, false if failed recoverably
   * @throws IOException if failed unrecoverably.
   */
  public boolean bulkLoadHFiles(List> familyPaths, boolean assignSeqId,
      BulkLoadListener bulkLoadListener) throws IOException
Copy after login

成功返回true,失败且recoverable,返回false,失败且unrecoverable,抛出IOException。

把这整段代码贴上来,方便看:

      List ioes = new ArrayList();
      List> failures = new ArrayList>();
      for (Pair p : familyPaths) {
        byte[] familyName = p.getFirst();
        String path = p.getSecond();
        Store store = getStore(familyName);
        if (store == null) {
          IOException ioe = new org.apache.hadoop.hbase.exceptions.DoNotRetryIOException(
              'No such column family ' + Bytes.toStringBinary(familyName));
          ioes.add(ioe);
          failures.add(p);
        } else {
          try {
            store.assertBulkLoadHFileOk(new Path(path));
          } catch (WrongRegionException wre) {
            // recoverable (file doesn't fit in region)
            failures.add(p);
          } catch (IOException ioe) {
            // unrecoverable (hdfs problem)
            ioes.add(ioe);
          }
        }
      }
      // validation failed, bail out before doing anything permanent.
      if (failures.size() != 0) {
        StringBuilder list = new StringBuilder();
        for (Pair p : failures) {
          list.append('n').append(Bytes.toString(p.getFirst())).append(' : ')
            .append(p.getSecond());
        }
        // problem when validating
        LOG.warn('There was a recoverable bulk load failure likely due to a' +
            ' split.  These (family, HFile) pairs were not loaded: ' + list);
        return false;
      }
      // validation failed because of some sort of IO problem.
      if (ioes.size() != 0) {
        IOException e = MultipleIOException.createIOException(ioes);
        LOG.error('There were one or more IO errors when checking if the bulk load is ok.', e);
        throw e;
      }
Copy after login

上面一段代码,在处理一批hfile时,将对应的失败和IOException保存在List里,然后在下面一段代码里进行处理,好吧,问题就在这:上面的代码抓到的IOException,都意味着该次bulkload是肯定要失败的,然而在后续的处理中,代码竟然先处理了failures里的信息,然后输出warm的log告诉用户recoverable,并且返回了false,直接把下面处理IOException的代码跳过了。理一下逻辑,这个地方的处理,必然应该是先处理IOException,如果没有IOException,才轮到处理failures。

至此,问题已经清楚,解决方法也基本明确,可这hbase的代码,不是咱说改就能改的,咋整?

就在这时,道凡大牛伸出了援手,https://issues.apache.org/jira/browse/HBASE,道凡说,就在这,提交issue,可以解决问题!

我寻思着能为hbase做些贡献好像还不错的样子,于是怀着试一试的心态点开了链接,注册,create issue,然后用不太熟练的英文把上面的问题描述了一遍,OK,issue创建完了,心想着应该会有大牛过来看看这个bug,然后很随意的帮忙fix一下,就搞定了,也没我啥事了。

第二天到公司,道凡突然发来一条消息,说issue有人回复了,点进去一看,一位大牛Ted Yu进来表示了赞同,还来了一句“Any chance of a patch ?” 我一想,这是大牛在鼓励咱这newbie大胆尝试嘛,果然很有大牛的风范,冲着对大牛的敬仰,以及此时咱后台组群里大哥哥大姐姐们的鼓励,咱抱着“不能怂”的心态,决定大胆尝试一把。

接下来的事情喜闻乐见,完全不知道怎么整的我根本不知道该干啥,好在有Ted Yu的指点和同事们的鼓励、帮助,一步一步的完成了check out代码,修改代码,搭建编译环境,提交patch,补充test case,在自己的环境运行test case,提交带test case的patch,等等等等等等一系列复杂的过程(此处省略好几万字),终于在今天上午,一位committer将我的patch提交到了多个版本的trunk上,事情到此已经基本了结,svn的log里也出现了我的名字,也让我感觉这些天的努力没有白费(由于时差,跟其它人讨论问题以及寻求帮助都需要耐心的等待)。

在此也希望广大同胞们能勇于提交issue,帮助自己也帮助更多使用这些开源软件的同学们,为造福人类贡献绵薄之力。

附上这次的issue的链接:https://issues.apache.org/jira/browse/HBASE-8192

最后附上一个issue从提交到解决的大概过程,希望对后续提交issue的同学能有所帮助:

1. 创建issue,尽可能的把问题描述清楚,如果解决方案比较明确,一并附上,如果不是很明确,可以在comment里跟其他人讨论、交流。

2. 有了解决方案以后,准备自己提交patch的话,就得搭建开发环境(如果没搭过),包括check out代码(patch一般都是打在trunk上的,http://svn.apache.org/repos/asf/hbase/trunk),安装mvn、jdk等(暂时不清楚具体的jdk版本依赖,我自己搭建的时候用1.6编译出错了,换1.7编译通过的)。http://hbase.apache.org/book.html#developer 这里有一些官方的手册,可能会给你带来一些帮助。

3. 修改代码,重新编译,运行test case,上面的手册对这些过程也有帮助,碰到问题可以参考。修改代码的时候有一些注意事项:http://wiki.apache.org/hadoop/Hbase/HowToContribute 可以先看一下。运行test case的时候关注一下磁盘的剩余空间,因为没空间时报的错误信息可能不是直接相关的,会是其它的一些Exception,所以要多想着这事(我被这个坑了不少次),test data会占据不小的空间(几个G),还有就是记得mvn clean。

4. attach files将你的patch上传,然后submit patch。这里提交的是一份你代码与trunk代码的diff,要从hbase trunk的svn根目录svn diff。

5. 每次attach files之后,过一会就会有Hadoop QA(不是很清楚是否为自动的)来测试你的patch。test result里列出来的问题是需要解决的(除了那些不是你代码改动带来的test case fail)。

6. 提交了patch之后,issue的状态会变为patch available,这时候(可能需要等一段时间)会有人(不确定是否一定是committer)来帮你review,如果觉得没问题的话他们会在comment里留下+1,或是lgtm(looks good to me)之类的东西。

7. 如果patch基本没问题之后,需要等committer来把你的patch拖到一些branch上进行测试,然后他们会在测试通过之后将你的patch commit到对应的svn上。

8. 基本上就没有什么问题了,如果有问题committer应该会再联系你。

最后的最后,如果对此感兴趣或是有问题想交流的,欢迎骚扰。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

This Apple ID is not yet in use in the iTunes Store: Fix This Apple ID is not yet in use in the iTunes Store: Fix Jun 10, 2024 pm 05:42 PM

When logging into iTunesStore using AppleID, this error saying "This AppleID has not been used in iTunesStore" may be thrown on the screen. There are no error messages to worry about, you can fix them by following these solution sets. Fix 1 – Change Shipping Address The main reason why this prompt appears in iTunes Store is that you don’t have the correct address in your AppleID profile. Step 1 – First, open iPhone Settings on your iPhone. Step 2 – AppleID should be on top of all other settings. So, open it. Step 3 – Once there, open the “Payment & Shipping” option. Step 4 – Verify your access using Face ID. step

How to fix red-eye on iPhone How to fix red-eye on iPhone Feb 23, 2024 pm 04:31 PM

So, you took some great photos at your last party, but unfortunately, most of the photos you took were of red eyes. The photo itself is great, but the red eyes in it kind of ruin the image. Not to mention, some of those party photos might be from your friends’ phones. Today we'll look at how to remove red eye from photos. What causes the red eyes in the photo? Red-eye often occurs when taking photos with flash. This is because the light from the flash shines directly into the back of the eye, causing the blood vessels under the eye to reflect the light, giving the effect of red eyes in the photo. Fortunately, with the continuous advancement of technology, some cameras are now equipped with red-eye correction functions that can effectively solve this problem. By using this feature, the camera takes pictures

How to solve the problem of Win11 failing to verify credentials? How to solve the problem of Win11 failing to verify credentials? Jan 30, 2024 pm 02:03 PM

When a Win11 user uses credentials to log in, he or she receives an error message stating that your credentials cannot be verified. What is going on? After the editor investigated this problem, I found that there may be several different situations that directly or indirectly cause this problem. Let's take a look with the editor.

An easy guide to fixing Windows 11 blue screen issues An easy guide to fixing Windows 11 blue screen issues Dec 27, 2023 pm 02:26 PM

Many friends always encounter blue screens when using computer operating systems. Even the latest win11 system cannot escape the fate of blue screens. Therefore, today I have brought you a tutorial on how to repair win11 blue screens. No matter whether you have encountered a blue screen or not, you can learn it first in case you need it. How to fix win11 blue screen method 1. If we encounter a blue screen, first restart the system and check whether it can start normally. 2. If it can start normally, right-click "Computer" on the desktop and select "Manage" 3. Then expand "System Tools" on the left side of the pop-up window and select "Event Viewer" 4. In the event viewer, we will You can see what specific problem caused the blue screen. 5. Then just follow the blue screen situation and events

Comprehensive Guide to PHP 500 Errors: Causes, Diagnosis and Fixes Comprehensive Guide to PHP 500 Errors: Causes, Diagnosis and Fixes Mar 22, 2024 pm 12:45 PM

A Comprehensive Guide to PHP 500 Errors: Causes, Diagnosis, and Fixes During PHP development, we often encounter errors with HTTP status code 500. This error is usually called "500InternalServerError", which means that some unknown errors occurred while processing the request on the server side. In this article, we will explore the common causes of PHP500 errors, how to diagnose them, and how to fix them, and provide specific code examples for reference. Common causes of 1.500 errors 1.

How to fix the volume cannot be adjusted in WIN10 How to fix the volume cannot be adjusted in WIN10 Mar 27, 2024 pm 05:16 PM

1. Press win+r to open the run window, enter [regedit] and press Enter to open the registry editor. 2. In the opened registry editor, click to expand [HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindowsCurrentVersionRun]. In the blank space on the right, right-click and select [New - String Value], and rename it to [systray.exe]. 3. Double-click to open systray.exe, modify its numerical data to [C:WindowsSystem32systray.exe], and click [OK] to save the settings.

What does game bug mean? What does game bug mean? Feb 18, 2024 am 11:30 AM

What do game bugs mean? During the process of playing games, we often encounter some unexpected errors or problems, such as characters getting stuck, tasks being unable to continue, screen flickering, etc. These abnormal phenomena are called game bugs, that is, faults or errors in the game. In this article, we'll explore what game bugs mean and the impact they have on players and developers. Game bugs refer to errors that occur during the development or operation of the game, causing the game to fail to run normally or to behave unexpectedly. These errors may be due to

Fix aksfridge.sys blue screen error in Windows 11/10 Fix aksfridge.sys blue screen error in Windows 11/10 Feb 11, 2024 am 11:30 AM

If you encounter aksfridge.sys blue screen error after upgrading to Windows 11 or Windows 10, this article will provide you with solutions. You can try the following methods to successfully resolve this issue. The genuine aksfridge.sys file is the software component of AladdinHASP from AladdinKnowledgeSystems. AladdinHASP (Hardware Anti-Software Piracy) is a suite of digital rights management (DRM) protection and licensing software. Aksfridge.sys is a filter driver necessary for HASP to function properly. This component adds support for specialized external devices. Hardware Anti-Software Piracy, also known as AladdinHAS

See all articles