Home > Database > Mysql Tutorial > Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

DDD
Release: 2025-01-12 10:55:44
Original
422 people have browsed it

Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

*Hive's `COUNT()` Yields Unexpected Results After Filtering**

Using Hive's COUNT(*) function can sometimes produce surprising outcomes. Consider a table, "mytable," with a column "master_id."

An initial count query reveals 1,129,563 rows:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable;</code>
Copy after login

However, filtering for non-null "master_id" values unexpectedly increases the count to 1,134,041:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NOT NULL;</code>
Copy after login

This is puzzling, especially since a query checking for null "master_id" values returns zero:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NULL;</code>
Copy after login

The Reason Behind the Discrepancy

This inconsistency stems from Hive's reliance on table statistics for query optimization. The initial COUNT(*) query, lacking a WHERE clause, might use estimated statistics instead of a full table scan. Adding the WHERE clause forces a complete table scan, leading to a precise, and higher, row count.

How to Obtain Accurate Counts

To guarantee accurate results, consider these solutions:

  1. Disable Statistics-Based Optimization: Prevent Hive from using statistics for query optimization:

    <code class="language-sql">SET hive.compute.query.using.stats=false;</code>
    Copy after login
  2. Manually Update Table Statistics: Use the ANALYZE TABLE command to refresh the table's statistics:

    <code class="language-sql">ANALYZE TABLE mytable COMPUTE STATISTICS;</code>
    Copy after login
  3. Enable Automatic Statistics Gathering: Configure Hive to automatically update statistics during INSERT OVERWRITE operations:

    <code class="language-sql">SET hive.stats.autogather=true;</code>
    Copy after login

By implementing one of these methods, you can ensure your COUNT(*) queries consistently return accurate row counts.

The above is the detailed content of Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template