*Hive's `COUNT()` Yields Unexpected Results After Filtering**
Using Hive's COUNT(*)
function can sometimes produce surprising outcomes. Consider a table, "mytable," with a column "master_id."
An initial count query reveals 1,129,563 rows:
<code class="language-sql">SELECT COUNT(*) AS c FROM mytable;</code>
However, filtering for non-null "master_id" values unexpectedly increases the count to 1,134,041:
<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NOT NULL;</code>
This is puzzling, especially since a query checking for null "master_id" values returns zero:
<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NULL;</code>
The Reason Behind the Discrepancy
This inconsistency stems from Hive's reliance on table statistics for query optimization. The initial COUNT(*)
query, lacking a WHERE
clause, might use estimated statistics instead of a full table scan. Adding the WHERE
clause forces a complete table scan, leading to a precise, and higher, row count.
How to Obtain Accurate Counts
To guarantee accurate results, consider these solutions:
Disable Statistics-Based Optimization: Prevent Hive from using statistics for query optimization:
<code class="language-sql">SET hive.compute.query.using.stats=false;</code>
Manually Update Table Statistics: Use the ANALYZE TABLE
command to refresh the table's statistics:
<code class="language-sql">ANALYZE TABLE mytable COMPUTE STATISTICS;</code>
Enable Automatic Statistics Gathering: Configure Hive to automatically update statistics during INSERT OVERWRITE
operations:
<code class="language-sql">SET hive.stats.autogather=true;</code>
By implementing one of these methods, you can ensure your COUNT(*)
queries consistently return accurate row counts.
The above is the detailed content of Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?. For more information, please follow other related articles on the PHP Chinese website!