Hive query count difference: non-empty count is greater than total count
In Hive, we observed an interesting phenomenon regarding row count calculation. Suppose there is a table named mytable that contains a field named master_id. When the following query is executed, the total number of rows retrieved is 1,129,563:
<code class="language-sql">SELECT COUNT(*) AS c FROM mytable;</code>
However, when querying for the number of rows with non-empty master_id values, the count increases to 1,134,041:
<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NOT NULL;</code>
This difference occurs because, by default, queries without a WHERE clause may use statistics to estimate the number of rows. To solve this problem, parameter hive.compute.query.using.stats
can be set to false to disable statistical estimation.
Alternatively, you can calculate table statistics explicitly using the ANALYZE TABLE
statement, or set hive.stats.autogather
to true to collect statistics when bulk importing data via the INSERT OVERWRITE operation. This will ensure accurate and consistent query results and avoid unexpected results like this.
The above is the detailed content of Why Does My Hive COUNT(*) Query Show Fewer Rows Than My Non-NULL Count Query?. For more information, please follow other related articles on the PHP Chinese website!