Home > Java > javaTutorial > pg-index-health – a static analysis tool for you PostgreSQL database

pg-index-health – a static analysis tool for you PostgreSQL database

Linda Hamilton
Release: 2025-01-06 18:20:40
Original
763 people have browsed it

Hello there!

Since 2019, I have been developing an open-source tool called pg-index-health, which analyzes database structures and identifies potential issues. In one of my previous articles, I shared the story of how this tool came to life.

Over the years, pg-index-health has evolved and improved. In 2024, with the support of several contributors, I managed to address most of its remaining "growing pains" and bring the project to a state where it is ready for large-scale expansion.

The Growth of Databases with the Rise of Microservices

I've been working with PostgreSQL since 2015, and this fascinating journey began at the Yaroslavl-based company Tensor.

Back in 2015, it was still the era of monoliths with massive databases and a large number of tables. Typically, any changes to the structure of such databases required mandatory approval from an architect or a development lead, who served as the key knowledge holder. While this safeguarded against most errors, it slowed down the process of making changes and was entirely unscalable.

Gradually, people began transitioning to microservices.
The number of databases grew significantly, but the number of tables within each database, conversely, decreased. Now, each team started managing the structure of its own database independently. The centralized source of expertise disappeared, and database design errors began to multiply and propagate from one service to another.

The Testing Pyramid and Its Shapes

Most of you have probably heard of the testing pyramid. For monoliths, it has a fairly characteristic shape with a broad base of unit tests. For more details, I recommend Martin Fowler's article.

pg-index-health – a static analysis tool for you PostgreSQL database

Microservices have changed not only the approach to development but also the appearance of the testing pyramid. This shift was largely driven by the rise of containerization technologies (Docker, Testcontainers). Today the testing pyramid is no longer a pyramid at all. It can have a very bizarre shape. The most well-known examples are the Honeycomb and the Testing Trophy.

pg-index-health – a static analysis tool for you PostgreSQL database

The modern trend is to write as few unit tests as possible, focusing on implementation details, and to prioritize component and integration tests that validate the actual functionality provided by the service.

My personal favorite is the Testing Trophy. At its base lies static code analysis, designed to guard against common errors.

The Importance of Static Code Analysis

Static analysis for Java and Kotlin code is now common practice. For Kotlin services, the tool of choice is typically detekt. For Java applications, the range of available tools (often referred to as linters) is broader. The main tools include Checkstyle, PMD, SpotBugs, and Error Prone. You can read more about them in my previous article.

Notably, both detekt and Checkstyle also handle code formatting, effectively functioning as formatters.

Static Analysis for Database Migrations

Modern microservices often include database migrations for creating and updating the database structure alongside application code.

In the Java ecosystem, the main tools for managing migrations are Liquibase and Flyway. Any changes to the database structure must always be documented in migrations. Even if changes are made manually during an incident in production, a migration must later be created to apply those changes across all environments.

Writing migrations in plain SQL is the best practice as it provides maximum flexibility and saves time compared to learning the XML dialects of tools like Liquibase. I touched on this in my article, "Six Tips for Using PostgreSQL in Functional Tests".

Verifying SQL Migration Code

To verify the SQL code in migrations, I recommend using SQLFluff, which is essentially a Checkstyle equivalent for SQL. This linter supports multiple databases and dialects (including PostgreSQL) and can be integrated into your CI pipeline. It offers over 60 customizable rules, enabling you to manage table and column aliases, SQL command casing, indentation, column ordering in queries, and much more.

Compare the query with and without formatting:

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Well-formatted SQL code is much easier to read and understand. Most importantly, code reviews will no longer be bogged down by discussions about formatting preferences. SQLFluff enforces a consistent style, saving time.

SQLFluff in Action

This is what it looks like in a real pull request:

pg-index-health – a static analysis tool for you PostgreSQL database

Here SQLFluff found a problem with formatting the return value in the select statement: when only one column is returned, we do not put it on a separate row. The second point is the incorrect column order in the selection results: first we return simple columns and only then the calculation results. And the third is the incorrect case for and in the join statement: I prefer to write all queries in lowercase.

For more examples of SQLFluff in use, check out my open-source projects: one, two.

Using Metadata to Analyze Database Structure

The structure of the database itself can also be checked. However, working with migrations is extremely inconvenient: there can be many of them; a new migration may fix errors in a previous migration, and so on. As a rule, we are more interested in the final structure of the database than its intermediate state.

Leveraging Information Schema

PostgreSQL (like many other relational databases) stores metadata about all objects and relationships between them and provides it externally in the form of information_schema. We can use queries to information_schema to identify any deviations, problems, or common errors (this is exactly what SchemaCrawler does).

Since we work only with PostgreSQL, instead of information_schema we can use system catalogs (the pg_catalog schema), which provide much more information about the internal structure of a specific database.

Cumulative Statistics System

In addition to metadata, PostgreSQL collects information about the operation of each database: what queries are executed, how they are executed, what access methods are used, etc. The Cumulative Statistics System is responsible for collecting this data.

By querying these statistics through system views and combining them with data from the system catalogs, we can:

  • Identify unused indexes;
  • Detect tables that lack adequate indexing.

Statistics can be reset manually. The date and time of the last reset are recorded in the system. It is important to consider this to understand whether the statistics can be trusted or not. For example, if you have some business logic that is executed once a month/quarter/half-year, the statistics need to be collected for a period of at least the interval mentioned above.

If a database cluster is used, then statistics are collected independently on each host and are not replicated within the cluster.

pg-index-health and Its Structure

The idea of analyzing database structure based on metadata within the database itself, as described above, has been implemented by me in the form of a tool called pg-index-health.

My solution includes the following components:

  • A set of checks in the form of SQL queries, placed in a separate repository (currently consisting of 25 checks). The queries are decoupled from the Java codebase and can be reused in projects written in other programming languages.
  • A domain model — a minimal set of classes that represent the results of the checks as objects.
  • The HighAvailabilityPgConnection abstraction for connecting to a database cluster consisting of multiple hosts.
  • Utilities for executing SQL queries and serializing results into domain model objects.
  • A Spring Boot starter for convenient and quick integration of checks into unit/component/integration tests.
  • A migration generator that can create corrective SQL migrations for identified issues.

Types of Checks

All checks (also referred to as diagnostics) are divided into two groups:

  • Runtime checks (require statistics).
  • Static checks (do not require statistics).

Runtime checks

Runtime checks are meaningful only when executed on a live database instance in production. These checks require accumulated statistics and aggregate this data from all hosts in the cluster.

Let’s consider a database cluster consisting of three hosts: primary, secondary, and async replica. Some services use clusters with similar topologies and execute heavy read queries only on the asynchronous replica to balance the load. Such queries are usually not executed on the primary host as they create additional load and negatively impact the latency of other queries.

pg-index-health – a static analysis tool for you PostgreSQL database

As previously mentioned, in PostgreSQL, statistics are collected separately on each host and are not replicated within the cluster. Thus, you can easily encounter a situation where certain indexes are used and necessary only on the asynchronous replica. To make a reliable determination about whether an index is needed or not, it is necessary to run the check on each host in the cluster and aggregate the results.

Static checks

Static checks do not require accumulated statistics and can be executed on the primary host immediately after applying migrations. Of course, they can also be used on a production database to obtain data in real time. However, most checks are static, and they are particularly useful in tests, as they help catch and prevent common errors during the development phase.

pg-index-health – a static analysis tool for you PostgreSQL database

How to Use pg-index-health

The primary use case for pg-index-health is adding tests to verify the database structure in your testing pipeline.

For Spring Boot applications, you need to add the starter to your test dependencies:

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Then add a standard test:

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

In this test, all available checks are injected as a list. Then, only static checks are filtered and executed on the real database deployed in a container with applied migrations.

Ideally, each check should return an empty list. If there are any deviations when adding the next migration, the test will fail. The developer will be forced to pay attention to this and resolve the problem in any way: either fix it in his migration, or explicitly ignore it.

False Positives and Adding Exclusions

It is important to understand that pg-index-health, like any other static analyzer, may generate false positives. Additionally, some checks may not be relevant to your project. For example, it is considered good practice to document the database structure. PostgreSQL allows adding comments to almost all database objects. In a migration, this might look like the following:

dependencies {
    testImplementation("io.github.mfvanek:pg-index-health-test-starter:0.14.4")
}
Copy after login
Copy after login
Copy after login
Copy after login

Within your team, you might agree not to do this. In that case, the results of corresponding checks (TABLES_WITHOUT_DESCRIPTION, COLUMNS_WITHOUT_DESCRIPTION, FUNCTIONS_WITHOUT_DESCRIPTION) become irrelevant for you.

You can either completely exclude these checks:

import io.github.mfvanek.pg.core.checks.common.DatabaseCheckOnHost;
import io.github.mfvanek.pg.core.checks.common.Diagnostic;
import io.github.mfvanek.pg.model.dbobject.DbObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest
@ActiveProfiles("test")
class DatabaseStructureStaticAnalysisTest {

    @Autowired
    private List<DatabaseCheckOnHost<? extends DbObject>> checks;

    @Test
    void checksShouldWork() {
        assertThat(checks)
            .hasSameSizeAs(Diagnostic.values());

        checks.stream()
            .filter(DatabaseCheckOnHost::isStatic)
            .forEach(c -> assertThat(c.check())
                .as(c.getDiagnostic().name())
                .isEmpty());
    }
}
Copy after login
Copy after login
Copy after login
Copy after login

Or simply ignore their results:

create table if not exists demo.warehouse
(
    id bigint primary key generated always as identity,
    name varchar(255) not null
);

comment on table demo.warehouse is 'Information about the warehouses';
comment on column demo.warehouse.id is 'Unique identifier of the warehouse';
comment on column demo.warehouse.name is 'Human readable name of the warehouse';
Copy after login
Copy after login
Copy after login

When introducing pg-index-health, you may often encounter situations where the database structure already has some deviations, but you do not want to address them immediately. At the same time, the check is relevant, and disabling it is not an option. In such cases, it is best to fix all deviations in the code:

@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .filter(c -> c.getDiagnostic() != Diagnostic.TABLES_WITHOUT_DESCRIPTION &&
            c.getDiagnostic() != Diagnostic.COLUMNS_WITHOUT_DESCRIPTION)
        .forEach(c -> assertThat(c.check())
            .as(c.getDiagnostic().name())
            .isEmpty());
}
Copy after login
Copy after login
Copy after login

Now, I would like to focus in more detail on the most frequently encountered issues.

Tables Without Primary Keys

Due to the specifics of the MVCC mechanism in PostgreSQL, situations like bloat can occur, where the size of your table (or index) rapidly grows due to a large number of dead tuples. This can happen, for example, as a result of long-running transactions or a one-time update of a large number of rows.

Garbage collection within the database is handled by the autovacuum process, but it does not free up the physical disk space occupied. The only way to effectively reduce the physical size of a table is by using the VACUUM FULL command, which requires an exclusive lock for the duration of the operation. For large tables, this can take several hours, making full vacuuming impractical for most modern services.

To address the issue of table bloat without downtime, third-party extensions like pg_repack are often used. One of the mandatory requirements of pg_repack is the presence of a primary key or some other uniqueness constraint on the target table. The TABLES_WITHOUT_PRIMARY_KEY diagnostic helps detect tables without primary keys and prevents maintenance problems in the future.

Below is an example of a table without a primary key. If bloat occurs in this table, pg_repack will be unable to process it and will return an error.

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Duplicated Indexes

Our databases operate on hosts with limited resources, and disk space is one of them. When using Database-as-a-Service solutions, there is often a physical limitation on the maximum database size that cannot be changed.

Each index in a table is a separate entity on the disk. It occupies space and requires resources for maintenance, which slows down data insertion and updates. We create indexes to speed up searches or ensure the uniqueness of certain values. However, improper use of indexes can lead to a situation where their combined size exceeds the size of the useful data in the table itself. Thus, the number of indexes in a table should be minimal yet sufficient for its functionality.

I have encountered numerous cases where unnecessary indexes are created in migrations. For example, an index for a primary key is created automatically. While it is technically possible to manually index the id column, doing so is completely pointless.

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

A similar situation arises with unique constraints. When you mark a column (or a group of columns) with the unique keyword, PostgreSQL automatically creates an unique index for that column (or group of columns). Manually creating additional indexes is unnecessary. If done, this results in duplicated indexes. Such redundant indexes can and should be removed, and the DUPLICATED_INDEXES diagnostic can help identify them.

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Overlapping (intersecting) Indexes

Most indexes are created for a single column. When query optimization begins, more complex indexes may be added, involving multiple columns. This leads to scenarios where indexes are created for columns like A, A B, and A B C. The first two indexes in this series can often be discarded since they are prefixes of the third (I recommend watching this video). Removing these redundant indexes can save significant disk space, and the INTERSECTED_INDEXES diagnostic is designed to detect such cases.

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Foreign Keys Without Indexes

PostgreSQL allows the creation of foreign key constraints without specifying a supporting index, meaning that a table referencing another does not require and will not automatically create an index. In some cases, this might not be a problem and may not manifest at all. However, sometimes it can lead to incidents in production.

Let's look at a small example (I’m using PostgreSQL 16.6):

dependencies {
    testImplementation("io.github.mfvanek:pg-index-health-test-starter:0.14.4")
}
Copy after login
Copy after login
Copy after login
Copy after login

We have an orders table and an order_item table. They are linked via a foreign key on the order_id column. The foreign key should always reference either the primary key or some unique constraint, which is satisfied in our case.

Let's populate the tables with data and gather statistics. We'll add 100,000 orders, with half having two items and the others having one.

import io.github.mfvanek.pg.core.checks.common.DatabaseCheckOnHost;
import io.github.mfvanek.pg.core.checks.common.Diagnostic;
import io.github.mfvanek.pg.model.dbobject.DbObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest
@ActiveProfiles("test")
class DatabaseStructureStaticAnalysisTest {

    @Autowired
    private List<DatabaseCheckOnHost<? extends DbObject>> checks;

    @Test
    void checksShouldWork() {
        assertThat(checks)
            .hasSameSizeAs(Diagnostic.values());

        checks.stream()
            .filter(DatabaseCheckOnHost::isStatic)
            .forEach(c -> assertThat(c.check())
                .as(c.getDiagnostic().name())
                .isEmpty());
    }
}
Copy after login
Copy after login
Copy after login
Copy after login

If we attempt to retrieve items for an order with ID=100, we should successfully return 2 rows. Since there is an index on the id column in the orders table, it might seem that this query should be fast.

create table if not exists demo.warehouse
(
    id bigint primary key generated always as identity,
    name varchar(255) not null
);

comment on table demo.warehouse is 'Information about the warehouses';
comment on column demo.warehouse.id is 'Unique identifier of the warehouse';
comment on column demo.warehouse.name is 'Human readable name of the warehouse';
Copy after login
Copy after login
Copy after login

However, if we try to profile this query, we will see sequential scanning of the table in the execution plan. We should also be concerned about the large number of pages that need to be read (Buffers parameter).

@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .filter(c -> c.getDiagnostic() != Diagnostic.TABLES_WITHOUT_DESCRIPTION &&
            c.getDiagnostic() != Diagnostic.COLUMNS_WITHOUT_DESCRIPTION)
        .forEach(c -> assertThat(c.check())
            .as(c.getDiagnostic().name())
            .isEmpty());
}
Copy after login
Copy after login
Copy after login
@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .forEach(c -> {
            final ListAssert<? extends DbObject> listAssert = assertThat(c.check())
                .as(c.getDiagnostic().name());
            switch (c.getDiagnostic()) {
                case TABLES_WITHOUT_DESCRIPTION, COLUMNS_WITHOUT_DESCRIPTION -> listAssert.hasSizeGreaterThanOrEqualTo(0); // ignored

                default -> listAssert.isEmpty();
            }
        });
}
Copy after login

If we create an index for a column with a foreign key, the situation will return to normal:

@Test
void checksShouldWorkForAdditionalSchema() {
    final PgContext ctx = PgContext.of("additional_schema");
    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .forEach(c -> {
            final ListAssert<? extends DbObject> listAssert = assertThat(c.check(ctx))
                .as(c.getDiagnostic().name());

            switch (c.getDiagnostic()) {
                case TABLES_WITHOUT_DESCRIPTION, TABLES_NOT_LINKED_TO_OTHERS ->
                    listAssert.hasSize(1)
                        .asInstanceOf(list(Table.class))
                        .containsExactly(
                            Table.of(ctx, "additional_table")
                        );

                default -> listAssert.isEmpty();
            }
        });
}
Copy after login

The sequential scan will disappear from the query plan, and the number of pages read will be significantly reduced:

create table if not exists demo.payment
(
    id bigint not null, -- column is not marked as primary key
    order_id bigint references demo.orders (id),
    status int not null,
    created_at timestamp not null,
    payment_total decimal(22, 2) not null
);
Copy after login

The FOREIGN_KEYS_WITHOUT_INDEX diagnostic will allow you to catch such cases early during development, preventing performance issues.

Should I create indexes or not?

It’s important to remember the issue of false positives: not all foreign key columns need to be indexed. Try to estimate the approximate table size in production; check your code for filtering, searching, or joining on the foreign key column. If you are 100% sure that you don’t need the index, you can simply add it to the exclusions. If you're unsure, it's better to create the index (it can always be removed later).

I’ve often encountered incidents where the database "slowed down" due to the absence of an index on a foreign key, but I haven’t seen any incidents where the database "slowed down" because of the presence of such indexes. Therefore, I disagree with the point made in the Percona blog article that foreign key indexes should not be created at all from the start. This is a DBA approach. Do you have a dedicated DBA on your team?

Null value in indexes

By default, PostgreSQL includes null values in btree indexes, but they are usually not needed there. All null values are unique, and you cannot simply retrieve a record where the column value is null. Most of the time, it’s better to exclude nulls from the index by creating partial indexes on nullable columns like where is not null. The diagnostic INDEXES_WITH_NULL_VALUES helps to detect such cases.

Let's consider an example with orders and order_items. The order_item table has a nullable column warehouse_id, which represents the warehouse ID.

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Suppose we have several warehouses. After the order is paid, we begin to assemble it. We will update the status of some orders and mark them as paid.

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

Individual items in an order may be shipped from different warehouses according to an internal algorithm, considering logistics, stock, warehouse load, etc. After assigning the warehouse and updating the stock, we update the warehouse_id field for each item in the order (which was initially null).

dependencies {
    testImplementation("io.github.mfvanek:pg-index-health-test-starter:0.14.4")
}
Copy after login
Copy after login
Copy after login
Copy after login

We will need to search by a specific warehouse ID to know which items need to be completed and shipped. We take only paid orders for a certain time range.

import io.github.mfvanek.pg.core.checks.common.DatabaseCheckOnHost;
import io.github.mfvanek.pg.core.checks.common.Diagnostic;
import io.github.mfvanek.pg.model.dbobject.DbObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest
@ActiveProfiles("test")
class DatabaseStructureStaticAnalysisTest {

    @Autowired
    private List<DatabaseCheckOnHost<? extends DbObject>> checks;

    @Test
    void checksShouldWork() {
        assertThat(checks)
            .hasSameSizeAs(Diagnostic.values());

        checks.stream()
            .filter(DatabaseCheckOnHost::isStatic)
            .forEach(c -> assertThat(c.check())
                .as(c.getDiagnostic().name())
                .isEmpty());
    }
}
Copy after login
Copy after login
Copy after login
Copy after login

The first solution would likely be a regular index on the warehouse_id column:

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

If we create such an index, it will be used without issues when searching for items for a specific warehouse. It might seem that this index should allow efficiently finding all items where the warehouse is not yet assigned, filtering records with the condition warehouse_id is null.

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login
Copy after login

However, if we look at the query execution plan, we will see sequential access there — the index is not used.

dependencies {
    testImplementation("io.github.mfvanek:pg-index-health-test-starter:0.14.4")
}
Copy after login
Copy after login
Copy after login
Copy after login

Of course, this is related to the specific distribution of data in the test database. The warehouse_id column has low cardinality, meaning the number of unique values in it is low. An index on this column has low selectivity. Index selectivity refers to the ratio of the number of distinct indexed values (i.e., cardinality) to the total number of rows in the table distinct / count(). For example, a unique index has a selectivity of one.

We can increase the selectivity of the index by removing null values and creating a partial index on the warehouse_id column.

import io.github.mfvanek.pg.core.checks.common.DatabaseCheckOnHost;
import io.github.mfvanek.pg.core.checks.common.Diagnostic;
import io.github.mfvanek.pg.model.dbobject.DbObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest
@ActiveProfiles("test")
class DatabaseStructureStaticAnalysisTest {

    @Autowired
    private List<DatabaseCheckOnHost<? extends DbObject>> checks;

    @Test
    void checksShouldWork() {
        assertThat(checks)
            .hasSameSizeAs(Diagnostic.values());

        checks.stream()
            .filter(DatabaseCheckOnHost::isStatic)
            .forEach(c -> assertThat(c.check())
                .as(c.getDiagnostic().name())
                .isEmpty());
    }
}
Copy after login
Copy after login
Copy after login
Copy after login

We will immediately see this index in the query plan:

create table if not exists demo.warehouse
(
    id bigint primary key generated always as identity,
    name varchar(255) not null
);

comment on table demo.warehouse is 'Information about the warehouses';
comment on column demo.warehouse.id is 'Unique identifier of the warehouse';
comment on column demo.warehouse.name is 'Human readable name of the warehouse';
Copy after login
Copy after login
Copy after login

If we compare the sizes of the indexes, we will see a significant difference. The partial index is much smaller and will be updated less frequently. Using this index, we save disk space and improve performance.

Query to get the size of the indexes
@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .filter(c -> c.getDiagnostic() != Diagnostic.TABLES_WITHOUT_DESCRIPTION &&
            c.getDiagnostic() != Diagnostic.COLUMNS_WITHOUT_DESCRIPTION)
        .forEach(c -> assertThat(c.check())
            .as(c.getDiagnostic().name())
            .isEmpty());
}
Copy after login
Copy after login
Copy after login

table_name index_name index_size_bytes
demo.order_item demo.idx_order_item_warehouse_id 1056768
demo.order_item demo.idx_order_item_warehouse_id_without_nulls 16384

Plans for the Future

These are far from all the issues that pg-index-health can detect. The full list of diagnostics is available in the README of the project on GitHub, and it is regularly expanded.

Integrating pg-index-health into a Spring Boot application is quite simple. The overhead for running the checks is minimal. As a result, you will gain protection from common errors and issues. I encourage you to try implementing it!

In the near future, I plan to add full support for partitioned tables in all checks. Currently, this is implemented only for 11 out of 25 checks. I also want to expand the number of checks: there are already tickets for implementing at least 5 new checks. Additionally, in 2025, I plan to switch to Java 17 and Spring Boot 3.

Repository Links

  • pg-index-health
  • Raw sql queries for checks
  • Demo applications

Additional Material

  • My original post in Russian
  • A similar solution - SchemaCrawler
  • DBA: finding useless indexes (in Russian)
  • Index health in PostgreSQL through the eyes of a Java developer (in Russian)
  • Static analysis of the database structure (in Russian)

The above is the detailed content of pg-index-health – a static analysis tool for you PostgreSQL database. For more information, please follow other related articles on the PHP Chinese website!

source:dev.to
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template