使用复杂规范化数据库的技巧-mysql教程-PHP中文网

首页

数据库

mysql教程

使用复杂规范化数据库的技巧

Aug 09, 2024 pm 12:01 PM

Tips for Working with Complex Normalized Databases

我们都被教导了标准化数据的好处。所以我不会用这些细节来烦你，但总结一下：

标准化是在数据库中组织数据的过程。它包括创建表并根据设计的规则在这些表之间建立关系，这些规则旨在保护数据并通过消除冗余和不一致的依赖关系使数据库更加灵活。

Microsoft 365 - 规范化说明

说实话，直到最近我不得不处理多个“高度标准化”的遗留应用程序时，我才真正想到标准化。当我说“高度标准化”时，我的意思是“高度标准化”——以至于它不再有意义了。这让我想起了 Coding Horror 的这篇精彩文章：也许正常化并不正常。

问题是，除非你真的很幸运，否则你不需要担心这样的事情。让我们通过一个特定的场景并尝试不同的技术来理解这个主题的复杂性，而不是假设性地讨论这个问题。一旦我们完成了这个场景，我们就可以讨论一下技术细节，以更好地理解为什么高度标准化的架构可能会出现问题，并审查我们可以考虑改善我们的体验的优化。

？您可以在此处查看本文的代码。

您正在开发一个基于现有大型 SASS（软件即服务）的库存管理系统。系统由库存商品组成，每个库存商品都有类别、供应商、仓库和各种属性。客户已请求一份报告，该报告需要显示该商品的详细信息，包括供应商名称和仓库名称。

这是一个简化的架构，没有多租户（只是为了简单起见）：

每个项目都引用类别、供应商和仓库表中的条目。每个项目的属性都存储在 item_attributes 表中。这一切都很有道理，而且很容易实现：

CREATE TABLE items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category_id INT,
    supplier_id INT,
    warehouse_id INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (category_id) REFERENCES categories(id),
    FOREIGN KEY (supplier_id) REFERENCES suppliers(id),
    FOREIGN KEY (warehouse_id) REFERENCES warehouses(id)
);

CREATE TABLE categories (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE suppliers (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE warehouses (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    location VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE item_attributes (
    id INT AUTO_INCREMENT PRIMARY KEY,
    item_id INT,
    attribute_name VARCHAR(255) NOT NULL,
    attribute_value VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (item_id) REFERENCES items(id)
);

-- To illustrate the denormalization strategy mentioned, here’s an example of a denormalized items_denormalized table:

CREATE TABLE items_denormalized (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category_name VARCHAR(255),
    supplier_name VARCHAR(255),
    warehouse_name VARCHAR(255),
    attribute_name VARCHAR(255),
    attribute_value VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE INDEX idx_items_id ON items(id);
CREATE INDEX idx_categories_id ON categories(id);
CREATE INDEX idx_suppliers_id ON suppliers(id);
CREATE INDEX idx_warehouses_id ON warehouses(id);
CREATE INDEX idx_item_attributes_item_id ON item_attributes(item_id);

登录后复制

播种数据

对于我们所做的任何性能工作，重要的是能够重现我们预期的规模，以便更好地了解我们的应用程序将如何执行。这就是为什么我整理了以下播种脚本：

require 'faker'

def create_records(message, &amp;block)
  puts "Creating #{message}."
  starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield if block_given?
  ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  elapsed = ending - starting
  puts "#{message.capitalize} created. #{elapsed}"
end

puts 'Truncating database...'
ActiveRecord::Tasks::DatabaseTasks.truncate_all
puts 'Database truncated.'

create_records('categories') do
  10.times do
    Category.create(name: Faker::Book.genre)
  end
end

create_records('suppliers') do
  25.times do
    Supplier.create(name: Faker::Company.name)
  end
end

create_records('warehouses') do
  1000.times do
    Warehouse.create(name: Faker::Company.name, location: Faker::Address.full_address)
  end
end

create_records('items') do
  categories = Category.all.to_a
  suppliers = Supplier.all.to_a
  warehouses = Warehouse.all.to_a

  items = 100_000.times.map do
    {
      name: Faker::Commerce.product_name,
      category_id: categories.sample.id,
      supplier_id: suppliers.sample.id,
      warehouse_id: warehouses.sample.id
    }
  end

  items.each_slice(1000) do |batch|
    Item.insert_all(batch)
  end
end

create_records('item attributes') do
  items = Item.all

  # We'll bump this up later to 1_000_000 in order to see
  # the perf issues come up.
  item_attributes = 100_000.times.map do
    {
      attribute_name: Faker::Lorem.word,
      attribute_value: Faker::Lorem.word,
      item_id: items.sample.id
    }
  end

  item_attributes.each_slice(1000) do |batch|
    ItemAttribute.insert_all(batch)
  end
end

create_records('denormalized items') do
  items_with_associations = Item.includes(:category, :supplier, :warehouse)

  denormalized_items_attributes = []

  items_with_associations.find_each(batch_size: 1000) do |item|
    denormalized_items_attributes &lt;&lt; {
      name: item.name,
      item_id: item.id,
      category_name: item.category.name,
      category_id: item.category.id,
      supplier_name: item.supplier.name,
      supplier_id: item.supplier.id,
      warehouse_name: item.warehouse.name,
      warehouse_id: item.warehouse.id,
      created_at: DateTime.now,
      updated_at: DateTime.now
    }
  end

  denormalized_items_attributes.each_slice(1000) do |batch|
    ItemDenormalized.insert_all(batch)
  end
end

登录后复制

此播种脚本有助于为我们所有实体创建记录。您可以微调脚本以创建更多或更少的记录来对架构进行压力测试。这正是我们稍后要做的事情。

现在，请记住，这将在您的本地计算机上运行，因此我们不会在这里测试生产级别的资源。希望您可以拥有一个生产级环境来尝试不同的策略，但这里的重点不是复制生产 - 而是充分理解使用高度标准化架构的复杂性。

当我们运行种子时，我们将得到以下日志：

bundle exec rails db:seed
Truncating database...
Database truncated.
Creating categories.
Categories created. 3.226257999893278
Creating suppliers.
Suppliers created. 0.1299410001374781
Creating warehouses.
Warehouses created. 4.184017000021413
Creating items.
Items created. 7.629256000043824
Creating item attributes.
Item attributes created. 59.715396999847144
Creating denormalized items.
Denormalized items created. 12.066422999836504

登录后复制

好吧，让我们开始运行一些查询。

性能？什么表演？！

所以，假设我想要除了 McDermott-Casper（一家已经破产的供应商）的商品之外的所有商品。另外，我不想要具有 enim 和/或 modi 属性的项目：

我们可以使用 ActiveRecord 编写查询，如下所示：

excluded_suppliers =
  Supplier
    .select('id')
    .where(name: "McDermott-Casper")
    .to_sql

excluded_attributes =
  ItemAttribute
    .select(:item_id)
    .where(attribute_name: ['enim', 'modi'])
    .to_sql

Item
  .distinct
  .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
  .joins(:category, :supplier, :warehouse)
  .left_outer_joins(:item_attributes)
  .where("items.supplier_id NOT IN (#{excluded_suppliers})")
  .where("items.id NOT IN(#{excluded_attributes})")
  .to_a

登录后复制

根据我们的场景排除项目的条件在 WHERE 条件中用作嵌入子查询，同时我们连接类别、供应商、仓库和（左外连接）项目属性以确保我们仅检索条件的匹配项目.

好吧，让我们测试一下：

bundle exec rails c
Loading development environment (Rails 7.1.3.4)
irb(main):001* excluded_suppliers =
irb(main):002&gt;   Supplier
irb(main):003&gt;     .select('id')
irb(main):004&gt;     .where(name: "McDermott-Casper")
irb(main):005&gt;     .to_sql
=&gt; "SELECT \"suppliers\".\"id\" FROM \"suppliers\" WHERE \"suppliers\".\"name\" = 'McDermott-Casper'"
irb(main):006* excluded_attributes =
irb(main):007&gt;   ItemAttribute
irb(main):008&gt;     .select(:item_id)
irb(main):009&gt;     .where(attribute_name: ['enim', 'modi'])
irb(main):010&gt;     .to_sql
=&gt; "SELECT \"item_attributes\".\"item_id\" FROM \"item_attributes\" WHERE \"item_attributes\".\"attribute_name\" IN ('enim', 'modi')"
irb(main):011&gt; Item
irb(main):012&gt;   .distinct
irb(main):013&gt;   .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
irb(main):014&gt;   .joins(:category, :supplier, :warehouse)
irb(main):015&gt;   .left_outer_joins(:item_attributes)
irb(main):016&gt;   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):017&gt;   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):018&gt;   .to_a
  Item Load (535.5ms)  SELECT DISTINCT items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name FROM "items" INNER JOIN "categories" ON "categories"."id" = "items"."category_id" INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id" INNER JOIN "warehouses" ON "warehouses"."id" = "items"."warehouse_id" LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id" WHERE (items.supplier_id NOT IN (SELECT "suppliers"."id" FROM "suppliers" WHERE "suppliers"."name" = 'McDermott-Casper')) AND (items.id NOT IN(SELECT "item_attributes"."item_id" FROM "item_attributes" WHERE "item_attributes"."attribute_name" IN ('enim', 'modi')))
=&gt;

登录后复制

太棒了！我们正在亚秒级获取。

好吧。让我们看看当我们将系统中的属性数量增加到……比如说一百万时会发生什么。我们可以通过运行从种子脚本中提取的以下代码来做到这一点：

  items = Item.all

  # We'll bump this up later to 1_000_000 in order to see
  # the perf issues come up.
  item_attributes = 900_000.times.map do
    {
      attribute_name: Faker::Lorem.word,
      attribute_value: Faker::Lorem.word,
      item_id: items.sample.id
    }
  end

  item_attributes.each_slice(1000) do |batch|
    ItemAttribute.insert_all(batch)
  end

登录后复制

现在请记住，上面有 1,187 条与 enim 或 modi 匹配的项目属性记录。

irb(main):001* excluded_suppliers =
irb(main):002&gt;   Supplier
irb(main):003&gt;     .select('id')
irb(main):004&gt;     .where(name: "McDermott-Casper")
irb(main):005&gt;     .to_sql
irb(main):006&gt; 
=&gt; "SELECT \"suppliers\".\"id\" FROM \"suppliers\" WHERE \"suppliers\".\"name\" = 'McDermott-Casper'"
irb(main):007* excluded_attributes =
irb(main):008&gt;   ItemAttribute
irb(main):009&gt;     .select(:item_id)
irb(main):010&gt;     .where(attribute_name: ['enim', 'modi'])
irb(main):011&gt;     .to_sql
irb(main):012&gt; 
=&gt; "SELECT \"item_attributes\".\"item_id\" FROM \"item_attributes\" WHERE \"item_attributes\".\"attribute_name\" IN ('enim', 'modi')"
irb(main):013&gt; Item
irb(main):014&gt;   .distinct
irb(main):015&gt;   .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
irb(main):016&gt;   .joins(:category, :supplier, :warehouse)
irb(main):017&gt;   .left_outer_joins(:item_attributes)
irb(main):018&gt;   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):019&gt;   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):020&gt;   .to_a
irb(main):021&gt; 
  Item Load (3002.4ms)  SELECT DISTINCT items.id,

登录后复制

哇！好的。现在我们是 3 秒。

随着时间的推移，随着越来越多的项目添加到系统中，问题只会变得更糟，并且相对而言 item_attributes 将继续影响此特定查询。当添加 900,000 个以上属性时，匹配 enim 或 modi 的记录数量就会增加。事实上，我们的记录从 1,187 条增加到 12,154 条。

This kind of scale is completely normal and really shouldn’t be unexpected. As the number of attributes for items can increase significantly over time in an inventory management system for all sorts of reasons. Ok, so more records were added - of course performance would be impacted. What exactly is happening?

Is normalization really the issue here?

I’m going to remove the joins to categories and warehouses:

irb(main):029&gt; Item
irb(main):030&gt;   .distinct
irb(main):031&gt;   .select('items.id, items.name, suppliers.name AS supplier_name')
irb(main):032&gt;   .joins(:supplier)
irb(main):033&gt;   .left_outer_joins(:item_attributes)
irb(main):034&gt;   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):035&gt;   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):036&gt;   .to_a
irb(main):037&gt;
  Item Load (1938.4ms)  SELECT DISTINCT items.id, items.name, suppliers.name AS supplier_name FROM "items" INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id" LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id" WHERE (items.supplier_id NOT IN (SELECT "suppliers"."id" FROM "suppliers" WHERE "suppliers"."name" = 'McDermott-Casper')) AND (items.id NOT IN(SELECT "item_attributes"."item_id" FROM "item_attributes" WHERE "item_attributes"."attribute_name" IN ('enim', 'modi')))
=&gt;

登录后复制

Ok, so yeah, we get a ~30% improvement just removing the join. Let's run an explain on these and try to understand what's going on.

Unique  (cost=80266.89..84016.89 rows=250000 width=99)
  -&gt;  Sort  (cost=80266.89..80891.89 rows=250000 width=99)
        Sort Key: items.id, items.name, categories.name, suppliers.name, warehouses.name
        -&gt;  Hash Join  (cost=20105.00..44177.93 rows=250000 width=99)
              Hash Cond: (items.warehouse_id = warehouses.id)
              -&gt;  Hash Join  (cost=20066.50..43480.40 rows=250000 width=89)
                    Hash Cond: (items.supplier_id = suppliers.id)
                    -&gt;  Hash Join  (cost=20030.63..42785.86 rows=250000 width=78)
                          Hash Cond: (items.category_id = categories.id)
                          -&gt;  Hash Right Join  (cost=19998.80..42094.91 rows=250000 width=54)
                                Hash Cond: (item_attributes.item_id = items.id)
                                -&gt;  Seq Scan on item_attributes  (cost=0.00..19471.00 rows=1000000 width=8)
                                -&gt;  Hash  (cost=19686.30..19686.30 rows=25000 width=54)
                                      -&gt;  Seq Scan on items  (cost=16933.30..19686.30 rows=25000 width=54)
                                            Filter: ((NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)))
                                            SubPlan 1
                                              -&gt;  Seq Scan on suppliers suppliers_1  (cost=0.00..24.38 rows=1 width=8)
"                                                    Filter: ((name)::text = 'McDermott-Casper'::text)"
                                            SubPlan 2
                                              -&gt;  Gather  (cost=1000.00..16878.93 rows=11996 width=8)
                                                    Workers Planned: 2
                                                    -&gt;  Parallel Seq Scan on item_attributes item_attributes_1  (cost=0.00..14679.33 rows=4998 width=8)
"                                                          Filter: ((attribute_name)::text = ANY ('{enim,modi}'::text[]))"
                          -&gt;  Hash  (cost=19.70..19.70 rows=970 width=40)
                                -&gt;  Seq Scan on categories  (cost=0.00..19.70 rows=970 width=40)
                    -&gt;  Hash  (cost=21.50..21.50 rows=1150 width=27)
                          -&gt;  Seq Scan on suppliers  (cost=0.00..21.50 rows=1150 width=27)
              -&gt;  Hash  (cost=26.00..26.00 rows=1000 width=26)
                    -&gt;  Seq Scan on warehouses  (cost=0.00..26.00 rows=1000 width=26)

登录后复制

The plan above is telling us the output of each join is funneled into the next one:

(items &lt;&gt; warehouses) -&gt; (items &lt;&gt; suppliers) -&gt; (items &lt;&gt; categories)

登录后复制

Because of the multiple joins, we essentially increase the performance impact as more data is spread out across your database, e.g. normalization.

Now, let’s look at the plan after we remove the joins:

Unique  (cost=73750.91..76250.91 rows=250000 width=49)
  -&gt;  Sort  (cost=73750.91..74375.91 rows=250000 width=49)
        Sort Key: items.id, items.name, suppliers.name
        -&gt;  Hash Join  (cost=20034.68..42789.45 rows=250000 width=49)
              Hash Cond: (items.supplier_id = suppliers.id)
              -&gt;  Hash Right Join  (cost=19998.80..42094.91 rows=250000 width=38)
                    Hash Cond: (item_attributes.item_id = items.id)
                    -&gt;  Seq Scan on item_attributes  (cost=0.00..19471.00 rows=1000000 width=8)
                    -&gt;  Hash  (cost=19686.30..19686.30 rows=25000 width=38)
                          -&gt;  Seq Scan on items  (cost=16933.30..19686.30 rows=25000 width=38)
                                Filter: ((NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)))
                                SubPlan 1
                                  -&gt;  Seq Scan on suppliers suppliers_1  (cost=0.00..24.38 rows=1 width=8)
"                                        Filter: ((name)::text = 'McDermott-Casper'::text)"
                                SubPlan 2
                                  -&gt;  Gather  (cost=1000.00..16878.93 rows=11996 width=8)
                                        Workers Planned: 2
                                        -&gt;  Parallel Seq Scan on item_attributes item_attributes_1  (cost=0.00..14679.33 rows=4998 width=8)
"                                              Filter: ((attribute_name)::text = ANY ('{enim,modi}'::text[]))"
              -&gt;  Hash  (cost=21.50..21.50 rows=1150 width=27)
                    -&gt;  Seq Scan on suppliers  (cost=0.00..21.50 rows=1150 width=27)

登录后复制

Ok, so we get a better query plan. Less joins, less data to scan and therefore more performance. However, doing this won't meet the requirements. Remember, the report needs the names of the associated suppliers and warehouses. Let's see what happens when we denormalize the data and simplify the lookup process.

irb(main):074* excluded_suppliers =
irb(main):075&gt;   Supplier
irb(main):076&gt;     .select('id')
irb(main):077&gt;     .where(name: "McDermott-Casper")
irb(main):078&gt;     .to_sql
irb(main):079&gt; 
irb(main):080* excluded_attributes =
irb(main):081&gt;   ItemAttribute
irb(main):082&gt;     .select(:item_id)
irb(main):083&gt;     .where(attribute_name: ['enim', 'modi'])
irb(main):084&gt;     .to_sql
irb(main):085&gt; 
irb(main):086&gt; ItemDenormalized
irb(main):087&gt;   .distinct
irb(main):088&gt;   .select('items_denormalized.id as id, items_denormalized.category_name as category_name, items_denormalized.supplier_name as supplier_name, items_denormalized.warehouse_name as warehouse_name')
irb(main):089&gt;   .joins(:supplier)
irb(main):090&gt;   .left_outer_joins(:item_attributes)
irb(main):091&gt;   .where("items_denormalized.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):092&gt;   .where("items_denormalized.item_id NOT IN(#{excluded_attributes})")
irb(main):093&gt;   .to_a
irb(main):094&gt; 
  ItemDenormalized Load (1107.3ms)  SELECT DISTINCT items_denormalized.id as id,

登录后复制

In this example, the lookup on the denormalized table performed similarly to when we removed the joins (1107.3ms v. 1938.4ms). The difference is that we have the category and warehouse names. Denormalization does introduce multiple complexities that need to be handled; such as redundancy and integrity of the data, e.g. what happens when categories are updated? or when warehouses are deleted?

Putting that aside though, we see that denormalization handles certain scenarios well when it comes to performance. We should consider it's benefits when building applications that will inevitably need to scale. In our example above, we can see with just a million records, we start to run into some performance bottlenecks.

Performance Bottlenecks

Let's think through what bottlenecks start to come into play after running through the examples above.

Complex Queries

Highly normalized schemas often require complex queries with multiple joins, which can be slow and resource-intensive.

SELECT DISTINCT
    items.id,
    items.name,
    categories.name AS category_name,
    suppliers.name AS supplier_name,
    warehouses.name AS warehouse_name
FROM
    "items"
    INNER JOIN "categories" ON "categories"."id" = "items"."category_id"
    INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id"
    INNER JOIN "warehouses" ON "warehouses"."id" = "items"."warehouse_id"
    LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id"
WHERE (items.supplier_id NOT IN(
        SELECT
            "suppliers"."id" FROM "suppliers"
        WHERE
            "suppliers"."name" = 'McDermott-Casper'))
    AND(items.id NOT IN(
            SELECT
                "item_attributes"."item_id" FROM "item_attributes"
            WHERE
                "item_attributes"."attribute_name" IN('enim', 'modi')));

登录后复制

I wouldn't consider the above too complex, however, the conditions that execute subqueries can start to get complex when joining on joins. This happens a lot in large scale applications that have evolved over time. Again, normalization is great in an ideal world - but it is also important to understand what other complexities it introduces.

Increased I/O Operations

Each table lookup can lead to additional I/O operations, slowing down the overall query performance. When we start to talk through IO operations in the database, it's important to know, high level, why this is an important part of the puzzle. So let's dive into some issues that come up at scale.

Read/Write: Each join that involves disk-based temporary tables or large data sets will increase the number of disk reads and writes. This can cause a significant I/O load, especially in applications where the behavior is quite active (jobs, high traffic, etc.).

Buffer Pool Pressure: Joins can put pressure on the MySQL buffer pool, especially with larger data sets. When the buffer pool is full, MySQL has to evict pages to make room for new data, causing additional disk I/O.

Temporary Tables: MySQL may create temporary tables to hold intermediate results during complex join operations. These temporary tables can be stored in memory or on disk, depending on their size. Disk-based temporary tables increase I/O operations, leading to slower performance.

Lock Contention

In a highly concurrent environment, frequent access and updates across multiple tables can lead to lock contention and further degrade performance.

Multiple Joins

Lock Types: MySQL uses different types of locks (e.g., shared, exclusive) depending on the operation. Complex queries with multiple joins can require various locks, leading to contention if different parts of the query need the same resources.

Row-Level vs. Table-Level Locks: InnoDB uses row-level locking, which is generally more efficient than table-level locking used by MyISAM. However, even row-level locks can cause contention if multiple transactions try to modify the same rows simultaneously.

Joins on Joins

Increased Lock Duration: Queries involving joins on joins often take longer to execute. The longer a transaction holds locks, the higher the chance of contention with other transactions.

Lock Escalation: Although InnoDB uses row-level locking, high contention can sometimes cause lock escalation, where the database engine escalates to table-level locks to manage the contention, leading to broader performance issues. This is typically due to non-existent and/or lacking indexes.

Lock Waits and Deadlocks

Lock Waits: When a transaction needs a lock held by another transaction, it must wait, leading to increased query execution time and potential timeouts.

Deadlocks: Complex queries with multiple joins increase the risk of deadlocks, where two or more transactions are waiting for each other’s locks, causing the database to automatically roll back one of the transactions to resolve the deadlock, typically the "victim" is rolled back.

Strategies for Optimization

To mitigate performance issues in highly normalized architectures, consider the following strategies:

Denormalization

The process for denormalizing data involves adding redundant data to tables to reduce the number of joins required. While this increases storage requirements and the risk of data anomalies, it can significantly improve read performance.

SELECT i.id, i.name, i.category_name, i.supplier_name, i.warehouse_name, i.attribute_value
FROM items_denormalized i
WHERE i.id = ?

登录后复制

In this example, the items_denormalized table combines data from the categories, suppliers, warehouses, and item_attributes tables, eliminating the need for multiple joins.

Indexing

Proper indexing can dramatically improve query performance. Ensure that all columns used in joins and WHERE clauses are indexed. Remember, an index is super important to prevent full table locks. Keep in mind, that even this will not help if temporary tables are created with your joins, which will NOT have indexes.

CREATE INDEX idx_items_id ON items(id);
CREATE INDEX idx_categories_id ON categories(id);
CREATE INDEX idx_suppliers_id ON suppliers(id);
CREATE INDEX idx_warehouses_id ON warehouses(id);
CREATE INDEX idx_item_attributes_item_id ON item_attributes(item_id);

登录后复制

Caching

Implement caching mechanisms to store frequently accessed data in memory, reducing the need for repeated database queries. There are multiple strategies for implementing caching, which will be covered in a different post, but these strategies can range from utilizing summary tables, to integrating different technologies that can store results temporarily.

# Example using Ruby on Rails with Redis cache
item = Rails.cache.fetch("item_#{id}", expires_in: 12.hours) do
  Item.includes(:category, :supplier, :warehouse, :item_attributes).find(id)
end

登录后复制

Query Optimization

Analyze and optimize your queries to ensure they are as efficient as possible. Use tools like MySQL’s EXPLAIN ANALYZE statement to understand the execution plan and identify bottlenecks.

EXPLAIN SELECT i.id, i.name, c.name AS category, s.name AS supplier, w.name AS warehouse, ia.attribute_value
FROM items i
JOIN categories c ON i.category_id = c.id
JOIN suppliers s ON i.supplier_id = s.id
JOIN warehouses w ON i.warehouse_id = w.id
JOIN item_attributes ia ON i.id = ia.item_id
WHERE i.id = 1;

登录后复制

Conclusion

Normalization is a powerful technique for maintaining data integrity, but it can lead to performance challenges in large-scale applications. Knowing the tradeoffs here can help you scale your application in the long term, considering denormalization as just another strategy to help scale. If denormalization is not favorable; consider reviewing indices (including composites), result caching and query optimization to improve performance. Thank you for reading and please reach out if you have any questions!