복잡한 정규화된 데이터베이스 작업을 위한 팁-MySQL 튜토리얼-php.cn

Tips for Working with Complex Normalized Databases

우리 모두는 데이터 정규화의 이점을 배웠습니다. 자세한 내용은 지루하지 않게 요약하자면 다음과 같습니다.

정규화는 데이터베이스에서 데이터를 구성하는 프로세스입니다. 여기에는 중복성과 일관되지 않은 종속성을 제거하여 데이터를 보호하고 데이터베이스를 보다 유연하게 만들기 위해 설계된 규칙에 따라 테이블을 생성하고 해당 테이블 간의 관계를 설정하는 작업이 포함됩니다.

Microsoft 365 - 정규화 설명

솔직히 말해서, "고도로 정규화된" 여러 레거시 애플리케이션을 처리해야 했던 최근까지 정규화는 제 마음 속에 떠오른 적이 없었습니다. 그리고 제가 "매우 정규화됨"이라고 말하는 것은 더 이상 말이 되지 않을 정도로 "매우 정규화됨"을 의미합니다. Coding Horror의 놀라운 기사가 생각났습니다. 정규화는 정상이 아닐 수도 있습니다.

문제는 정말 운이 좋지 않다면 이런 걱정을 할 필요가 없다는 것입니다. 이에 대해 가정적으로 이야기하는 대신 특정 시나리오를 살펴보고 이 주제의 복잡성을 이해하기 위해 다양한 기술을 시도해 보겠습니다. 이 시나리오를 검토한 후에는 고도로 정규화된 아키텍처가 문제가 될 수 있는 이유를 더 잘 이해하기 위해 기술을 설명하고 경험 개선을 위해 고려할 수 있는 최적화를 검토하겠습니다.

? 여기에서 이 기사의 코드를 확인할 수 있습니다.

당신은 기존의 대규모 SASS(서비스형 소프트웨어) 기반 재고 관리 시스템을 작업하고 있습니다. 시스템은 재고 품목으로 구성되며, 각 재고 품목에는 카테고리, 공급자, 창고 및 다양한 속성이 있습니다. 고객이 보고서를 요청했으며 이 보고서에는 공급업체 이름, 창고 이름을 포함한 품목 세부정보가 표시되어야 합니다.

다음은 멀티 테넌시 없이 단순화된 스키마입니다(간단하게 유지하기 위해).

각 항목은 카테고리, 공급업체 및 창고 테이블의 항목을 참조합니다. 각 항목의 속성은 item_attributes 테이블에 저장됩니다. 이 모든 것이 이해가 되며 매우 쉽게 적용할 수 있습니다.

CREATE TABLE items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category_id INT,
    supplier_id INT,
    warehouse_id INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (category_id) REFERENCES categories(id),
    FOREIGN KEY (supplier_id) REFERENCES suppliers(id),
    FOREIGN KEY (warehouse_id) REFERENCES warehouses(id)
);

CREATE TABLE categories (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE suppliers (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE warehouses (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    location VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE TABLE item_attributes (
    id INT AUTO_INCREMENT PRIMARY KEY,
    item_id INT,
    attribute_name VARCHAR(255) NOT NULL,
    attribute_value VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (item_id) REFERENCES items(id)
);

-- To illustrate the denormalization strategy mentioned, here’s an example of a denormalized items_denormalized table:

CREATE TABLE items_denormalized (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category_name VARCHAR(255),
    supplier_name VARCHAR(255),
    warehouse_name VARCHAR(255),
    attribute_name VARCHAR(255),
    attribute_value VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

CREATE INDEX idx_items_id ON items(id);
CREATE INDEX idx_categories_id ON categories(id);
CREATE INDEX idx_suppliers_id ON suppliers(id);
CREATE INDEX idx_warehouses_id ON warehouses(id);
CREATE INDEX idx_item_attributes_item_id ON item_attributes(item_id);

로그인 후 복사

데이터 시딩

우리가 수행하는 모든 성능 작업에서는 애플리케이션이 어떻게 작동할지에 대한 좋은 아이디어를 얻기 위해 예상하는 규모를 재현할 수 있는 것이 중요합니다. 그래서 다음과 같은 시드 스크립트를 구성했습니다.

require 'faker'

def create_records(message, &block)
  puts "Creating #{message}."
  starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield if block_given?
  ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  elapsed = ending - starting
  puts "#{message.capitalize} created. #{elapsed}"
end

puts 'Truncating database...'
ActiveRecord::Tasks::DatabaseTasks.truncate_all
puts 'Database truncated.'

create_records('categories') do
  10.times do
    Category.create(name: Faker::Book.genre)
  end
end

create_records('suppliers') do
  25.times do
    Supplier.create(name: Faker::Company.name)
  end
end

create_records('warehouses') do
  1000.times do
    Warehouse.create(name: Faker::Company.name, location: Faker::Address.full_address)
  end
end

create_records('items') do
  categories = Category.all.to_a
  suppliers = Supplier.all.to_a
  warehouses = Warehouse.all.to_a

  items = 100_000.times.map do
    {
      name: Faker::Commerce.product_name,
      category_id: categories.sample.id,
      supplier_id: suppliers.sample.id,
      warehouse_id: warehouses.sample.id
    }
  end

  items.each_slice(1000) do |batch|
    Item.insert_all(batch)
  end
end

create_records('item attributes') do
  items = Item.all

  # We'll bump this up later to 1_000_000 in order to see
  # the perf issues come up.
  item_attributes = 100_000.times.map do
    {
      attribute_name: Faker::Lorem.word,
      attribute_value: Faker::Lorem.word,
      item_id: items.sample.id
    }
  end

  item_attributes.each_slice(1000) do |batch|
    ItemAttribute.insert_all(batch)
  end
end

create_records('denormalized items') do
  items_with_associations = Item.includes(:category, :supplier, :warehouse)

  denormalized_items_attributes = []

  items_with_associations.find_each(batch_size: 1000) do |item|
    denormalized_items_attributes << {
      name: item.name,
      item_id: item.id,
      category_name: item.category.name,
      category_id: item.category.id,
      supplier_name: item.supplier.name,
      supplier_id: item.supplier.id,
      warehouse_name: item.warehouse.name,
      warehouse_id: item.warehouse.id,
      created_at: DateTime.now,
      updated_at: DateTime.now
    }
  end

  denormalized_items_attributes.each_slice(1000) do |batch|
    ItemDenormalized.insert_all(batch)
  end
end

로그인 후 복사

이 시드 스크립트는 모든 엔터티에 대한 기록을 생성하는 데 도움이 됩니다. 스크립트를 미세 조정하여 아키텍처 스트레스 테스트를 위한 레코드를 더 많거나 적게 생성할 수 있습니다. 이것이 바로 우리가 잠시 후에 하게 될 작업입니다.

이제 이 작업은 로컬 컴퓨터에서 실행되므로 여기서는 프로덕션 수준 리소스를 테스트하지 않는다는 점을 기억하세요. 다양한 전략을 실험할 수 있는 프로덕션 수준 환경을 갖출 수 있기를 바랍니다. 하지만 여기서 중요한 점은 프로덕션을 재현하는 것이 아니라 고도로 정규화된 아키텍처로 작업할 때의 복잡성을 잘 이해하는 것입니다.

시드를 실행하면 다음 로그를 얻게 됩니다.

bundle exec rails db:seed
Truncating database...
Database truncated.
Creating categories.
Categories created. 3.226257999893278
Creating suppliers.
Suppliers created. 0.1299410001374781
Creating warehouses.
Warehouses created. 4.184017000021413
Creating items.
Items created. 7.629256000043824
Creating item attributes.
Item attributes created. 59.715396999847144
Creating denormalized items.
Denormalized items created. 12.066422999836504

로그인 후 복사

자, 몇 가지 쿼리를 실행해 보겠습니다.

성능? 무슨 공연?!

그래서 파산한 공급업체인 McDermott-Casper의 품목을 제외한 모든 품목을 원한다고 가정해 보겠습니다. 또한 enim 및/또는 modi 속성이 연결된 항목은 원하지 않습니다.

ActiveRecord를 사용하면 다음과 같이 매우 쉽게 쿼리를 작성할 수 있습니다.

excluded_suppliers =
  Supplier
    .select('id')
    .where(name: "McDermott-Casper")
    .to_sql

excluded_attributes =
  ItemAttribute
    .select(:item_id)
    .where(attribute_name: ['enim', 'modi'])
    .to_sql

Item
  .distinct
  .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
  .joins(:category, :supplier, :warehouse)
  .left_outer_joins(:item_attributes)
  .where("items.supplier_id NOT IN (#{excluded_suppliers})")
  .where("items.id NOT IN(#{excluded_attributes})")
  .to_a

로그인 후 복사

시나리오에 따라 항목을 제외하는 조건은 WHERE 조건에 포함된 하위 쿼리로 활용되며, 카테고리, 공급업체, 창고 및 (왼쪽 외부 조인) 항목 속성을 조인하여 조건과 일치하는 항목만 검색합니다. .

자, 이것을 테스트해 보겠습니다.

bundle exec rails c
Loading development environment (Rails 7.1.3.4)
irb(main):001* excluded_suppliers =
irb(main):002>   Supplier
irb(main):003>     .select('id')
irb(main):004>     .where(name: "McDermott-Casper")
irb(main):005>     .to_sql
=> "SELECT \"suppliers\".\"id\" FROM \"suppliers\" WHERE \"suppliers\".\"name\" = 'McDermott-Casper'"
irb(main):006* excluded_attributes =
irb(main):007>   ItemAttribute
irb(main):008>     .select(:item_id)
irb(main):009>     .where(attribute_name: ['enim', 'modi'])
irb(main):010>     .to_sql
=> "SELECT \"item_attributes\".\"item_id\" FROM \"item_attributes\" WHERE \"item_attributes\".\"attribute_name\" IN ('enim', 'modi')"
irb(main):011> Item
irb(main):012>   .distinct
irb(main):013>   .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
irb(main):014>   .joins(:category, :supplier, :warehouse)
irb(main):015>   .left_outer_joins(:item_attributes)
irb(main):016>   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):017>   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):018>   .to_a
  Item Load (535.5ms)  SELECT DISTINCT items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name FROM "items" INNER JOIN "categories" ON "categories"."id" = "items"."category_id" INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id" INNER JOIN "warehouses" ON "warehouses"."id" = "items"."warehouse_id" LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id" WHERE (items.supplier_id NOT IN (SELECT "suppliers"."id" FROM "suppliers" WHERE "suppliers"."name" = 'McDermott-Casper')) AND (items.id NOT IN(SELECT "item_attributes"."item_id" FROM "item_attributes" WHERE "item_attributes"."attribute_name" IN ('enim', 'modi')))
=>

로그인 후 복사

멋져요! 1초 미만의 속도로 데이터를 가져오고 있습니다.

알겠습니다. 시스템의 속성 수를 백만 개로 늘리면 어떤 일이 발생하는지 살펴보겠습니다. 시드 스크립트에서 추출된 다음 코드를 실행하여 이를 수행할 수 있습니다.

  items = Item.all

  # We'll bump this up later to 1_000_000 in order to see
  # the perf issues come up.
  item_attributes = 900_000.times.map do
    {
      attribute_name: Faker::Lorem.word,
      attribute_value: Faker::Lorem.word,
      item_id: items.sample.id
    }
  end

  item_attributes.each_slice(1000) do |batch|
    ItemAttribute.insert_all(batch)
  end

로그인 후 복사

위에는 enim 또는 modi와 일치하는 1,187개의 항목 속성 레코드가 있다는 점을 기억하세요.

irb(main):001* excluded_suppliers =
irb(main):002>   Supplier
irb(main):003>     .select('id')
irb(main):004>     .where(name: "McDermott-Casper")
irb(main):005>     .to_sql
irb(main):006> 
=> "SELECT \"suppliers\".\"id\" FROM \"suppliers\" WHERE \"suppliers\".\"name\" = 'McDermott-Casper'"
irb(main):007* excluded_attributes =
irb(main):008>   ItemAttribute
irb(main):009>     .select(:item_id)
irb(main):010>     .where(attribute_name: ['enim', 'modi'])
irb(main):011>     .to_sql
irb(main):012> 
=> "SELECT \"item_attributes\".\"item_id\" FROM \"item_attributes\" WHERE \"item_attributes\".\"attribute_name\" IN ('enim', 'modi')"
irb(main):013> Item
irb(main):014>   .distinct
irb(main):015>   .select('items.id, items.name, categories.name AS category_name, suppliers.name AS supplier_name, warehouses.name AS warehouse_name')
irb(main):016>   .joins(:category, :supplier, :warehouse)
irb(main):017>   .left_outer_joins(:item_attributes)
irb(main):018>   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):019>   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):020>   .to_a
irb(main):021> 
  Item Load (3002.4ms)  SELECT DISTINCT items.id,

로그인 후 복사

와! 좋아요. 이제 우리는 3초입니다.

시간이 지남에 따라 더 많은 항목이 시스템에 추가되고 관련 item_attributes가 이 특정 쿼리에 계속 영향을 미치므로 문제는 더욱 악화될 것입니다. 900,000개의 속성이 더 추가되면 enim 또는 modi와 일치하는 레코드 수가 증가했습니다. 실제로 레코드 수가 1,187개에서 12,154개로 늘어났습니다.

This kind of scale is completely normal and really shouldn’t be unexpected. As the number of attributes for items can increase significantly over time in an inventory management system for all sorts of reasons. Ok, so more records were added - of course performance would be impacted. What exactly is happening?

Is normalization really the issue here?

I’m going to remove the joins to categories and warehouses:

irb(main):029> Item
irb(main):030>   .distinct
irb(main):031>   .select('items.id, items.name, suppliers.name AS supplier_name')
irb(main):032>   .joins(:supplier)
irb(main):033>   .left_outer_joins(:item_attributes)
irb(main):034>   .where("items.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):035>   .where("items.id NOT IN(#{excluded_attributes})")
irb(main):036>   .to_a
irb(main):037>
  Item Load (1938.4ms)  SELECT DISTINCT items.id, items.name, suppliers.name AS supplier_name FROM "items" INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id" LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id" WHERE (items.supplier_id NOT IN (SELECT "suppliers"."id" FROM "suppliers" WHERE "suppliers"."name" = 'McDermott-Casper')) AND (items.id NOT IN(SELECT "item_attributes"."item_id" FROM "item_attributes" WHERE "item_attributes"."attribute_name" IN ('enim', 'modi')))
=>

로그인 후 복사

Ok, so yeah, we get a ~30% improvement just removing the join. Let's run an explain on these and try to understand what's going on.

Unique  (cost=80266.89..84016.89 rows=250000 width=99)
  ->  Sort  (cost=80266.89..80891.89 rows=250000 width=99)
        Sort Key: items.id, items.name, categories.name, suppliers.name, warehouses.name
        ->  Hash Join  (cost=20105.00..44177.93 rows=250000 width=99)
              Hash Cond: (items.warehouse_id = warehouses.id)
              ->  Hash Join  (cost=20066.50..43480.40 rows=250000 width=89)
                    Hash Cond: (items.supplier_id = suppliers.id)
                    ->  Hash Join  (cost=20030.63..42785.86 rows=250000 width=78)
                          Hash Cond: (items.category_id = categories.id)
                          ->  Hash Right Join  (cost=19998.80..42094.91 rows=250000 width=54)
                                Hash Cond: (item_attributes.item_id = items.id)
                                ->  Seq Scan on item_attributes  (cost=0.00..19471.00 rows=1000000 width=8)
                                ->  Hash  (cost=19686.30..19686.30 rows=25000 width=54)
                                      ->  Seq Scan on items  (cost=16933.30..19686.30 rows=25000 width=54)
                                            Filter: ((NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)))
                                            SubPlan 1
                                              ->  Seq Scan on suppliers suppliers_1  (cost=0.00..24.38 rows=1 width=8)
"                                                    Filter: ((name)::text = 'McDermott-Casper'::text)"
                                            SubPlan 2
                                              ->  Gather  (cost=1000.00..16878.93 rows=11996 width=8)
                                                    Workers Planned: 2
                                                    ->  Parallel Seq Scan on item_attributes item_attributes_1  (cost=0.00..14679.33 rows=4998 width=8)
"                                                          Filter: ((attribute_name)::text = ANY ('{enim,modi}'::text[]))"
                          ->  Hash  (cost=19.70..19.70 rows=970 width=40)
                                ->  Seq Scan on categories  (cost=0.00..19.70 rows=970 width=40)
                    ->  Hash  (cost=21.50..21.50 rows=1150 width=27)
                          ->  Seq Scan on suppliers  (cost=0.00..21.50 rows=1150 width=27)
              ->  Hash  (cost=26.00..26.00 rows=1000 width=26)
                    ->  Seq Scan on warehouses  (cost=0.00..26.00 rows=1000 width=26)

로그인 후 복사

The plan above is telling us the output of each join is funneled into the next one:

(items <> warehouses) -> (items <> suppliers) -> (items <> categories)

로그인 후 복사

Because of the multiple joins, we essentially increase the performance impact as more data is spread out across your database, e.g. normalization.

Now, let’s look at the plan after we remove the joins:

Unique  (cost=73750.91..76250.91 rows=250000 width=49)
  ->  Sort  (cost=73750.91..74375.91 rows=250000 width=49)
        Sort Key: items.id, items.name, suppliers.name
        ->  Hash Join  (cost=20034.68..42789.45 rows=250000 width=49)
              Hash Cond: (items.supplier_id = suppliers.id)
              ->  Hash Right Join  (cost=19998.80..42094.91 rows=250000 width=38)
                    Hash Cond: (item_attributes.item_id = items.id)
                    ->  Seq Scan on item_attributes  (cost=0.00..19471.00 rows=1000000 width=8)
                    ->  Hash  (cost=19686.30..19686.30 rows=25000 width=38)
                          ->  Seq Scan on items  (cost=16933.30..19686.30 rows=25000 width=38)
                                Filter: ((NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)))
                                SubPlan 1
                                  ->  Seq Scan on suppliers suppliers_1  (cost=0.00..24.38 rows=1 width=8)
"                                        Filter: ((name)::text = 'McDermott-Casper'::text)"
                                SubPlan 2
                                  ->  Gather  (cost=1000.00..16878.93 rows=11996 width=8)
                                        Workers Planned: 2
                                        ->  Parallel Seq Scan on item_attributes item_attributes_1  (cost=0.00..14679.33 rows=4998 width=8)
"                                              Filter: ((attribute_name)::text = ANY ('{enim,modi}'::text[]))"
              ->  Hash  (cost=21.50..21.50 rows=1150 width=27)
                    ->  Seq Scan on suppliers  (cost=0.00..21.50 rows=1150 width=27)

로그인 후 복사

Ok, so we get a better query plan. Less joins, less data to scan and therefore more performance. However, doing this won't meet the requirements. Remember, the report needs the names of the associated suppliers and warehouses. Let's see what happens when we denormalize the data and simplify the lookup process.

irb(main):074* excluded_suppliers =
irb(main):075>   Supplier
irb(main):076>     .select('id')
irb(main):077>     .where(name: "McDermott-Casper")
irb(main):078>     .to_sql
irb(main):079> 
irb(main):080* excluded_attributes =
irb(main):081>   ItemAttribute
irb(main):082>     .select(:item_id)
irb(main):083>     .where(attribute_name: ['enim', 'modi'])
irb(main):084>     .to_sql
irb(main):085> 
irb(main):086> ItemDenormalized
irb(main):087>   .distinct
irb(main):088>   .select('items_denormalized.id as id, items_denormalized.category_name as category_name, items_denormalized.supplier_name as supplier_name, items_denormalized.warehouse_name as warehouse_name')
irb(main):089>   .joins(:supplier)
irb(main):090>   .left_outer_joins(:item_attributes)
irb(main):091>   .where("items_denormalized.supplier_id NOT IN (#{excluded_suppliers})")
irb(main):092>   .where("items_denormalized.item_id NOT IN(#{excluded_attributes})")
irb(main):093>   .to_a
irb(main):094> 
  ItemDenormalized Load (1107.3ms)  SELECT DISTINCT items_denormalized.id as id,

로그인 후 복사

In this example, the lookup on the denormalized table performed similarly to when we removed the joins (1107.3ms v. 1938.4ms). The difference is that we have the category and warehouse names. Denormalization does introduce multiple complexities that need to be handled; such as redundancy and integrity of the data, e.g. what happens when categories are updated? or when warehouses are deleted?

Putting that aside though, we see that denormalization handles certain scenarios well when it comes to performance. We should consider it's benefits when building applications that will inevitably need to scale. In our example above, we can see with just a million records, we start to run into some performance bottlenecks.

Performance Bottlenecks

Let's think through what bottlenecks start to come into play after running through the examples above.

Complex Queries

Highly normalized schemas often require complex queries with multiple joins, which can be slow and resource-intensive.

SELECT DISTINCT
    items.id,
    items.name,
    categories.name AS category_name,
    suppliers.name AS supplier_name,
    warehouses.name AS warehouse_name
FROM
    "items"
    INNER JOIN "categories" ON "categories"."id" = "items"."category_id"
    INNER JOIN "suppliers" ON "suppliers"."id" = "items"."supplier_id"
    INNER JOIN "warehouses" ON "warehouses"."id" = "items"."warehouse_id"
    LEFT OUTER JOIN "item_attributes" ON "item_attributes"."item_id" = "items"."id"
WHERE (items.supplier_id NOT IN(
        SELECT
            "suppliers"."id" FROM "suppliers"
        WHERE
            "suppliers"."name" = 'McDermott-Casper'))
    AND(items.id NOT IN(
            SELECT
                "item_attributes"."item_id" FROM "item_attributes"
            WHERE
                "item_attributes"."attribute_name" IN('enim', 'modi')));

로그인 후 복사

I wouldn't consider the above too complex, however, the conditions that execute subqueries can start to get complex when joining on joins. This happens a lot in large scale applications that have evolved over time. Again, normalization is great in an ideal world - but it is also important to understand what other complexities it introduces.

Increased I/O Operations

Each table lookup can lead to additional I/O operations, slowing down the overall query performance. When we start to talk through IO operations in the database, it's important to know, high level, why this is an important part of the puzzle. So let's dive into some issues that come up at scale.

Read/Write: Each join that involves disk-based temporary tables or large data sets will increase the number of disk reads and writes. This can cause a significant I/O load, especially in applications where the behavior is quite active (jobs, high traffic, etc.).

Buffer Pool Pressure: Joins can put pressure on the MySQL buffer pool, especially with larger data sets. When the buffer pool is full, MySQL has to evict pages to make room for new data, causing additional disk I/O.

Temporary Tables: MySQL may create temporary tables to hold intermediate results during complex join operations. These temporary tables can be stored in memory or on disk, depending on their size. Disk-based temporary tables increase I/O operations, leading to slower performance.

Lock Contention

In a highly concurrent environment, frequent access and updates across multiple tables can lead to lock contention and further degrade performance.

Multiple Joins

Lock Types: MySQL uses different types of locks (e.g., shared, exclusive) depending on the operation. Complex queries with multiple joins can require various locks, leading to contention if different parts of the query need the same resources.

Row-Level vs. Table-Level Locks: InnoDB uses row-level locking, which is generally more efficient than table-level locking used by MyISAM. However, even row-level locks can cause contention if multiple transactions try to modify the same rows simultaneously.

Joins on Joins

Increased Lock Duration: Queries involving joins on joins often take longer to execute. The longer a transaction holds locks, the higher the chance of contention with other transactions.

Lock Escalation: Although InnoDB uses row-level locking, high contention can sometimes cause lock escalation, where the database engine escalates to table-level locks to manage the contention, leading to broader performance issues. This is typically due to non-existent and/or lacking indexes.

Lock Waits and Deadlocks

Lock Waits: When a transaction needs a lock held by another transaction, it must wait, leading to increased query execution time and potential timeouts.

Deadlocks: Complex queries with multiple joins increase the risk of deadlocks, where two or more transactions are waiting for each other’s locks, causing the database to automatically roll back one of the transactions to resolve the deadlock, typically the "victim" is rolled back.

Strategies for Optimization

To mitigate performance issues in highly normalized architectures, consider the following strategies:

Denormalization

The process for denormalizing data involves adding redundant data to tables to reduce the number of joins required. While this increases storage requirements and the risk of data anomalies, it can significantly improve read performance.

SELECT i.id, i.name, i.category_name, i.supplier_name, i.warehouse_name, i.attribute_value
FROM items_denormalized i
WHERE i.id = ?

로그인 후 복사

In this example, the items_denormalized table combines data from the categories, suppliers, warehouses, and item_attributes tables, eliminating the need for multiple joins.

Indexing

Proper indexing can dramatically improve query performance. Ensure that all columns used in joins and WHERE clauses are indexed. Remember, an index is super important to prevent full table locks. Keep in mind, that even this will not help if temporary tables are created with your joins, which will NOT have indexes.

CREATE INDEX idx_items_id ON items(id);
CREATE INDEX idx_categories_id ON categories(id);
CREATE INDEX idx_suppliers_id ON suppliers(id);
CREATE INDEX idx_warehouses_id ON warehouses(id);
CREATE INDEX idx_item_attributes_item_id ON item_attributes(item_id);

로그인 후 복사

Caching

Implement caching mechanisms to store frequently accessed data in memory, reducing the need for repeated database queries. There are multiple strategies for implementing caching, which will be covered in a different post, but these strategies can range from utilizing summary tables, to integrating different technologies that can store results temporarily.

# Example using Ruby on Rails with Redis cache
item = Rails.cache.fetch("item_#{id}", expires_in: 12.hours) do
  Item.includes(:category, :supplier, :warehouse, :item_attributes).find(id)
end

로그인 후 복사

Query Optimization

Analyze and optimize your queries to ensure they are as efficient as possible. Use tools like MySQL’s EXPLAIN ANALYZE statement to understand the execution plan and identify bottlenecks.

EXPLAIN SELECT i.id, i.name, c.name AS category, s.name AS supplier, w.name AS warehouse, ia.attribute_value
FROM items i
JOIN categories c ON i.category_id = c.id
JOIN suppliers s ON i.supplier_id = s.id
JOIN warehouses w ON i.warehouse_id = w.id
JOIN item_attributes ia ON i.id = ia.item_id
WHERE i.id = 1;

로그인 후 복사

Conclusion

Normalization is a powerful technique for maintaining data integrity, but it can lead to performance challenges in large-scale applications. Knowing the tradeoffs here can help you scale your application in the long term, considering denormalization as just another strategy to help scale. If denormalization is not favorable; consider reviewing indices (including composites), result caching and query optimization to improve performance. Thank you for reading and please reach out if you have any questions!