如何处理大量数据-Mysql Tutorial-php.cn

Home

Database

Mysql Tutorial

如何处理大量数据

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 02:59 PM

deal with how data database

如何处理大量数据提高超大量数据数据库处理速度的方法-表分区庞大的数据量不光是查询操作,删除起来也痛苦. 使用表分区的效果比较明显.特别是删除操作比较方便,速度也快.直接truncate掉按照rule分区以后的分区表数据,索引什么都会快速删除掉. 至于查询速度

如何处理大量数据

提高超大量数据数据库处理速度的方法-表分区

庞大的数据量不光是查询操作,删除起来也痛苦.

使用表分区的效果比较明显.特别是删除操作比较方便,速度也快.直接truncate掉按照rule分区以后的

分区表数据,索引什么都会快速删除掉.

至于查询速度的问题,索引比必不可少的(,如何建立高效的索引这篇文章就不再说明了.)

还有就是负载均衡. 数据库postgresql + postgresforest 可以达到很好的效果.(其实中心思想也是表分区.)

PostgreSQL 支持基本的表分区功能。本节描述为何需要表分区以及你如何在你的数据库设计里面实现表分区。

概述

分区的意思是把逻辑上的一个大表分割成物理上的几块儿。分区可以提供若干好处：

某些类型的查询性能可以得到极大提升。

更新的性能也可以得到提升，因为表的每块的索引要比在整个数据集上的索引要小。如果索引不能全部放在内存里，那么在索引上的读和写都会产生更多的磁盘访问。

批量删除可以用简单地删除某个分区来实现－只要需求已经在分区设计是进行了规划。 DROP TABLE 比批量 DELETE 要快很多，因为不需要有 VACUUM 的开销。

很少用的数据可以移动到便宜的、慢一些地存储介质上。

这种好处通常只有在表可能会变得非常大的情况下才有价值。表在多大的情况下会从分区中收益取决于应用，不过有个基本的拇指规则就是表的大小超过了数据库服务器的物理内存大小。

目前，PostgreSQL 支持通过表继承进行分区。每个分区必须做为单独一个父表的子表进行创建。父表自身通常是空的；它的存在只是为了代表整个数据集。你在试图实现分区之前，应该先熟悉继承（参阅 Section 5.8）。

PostgreSQL 里面可以实现下面形式的分区：

范围分区

表被一个或者多个键字字段分区成"范围"，在这些范围之间没有重叠的数值分布到不同的分区里。比如，我们可以为特定的商业对象根据数据范围分区，或者根据标识符范围分区。

列表分区

表是通过明确地列出每个分区里应该出现那些键字值实现的。

目前还不支持散列分区。

实现分区

要设置一个分区的表，做下面的步骤：

创建"主表"，所有分区都从它上面继承下去。

这个表将没有什么数据，不要在这个表上定义任何检查约束，除非你希望约束同样也适用于所有分区。同时在其上定义任何索引或者唯一约束也没有意义。

创建几个"子"表，每个都从主表上继承。通常，这些表将不会对从主表继承过来集合增加任何字段。

我们将把子表称作分区，尽管它们就是普通的 PostgreSQL 表。

给分区表增加约束，定义每个分区允许的健值。

典型的例子是：

CHECK ( x = 1 )

CHECK ( county IN ( 'Oxfordshire', 'Buckinghamshire', 'Warwickshire' ))

CHECK ( outletID >= 100 AND outletID

确信这些约束保证在不同的分区里不会有重叠的键字。一个常见的错误是设置下面这样的范围：

CHECK ( outletID BETWEEN 100 AND 200 )

CHECK ( outletID BETWEEN 200 AND 300 )

这样做是错误的，因为它没说清楚健值 200 属于那个范围。

请注意在范围和列表分区的语法方面没有什么区别；这些术语只是用于描述的。

对于每个分区，在键字字段上创建一个索引，以及其它你想创建的索引。（键字索引并非严格要求的，但是在大多数情况下它是很有帮助的。如果你希望键字值是唯一的，那么你应该总是给每个分区创建一个唯一或者主键约束。

另外，定义一个规则或者触发器，把对主表的修改重定向到合适的分区表。

确保 postgresql.conf 里的配置参数 constraint_exclusion 是打开的。没有这个参数，查询不会按照需要进行优化。

比如，假设我们为一个巨大的冰激凌公司构造数据库。该公司每天都测量最高温度，以及每个地区的冰激凌销售。概念上，我们需要一个这样的表：

CREATE TABLE measurement (

city_id int not null,

logdate date not null,

peaktemp int,

unitsales int

);

我们知道大多数查询都只会访问最后一周，最后一个月或者最后一个季度的数据，因为这个表的主要用途是为管理准备在线报告。为了减少需要存储的旧数据，我们决定值保留最近三年的有用数据。在每个月的开头，我们都会删除最旧的一个月的数据。

在这种情况下，我们可以使用分区来帮助我们实现所有我们对表的不同需求。下面的步骤描述了上面的需求，分区可以这样设置：

主表是 measurement 表，就像上面那样声明。

然后我们为每个月创建一个分区：

CREATE TABLE measurement_yy04mm02 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy04mm03 ( ) INHERITS (measurement);

...

CREATE TABLE measurement_yy05mm11 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy05mm12 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy06mm01 ( ) INHERITS (measurement);

每个分区都是拥有自己内容的完整的表，只是它们从 measurement 表继承定义。

这样就解决了我们的一个问题：删除旧数据。每个月，我们需要做的只是在最旧的子表上执行一个 DROP TABLE，然后为新月份创建一个新的子表。

我们必须增加非重叠的表约束，所以我们的建表脚本就变成：

CREATE TABLE measurement_yy04mm02 (

CHECK ( logdate >= DATE '2004-02-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy04mm03 (

CHECK ( logdate >= DATE '2004-03-01' AND logdate

) INHERITS (measurement);

...

CREATE TABLE measurement_yy05mm11 (

CHECK ( logdate >= DATE '2005-11-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy05mm12 (

CHECK ( logdate >= DATE '2005-12-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy06mm01 (

CHECK ( logdate >= DATE '2006-01-01' AND logdate

) INHERITS (measurement);

我们可能还需要在键字字段上有索引：

CREATE INDEX measurement_yy04mm02_logdate ON measurement_yy04mm02 (logdate);

CREATE INDEX measurement_yy04mm03_logdate ON measurement_yy04mm03 (logdate);

...

CREATE INDEX measurement_yy05mm11_logdate ON measurement_yy05mm11 (logdate);

CREATE INDEX measurement_yy05mm12_logdate ON measurement_yy05mm12 (logdate);

CREATE INDEX measurement_yy06mm01_logdate ON measurement_yy06mm01 (logdate);

我们选择先不建立更多的索引。

如果数据只进入最新的分区，我们可以设置一个非常简单的规则来插入数据。我们必须每个月都重新定义这个规则，这样它总是指向当前分区。

CREATE OR REPLACE RULE measurement_current_partition AS

ON INSERT TO measurement

DO INSTEAD

INSERT INTO measurement_yy06mm01 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

我们可能想插入数据并且想让服务器自动定位应该向哪个分区插入数据。我们可以用像下面这样的更复杂的规则集来实现这个目标。

CREATE RULE measurement_insert_yy04mm02 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2004-02-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy04mm02 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

...

CREATE RULE measurement_insert_yy05mm12 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2005-12-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy05mm12 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

CREATE RULE measurement_insert_yy06mm01 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2006-01-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy06mm01 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

请注意每个规则里面的 WHERE 子句正好匹配其分区的 CHECK 约束。

我们可以看出来，一个复杂的分区方案可能要求相当不少的 DDL。在上面的例子里我们需要每个月创建一次新分区，因此写一个脚本自动生成需要的 DDL 是明智的。

还要注意下面的事项：

目前还没有什么办法校验所有 CHECK 是相互排他的。数据库设计者必须注意这一点。

目前还没有简单的办法声明数据行绝对不能插入主表。主表上的一个 CHECK (false) 约束将被所有子表继承，因此不能这么用。一个可行的办法是在主表上设置一个 ON INSERT 触发器，总是抛出错误。（另外，这样的触发器也可以用于重定向数据到合适的子表，而不是用上面建议的那样一套规则。）

分区也可以使用一个 UNION ALL 试图来安排：

CREATE VIEW measurement AS

SELECT * FROM measurement_yy04mm02

UNION ALL SELECT * FROM measurement_yy04mm03

...

UNION ALL SELECT * FROM measurement_yy05mm11

UNION ALL SELECT * FROM measurement_yy05mm12

分区和约束排除

约束排除（Constraint exclusion）是一种查询优化技巧，它改进了用上面方法定义的表分区的性能。比如：

SET constraint_exclusion = on; SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';

如果没有约束排除，上面的查询会扫描 measurement 表中的每一个分区。打开了约束排除之后，规划器将检查每个分区的约束然后试图证明该分区不需要被扫描，因为它不能包含任何符合 WHERE 子句条件的数据行。如果规划器可以证明这个，它就把该分区从查询规划里排除出去。

你可以使用 EXPLAIN 命令显示一个规划在 constraint_exclusion 打开和关闭情况下的不同。用上面方法设置的表的典型的缺省规划是：

SET constraint_exclusion = off; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01'; QUERY PLAN ----------------------------------------------------------------------------------------------- Aggregate (cost=158.66..158.68 rows=1 width=0) -> Append (cost=0.00..151.88 rows=2715 width=0) -> Seq Scan on measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy04mm02 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy04mm03 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) ... -> Seq Scan on measurement_yy05mm12 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy06mm01 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date)

部分或者全部分区可能会使用索引扫描而不是全表扫描，不过这里要表达的意思是我们没有必要扫描旧的分区旧可以回答这个查询。在我们打开约束排除之后，我们可以得到生成同样回答的明显节省的规划：

SET constraint_exclusion = on; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01'; QUERY PLAN ----------------------------------------------------------------------------------------------- Aggregate (cost=63.47..63.48 rows=1 width=0) -> Append (cost=0.00..60.75 rows=1086 width=0) -> Seq Scan on measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy06mm01 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date)

请注意，约束排除只由 CHECK 约束驱动，而不会由索引驱动。因此，在键字字段上定义索引是没有必要的。在给出的分区上是否需要建立索引取决于那些扫描该分区的查询通常是扫描该分区的一大部分还是只是一小部分。对于后者，索引通常都有帮助，对于前者则没有什么好处。

还有下面的注意：

约束排除只是在查询的 WHERE 子句包含约束的时候才生效。一个参数化的查询不会被优化，因为在运行时规划器不知道改参数会选择哪个分区。由于某些原因，像 CURRENT_DATE 这样的"稳定的（stable）"函数必须避免。把分区键字和另外一个表的字段连接起来也不会得到优化。

在 CHECK 约束里面避免跨数据类型的比较，因为目前规划器会无法证明这样的条件为假。比如，下面的约束会在 x 是整数字段的时候可用，但是在 x 是一个 bigint 的时候不能用：

CHECK ( x = 1 )

对于 bigint 字段，我们必须使用类似下面这样的约束：

CHECK ( x = 1::bigint )

这个问题并不仅仅局限于 bigint 数据类型 — 它可能会发生在任何约束的缺省数据类型与其比较的字段的数据类型不匹配的场合。在提交的查询里的跨数据类型的比较通常是 OK 的，只是不能在 CHECK 条件里。

目前，在主表上的 UPDATE 和 DELETE 命令并不执行约束排除。

主表的所有分区上面的所有约束都认为是约束排除了的，因此，大量的分区会显著增加查询规划的时间。

别忘记你仍然需要为每个分区独立运行 ANALYZE。类似下面的命令

ANALYZE measurement;

是只会处理主表的。

UNION ALL SELECT * FROM measurement_yy06mm01;

不过，约束排除目前还不支持用这种方式定义的分区。还有，重建试图也给增加和删除数据集里面的独立分区增加了额外的步骤。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7374

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1267

PHP Tutorial

1215

Related knowledge

Slow Cellular Data Internet Speeds on iPhone: Fixes May 03, 2024 pm 09:01 PM

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. May 07, 2024 pm 05:00 PM

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

2024 QS ranking released! Computer science MIT dominates the list, Tsinghua University is 11th, Peking University is 15th Apr 18, 2024 pm 09:04 PM

The 2024QS World University Rankings by Subject is here! Overall, there is little change from 2023. According to the official website information, the 2024QS World University Rankings by Subject covers 55 subdivisions and 5 major academic fields. A total of 1,559 universities participated in the ranking, 64 of which are new faces this year (that is, they will not appear in the 2023 ranking). Among these 64 colleges and universities, 14 are truly appearing for the first time. Among them is the University of Chinese Academy of Sciences. According to the refined subjects, Music is a new subject introduced this year. In addition, the data science and artificial intelligence rankings have been expanded, with 51 new universities added to the rankings. The top five in the overall list are: Massachusetts Institute of Technology, University of Cambridge, University of Oxford, and Harvard University

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Apr 29, 2024 pm 04:55 PM

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

Open-Sora comprehensive open source upgrade: supports 16s video generation and 720p resolution Apr 25, 2024 pm 02:55 PM

Open-Sora has been quietly updated in the open source community. It now supports video generation up to 16 seconds, with resolutions up to 720p, and can handle text-to-image, text-to-video, image-to-video, and video-to-video of any aspect ratio. and the generation needs of infinitely long videos. Let's try it out. Generate a horizontal screen Christmas snow scene, post to B site and then generate a vertical screen, and use Douyin to generate a 16-second long video. Now everyone can have a screenwriting addiction. How to play? Guidance GitHub: https://github.com/hpcaitech/Open-Sora What’s even cooler is that Open-Sora is still all open source, including the latest model architecture, the latest model weights, multi-time/resolution/long-term

Within hours of release, Microsoft deleted a large open source model comparable to GPT-4 in seconds! Forgot to take the poison test Apr 23, 2024 pm 05:22 PM

Last week, Microsoft airdropped WizardLM-2, an open source model called GPT-4 level. But I didn’t expect that it would be deleted immediately a few hours after it was posted. Some netizens suddenly discovered that WizardLM’s model weights and announcement posts had all been deleted and were no longer in the Microsoft collection. Apart from the mention of the site, no evidence could be found to prove that this was an official Microsoft project. The GitHub project homepage has become a 404. Project address: https://wizardlm.github.io/ Including the weight of the model on HF, all have disappeared... The whole network is full of confusion, why is WizardLM gone? However, the reason Microsoft did this was because the team forgot to "test" the model. Later, micro

See all articles