This article brings you relevant knowledge about mysql, which mainly introduces related issues about database zipper tables. Zipper tables are a data model, mainly for data warehouse design. The table is defined by the way it stores data. As the name suggests, the so-called zipper is to record history. Let’s take a look at it together. I hope it will be helpful to everyone.
Recommended learning: mysql video tutorial
zipper table generation background
In the process of designing the data model of the data warehouse, we often encounter such requirements:
1. The amount of data is relatively large;
2. Some fields in the table will Be updated, such as the user's address, product description information, order status, etc.;
3. Need to view historical snapshot information at a certain point in time or time period, for example, to view the historical snapshot information of an order at a certain time in the history The status at a point in time, for example, check how many times a user has updated in a certain period of time in the past, etc.;
4. The proportion and frequency of changes are not very large, for example, there are 10 million in total There are about 100,000 new members and changes every day;
5. If a full copy of this table is retained every day, then a lot of unchanged information will be saved in the full volume each time, which is very important for storage. A huge waste;
There are several options for this kind of table:
- Option 1: Only keep the latest copy every day, such as ours Use Sqoop to extract the latest full amount of data into Hive every day.
- Option 2: Keep a full amount of sliced data every day.
- Option 3: Use zipper list.
Comparison of the above solutions
Option 1
Needless to say more about this solution, it is very easy to implement Simple, drop the previous day's data every day and extract the latest one again.
The advantages are obvious, it saves space, and is also very convenient for common uses. There is no need to add a time partition when selecting a table.
The shortcomings are also obvious. There is no historical data. The only way to check the old accounts first is through other methods, such as drawing from the flow sheet.
Option 2
A full serving of slices every day is a relatively safe plan, and the historical data is also there.
The disadvantage is that the storage space is too large. If a full copy of this table is kept every day, a lot of unchanged information will be saved in each full copy, which is a huge waste of storage. , I still feel very deeply about this...
Of course we can also make some trade-offs, such as only retaining the data of the past month? However, the demand is shameless, and the life cycle of data is not something we can completely control.
Zipper table
The zipper table basically takes into account our needs in use.
First of all, it makes a trade-off in terms of space. Although it does not occupy as small a space as Plan 1, its daily increment may be only one thousandth or even one ten thousandth of Plan 2.
In fact, it can meet the needs of Solution 2, which can not only obtain the latest data, but also add filtering conditions and obtain historical data.
So it is still necessary for us to use zipper tables.
Zipper table concept
The zipper table is a data model, which is mainly defined for the way tables store data in data warehouse design. As the name suggests, the so-called zipper is to record history. Record information about all changes in a thing from its beginning to its current state. Zipper tables can avoid the massive storage problem caused by storing all records for each day, and are also a common way to deal with slowly changing data (SCD2).
Baidu Encyclopedia’s explanation: The zipper table is a table that maintains historical status and the latest status data. Depending on the zipper granularity, the zipper table is actually equivalent to a snapshot, but it has been optimized and part of it has been removed. Unchanged records, the customer records at the time of zipping can be easily restored through the zipper table.
Zipper table algorithm
1. Collect the full data of the day to the ND (NowDay) table;
2. Can retrieve yesterday’s full data from the history table and store it in the OD ( OldDay (last day) table;
3. Compare the two tables with all fields. (ND-OD) is the new and changed data on the day, that is, the increment of the day, represented by W_I;
4. Compare the full fields of the two tables. (OD-ND) is the data that needs to be closed when the status ends. END_DATE needs to be modified, represented by W_U;
5. Change the contents of the W_I table All are inserted into the history table, these are new records, start_date is the current day, and end_date is the max value, which can be set to '9999-12-31';
6. Perform W_U part of the history table In the update operation, start_date remains unchanged, while end_date is changed to the current day, which is a link operation. The historical table (OD) is compared with the W_U table, except for START_DATE and END_DATE. The W_U table shall prevail. The intersection of the two will change END_DATE to the current day. , indicating that the record is invalid.
Zipper table example 1
As a simple example, for example, there is an order table:
There are 3 records on June 20th:
Order creation date |
Order number |
Order status |
2012 -06-20 |
001 |
Create order |
2012-06-20 |
002 |
Create Order |
2012-06-20 |
003 |
Payment completed |
to 6 On March 21st, there are 5 records in the table:
Order creation date |
Order number |
Order status |
2012-06-20 |
001 |
Create order |
##2012-06-20 | 002 | Create order |
2012-06-20 | 003 | Payment completed |
2012-06-21 | 004 | Create order |
2012-06-21 | 005 | Create order |
By June 22, there are 6 records in the table:
Order creation date | Order number | Order status |
##2012-06-20
001 |
Create order |
|
2012-06-20
002 |
Create order |
|
2012-06-20
003 |
Payment completed |
|
2012-06-21
004 |
Create order |
|
2012-06-21
005 |
Create order |
|
2012-06-22
006 |
Create order |
|
Retention method for this table in the data warehouse:
1. Only one copy of the full amount is retained, so the data is the same as the record on June 22. If you need to check the status of order 001 on June 21, it cannot be satisfied;
2. One copy is retained every day. In full, the table in the data warehouse has a total of 14 records, but many records are saved repeatedly without task changes, such as order 002,004. The large amount of data will cause a lot of storage waste;
If If the table is designed to be saved as a historical zipper table in the data warehouse, there will be a table like the following:
Order creation date
Order number |
Order status |
dw_bigin_date |
dw_end_date |
|
2012-06-20
001 |
Create order |
2012-06-20 |
2012-06-20 |
| ##2012-06-20
001 | Payment completed | 2012-06-21 | 9999-12-31 |
| 2012-06- 20
002 | Create order | 2012-06-20 | 9999-12-31 |
## 2012-06-20 |
003
Payment completed |
2012-06-20 |
2012-06-21 |
|
2012-06-20 |
003
shipped |
2012-06-22 |
9999-12-31 | |
2012-06-21 |
004
Create order |
2012-06-21 |
9999-12- 31 |
| ##2012-06-21 | 005
Create order | 2012-06-21 | 2012-06-21 |
| 2012-06-21 | 005
Payment completed | 2012-06-22 | 9999-12-31 |
| 2012-06-22 | 006
Create order | 2012-06 -22 | 9999-12-31 |
|
Note: |
1. dw_begin_date indicates the life cycle start time of the record, dw_end_date Indicates the end time of the life cycle of the record;
2. dw_end_date = '9999-12-31' indicates that the record is currently in a valid state;
3. If querying all currently valid record, then select * from order_his where dw_end_date = '9999-12-31';
4. If you query the historical snapshot of 2012-06-21, then select * from order_his where dw_begin_date <= '2012- 06-21' and end_date >= '2012-06-21', this statement will query the following records:
Order Creation Date
Order number
Order status | dw_bigin_date | dw_end_date |
|
##2012-06-20 |
001
Payment completed
2012-06-21 |
9999-12-31 |
|
2012-06- 20 |
002 |
Create order
2012-06-20 |
9999-12-31 |
| ## 2012-06-20 | 003 | Payment completed
2012-06-20 | 2012-06-21 |
| 2012-06-21 | 004 | Create order
2012-06-21 | 9999-12-31 |
| 2012-06-21 | 005 | Create order
2012-06-21 | 2012-06-21 |
|
is completely consistent with the record in the source table on June 21: |
|
Order creation date
Order number
Order status
|
| 2012-06-20 | 001
Create order
| 2012-06-20 | 002 | Create order
| 2012-06-20 | 003 | Payment completed
| 2012-06-21 | 004 | Create order
| 2012-06-21 | 005 | Create order
|
It can be seen that such a historical zipper table can not only meet the demand for historical data, but also save storage resources to a great extent;
Zipper table example 2:
In There may only be a few records of a person's life in the history table, which avoids the problem of massive storage caused by recording customer status on a daily basis:
Name of person | Start Date |
End Date |
Status |
##client | 19000101 | 19070901 | Hat home |
client | 19070901 | 19130901 | A Elementary School |
client | 19130901 | 19160901 | B Junior High School |
client | 19160901 | 19190901 | C High School |
client | 19190901 | 19230901 | D University |
client | 19230901 | 19601231 | E Company |
client | 19601231 | 29991231 | H retires at home |
##Every record above is not counted at the end, for example By 19070901, the client is already in A, not H. Therefore, except for the last record, whose status has not changed so far, the rest of the records are actually in the state on the end date of the record and are no longer in the state on the end date of the record. This phenomenon can be understood as counting the beginning but not the end.
Zip table implementation method
1. Define two temporary tables, one for the full data of the day, and the other for the data that needs to be added or updated;
CREATE VOLATILE TABLE VT_xxxx_NEW AS xxxx WITH NO DATA ON COMMIT PRESERVE ROWS;CREATE VOLATILE SET TABLE VT_xxxx_CHG,NO LOG AS xxxx WITH NO DATA ON COMMIT PRESERVE ROWS;
Copy after login
2. Get the full amount of data for the day
INSERT INTO VT_xxxx_NEW(xx) SELECT (xx,cur_date, max_date) FROM xxxx_sorce;
Copy after login
3. Extract new or changed data from xxxx_NEW temporary table to xxxx_CHG temporary table;
INSERT INTO VT_xxxx_CHG(xx)SELECT xx FROM VT_xxxx_NEWWHERE (xx) NOT IN (select xx from xxxx_HIS where end_date='max_date');
Copy after login
4. Update the end_date of the invalid record in the history table to max Value
UPDATE A1 FROM xxxx_HIS A1, VT_xxxx_CHG A2SET End_Date='current_date'WHERE A1.xx=A2.xx AND A1.End_Date='max_date';
Copy after login
5. Insert new or changed data into the target table
INSERT INTO xxxx_HIS SELECT * FROM VT_xxxx_CHG;
Copy after login
Take product data as an exampleThere is a product table t_product, The table structure is as follows:
Column name | Type | Description |
##goods_id
varchar(50) | Product number |
| goods_status
varchar(50) | Product status (pending review, for sale, on sale, deleted) |
| createtime
varchar(50) | Product creation date |
| modifytime
varchar(50) | Product modification date |
|
The data on December 20, 2019 is as follows:
goods_id
goods_status |
##createtime |
##modifytime
|
| 001
To be reviewed
2019-12-20 | 2019-12-20 |
| 002 | For Sale
2019-12-20 | 2019-12-20 |
| 003 | for sale
2019-12-20 | 2019-12-20 |
##004 |
Deleted |
2019- 12-20
2019-12-20 |
|
| The status of the product will change over time. We need to record the history of all changes in the product. The information is saved. | Plan 1: Snapshot each day's data to the data warehouse
This plan is to save a full copy every day and synchronize all data to the data warehouse. Many records are saved repeatedly. , no changes.
December 20 (4 data)
##goods_id
goods_status
createtime |
modifytime |
|
001 | To be reviewed
2019-12-18
2019-12-20 |
| 002 | For sale | 2019-12-19
2019-12-20 |
##003 |
for sale |
2019-12-20 |
2019-12-20
|
004 |
Deleted |
2019-12-15 |
2019-12-20
|
December 21 (10 pieces of data)
##goods_id
|
goods_status
|
createtime
|
modifytime
|
The following are 12 Snapshot data on March 20th |
|
|
|
##001
To be reviewed |
2019-12-18 |
2019-12-20 |
| ##002
for sale | 2019-12-19 | 2019-12-20 |
| 003
On sale | 2019-12-20 | 2019-12-20 |
| 004
Deleted | 2019-12-15 | 2019 -12-20 |
| The following is the snapshot data on December 21
|
|
|
##001 |
for sale (from pending to pending)
2019-12-18 |
2019-12 -21 |
|
002 |
for sale
2019-12-19 |
2019-12-20 |
|
003 |
On sale
2019-12-20 |
2019-12-20 |
|
004 |
Deleted
2019-12-15 |
2019-12-20 |
|
005 (New item) |
To be reviewed
2019-12-21 |
2019-12-21 |
|
006 (New product) |
pending review
2019-12-21 |
2019-12-21 |
|
12月22日(18条数据)
goods_id |
goods_status |
createtime |
modifytime |
以下为12月20日快照数据 |
|
|
|
001 |
待审核 |
2019-12-18 |
2019-12-20 |
002 |
待售 |
2019-12-19 |
2019-12-20 |
003 |
在售 |
2019-12-20 |
2019-12-20 |
004 |
已删除 |
2019-12-15 |
2019-12-20 |
以下为12月21日快照数据 |
|
|
|
001 |
待售(从待审核到待售) |
2019-12-18 |
2019-12-21 |
002 |
待售 |
2019-12-19 |
2019-12-20 |
003 |
在售 |
2019-12-20 |
2019-12-20 |
004 |
已删除 |
2019-12-15 |
2019-12-20 |
005 |
待审核 |
2019-12-21 |
2019-12-21 |
006 |
待审核 |
2019-12-21 |
2019-12-21 |
以下为12月22日快照数据 |
|
|
|
001 |
待售 |
2019-12-18 |
2019-12-21 |
002 |
待售 |
2019-12-19 |
2019-12-20 |
003 |
已删除(从在售到已删除) |
2019-12-20 |
2019-12-22 |
004 |
待审核 |
2019-12-21 |
2019-12-21 |
005 |
待审核 |
2019-12-21 |
2019-12-21 |
006 |
已删除(从待审核到已删除) |
2019-12-21 |
2019-12-22 |
007 |
待审核 |
2019-12-22 |
2019-12-22 |
008 |
待审核 |
2019-12-22 |
2019-12-22 |
MySQL数仓代码实现
MySQL初始化
在MySQL中 lalian 库和商品表用于到原始数据层
-- 创建数据库create database if not exists lalian;-- 创建商品表create table if not exists `lalian`.`t_product`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50) -- 商品修改时间); Copy after login
在MySQL中创建ods和dw层来模拟数仓
-- ods创建商品表create table if not exists `lalian`.`ods_t_product`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50), -- 商品修改时间
cdat varchar(10) -- 模拟hive分区)default character set = 'utf8';-- dw创建商品表create table if not exists `lalian`.`dw_t_product`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50), -- 商品修改时间
cdat varchar(10) -- 模拟hive分区)default character set = 'utf8'; Copy after login
增量导入12月20号数据
原始数据导入12月20号数据(4条)
insert into `lalian`.`t_product`(goods_id, goods_status, createtime, modifytime) values('001', '待审核', '2019-12-18', '2019-12-20'),('002', '待售', '2019-12-19', '2019-12-20'),('003', '在售', '2019-12-20', '2019-12-20'),('004', '已删除', '2019-12-15', '2019-12-20'); Copy after login
注意:由于这里使用的MySQL来模拟的数仓所以直接使用insert into的方式导入数据,在企业中可能会使用hive来做数仓使用 kettle 或者 sqoop 或 datax 等来同步数据。
# 从原始数据层导入到ods 层insert into lalian.ods_t_productselect *,'20191220' from lalian.t_product ;# 从ods同步到dw层insert into lalian.dw_t_productselect * from lalian.ods_t_product where cdat='20191220'; Copy after login
查看dw层的运行结果
select * from lalian.dw_t_product where cdat='20191220'; Copy after login
goods_id |
goods_status |
createtime |
modifytime |
cdat |
1 |
待审核 |
2019/12/18 |
2019/12/20 |
20191220 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191220 |
3 |
在售 |
2019/12/20 |
2019/12/20 |
20191220 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191220 |
增量导入12月21数据
原始数据层导入12月21日数据(6条数据)
UPDATE `lalian`.`t_product` SET goods_status = '待售', modifytime = '2019-12-21' WHERE goods_id = '001';
INSERT INTO `lalian`.`t_product`(goods_id, goods_status, createtime, modifytime) VALUES
('005', '待审核', '2019-12-21', '2019-12-21'),
('006', '待审核', '2019-12-21', '2019-12-21'); Copy after login
将数据导入到ods层与dw层
# 从原始数据层导入到ods 层insert into lalian.ods_t_productselect *,'20191221' from lalian.t_product ;# 从ods同步到dw层insert into lalian.dw_t_productselect * from lalian.ods_t_product where cdat='20191221'; Copy after login
查看dw层的运行结果
select * from lalian.dw_t_product where cdat='20191221'; Copy after login
goods_id |
goods_status |
createtime |
modifytime |
cdat |
1 |
待售 |
2019/12/18 |
2019/12/21 |
20191221 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191221 |
3 |
在售 |
2019/12/20 |
2019/12/20 |
20191221 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191221 |
5 |
待审核 |
2019/12/21 |
2019/12/21 |
20191221 |
6 |
待审核 |
2019/12/21 |
2019/12/21 |
20191221 |
增量导入12月22日数据
原始数据层导入12月22日数据(6条数据)
UPDATE `lalian`.`t_product` SET goods_status = '已删除', modifytime = '2019-12-22' WHERE goods_id = '003';UPDATE `lalian`.`t_product` SET goods_status = '已删除', modifytime = '2019-12-22' WHERE goods_id = '006';INSERT INTO `lalian`.`t_product`(goods_id, goods_status, createtime, modifytime) VALUES('007', '待审核', '2019-12-22', '2019-12-22'),('008', '待审核', '2019-12-22', '2019-12-22'); Copy after login
将数据导入到ods层与dw层
# 从原始数据层导入到ods 层
insert into lalian.ods_t_product
select *,'20191222' from lalian.t_product ;
# 从ods同步到dw层
insert into lalian.dw_t_productpeizhiwenjian
select * from lalian.ods_t_product where cdat='20191222'; Copy after login
查看dw层的运行结果
select * from lalian.dw_t_product where cdat='20191222'; Copy after login
goods_id |
goods_status |
createtime |
modifytime |
cdat |
1 |
待售 |
2019/12/18 |
2019/12/21 |
20191222 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191222 |
3 |
已删除 |
2019/12/20 |
2019/12/22 |
20191222 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191222 |
5 |
待审核 |
2019/12/21 |
2019/12/21 |
20191222 |
6 |
已删除 |
2019/12/21 |
2019/12/22 |
20191222 |
7 |
待审核 |
2019/12/22 |
2019/12/22 |
20191222 |
8 |
待审核 |
2019/12/22 |
2019/12/22 |
20191222 |
查看dw层的运行结果
select * from lalian.dw_t_product; Copy after login
goods_id |
goods_status |
createtime |
modifytime |
cdat |
1 |
待审核 |
2019/12/18 |
2019/12/20 |
20191220 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191220 |
3 |
在售 |
2019/12/20 |
2019/12/20 |
20191220 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191220 |
1 |
待售 |
2019/12/18 |
2019/12/21 |
20191221 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191221 |
3 |
在售 |
2019/12/20 |
2019/12/20 |
20191221 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191221 |
5 |
待审核 |
2019/12/21 |
2019/12/21 |
20191221 |
6 |
待审核 |
2019/12/21 |
2019/12/21 |
20191221 |
1 |
待售 |
2019/12/18 |
2019/12/21 |
20191222 |
2 |
待售 |
2019/12/19 |
2019/12/20 |
20191222 |
3 |
已删除 |
2019/12/20 |
2019/12/22 |
20191222 |
4 |
已删除 |
2019/12/15 |
2019/12/20 |
20191222 |
5 |
待审核 |
2019/12/21 |
2019/12/21 |
20191222 |
6 |
已删除 |
2019/12/21 |
2019/12/22 |
20191222 |
7 |
待审核 |
2019/12/22 |
2019/12/22 |
20191222 |
8 |
待审核 |
2019/12/22 |
2019/12/22 |
20191222 |
From the above case, we can see that the table retains a full copy every day, and a lot of unchanged information will be saved in each full volume. If the amount of data is large, it will be a huge waste of storage , the table can be designed as a zipper table, which can not only reflect the historical status of the data, but also save storage space to the maximum extent.
Option 2: Use zipper table to save historical snapshots
zipper table does not store redundant data. Only the data of a certain row needs to be saved , correspondingly It will save storage space compared with full synchronization every time.
Be able to query historical snapshots
Added two additional columns (dw_start_date , dw_end_date ), is the life cycle of the data row.
Data of product zipper list on December 20
##goods_id
|
goods_status
|
createtime
|
modifytime
|
dw_start_date
|
dw_end_date
|
##001
To be reviewed |
2019-12-18 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
|
002
For Sale |
2019-12-19 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
|
003
On sale |
2019-12-20 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
|
004
Deleted |
2019-12-15 |
2019-12 -20 |
2019-12-20 |
9999-12-31 |
|
##The data on December 20 is a brand new data import To the dw table
dw_start_date indicates the starting time of the life cycle of a certain piece of data, that is, the data is valid from that time (ie the effective date)
dw_end_date indicates the life cycle of a certain piece of data The end time, that is, the data reaches this day (not included) (that is, the expiration date) - dw_end_date is 9999-12-31, which means that the current data is the latest data, and the data will not expire until 9999-12-31
-
- December 21st product zipper table data
##goods_id
goods_status |
##createtime
| ##modifytime | dw_start_date | dw_end_date |
001 |
To be reviewed
2019-12-18
2019-12-20 |
2019-12-20 |
| 2019-12-21 |
|
002For Sale |
2019-12-19
2019-12-20 |
2019-12-20 |
9999- 12-31 |
|
003 |
On Sale |
2019-12-20
2019-12-20 |
2019-12-20 |
9999-12-31 |
|
004 |
Deleted |
2019-12-15
2019-12-20 |
2019-12-20 |
9999-12-31 |
|
##001( Change) |
|
for sale
2019-12-18 |
2019-12-21 |
2019-12-21 |
9999-12-31 |
|
005 (New)
|
To be reviewed
2019-12-21 |
2019 -12-21 |
2019-12-21 |
9999-12-31 |
|
There is no redundant data stored in the zipper table, that is, as long as the data does not change, there is no need to synchronize
- The status of the product data numbered #001 has changed (from pending review → pending sale), which is required Change the original dw_end_date from 9999-12-31 to 2019-12-21, indicating the pending review status, valid from 2019/12/20 (inclusive) - 2019/12/21 (exclusive) ;
- 001 number re-saves a record in the new state, dw_start_date is 2019/12/21, dw_end_date is 9999/12/31;
- New data 005, 006. dw_start_date is 2019/12/21 and dw_end_date is 9999/12/31.
December 22nd product zipper table data
##goods_id
|
goods_status
| ##createtime
##modifytime |
dw_start_date |
dw_end_date |
|
001
To be reviewed | 2019-12-18 | 2019-12-20 | 2019-12-20 | 2019-12-21 |
##002 |
for sale
2019-12-19 |
2019-12-20 |
2019-12-20 |
9999-12-31 | |
003 |
On sale
2019-12-20 |
2019-12-20 |
2019-12- 20 |
| 2019-12-22 |
004 |
Deleted
2019-12-15 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
| ##001 | For Sale
2019-12-18 | 2019-12-21 | 2019-12-21 | 9999-12-31 |
| 005 | Pending review
2019-12-21 | 2019-12-21 | 2019-12 -21 | 9999-12-31 |
| 006 | pending review
2019-12-21 | 2019-12-21 | 2019-12-21 | 9999-12-31 |
| ##003 (Change) |
Deleted | 2019-12-20
##2019-12-22 |
2019-12-22 |
##9999-12-31 |
| 007 (New ) | pending review
2019-12-22 |
2019-12-22 |
2019-12-22 |
9999-12-31 |
| 008 (New) | Pending review
2019-12-22 |
2019- 12-22 |
2019-12-22 |
9999-12-31 |
拉链表中没有存储冗余的数据,即只要数据没有变化,无需同步
- 003编号的商品数据的状态发生了变化(从在售→已删除),需要将原有的 dw_end_date从9999-12-31变为2019-12-22,表示在售状态,在2019/12/20(包含) - 2019/12/22(不包含) 有效
- 003编号新的状态重新保存了一条记录,dw_start_date为2019-12-22,dw_end_date为9999-12-31
- 新数据007、008、dw_start_date为2019-12-22,dw_end_date为9999-12-31
MySQL数仓拉链表快照实现
操作流程:
- 在原有dw层表上,添加额外的两列
- 只同步当天修改的数据到ods层
- 拉链表算法实现
- 拉链表的数据为:当天最新的数据 UNION ALL 历史数据
代码实现
在MySQL中lalian库和商品表用于到原始数据层
-- 创建数据库create database if not exists lalian;-- 创建商品表create table if not exists `lalian`.`t_product2`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50) -- 商品修改时间)default character set = 'utf8'; Copy after login
在MySQL中创建ods和dw层 模拟数仓
-- ods创建商品表create table if not exists `lalian`.`ods_t_product2`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50), -- 商品修改时间
cdat varchar(10) -- 模拟hive分区)default character set = 'utf8';-- dw创建商品表create table if not exists `lalian`.`dw_t_product2`(
goods_id varchar(50), -- 商品编号
goods_status varchar(50), -- 商品状态
createtime varchar(50), -- 商品创建时间
modifytime varchar(50), -- 商品修改时间
dw_start_date varchar(12), -- 生效日期
dw_end_date varchar(12), -- 失效时间
cdat varchar(10) -- 模拟hive分区)default character set = 'utf8'; Copy after login
全量导入2019年12月20日数据
原始数据层导入12月20日数据(4条数据)
insert into `lalian`.`t_product_2`(goods_id, goods_status, createtime, modifytime) values('001', '待审核', '2019-12-18', '2019-12-20'),('002', '待售', '2019-12-19', '2019-12-20'),('003', '在售', '2019-12-20', '2019-12-20'),('004', '已删除', '2019-12-15', '2019-12-20'); Copy after login
将数据导入到数仓中的ods层
insert into lalian.ods_t_product2select *,'20191220' from lalian.t_product2 where modifytime >='2019-12-20'; Copy after login
将数据从ods层导入到dw层
insert into lalian.dw_t_product2select goods_id, goods_status, createtime, modifytime, modifytime,'9999-12-31', cdat from lalian.ods_t_product2 where cdat='20191220'; Copy after login
增量导入2019年12月21日数据
原始数据层导入12月21日数据(6条数据)
UPDATE `lalian`.`t_product2` SET goods_status = '待售', modifytime = '2019-12-21' WHERE goods_id = '001';INSERT INTO `lalian`.`t_product2`(goods_id, goods_status, createtime, modifytime) VALUES('005', '待审核', '2019-12-21', '2019-12-21'),('006', '待审核', '2019-12-21', '2019-12-21'); Copy after login
原始数据层同步到ods层
insert into lalian.ods_t_product2select *,'20191221' from lalian.t_product2 where modifytime >='2019-12-21'; Copy after login
编写ods层到dw层重新计算 dw_end_date
select t1.goods_id, t1.goods_status, t1.createtime, t1.modifytime,
t1.dw_start_date,
case when (t2.goods_id is not null and t1.dw_end_date>'2019-12-21') then '2019-12-21'else t1.dw_end_date end as dw_end_date ,
t1.cdatfrom lalian.dw_t_product2 t1left join (select * from lalian.ods_t_product2 where cdat='20191221')t2 on t1.goods_id=t2.goods_idunionselect goods_id, goods_status, createtime, modifytime, modifytime,'9999-12-31', cdat from lalian.ods_t_product2 where cdat='20191221'; Copy after login
执行结果如下:
goods_id |
goods_status |
createtime |
modifytime |
dw_start_date |
dw_end_date |
cdat |
1 |
待审核 |
2019-12-18 |
2019-12-20 |
2019-12-20 |
2019-12-21 |
20191220 |
2 |
待售 |
2019-12-19 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
20191220 |
3 |
在售 |
2019-12-20 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
20191220 |
4 |
已删除 |
2019-12-15 |
2019-12-20 |
2019-12-20 |
9999-12-31 |
20191220 |
1 |
待售 |
2019-12-18 |
2019-12-21 |
2019-12-21 |
9999-12-31 |
20191221 |
5 |
待审核 |
2019-12-21 |
2019-12-21 |
2019-12-21 |
9999-12-31 |
20191221 |
6 |
待审核 |
2019-12-21 |
2019-12-21 |
2019-12-21 |
9999-12-31 |
20191221 |
拉链历史表,既能满足反应数据的历史状态,又可以最大程度的节省存储。我们做拉链表的时候要确定拉链表的粒度,比如说拉链表每天只取一个状态,也就是说如果一天有3个状态变更,我们只取最后一个状态,这种天粒度的表其实已经能解决大部分的问题了。
推荐学习:mysql视频教程
|
|
|
The above is the detailed content of Let's talk about database zipper tables. For more information, please follow other related articles on the PHP Chinese website!