Only when the data model organizes and stores the data in an orderly manner can big data achieve high performance. , low cost, high efficiency and high quality use.
1) Clear data structure: Each data segment Each layer has its scope, so that we can locate and understand it more easily when using the table.
Structural data relationships: There are complex data relationships between source systems. For example, customer information exists in the core system, credit system, financial management system, and capital system at the same time. How should we make decisions when taking numbers? The data warehouse will conduct unified modeling of data on the same topic and sort out complex data relationships into clear data models, which can avoid the above problems when used.
2) Data lineage tracking: In simple terms, it can be understood like this. What we ultimately give business integrity is There is only one business table that can be used directly, but it comes from many sources. If there is a problem with one of the source tables, we hope to be able to quickly and accurately locate the problem and understand its scope of harm.
3) Data reuse to reduce repeated development: Standardize data layering and develop some common middleware Layer data can reduce a lot of repeated calculations. Based on the layer-by-layer processing principle of data, the lower layer contains all the data required for upper-layer data processing. This processing method avoids each data developer from re-extracting data from the source system for processing. Through the introduction of the summary layer, repeated calculations of downstream user logic are avoided, saving users' development time and energy, and also saving calculation and storage. It greatly reduces unnecessary data redundancy, enables reuse of calculation results, and greatly reduces storage and computing costs.
#4) Simplify complex problems. Decompose a complex task into multiple steps to complete. Each layer only handles a single step, which is relatively simple and easy to understand. It is also easy to maintain the accuracy of the data. When there is a problem with the data, you don't need to repair all the data. You only need to start repairing it from the problematic step.
#5) Shield the (influence) of the original data and shield the impact on the business. When the business or system changes, there is no need to change the business once before re-accessing the data. Improve data stability and continuity.
Shield the complexity of the source business system: the source system may be extremely complex, and table naming, field naming, field meanings, etc. may be diverse and standardized through the DW layer And shield all these complexities to ensure the convenience and standardization of data use by downstream data users. If the source system business changes, the relevant changes will be processed by the DW layer, which is transparent to downstream users without changing the code and logic of downstream users.
Maintainability of data warehouse: The layered design allows problems on a certain layer to be solved only at that layer, without changing the code and code of the next layer. logic.
Big data systems require data model approaches to help better organize and store data to achieve the best balance between performance, cost, efficiency and quality !
ETL (extraction transformation loading) is responsible for transforming distributed and heterogeneous data sources. After the data is extracted to the temporary middle layer, it is cleaned, transformed, integrated, and finally loaded into the data warehouse or data mart. ETL is the core and soul of implementing a data warehouse. The design and implementation of ETL rules account for about 60% to 80% of the entire data warehouse construction workload.
1) Data extractionIncludes initial data loading and data refresh: initial data loading mainly focuses on how to establish dimension tables and fact tables , and put the corresponding data into these data tables; data refresh focuses on how to append and update the corresponding data in the data warehouse when the source data changes (for example, you can create scheduled tasks, or triggers form for regular data refresh).
2) Data cleaning is mainly aimed at ambiguities, duplications, and incompleteness that appear in the source database , data that violates business or logic rules and other issues are processed uniformly. That is, cleaning out data that is not in line with the business or is useless. For example, you can write hive or MR to clean the data whose length does not meet the requirements.
3) Data transformation (transformation) is mainly to convert the cleaned data into a data warehouse Required data: The data dictionary or data format of the same data field from different source systems may be different (for example, it is called id in table A and ids in table B). A unified data dictionary needs to be provided for them in the data warehouse. and format, to normalize the data content; on the other hand, the contents of some fields required by the data warehouse may not be available in the source system, but need to be determined based on the contents of multiple fields in the source system.
4) Data loading is to import the last processed data into the corresponding storage space (hbase, mysql, etc.) to facilitate the provision of data to the data mart for visualization.
Generally, large companies have their own encapsulated data platform and task scheduling platform for data security and operation convenience. The bottom layer encapsulates big data clusters such as hadoop clusters. Spark clusters, sqoop, hive, zookeeper, hbase, etc. only provide web interfaces, and provide different permissions to different employees, and then perform different operations and calls on the cluster. Taking the data warehouse as an example, the data warehouse is divided into several logical levels. In this way, for different levels of data operations, different levels of tasks can be created and executed in different levels of task flows (a cluster in a large company usually has thousands, or even tens of thousands, of scheduled tasks waiting to be executed every day, so the divisions are different. Hierarchical task flow, tasks at different levels are executed in corresponding task flows, which will make management and maintenance more convenient).
The internal division of the data warehouse layer is not layering for the sake of layering, layering is to solve ETL tasks And various issues such as the organization of workflow, the flow of data, the control of read and write permissions, and the satisfaction of different needs.
The more common practice in the industry is to divide the entire data warehouse layer into many layers such as DWD, DWT, DWS, DIM, DM, etc. However, we have never been able to tell what the clear boundaries between these layers are, or we can clearly explain the boundaries between them. However, the complex business scenarios prevent us from actually implementing them.
So generally speaking, three layers are the most basic for data stratification. As for how to divide the DW layer, it is based on the specific business needs and company Define the scene yourself.
##The data center contains a lot of content. If it corresponds to specific work, it can include the following content:
Data warehouse standards can be divided into four layer. However, please note that this division and naming is not unique. Generally, data warehouses have four levels, but different companies may have different names. But the core concepts all come from the four-layer data model.
##01 Paste source layer (ODS, Operational Data Store)Data introduction layer (ODS, Operational Data Store, also known as data base layer): The original data is almost unprocessed It is stored in the data warehouse system and is basically consistent in structure with the source system. It is the data preparation area of the data warehouse. The main responsibility of this layer is to synchronize and store basic data.
# Generally speaking, the data in the ODS layer and the data in the source system are isomorphic, and the main purpose is to simplify subsequent data processing. In terms of data granularity, the data granularity of the ODS layer is fine. Tables in the ODS layer usually include two types, one used to store the current data that needs to be loaded, and the other used to store the historical data after processing. Historical data is generally stored for 3-6 months and then needs to be cleared to save space. However, different projects should be treated differently. If the amount of data in the source system is not large, it can be retained for a longer period of time or even saved in full.
Note: At this layer, it should not be simple data access, but should consider certain data cleaning, such as the processing of abnormal fields and standardization of field naming. , unification of time fields, etc. Generally these are easily overlooked, but they are crucial. It will be very useful especially when we do automatic generation of various features later.
Note: Some companies’ ODS layer will not do much data filtering and will be placed in the DWD layer for processing. Some companies will do relatively refined filtering of data at the ODS layer from the beginning. This is not clearly defined and depends on each company's own ideas and technical specifications.
#Generally when enterprises develop, they will do some basic processing when storing raw data into ODS.
Data source distinction
Data is stored in partitions based on time, usually by day. There are also companies that use three-level partitions of year, month, and day for storage.
# Perform the most basic data processing, such as discarding format errors, filtering out key information loss, etc.
Data real-time offline 1) Main source of data: 2) Data storage strategy (incremental, full) In actual application , you can choose to use incremental, full storage or zipper storage. In order to satisfy historical data analysis If required, you can add the time dimension as a partition field in the ODS layer table. Incremental storage in days, with business dates as partitions, and each partition stores daily incremental business data. For example: ##On January 1, user A visited Company A E-commerce store B and company A's e-commerce log generate a record t1. On January 2, user A visited company A's e-commerce store C again, and company A's e-commerce log generated a record t2. Using incremental storage, t1 will be stored in the partition on January 1st, and t2 will be stored in the partition on January 2nd. #On January 1, user A purchased product B on company A’s e-commerce website, and the transaction log will generate a record t1. On January 2, user A returns product B again, and the transaction log will update the t1 record. Using incremental storage, the initial purchased t1 record will be stored in the partition on January 1, and the updated t1 will be stored in the partition on January 2 in this partition. # ODS tables with strong transaction nature such as transactions and logs are suitable for incremental storage. This type of table has a large amount of data, and the storage cost of using full storage is high. In addition, downstream applications of such tables have less demand for full historical data access (such demands can be obtained through subsequent aggregation by the data warehouse). For example, a log ODS table does not have a data update business process, so all incremental partitions UNION together to form a full set of data. Full storage in days Storage uses business dates as partitions, and each partition stores the full amount of business data up to the business date. For example, on January 1st, seller A released two products, B and C, on company A’s e-commerce website. The front-end product table will generate two records. t1, t2. On January 2, seller A removed product B from the shelves and released product D at the same time. The front-end product table will update record t1 and create a new record t3. Using the full storage method, two records of t1 and t2 are stored in the partition on January 1, and the updated t1, t2, and t3 records are stored in the partition on January 2. For slowly changing dimensional data with small amounts of data, such as product categories, full storage can be used directly. Zip storage by adding two A timestamp field (start_dt and end_dt) is used to record all change data with daily granularity. Usually the partition field is also these two timestamp fields. Scheme Concept: Also called the interface layer (stage), used for Store daily incremental data and change data Data generation method: directly receive source data from kafka, which requires business tables to generate update, delete, and insert data every day. Only the business table for insert data is generated, and the data is directly entered into the detail layer. Discussion plan: Only put the canal log directly into the buffer layer. If there are other businesses with zipper data, also put it into the buffer layer. Log storage method: using impala appearance and parquet file format to facilitate reading of data that requires MR processing. #Log deletion method: long-term storage, only the data of the last few days can be stored. Discussion plan: direct long-term storage. Table schema: Generally, partitions are created by day, and partitioned by is generally stored by day. #Library and table naming. Library name: ods, table name: The initial consideration format is ods date business table name, to be determined. #hive’s external table corresponds to the business table. hive external table, the file storing data may not be in the default location of hive's hdfs, and when the hive corresponding table is deleted, the corresponding data file is not will be deleted. In this way, for enterprise development, it can prevent valuable data from being deleted from the hive business table due to table deletion operations. On the contrary, the data files are stored in the default location corresponding to hive. When the table is deleted, the corresponding file will also be deleted. will be deleted. Data warehouse layer (DW) layer: data warehouse The layer is the core design layer when we build a data warehouse. This layer will build various data models based on themes from the data obtained from the ODS layer. Each theme corresponds to a macro analysis area. The data warehouse layer excludes useless for decision-making. of data, providing a concise view of a specific topic. All historical data in the BI system will be saved at the DW layer, such as 10 years of data. DW stores detailed fact data, dimension table data and public indicator summary data. Among them, detailed fact data and dimension table data are generally generated based on ODS layer data processing. Public indicator summary data is generally generated based on dimension table data and detailed fact data. The DW layer is further subdivided into the dimension layer (DIM), detailed data layer (DWD) and summary data layer (DWS), using the dimensional model method as the theory Basically, you can define the relationship between the primary key of the dimensional model and the foreign key in the fact model, reducing data redundancy and improving the ease of use of the detailed data table. In the summary data layer, dimensions in the reused statistical granularity can also be associated, and more wide-table methods can be used to build the public indicator data layer to improve the reusability of public indicators and reduce repeated processing. Dimension layer (DIM, Dimension): Dimension is used as the modeling driver, and based on the business meaning of each dimension, it is defined by adding dimension attributes, associated dimensions, etc. Calculate logic, complete the process of attribute definition and establish consistent data analysis dimension tables. In order to avoid redundantly associated dimension attributes in the dimensional model, a dimension table is built based on the snowflake model. Detailed data layer (DWD, Data Warehouse Detail): Taking business process as the modeling driver, based on the characteristics of each specific business process, the finest granularity is constructed Detailed fact table. Some important attribute fields can be made appropriately redundant, that is, wide tables can be processed. Summary data layer (DWS, Data Warehouse Summary): Using the analyzed subject object as the modeling driver, based on the indicator requirements of upper-layer applications and products, a public-granular summary indicator table is constructed. Use wide table methods to physicalize the model, build statistical indicators with naming standards and consistent calibers, provide public indicators for the upper layer, and establish summary wide tables and detailed fact tables. Theme domain: Business process-oriented, abstract collection of business activity events, such as order placement, payment, and refund are all business processes. Theme partitioning for the common level of detail (DWD). #Data domain: For business analysis, it is an abstract collection of business processes or dimensions. Data domain partitioning for the common summary layer (DWS). The DWD layer is driven by business processes. The DWS layer, DWT layer and ADS layer are requirements driven. DWD: data warehouse details data detail layer. Mainly perform some data cleaning and standardization operations on the ODS data layer. Data cleaning: remove null values, dirty data, enumeration value conversion, and those that exceed the limit range. DWB: data warehouse base data base layer, which stores objective data. It is generally used as an intermediate layer and can be considered as a data layer for a large number of indicators. DWS: data warehouse service data service layer, based on the basic data on DWB, integrated and summarized into a service data layer that analyzes a certain subject area, usually wide surface. Used to provide subsequent business queries, OLAP analysis, data distribution, etc. User behavior, light aggregation Mainly for Do some light summarization of ODS/DWD layer data. 1) Public dimension layer (DIM, Dimension) DIM: This The layer is relatively simple. For example, you can understand that information such as country code and country name, geographical location, Chinese name, and flag picture are stored in the DIM layer. # Based on the concept of dimensional modeling, establish consistent dimensions for the entire enterprise. Reduce the risk of inconsistent data calculation calibers and algorithms. The public dimension summary layer (DIM) is mainly composed of dimension tables (dimension tables). Dimension is a logical concept, a perspective from which to measure and observe business. Dimension tables are tables that physicalize tables built on the data platform based on dimensions and their attributes, and adopt the principle of wide table design. Therefore, building a common dimension summary layer (DIM) first requires defining the dimensions. High cardinality dimensional data: generally a data table similar to a user data table or a product data table. The amount of data may be tens of millions or hundreds of millions. Low cardinality dimension data: usually a configuration table, such as the Chinese meaning corresponding to the enumeration value, or the date dimension table. The amount of data may be single digits or tens of thousands. Design dimension table: ##After completing the dimension definition, you can Supplement it and generate the dimension table. The design of dimension tables requires attention: It is recommended that the dimension form table information should not exceed 10 million pieces. When joining a dimension table with other tables, it is recommended that you use Map Join ## Avoid updating dimension table data too frequently. Slowly changing dimensions: zipper table Common Dimension Summary Layer (DIM) Dimension Table Specification Public dimension summary layer (DIM) dimension table naming specification: dim_{Business segment name/pub}_{Dimension definition}[_{Custom naming label}], the so-called pub is related to the specific business segment Dimensions that are irrelevant or common to all business sectors, such as the time dimension. For example: public area dimension table dim_pub_area product dimension table dim_asale_itm ##The level of business details expressed by a record in the fact table is called granularity. Generally, granularity can be expressed in two ways: one is the level of detail represented by the combination of dimension attributes, and the other is the specific business meaning represented. Transparent! Common modeling methods and practical examples in the data warehouse field. Modeling methods and principles ##Need to build a dimensional model, generally using a star schema Model, the presented state is generally a constellation model (composed of multiple fact tables, the dimension table is public and can be shared by multiple fact tables); In order to support data re-run, an additional data business date field can be added, tables can be divided by day, and the incremental ODS layer data and the DWD related table of the previous day can be used for merge processing? #The granularity is that one line of information represents one behavior, such as placing an order. Dimensional modeling steps Select business process: In the business system, select Interested business lines, such as ordering business, payment business, refund business, logistics business, one business line corresponds to one fact table. If it is a small or medium-sized company, try to select all business processes. If DWD is a large company (more than 1,000 tables), choose the business line related to the needs. Declaration granularity: Data granularity refers to the level of refinement or comprehensiveness of the data stored in the data warehouse. Declaring granularity means precisely defining what a row of data in the fact table represents. You should choose the smallest granularity possible to meet a variety of needs. A typical granularity statement is as follows: each item in the order is treated as a row in the order fact table, and the granularity is each time. The number of orders per week is given as a row, with the granularity being weekly. The number of orders per month is presented as a row, with the granularity being monthly. If the granularity at the DWD layer is weekly or monthly, then there will be no way to count fine-grained indicators later. So it is recommended to use the smallest granularity. Determine dimensions: The main function of dimensions is to describe business facts, which mainly represent information such as "who, where, when". The principle for determining dimensions is: whether indicators of relevant dimensions should be analyzed in subsequent requirements. For example, statistics are needed to determine when more orders were placed, which region placed more orders, and which user placed more orders. The dimensions that need to be determined include: time dimension, regional dimension, and user dimension. Dimension table: Dimension degradation needs to be performed according to the star schema principle in dimensional modeling. Determine the facts: The word "fact" here refers to the measurement value in the business (number of times, number of pieces, number of pieces, amount, etc.). Accumulation), such as order amount, number of orders placed, etc. In the DWD layer, the business process is used as the modeling driver, and the most fine-grained detail layer fact table is constructed based on the characteristics of each specific business process. The fact table can be appropriately widened. Note: The DWD layer is driven by business processes. The DWS layer, DWT layer and ADS layer are all demand-driven and have nothing to do with dimensional modeling. DWS and DWT both build wide tables and build tables according to themes. The theme is equivalent to the perspective from which the problem is viewed. Corresponds to the dimension table. About the topic: The data in the data warehouse is organized by topic. A theme is an abstract concept used to synthesize, classify, analyze and utilize data in enterprise information systems at a high level. Each theme basically corresponds to a macro analysis field. For example, financial analysis is an analysis field, so the theme of this data warehouse application is "financial analysis". About the subject domain: The subject domain is usually a closely related data topic collection. These data topics can be divided into different subject areas based on business concerns (that is, the boundaries of topics determined after analyzing a topic) About the division of subject areas: The determination of subject areas must be completed jointly by end users (business) and data warehouse designers. When dividing subject areas, different entry points may cause some debates and reconstructions. etc., the points to consider may be some of the following aspects: In short, if the starting point logic is different, there can be different division logics. An iterative approach can be adopted during the construction process. Instead of focusing on completing the abstraction of all topics at once, you can start with clearly defined topics and then gradually summarize them into a standard model for your own industry. Topic: parties, marketing, finance, contract agreement, organization, address, channel, product, ##What are the financial business themes? They can be divided into four themes: 2) DWD (data warehouse detail) data detail layer, granular fact layer DWD is the isolation layer between the business layer and the data warehouse. This layer mainly solves some data quality issues and data integrity issues. The detailed table is used to store detailed data converted from the original table of the ODS layer. The data of the DWD layer should be consistent, accurate, and clean data, that is, Clean the source system data ODS layer data (remove null values, dirty data, data exceeding the limit range, change row storage to column storage, change compression format), normalize, dimension degradation, desensitization and other operations. For example, user data information comes from many different tables, and problems such as delayed data loss often occur. In order to facilitate each user to better use the data, we can create a shield at this layer. This layer also contains unified dimensional data. Detailed-grained fact layer (DWD): Taking the business process as the modeling driver, building the finest-grained detail layer based on the characteristics of each specific business process Fact table. Combining the data usage characteristics of the enterprise, some important dimension attribute fields of the detailed fact table can be made appropriately redundant, that is, wide table processing. Tables at the fine-grained fact level are often also called logical fact tables. Responsible for the most granular data of the data, lightly summarized based on the DWD layer, combined with common dimensions (time, location, organizational level, user , commodities, etc.) This layer generally maintains the same data granularity as the ODS layer, and provides a certain data quality guarantee, and performs data processing on the basis of ODS. Processing to provide cleaner data. At the same time, in order to improve the usability of the data detail layer, this layer will adopt some dimension degradation techniques. When a dimension does not have any data required by the data warehouse, the dimension can be degraded and the dimension can be degraded to the fact table, reducing the number of fact tables and Dimension table associations. For example: Order id, there is no need to use a dimension table for such a large dimension. Storage, and the order ID is generally very important when we perform data analysis, so we redundant the order ID in the fact table. This dimension is a degenerate dimension. #The data at this layer generally follows the third normal form or dimensional modeling of the database, and its data granularity is usually the same as that of ODS. All historical data in the BI system will be saved at the PDW layer, such as 10 years of data. The following work needs to be done before the data is loaded into this layer: denoising, deduplication, extraction, business extraction, unit unification, field cutting, business identification . Types of data to be cleaned: The task of data cleaning is to filter the data that does not meet the requirements, and submit the filtering results to the business department to confirm whether it is filtered out or corrected by the business unit before extraction. #What does the DWD layer do? ①Data cleaning and filtering ##Remove abandoned fields and information with incorrect format Remove information that has lost key fields Filter meaningless data in core fields , for example, the order id in the order table is null, and the payment id in the payment table is empty Desensitize sensitive data such as mobile phone numbers and ID numbers Remove data that does not contain time information (this depends on the specific business of the company, but generally the data will have a timestamp, which facilitates subsequent processing of information in the time dimension) Analysis, processing and extraction) Some companies will also flatten the data at this layer, but this depends on the business needs. This is because kylin is suitable for processing The flattened data is not suitable for processing nested table data information Some companies will also cut the data session, which is usually the log data of the app. Other business scenarios may not be suitable. This is because the app enters the background mode. For example, the user opens the app for 10 minutes in the morning, then the app goes to the background and is opened again at night. At this time, the session is still one, and it should actually be cut. ( There are also companies that record the records of the app entering the background and entering the front desk again, so as to perform session cutting) ②Data mapping and conversion Convert GPS latitude and longitude into detailed addresses of provinces and cities. Common GPS quick queries in the industry generally use geohash mapping in the geographical location knowledge base, and then convert the GPS that needs to be compared into geohash and then compare the geohash in the knowledge base to find out the geographical location information. Of course, there are also companies that use open api, such as Amap. , Baidu Map’s API maps GPS and geographical location information, but this requires money to reach a certain number of times, so everyone knows that will also use the IP address Convert to detailed address of province, city and city. There are many fast search libraries, but the basic principle is binary search, because the ip address can be converted into a long integer. A typical example is the ip2region library which converts time Standardize the data for year, month, day or even week and quarter dimension information , because the data processed by big data may come from different departments of the company, different projects, and different On the client side, the same business data fields, data types, null values, etc. may be different at this time. At this time, the DWD layer needs to be smoothed. Otherwise, it will cause a lot of trouble during subsequent processing. Such as boolean, some use 0 1 identification, and some use true false identification Such as string null value, Some use "", some use null, just use null For example, the date format is more different and needs to be determined based on actual business data, but it is generally formatted into standard formats such as YYYY-MM-dd HH:mm:ss # Dimension degradation: Perform dimension degradation and dimensionality reduction on the tables transferred from business data. (Product level one, level two, level three, province, city, county, year, month, day) The redundant order ID is in the fact table How much data is reasonable to clean out: 1 Clean out 1 out of 10,000 pieces of data. Reasonable number of tables: 10,000 tables become 3,000 tables, 3,000 tables become 1,000 tables Detailed granular fact table design principles: Plan ##Discussion plan: The data synthesis method is: Full: Every day, the full data from the day before yesterday and yesterday’s new data at the detail level are combined into a new data table, overwriting the old table. At the same time, use historical mirroring to store a historical mirror in a new table by week/month/year. Log storage method: direct data uses impala outer surface and parquet file format. It is recommended to use internal tables. The following layers are all data generated from impala. It is recommended that all Use internal tables for static/dynamic partitioning. Table schema: Generally create partitions by day, and select partition fields according to specific business if there is no concept of time. partitioned by is generally stored according to days. #Library and table naming. Library name: dwd, table name: The initial consideration format is dwd date business table name, to be determined. Old data update method: directly overwrite Detailed granularity fact layer (DWD) Specification The naming specification is: dwd_{Business section/pub}_{Data domain abbreviation}_{Business process abbreviation}[_{Custom table naming label Abbreviation}] _{Single partition incremental full identifier}, pub indicates that the data includes data from multiple business sectors. The incremental and full identifier of a single partition is usually: i represents the increment and f represents the full amount. ##For example: dwd_asale_trd_ordcrt_trip_di (A e-commerce company air travel ticket order order fact table, daily refresh increment) dwd_asale_itm_item_df (A e-commerce product snapshot fact table, The full amount is refreshed every day). In this tutorial, the DWD layer is mainly composed of three tables: 3) DWS (data warehouse service) data service layer, summary layer wide table ##Based on the DWD detailed data layer, we will organize our data according to some analysis scenarios, analysis entities, etc., and organize it into some summary data layer DWS by theme. Detailed granularity ==> Summary granularity DWS layer (data summary layer) Wide table, subject-oriented summary, has relatively few dimensions. DWS performs coarse-grained summary and aggregation by each dimension ID based on the basic data of the DWD layer, such as aggregation by transaction source and transaction type. Integrate and summarize service data for analysis of a certain subject area, usually in a wide table. #Based on DWD, light summary by day. Statistics of the daily behavior of each subject object (for example, purchase behavior, statistics of product repurchase rate). There will be relatively few data tables in this layer, and most of them are wide tables (one table will cover more business content, and there will be more fields in the table) ). According to the topics, such as orders, users, etc., a wide table with many fields is generated to provide subsequent business queries, OLAP analysis, data distribution, etc. Integrate multiple middle layer data to form fact tables based on topics, such as user fact tables, channel fact tables, terminal fact tables, asset fact tables, etc. Fact tables are generally wide tables, achieving enterprise-level data consistency at this layer. First divide the business subject into sales domain, inventory domain, customer domain, purchasing domain, etc. The second step is to determine the fact table for each subject domain. and dimension tables. Usually based on business needs, it is divided into traffic, orders, users, etc., and a wide table with many fields is generated to provide subsequent business queries, OLAP analysis, data distribution, etc. The total sales of a certain category (for example: kitchenware) in each province in the last day, the names of the top 10 sales products in the category, and the distribution of user purchasing power in each province. Therefore, we can summarize the data of the last day from the perspective of the final successful transaction of goods, categories, buyers, etc. #For example, the number of products purchased by users at different login IPs in each time period, etc. A light aggregation here will make the calculation more efficient. On this basis, it will be much faster if the behavior of only 7 days, 30 days, and 90 days is calculated. We hope that 80% of business can be calculated through our DWS layer instead of ODS. #What does the DWS layer do? dws summarizes the data in the dwd layer by theme and puts it into a table according to the theme, For example, under the user theme, user registration information, user shipping address, and user credit data will be placed in the same table, and these correspond to multiple tables in the dwd layer, divided according to business, such as traffic, orders , users, etc., generate wide tables with many fields Theme modeling, perform data modeling around a certain business theme, and extract relevant data . For example: ##①DWS layer 1-3 wide tables per topic (handling more than 70% of the requirements for 100-200 indicators) Specific wide table names: user behavior wide table, user purchase product detail behavior wide table, product wide table, logistics wide table, after-sales, etc. #②Which wide table is the widest? Approximately how many fields are there? #The widest is the user behavior wide table. There are about 60-200 fields ③Specific user behavior wide table field names Comment, reward, collect, follow--product, follow--people, like, share, good price news, article release, active, sign-in, re-sign card, lucky house, gift, gold coin, e-commerce click, gmv ④Analyzed indicators Daily active, monthly active, weekly active, retention , retention rate, new additions (day, week, year), conversion rate, churn, return, 3 consecutive days of login within seven days (likes, collections, comments, purchases, additional purchases, orders, activities), 3 consecutive weeks ( Month) login, GMV (transaction amount, order placement), repurchase rate, repurchase rate ranking, likes, comments, collections, number of people receiving discounts, using discounts, silence, whether it is worth buying, number of refunds, refunds Style rate topn Popular items Daily activity: 1 million; monthly active users: 2-3 times the daily active users, 3 million How many total registered users are there? Between 10 million and 30 million GMV: 100,000 orders per day (50 – 100 yuan) 5 million-10 million With a daily activity of 1 million, there are about 100,000 people purchasing every day, with an average consumption of 100 yuan per person, and a daily GMV of 10 million. 10 %-20% 1 million-2 million Repurchase of certain daily commodities; (toilet paper, facial mask, toothpaste) 10%-20% Computers, monitors, watches 1% ##Product details=》Add to cart=》 Place an order=》Pay ##5%-10% 60-70% 90%-95% Activity: 10-20% ##Plan: Concept: Also known as data mart or wide table. According to business divisions, such as traffic, orders, users, etc., a wide table with many fields is generated to provide subsequent business queries, OLAP analysis, data distribution, etc. #Data generation method: generated by light summary layer and detailed layer data calculation. #Log storage method: using impala internal table, parquet file format. Table schema: Generally create partitions by day, and select partition fields according to specific business if there is no concept of time. #Library and table naming. Library name: dws, table name: The initial consideration format is: dws date business table name, to be determined. Old data update method: direct overwrite Public summary fact table specification 公共汇总事实表命名规范:dws_{业务板块缩写/pub}_{数据域缩写}_{数据粒度缩写}[_{自定义表命名标签缩写}]_{统计时间周期范围缩写}。关于统计实际周期范围缩写,缺省情况下,离线计算应该包括最近一天(_1d),最近N天(_nd)和历史截至当天(_td)三个表。如果出现_nd的表字段过多需要拆分时,只允许以一个统计周期单元作为原子拆分。即一个统计周期拆分一个表,例如最近7天(_1w)拆分一个表。不允许拆分出来的一个表存储多个统计周期。 对于小时表(无论是天刷新还是小时刷新),都用_hh来表示。对于分钟表(无论是天刷新还是小时刷新),都用_mm来表示。 举例如下: dws_asale_trd_byr_subpay_1d(买家粒度交易分阶段付款一日汇总事实表) dws_asale_trd_byr_subpay_td(买家粒度分阶段付款截至当日汇总表) dws_asale_trd_byr_cod_nd(买家粒度货到付款交易汇总事实表) dws_asale_itm_slr_td(卖家粒度商品截至当日存量汇总表) dws_asale_itm_slr_hh(卖家粒度商品小时汇总表)---维度为小时 dws_asale_itm_slr_mm(卖家粒度商品分钟汇总表)---维度为分钟 问:数据集市层是不是没地方放了,各个业务的数据集市表是应该在 dws 还是在 app? 答:这个问题不太好回答,我感觉主要就是明确一下数据集市层是干什么的,如果你的数据集市层放的就是一些可以供业务方使用的宽表,放在 app 层就行。如果你说的数据集市层是一个比较泛一点的概念,那么其实 dws、dwd、app 这些合起来都算是数据集市的内容。 数据应用层(ADS,Application Data Store):存放数据产品个性化的统计指标数据,报表数据。主要是提供给数据产品和数据分析使用的数据,通常根据业务需求,划分成流量、订单、用户等,生成字段比较多的宽表,用于提供后续的业务查询,OLAP分析,数据分发等。从数据粒度来说,这层的数据是汇总级的数据,也包括部分明细数据。从数据的时间跨度来说,通常是DW层的一部分,主要的目的是为了满足用户分析的需求,而从分析的角度来说,用户通常只需要分析近几年的即可。从数据的广度来说,仍然覆盖了所有业务数据。 在 DWS 之上,我们会面向应用场景去做一些更贴近应用的 APP 应用数据层,这些数据应该是高度汇总的,并且能够直接导入到我们的应用服务去使用。 应用层(ADS):应用层主要是各个业务方或者部门基于DWD和DWS建立的数据集市(Data Market, DM),一般来说应用层的数据来源于DW层,而且相对于DW层,应用层只包含部门或者业务方面自己关心的明细层和汇总层的数据。 该层主要是提供数据产品和数据分析使用的数据。一般就直接对接OLAP分析,或者业务层数据调用接口了 数据应用层APP:面向业务定制的应用数据主要提供给数据铲平和数据分析使用的数据,一般会放在ES,MYSQL,Oracle,Redis等系统供线上系统使用,也可以放在Hive 或者 Druid 中供数据分析和数据挖掘使用。 APP 层:为应用层,这层数据是完全为了满足具体的分析需求而构建的数据,也是星形或雪花结构的数据。如我们经常说的报表数据,或者说那种大宽表,一般就放在这里。包括前端报表、分析图表、KPI、仪表盘、OLAP、专题等分析,面向最终结果用户; 概念:应用层是根据业务需要,由前面三层数据统计而出的结果,可以直接提供查询展现,或导入至Mysql中使用。 数据生成方式:由明细层、轻度汇总层,数据集市层生成,一般要求数据主要来源于集市层。 日志存储方式:使用impala内表,parquet文件格式。 表schema:一般按天创建分区,没有时间概念的按具体业务选择分区字段。 库与表命名。库名:暂定ads,另外根据业务不同,不限定一定要一个库。 旧数据更新方式:直接覆盖。 ADS 层复购率统计 ①如何分析用户活跃? 在启动日志中统计不同设备 id 出现次数。 ②如何分析用户新增? 用活跃用户表 left join 用户新增表,用户新增表中 mid 为空的即为用户新增。 ③如何分析用户 1 天留存? 留存用户=前一天新增 join 今天活跃 用户留存率=留存用户/前一天新增 ④如何分析沉默用户? (登录时间为 7 天前,且只出现过一次) 按照设备 id 对日活表分组,登录次数为 1,且是在一周前登录。 ⑤如何分析本周回流用户? 本周活跃 left join 本周新增 left join 上周活跃,且本周新增 id 和上周活跃 id 都为 null。 ⑥如何分析流失用户? (登录时间为 7 天前) 按照设备 id 对日活表分组,且七天内没有登录过。 ⑦如何分析最近连续 3 周活跃用户数? 按照设备 id 对周活进行分组,统计次数大于 3 次。 ⑧How to analyze the number of active users for three consecutive days in the last seven days? 7 days of continuous collection, likes, purchases, additional purchases, payments, browsing, product clicks, returns 1 month for 7 consecutive days Two consecutive weeks TMP: Every Each level of calculation will have many temporary tables, and a DW TMP layer is specially designed to store the temporary tables of our data warehouse. The above is the detailed content of From ODS to ADS, a detailed explanation of data warehouse stratification!. For more information, please follow other related articles on the PHP Chinese website!
02 Data warehouse layer (DW, data warehouse)
##Trading product information fact table: dwd_asale_trd_itm_di.
CREATE TABLE IF NOT EXISTS dwd_asale_trd_itm_di
(
item_id BIGINT COMMENT '商品ID',
item_title STRING COMMENT '商品名称',
item_price DOUBLE COMMENT '商品价格',
item_stuff_status BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
item_prov STRING COMMENT '商品省份',
item_city STRING COMMENT '商品城市',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
commodity_id BIGINT COMMENT '品类ID',
commodity_name STRING COMMENT '品类名称',
buyer_id BIGINT COMMENT '买家ID',
)
COMMENT '交易商品信息事实表'
PARTITIONED BY (ds STRING COMMENT '日期')
LIFECYCLE 400;
##1/2/3, weekly retention, monthly retentiondrop table
if exists dws_sale_detail_daycount;
create external table dws_sale_detail_daycount(
user_id string comment '用户 id',
--用户信息
user_gender string comment '用户性别',
user_age string comment '用户年龄',
user_level string comment '用户等级',
buyer_nick string comment '买家昵称',
mord_prov string comment '地址',
--下单数、 商品数量, 金额汇总
login_count bigint comment '当日登录次数',
cart_count bigint comment '加入购物车次数',
order_count bigint comment '当日下单次数',
order_amount decimal(16,2) comment '当日下单金额',
payment_count bigint comment '当日支付次数',
payment_amount decimal(16,2) comment '当日支付金额',
confirm_paid_amt_sum_1d double comment '最近一天订单已经确认收货的金额总和'
order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comment '下单明细统计'
) comment '每日购买行为'
partitioned by(`dt`
string)
stored as parquet
location '/warehouse/gmall/dws/dws_sale_detail_daycount/'
tblproperties("parquet.compression" = "lzo");
CREATE TABLE IF NOT EXISTS dws_asale_trd_itm_ord_1d
(
item_id BIGINT COMMENT '商品ID',
--商品信息,产品信息
item_title STRING COMMENT '商品名称',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
--mord_prov STRING COMMENT '收货人省份',
--商品售出金额汇总
confirm_paid_amt_sum_1d DOUBLE COMMENT '最近一天订单已经确认收货的金额总和'
)
COMMENT '商品粒度交易最近一天汇总事实表'
PARTITIONED BY (ds STRING COMMENT '分区字段YYYYMMDD')
LIFECYCLE 36000;
03 应用层(ADS)applicationData Service应用数据服务
CREATE TABLE app_usr_interact( user_id string COMMENT '用户id',
nickname string COMMENT '用户昵称',
register_date string COMMENT '注册日期',
register_from string COMMENT '注册来源',
remark string COMMENT '细分渠道',
province string COMMENT '注册省份',
pl_cnt bigint COMMENT '评论次数',
ds_cnt bigint COMMENT '打赏次数',
sc_add bigint COMMENT '添加收藏',
sc_cancel bigint COMMENT '取消收藏',
gzg_add bigint COMMENT '关注商品',
gzg_cancel bigint COMMENT '取消关注商品',
gzp_add bigint COMMENT '关注人',
gzp_cancel bigint COMMENT '取消关注人',
buzhi_cnt bigint COMMENT '点不值次数',
zhi_cnt bigint COMMENT '点值次数',
zan_cnt bigint COMMENT '点赞次数',
share_cnts bigint COMMENT '分享次数',
bl_cnt bigint COMMENT '爆料数',
fb_cnt bigint COMMENT '好价发布数',
online_cnt bigint COMMENT '活跃次数',
checkin_cnt bigint COMMENT '签到次数',
fix_checkin bigint COMMENT '补签次数',
house_point bigint COMMENT '幸运屋金币抽奖次数',
house_gold bigint COMMENT '幸运屋积分抽奖次数',
pack_cnt bigint COMMENT '礼品兑换次数',
gold_add bigint COMMENT '获取金币',
gold_cancel bigint COMMENT '支出金币',
surplus_gold bigint COMMENT '剩余金币',
event bigint COMMENT '电商点击次数',
gmv_amount bigint COMMENT 'gmv',
gmv_sales bigint COMMENT '订单数'
)
PARTITIONED BY( dt string)
--stat_dt
date COMMENT '互动日期',
04 Hierarchical calling specifications