The concept of quantile value
In statistics and data analysis, quantiles (or quartiles) are often used to describe the statistical characteristics of data distribution. Generally, the quantile value is divided into four equal parts, namely the first quantile (Q1), the second quantile (Q2) (that is, the median), the third quantile (Q3) and the extreme Difference (IQR). Among them, 1/4 of the data is smaller than the first quantile, 1/4 of the data is larger than the third quantile, and the middle 50% of the data is between the first quantile and the third quantile. In statistics, the first quantile refers to the number in the top 25% of the entire sequence after a set of data is arranged in order of size; the second quantile refers to a set of data arranged in order of size. last, the number in the middle position; and the third quantile refers to the number in the bottom 25% of the entire sequence after a set of data is arranged in order of size. The median is the second quartile. In data analysis, quantile values can help us understand the distribution of data and determine whether the data is biased to one side or how dispersed it is. When the data distribution is uneven, quantile values can more accurately represent the differences in the data.
The denomination distribution range of coupons issued by merchants is [1, 20], and each coupon will be marked with its corresponding denomination. To accurately control the cost of coupons, it is necessary to understand the issuance of coupons in real time in order to make a more accurate assessment. Through real-time monitoring of the amount of coupons issued, the average amount of coupons issued, and the quantile value of the amount issued (understanding the average amount of coupons issued in different intervals), you can have a clearer understanding of the issuance of coupons.
Currently, the business has sorted out the following indicators and needs data from students who need it. All indicators are based on minutes as the statistical granularity:
Issuance volume: Total amount of coupons issued
Amount of coupons issued Average: Total amount issued/Total amount issued
Coupon amount issued 0.1 percentile mean: The amount of coupons issued per minute is sorted by denomination, with larger denominations in front and smaller denominations later. Calculate the amount of coupons issued per minute. The average value of the top 10% of the coupons [for example, the order of coupon denominations is: 10, 9, 8, 8, 6, 5, 4, 4, 2, 2, then the average value of the 0.1 quantile is 10]
0.2 percentile mean of coupon amount issued: The amount of coupons issued per minute is sorted by denomination, with larger denominations in front and smaller denominations later. Calculate the top 20% of the coupon amount issued per minute. The average value of coupons [for example, the denomination order of issued coupons is: 10,9,8,8,6,5,4,4,2,2, then the average value of 0.2 percentile is (10 9)/2=9.5]
Indicators such as the issuance volume and the average amount of coupons can be implemented using MySQL. So how to use MySQL to query the quantile value?
MySQL implements sorting
row_number() over ( partition by a1.min order by metric_value desc) as orderNum
metric_value represents the amount of coupons issued. Through the above function, it can be sorted according to the amount of coupons issued, and the coupon issuance data per minute is based on Amount sorting
MySQL implements topN
SELECT * FROM sales ORDER BY amount DESC LIMIT 10;
Obviously, this topN method cannot achieve sorting by minutes, and the top N% are taken. In order to know the amount of N%, we need to first determine the total amount, so we need to first calculate the total amount per minute. Then multiply it by N% to know how much data we need to extract N%.
select hour,min, count(1) as cn from table where dt=20230423 and hour=11 and min>=0 and min<=30 group by hour,min
Then, we multiply the statistical results by N%
select dt,a2.hour,a2.min as min,metric_value, round(cn*N%) as cn, orderNum from ( select dt,hour,a1.min as min, metric_value, row_number() over ( partition by a1.min order by metric_value desc) as orderNum from table a1 where dt=20230423 and hour=11 and min>=0 and min<=30 ) as a2 inner join ( select hour,min , count(1) as cn from table c where dt=20230423 and hour=11 and min>=0 and min<=30 group by hour,min ) a3 on a2.hour=a3.hour and a2.min=a3.min
In this way, we can compare cn (the amount of data required to calculate the quantile value) and orderNum (the size of the current coupon according to the face value The size of the sort order) is used to obtain the first N% of the data, and then avg processing is performed on this part of the data to obtain the quantile value data.
Adjust the calculation logic and fuse it together to get the SQL of the percentile value as follows:
select dt,hour,min, round(avg(metric_value)) as metric_value from ( select dt,a2.hour,a2.min as min,metric_value, round(cn*?) as cn, orderNum from ( select dt,hour,a1.min as min, metric_value, row_number() over ( partition by a1.min order by metric_value desc) as orderNum from table a1 where dt=20230423 and hour=11 and min>=0 and min<=30 ) as a2 inner join ( select hour,min, count(1) as cn from table a1 where dt=20230423 and hour=11 and min>=0 and min<=30 ) as a3 on a2.hour=a3.hour and a2.min=a3.min ) as q where cn>orderNum group by dt,hour,min order by dt,hour,min
This data is within the range of calculating percentile value statistics if cn > orderNum.. In order to calculate the 0.1 percentile value, the first 10% of coupon issuance data per minute needs to be collected. After sorting by denomination and grouping by minutes, each record will be marked with the rank of the record. The total amount of coupons issued per minute is multiplied by 10% to get cnt. This value is the amount of data required to calculate the 0.1-minute average of this minute. When cnt Explanation Before using MySQL to calculate the quantile value, the quantile value was always queried through the Java program for each minute's coupon issuance data, and then sorted to calculate the mean. accomplish. The biggest problem with program implementation is that if the amount of coupons issued is relatively large, then the quantile value indicators for a period of time need to be queried, which will put great pressure on the program. In fact, we do have this problem in our actual business. Every time you query 2 hours of quantile value data, over a million data will be loaded into the Java program, which is extremely scary for data query services. In order to solve this problem, we must implement the query of quantile values through MySQL. The program queries the detailed data to calculate the quantile value--> MySQL implements direct query of the quantile value The performance starts from >1min --> Within 15s; performance is greatly improved The above is the detailed content of How to query quantile value in MySQL. For more information, please follow other related articles on the PHP Chinese website!Effect