Home > Operation and Maintenance > Nginx > How to use nginx lua to collect data in website statistics

How to use nginx lua to collect data in website statistics

Release: 2023-05-28 17:32:48
1237 people have browsed it

Webmasters and operators often use website data analysis tools. Google Analytics, Baidu Statistics, Tencent Analytics, etc. are widely used. If you want to collect statistical data, you must first collect data. Let’s analyze the principles of data collection below. , and build a data collection system.

Data Collection Principle Analysis

Simply put, the website statistical analysis tool needs to collect the behavior of users browsing the target website (such as opening a certain web page, clicking on a certain button, adding items to the shopping cart, etc.) and additional behavior data (such as the order amount generated by an order behavior, etc.). Early website statistics often collected only one user behavior: the opening of a page. Then the user's behavior on the page cannot be collected. This collection strategy can meet common analysis perspectives such as basic traffic analysis, source analysis, content analysis and visitor attributes. However, with the widespread use of ajax technology and e-commerce websites, the demand for statistical analysis of e-commerce targets is becoming stronger and stronger. , this traditional collection strategy has become beyond its capabilities.
Later, Google innovatively introduced customizable data collection scripts in its product Google Analytics. Through the extensible interface defined by Google Analytics, users only need to write a small amount of javascript code to implement custom events and customization. Define tracking and analysis of metrics. At present, products such as Baidu Statistics and Sogou Analytics have copied the Google Analytics model.
In fact, the basic principles and processes of the two data collection modes are the same, but the latter one collects more information through javascript. Let’s take a look at the basic principles of data collection for various website statistics tools today.

Process Overview

First of all, the user's behavior will trigger an http request from the browser to the page being counted. Let's assume that the behavior is to open the web page. When the web page is opened, the javascript snippet embedded in the page will be executed. Friends who have used related tools should know that general website statistics tools will require users to add a small piece of javascript code to the web page. This code snippet will usually dynamically create a script. tag, and point src to a separate js file. At this time, this separate js file (green node in Figure 1) will be requested and executed by the browser. This js is often the real data collection script. After the data collection is completed, js will request a back-end data collection script (backend in Figure 1). This script is usually a dynamic script program disguised as an image, which may be written by php, python or other server-side languages. js will The collected data is passed to the back-end script through http parameters. The back-end script parses the parameters and records them in the access log in a fixed format. At the same time, some cookies for tracking may be planted in the http response to the client.
The above is a general process of data collection. The following uses Google Analytics as an example to conduct a relatively detailed analysis of each stage.

Buried script execution phase

To use Google Analytics (hereinafter referred to as GA), you need to insert a javascript fragment provided by it into the page. This fragment is often used It's called buried code.

_gaq is GA's global array, used to place various configurations. The format of each configuration is:

_gaq.push([‘Action’, ‘param1&rsquo ;, ‘param2’, …]);

Action specifies the configuration action, followed by a list of related parameters. The default embedding code given by GA will give two preset configurations. _setAccount is used to set the website identification ID. This identification ID is assigned when registering GA. _trackPageview tells GA to track a page visit. For more configuration, please refer to: https://developers.google.com/analytics/devguides/collection/gajs/. In fact, this _gaq is used as a FIFO queue, and the configuration code does not need to appear before the buried code. For details, please refer to the instructions in the above link.
As far as this article is concerned, the mechanism of _gaq is not the focus. The focus is on the code of the anonymous function behind it. This is what the buried code really needs to do. The main purpose of this code is to introduce an external js file (ga.js) by creating a script through the document.createElement method and pointing the src to the corresponding ga.js according to the protocol (http or https), and finally adding this element Insert into the DOM tree of the page.
Note that ga.async = true means calling the external js file asynchronously, that is, not blocking the browser's parsing, and executing it asynchronously after the external js download is completed. This attribute is newly introduced in HTML5.

Data collection script execution phase

The data collection script (ga.js) will be executed after being requested. This script generally does the following things:
1. Collect information through the browser's built-in javascript object, such as page title (through document.title ), referrer (previous URL, via document.referrer), user monitor resolution (via windows.screen), cookie information (via document.cookie), and other information.
2. Parse _gaq to collect configuration information. This may include user-defined event tracking, business data (such as product numbers on e-commerce websites, etc.).
3. Parse and splice the data collected in the above two steps according to the predefined format.
4. Request a back-end script and put the information in the http request parameter and carry it to the back-end script.
The only problem here is step 4. The common method for javascript to request back-end scripts is ajax, but ajax cannot make cross-domain requests. Here ga.js is executed in the domain of the website being counted, and the back-end script is in another domain (GA's back-end statistics script is http://www.google-analytics.com/__utm.gif), and ajax does not work. A common method is to create an Image object in a js script, point the src attribute of the Image object to the backend script and carry parameters. At this time, a cross-domain request to the backend is implemented. This is why backend scripts are often disguised as gif files. Through http packet capture, you can see the request of ga.js to __utm.gif.

You can see that ga.js brings a lot of information when requesting __utm.gif. For example, utmsr=1280×1024 is the screen resolution, utmac=UA-35712773-1 is the value parsed from _gaq The GA identification ID and so on.
It is worth noting that __utm.gif may not only be requested when the hidden code is executed. If event tracking is configured with _trackEvent, this script will also be requested when the event occurs.
Since ga.js has been compressed and obfuscated, and its readability is very poor, we will not analyze it. I will implement a script with similar functions in the later implementation stage.

Back-end script execution phase

GA's __utm.gif is a script disguised as a gif. This kind of back-end script generally needs to complete the following things:
1. Parse the information of http request parameters.
2. Obtain some information from the server (WebServer) that the client cannot obtain, such as visitor IP, etc.
3. Write the information to the log according to the format.
4. Generate a 1×1 empty gif image as the response content and set the Content-type of the response header to image/gif.
5. Set some required cookie information in the response header through Set-cookie.
The reason why cookies are set is because if you want to track unique visitors, the usual approach is that if the client does not have a specified tracking cookie during the request, a globally unique cookie is generated according to the rules and planted to the user, otherwise Set-cookie Place the obtained tracking cookie in to keep the same user cookie unchanged (see Figure 4).

Although this approach is not perfect (for example, users who clear cookies or change browsers will be considered two users), it is currently a widely used method. Note that if there is no need to track the same user across sites, you can use js to plant cookies under the domain of the website being counted (GA does this). If you want to position the entire network uniformly, you can plant cookies in the server domain through back-end scripts. Next (our implementation will do this later).

Design and implementation of the system

Based on the above principles, I built an access log collection system myself.
I call this system MyAnalytics.

Determine the information collected

For the sake of simplicity, I am not going to implement the complete data collection model of GA, but collect the information.

Buried code

Buried code I will learn from the GA model, but currently the configuration object will not be used as a FIFO queue.

I am currently using a statistics script named ma.js and enabling the secondary domain name analytics.codinglabs.org. Of course, there is a small problem here, because I do not have an https server, so if the code is deployed on an https site, there will be problems, but let's ignore it here.

Front-end statistics script

I wrote a statistical script ma.js that is not very complete but can complete the basic work:

    (function () {
    var params = {};
    if(document) {
    params.domain = document.domain || '';
    params.url = document.URL || '';
    params.title = document.title || '';
    params.referrer = document.referrer || '';
    if(window && window.screen) {
    params.sh = window.screen.height || 0;
    params.sw = window.screen.width || 0;
    params.cd = window.screen.colorDepth || 0;
    if(navigator) {
    params.lang = navigator.language || '';
    if(_maq) {
    for(var i in _maq) {
    switch(_maq[i][0]) {
    case '_setAccount':
    params.account = _maq[i][1];
    var args = '';
    for(var i in params) {
    if(args != '') {
    args += '&';
    args += i + '=' + encodeURIComponent(params[i]);
    var img = new Image(1, 1);
    img.src = 'http://analytics.codinglabs.org/1.gif?' + args;
Copy after login

Put the entire script In anonymous functions, ensure that the global environment is not polluted. The function has been explained in the principle section and will not be described again. Among them 1.gif is the back-end script.

Log format

The log uses one record per line, using the invisible character ^A (ascii code 0x01, under Linux, it can be entered through ctrl v ctrl a, "^A" is used below to represent the invisible character 0x01), and the specific format is as follows:
Time^AIP^ADomain name^AURL^APage title^AReferrer^AHigh resolution^AResolution wide^AColor depth^ ALanguage^AClient information^AUser ID^AWebsite ID

Backend script

log_format tick “$msec^A$remote_addr^A$u_domain^A$u_url^A$u_title^A$u_referrer^A$u_sh^A$u_sw^A$u_cd^A$u_lang^A$http_user_agent^A$u_utrace^A$u_account”;

    location /1.gif {
    default_type image/gif;
    access_log off;
    access_by_lua "
    -- 用户跟踪cookie名为__utrace
    local uid = ngx.var.cookie___utrace
    if not uid then
    -- 如果没有则生成一个跟踪cookie,算法为md5(时间戳+IP+客户端信息)
    uid = ngx.md5(ngx.now() .. ngx.var.remote_addr .. ngx.var.http_user_agent)
    ngx.header['Set-Cookie'] = {'__utrace=' .. uid .. '; path=/'}
    if ngx.var.arg_domain then
    -- 通过subrequest到/i-log记录日志,将参数和用户跟踪cookie带过去
    ngx.location.capture('/i-log?' .. ngx.var.args .. '&utrace=' .. uid)
    add_header Expires "Fri, 01 Jan 1980 00:00:00 GMT";
    add_header Pragma "no-cache";
    add_header Cache-Control "no-cache, max-age=0, must-revalidate";
    location /i-log {
    set_unescape_uri $u_domain $arg_domain;
    set_unescape_uri $u_url $arg_url;
    set_unescape_uri $u_title $arg_title;
    set_unescape_uri $u_referrer $arg_referrer;
    set_unescape_uri $u_sh $arg_sh;
    set_unescape_uri $u_sw $arg_sw;
    set_unescape_uri $u_cd $arg_cd;
    set_unescape_uri $u_lang $arg_lang;
    set_unescape_uri $u_utrace $arg_utrace;
    set_unescape_uri $u_account $arg_account;
    log_subrequest on;
    access_log /path/to/logs/directory/ma.log tick;
    echo '';
Copy after login



日志收集系统需要处理大量的访问日志,在时间的累积下文件规模急剧膨胀,放在同一文件中管理不便。所以通常要按时间段将日志切分,例如每天或每小时切分一个日志。我这里为了效果明显,每一小时切分一个日志。通过 crontab 定时调用一个 shell 脚本,以下是该脚本的内容:

    time=`date +%Y%m%d%H`
    mv ${_prefix}/logs/ma.log ${_prefix}/logs/ma/ma-${time}.log
    kill -USR1 `cat ${_prefix}/logs/nginx.pid`
Copy after login


59 * * * * root /path/to/directory/rotatelog.sh
Copy after login



然后我tail打开日志文件,然后刷新一下页面,因为没有设access log buffer, 我立即得到了一条新日志:
1351060731.360^A0.0.0.0^Awww.codinglabs.org^Ahttp://www.codinglabs.org/^ACodingLabs^A^A1024^A1280^A24^Azh-CN^AMozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4^A4d612be64366768d32e623d594e82678^AU-1-1


注意,原始日志最好尽量多的保留信息而不要做过多过滤和处理。例如上面的MyAnalytics保留了毫秒级时间戳而不是格式化后的时间,时间的格式化是后面的系统做的事而不是日志收集系统的责任。后面的系统根据原始日志可以分析出很多东西,例如通过IP库可以定位访问者的地域、user agent中可以得到访问者的操作系统、浏览器等信息,再结合复杂的分析模型,就可以做流量、来源、访客、地域、路径等分析了。当然,一般不会直接对原始日志分析,而是会将其清洗格式化后转存到其它地方,如MySQL或HBase中再做分析。

    awk -F^A '{print $1}' ma-2012102409.log | wc -l
    awk -F^A '{print $12}' ma-2012102409.log | uniq | wc -l
    awk -F^A '{print $2}' ma-2012102409.log | uniq | wc -l
Copy after login

The above is the detailed content of How to use nginx lua to collect data in website statistics. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
Latest Downloads
Web Effects
Website Source Code
Website Materials
Front End Template