I would like to know what algorithm Webmaster Home uses for statistics on website IP PV.
What is the general idea and direction? I think the relative accuracy of statistics without using SDK is not very high. Is this what I understand? Please let me know.
I would like to know what algorithm Webmaster Home uses for statistics on website IP PV.
What is the general idea and direction? I think the relative accuracy of statistics without using SDK is not very high. Is this what I understand? Please let me know.
1) Simple and crude. There is no need to worry about the user-agent of the browser. Regardless of cookies and other information, every time a PV is generated, it is counted directly. Advantages: simple. Disadvantages: it may not be true, or it may be fake. Data
2) Slightly more detailed statistics will distinguish new and old users. You can study the Baidu statistics SDK, which contains the user's browser information, operating system information, user's regional information, etc., that is to say , you can obtain this data through the browser's javascript and the interaction with the server data. For the backend server, you can obtain this data. So for a website like Webmaster's Home, he may want to count the real user visits so that Some behavioral analysis will combine the user's IP information, cookie information (that is, session) and user-agent for statistical analysis. Note that the IP here is the mapped IP address. For our daily home dial-up Internet access, it is The virtual intranet address of the operator is obtained in order to save IPv4 resources. Therefore, a user-agent, IP, and cookie can basically uniquely identify a user's information.
3) Furthermore, with these data, from a design perspective, the information of reading volume is not the highest priority in page display (the highest priority should be the business content itself), but the reading volume information The relevant information is meaningful, so the question arises, is it necessary to add write locks for mutual exclusion at the design level of the database for information such as reading volume? It is recommended here to understand what the CAP principle is.
4) So the solution may be caching, or it may be IP judgment and cookie detection. You have to try this before you know it. However, I personally think that the most likely one is the reading volume, which is used by Autohome. It is an asynchronous statistical method, which means that after you generate real reading, it will only give the reading counter 1 after background processing.
Provide some ideas to achieve this:
An IP can only be read twice at most, similar to the mechanism, or there is a deeper level of logical judgment, such as the IP is cleared the next day, and then the statistical algorithm becomes two times for each IP every day. Opportunity to increase the number of reads
Within a fixed period of time (such as 30 minutes), no matter how many times you visit using the same browser core, the number of readings will only increase once.
Verify user-agent, cookie and other information; insert a visitor record into table A for each browsing
Weibo implementation: I am doing Weibo, and I will talk about the practice of Weibo. Number of reads, number of likes, single access limit. All are implemented using redis. Then synchronize the database during the free period every night (according to certain rules, in batches, etc.).
If the user has logged in, it will only be counted once; if it is a visitor, it will be judged based on IP, timestamp, cookie, etc. If the user is the same, it will only be counted once.
This will prevent you from browsing too much.