Algorithms have no borders, domestic algorithms are equally good

Home

Backend Development

PHP Tutorial

Algorithms have no borders, domestic algorithms are equally good_PHP Tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 13, 2016 am 10:37 AM

company domestic foreign data analysis of algorithm compile

[CSDN Report] A few days ago, CSDN compiled an article published on High Scalability by Matt Abrams, the deputy director of data analysis of the foreign AddThis company. In this article, Matt Abrams introduced to readers that AddThis uses only 1.5KB of memory. One billion different objects were calculated, fully demonstrating the charm of the algorithm.

This article received widespread attention on Weibo, and we learned that Yitao’s algorithm was equally impressive. To this end, CSDN interviewed Zhang Yang from Yitao Data Department (he studied at Yantai University and Beihang University. He obtained a master's degree in computer theory from Beihang University in 2011. He joined Taobao in the same year and currently works in Yitao Data Department. work), ask him to explain Yitao’s related algorithms.

Picture: Zhang Yang, Yitao Data Department Engineer

CSDN: First of all, please introduce yourself and your usual work?

Zhang Yang: My name is Zhang Yang, and my nickname in the company is Ye Feng. Currently, he is an ordinary coder in the data department of Yitao. Like thousands of coders, he works by typing codes and writing programs every day. At the same time, he also regards it as the second greatest pleasure in life (the first greatest pleasure is eating). I am very interested in technologies such as PHP, Nginx, data mining, machine learning, algorithms, compilers, and distributed storage computing. I also like mathematics and history. I like the job of writing programs very much, and I hope to make programming a lifelong career. Besides writing programs, I also like to study mathematics and algorithms. At the same time, I am happy to summarize what I have learned into articles and publish them on my blog to share with everyone. Friends who are interested can visit my blog: codinglabs.org.

My position in the data department of Yitao is front-end development, but my "front-end development" is more complicated than the general front-end engineer. In addition to being responsible for HTML, CSS and JavaScript, I also develop background programs for PHP and Lua. , I occasionally develop some C and algorithm programs based on my interests and needs (I like writing C and algorithms very much, and enjoy it very much). At the same time, I also do some operation and maintenance work, such as building a server environment and maintaining online servers.

CSDN: What prompted you to be interested in algorithms?

Zhang Yang: Maybe it comes from my interest in mathematics. I have always liked mathematical things. I formally came into contact with algorithms when I was a sophomore in college. I bought an Introduction to Algorithms at that time, and then I really began to understand a series of basic things in the field of algorithms such as asymptotic complexity, algorithm analysis, dynamic programming, greedy algorithms, and NP problems. I felt amazing when I read it. I felt that every algorithm in the book shines with human wisdom. Reading and learning these things brought me a sense of satisfaction and pleasure that is difficult to express in words. In my subsequent studies and work, I continued to understand and understand from practical applications how algorithms can solve practical problems in various fields and promote the development of human civilization. This deepened my respect for algorithms.

CSDN: Why did Yitao Data Department develop this cardinality estimation algorithm?

Zhang Yang: Yitao Data Department mainly does some data analysis and mining in the field of e-commerce, and closely integrates these technologies with business to form some data products and services, such as data analysis, recommendation systems, etc. We all have Do. These data products not only serve external parties, but also provide support for the internal operations of the company or group.

In the field of e-commerce data analysis, the calculation of some key indicators (such as unique visitor, referred to as UV, which refers to the number of independent visitors under certain time and space constraints) is a very common task. Generally speaking, we will first mark each unique visitor by some means (for example, through cookies), and then record the visitor's mark in all access logs. In this way, the calculation of UV is equivalent to Calculate the number of unique elements in a repeatable user tag set, which is the mathematical cardinality.

There are two difficulties in calculating the base:

First, it is not conducive to the implementation of real-time stream computing. For example, some of our products often provide real-time UV, which is the number of unique visitors from a certain point in time (such as zero o'clock today) to the present. In order to achieve this, a data structure with high search performance (such as a B-tree) needs to be maintained in memory for each UV value, so that when a new visit comes in the real-time stream, it can be quickly found whether the visitor has already been here. This determines whether the UV value increases by 1 or remains unchanged. If we want to provide this kind of service for 1 million stores at the same time, we need to maintain 1 million B-trees in memory, and if we also need to calculate UVs from different source dimensions, this number will expand rapidly. This is a big challenge for our server computing resources and memory resources.

The second point is that the traditional cardinality calculation method cannot be effectively combined . For example, although the UVs of the previous hour and this hour have been calculated separately, complex calculations still need to be performed to see the total UVs of these two hours. Although the solution using the bitmap data structure can be quickly merged, the space complexity is too high, because the number of any combinations of time periods has a power-level relationship with the number of time periods, so neither B-tree nor simple bitmap is the same in the face of big data. Effective program.

Based on the above background, Wang Xiaozhe (name Qingwu), a technical expert from Yitao Data Department, studied the related algorithms of cardinality estimation and a java implementation of Clearspring (stream-lib), and took the lead in our holographic effect platform (codenamed Moonlight Treasure Box) ) has introduced a cardinality estimation algorithm in the project. It has successfully realized the technical problem of using a small amount of memory to calculate a large number of UVs, and has taken on the real-time effect of all venue pits on Tmall and Taobao during the Double Eleven and Double Twelve promotions. Computational tasks.

In order to facilitate more non-Java projects to use such algorithms, Wang Xiaozhe and I gave a C version of the implementation ccard-lib based on relevant papers and with reference to stream-lib. Then Zhang Wei (Hua Mingminzhan), an engineer from the Data Department of Yitao ) and implemented an extension to PHP. At present, the implementation of this C has been used in multiple products of Yitao Data Department, and has also been open sourced through github.

CSDN: Can you introduce the cardinality estimation algorithm of Yitao Data Department to readers in detail?

Zhang Yang: The algorithm we use is mainly the Adaptive Counting algorithm. This algorithm appears in the paper "Fast and accurate traffic matrix measurement using adaptive cardinality counting", but I also implemented it in ccard-lib. Common cardinality estimation algorithms such as Linear Counting, LogLog Counting and HyperLogLog Counting are provided.

These algorithms are probabilistic algorithms, which significantly save the use of computing resources by sacrificing a certain degree of accuracy (but the accuracy is controllable, and methods for controlling accuracy can be given through mathematical analysis). For example, we can estimate hundreds of millions of UVs using only 8k of memory, and the error does not exceed 2%, which saves memory significantly compared to using B-trees or original bitmaps. At the same time, the cardinality estimation algorithm uses the hash-transformed bitmap space, which can significantly save memory while still achieving efficient merging, which simultaneously solves the two difficulties mentioned above.

When using 2^16 (64K) bits, the estimated results are as follows:

Linear Counting with Murmurhash:

actual: 50000, estimated: 50062, error: 0.12%

actual: 100000, estimated: 99924, error: 0.08%

actual: 150000, estimated: 149865, error: 0.09%

actual: 200000, estimated: 199916, error: 0.04%

actual: 250000, estimated: 250123, error: 0.05%

actual: 300000, estimated: 299942, error: 0.02%

actual: 350000, estimated: 349801, error: 0.06%

actual: 400000, estimated: 400101, error: 0.03%

actual: 450000, estimated: 449955, error: 0.01%

actual: 500000, estimated: 500065, error: 0.01%

Linear Counting with Lookup3hash:

actual: 50000, estimated: 49835, error: 0.33%

actual: 100000, estimated: 99461, error: 0.54%

actual: 150000, estimated: 149006, error: 0.66%

actual: 200000, estimated: 198501, error: 0.75%

actual: 250000, estimated: 248365, error: 0.65%

actual: 300000, estimated: 298065, error: 0.65%

actual: 350000, estimated: 347504, error: 0.71%

actual: 400000, estimated: 397292, error: 0.68%

actual: 450000, estimated: 446700, error: 0.73%

actual: 500000, estimated: 495944, error: 0.81%

Hyperloglog Counting with Murmurhash:

actual: 50000, estimated: 50015, error: 0.03%

actual: 100000, estimated: 100048, error: 0.05%

actual: 150000, estimated: 149709, error: 0.19%

actual: 200000, estimated: 201595, error: 0.80%

actual: 250000, estimated: 250168, error: 0.07%

actual: 300000, estimated: 299864, error: 0.05%

actual: 350000, estimated: 348571, error: 0.41%

actual: 400000, estimated: 398583, error: 0.35%

actual: 450000, estimated: 448632, error: 0.30%

actual: 500000, estimated: 498330, error: 0.33%

Hyperloglog Counting with Lookup3hash:

actual: 50000, estimated: 49628, error: 0.74%

actual: 100000, estimated: 99357, error: 0.64%

actual: 150000, estimated: 148880, error: 0.75%

actual: 200000, estimated: 200475, error: 0.24%

actual: 250000, estimated: 249362, error: 0.26%

actual: 300000, estimated: 299119, error: 0.29%

actual: 350000, estimated: 349225, error: 0.22%

actual: 400000, estimated: 398805, error: 0.30%

actual: 450000, estimated: 448373, error: 0.36%

actual: 500000, estimated: 498183, error: 0.36%

Adaptive Counting with Murmurhash:

actual: 50000, estimated: 50015, error: 0.03%

actual: 100000, estimated: 100048, error: 0.05%

actual: 150000, estimated: 149709, error: 0.19%

actual: 200000, estimated: 201059, error: 0.53%

actual: 250000, estimated: 249991, error: 0.00%

actual: 300000, estimated: 300067, error: 0.02%

actual: 350000, estimated: 349610, error: 0.11%

actual: 400000, estimated: 399875, error: 0.03%

actual: 450000, estimated: 450348, error: 0.08%

actual: 500000, estimated: 500977, error: 0.20%

Adaptive Counting with Lookup3hash:

actual: 50000, estimated: 49628, error: 0.74%

actual: 100000, estimated: 99357, error: 0.64%

actual: 150000, estimated: 148880, error: 0.75%

actual: 200000, estimated: 199895, error: 0.05%

actual: 250000, estimated: 249563, error: 0.17%

actual: 300000, estimated: 299047, error: 0.32%

actual: 350000, estimated: 348665, error: 0.38%

actual: 400000, estimated: 399266, error: 0.18%

actual: 450000, estimated: 450196, error: 0.04%

actual: 500000, estimated: 499516, error: 0.10%

Loglog Counting with Murmurhash:

actual: 50000, estimated: 59857, error: 19.71%

actual: 100000, estimated: 103108, error: 3.11%

actual: 150000, estimated: 150917, error: 0.61%

actual: 200000, estimated: 201059, error: 0.53%

actual: 250000, estimated: 249991, error: 0.00%

actual: 300000, estimated: 300067, error: 0.02%

actual: 350000, estimated: 349610, error: 0.11%

actual: 400000, estimated: 399875, error: 0.03%

actual: 450000, estimated: 450348, error: 0.08%

actual: 500000, estimated: 500977, error: 0.20%

Loglog Counting with Lookup3hash:

actual: 50000, estimated: 59870, error: 19.74%

actual: 100000, estimated: 103044, error: 3.04%

actual: 150000, estimated: 150435, error: 0.29%

actual: 200000, estimated: 199895, error: 0.05%

actual: 250000, estimated: 249563, error: 0.17%

actual: 300000, estimated: 299047, error: 0.32%

actual: 350000, estimated: 348665, error: 0.38%

actual: 400000, estimated: 399266, error: 0.18%

actual: 450000, estimated: 450196, error: 0.04%

actual: 500000, estimated: 499516, error: 0.10%

限于篇幅，我在这里不能具体描述这些算法的细节，之前我在博客上发表了一篇翻译的文章，不过内容也是概括性描述。但是我已经在准备写博文详细介绍基数估计算法了，那里面会包括算法的数理细节以及对论文的一些解读，欢迎有兴趣的朋友关注我的博客。

CSDN：看到您微博上自称“代码洁癖重度患者”，这是一个很有趣的称呼，那么是否可以理解为您对代码的规范性很在意，您在平时在编码过程中如何保持代码的规范？

张洋：这么说其实是有点自嘲的意思吧。对代码格式我确实是很在意的，如果看到代码不规范、不整齐甚至多一个空行我都会觉得非常不舒服，骨子里对代码格式有一种完美主义倾向。

不过这个事情要分两面看，如果是我自己开发的比较专的东西，如算法库，可以坚持这种完美主义，但需要多人合作的场合实际上是不太合适的。实事求是的说，业务代码总是不可能一直很漂亮，需要在业务进度和代码质量中间做一个权衡。在保持代码规范方面，我始终认为不能完全靠程序员的自觉和代码规范的宣讲，通过工具（例如lint）和流程去保证会更有效一些。

CSDN：还有哪些困难是需要在未来工作中克服的？

张洋：需要克服的困难主要来自两方面吧。

一方面是算法本身改进的困难，这世界不存在完美无暇的算法，例如上面的基数估计算法，虽然大大降低了内存使用，但是如果维度爆炸的话，内存使用仍然会很夸张，而且合并bitmap也不是没有代价，有时需要进行内存和磁盘bitmap的合并，当bitmap量过大时磁盘IO会称为瓶颈，因此如何结合具体场景来优化和改进算法就成为一个难点。一个方法是查阅相关论文，了解和借鉴目前全球各大研究机构和公司对相关算法的最新研究成果。另一个方法就是自己进行改进，这块需要对算法本身极其相关的数学分析有非常深入掌握，因此对相关工程师的理论水平要求较高。

On the other hand is the combination of algorithms and business products. After all, algorithms are relatively formal things, and there is still a long way to go before they can be concretely applied to products. Finding the best fit and combination of algorithms and products is also one of the key points and difficulties in the work.

2012 has passed, we have passed the end of the world and ushered in a new chapter in the world. In 2013, we will also enter a new era of Internet development. All kinds of data are flooding the Internet. Big data has become one of the problems that every Internet company has to face. How to consume the minimum resources to obtain as much useful information as possible is a question that every Internet company should consider. Through the two recent articles on algorithms, I believe all readers will have an idea. Of course, each algorithm has its own advantages and disadvantages. We still have to choose the algorithm based on the actual usage in daily work, and we cannot generalize. (Wang Xudong/Author Bao Yan/Reviewer)

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7491

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance Mar 26, 2024 pm 12:41 PM

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Implementing Machine Learning Algorithms in C++: Common Challenges and Solutions Jun 03, 2024 pm 01:25 PM

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

Explore the underlying principles and algorithm selection of the C++sort function Apr 02, 2024 pm 05:36 PM

The bottom layer of the C++sort function uses merge sort, its complexity is O(nlogn), and provides different sorting algorithm choices, including quick sort, heap sort and stable sort.

Which company does Blue Star Travel Yao belong to? Mar 22, 2024 pm 03:41 PM

Blue Star Travel Ballad has been on the game hot list after the recent release of a promotional video. Many players are very curious about which company Blue Star Travel Ballad is made from. In fact, it is a new game from Shanghai 2D manufacturer Manjiu. The editor will explain it to you below. Here is the introduction of Blue Star Yuanluyao Game Company, come and take a look together. Which company does Blue Star Travel Yao come from? Answer: It was launched by Manjiu Network. 1. First of all, Blue Star Travel Yao is a game launched by Manju’s Big World RPG. A promotional video was released on March 20. 2. This product will get its version number in October 2023. The game's trademark and operating unit are both registered under the name of a company called. The latter was established in February 2023, and its official website shows that its headquarters is in Singapore. 3. The 11-minute promotional video released this time revealed this

Can artificial intelligence predict crime? Explore CrimeGPT's capabilities Mar 22, 2024 pm 10:10 PM

The convergence of artificial intelligence (AI) and law enforcement opens up new possibilities for crime prevention and detection. The predictive capabilities of artificial intelligence are widely used in systems such as CrimeGPT (Crime Prediction Technology) to predict criminal activities. This article explores the potential of artificial intelligence in crime prediction, its current applications, the challenges it faces, and the possible ethical implications of the technology. Artificial Intelligence and Crime Prediction: The Basics CrimeGPT uses machine learning algorithms to analyze large data sets, identifying patterns that can predict where and when crimes are likely to occur. These data sets include historical crime statistics, demographic information, economic indicators, weather patterns, and more. By identifying trends that human analysts might miss, artificial intelligence can empower law enforcement agencies

Improved detection algorithm: for target detection in high-resolution optical remote sensing images Jun 06, 2024 pm 12:33 PM

01 Outlook Summary Currently, it is difficult to achieve an appropriate balance between detection efficiency and detection results. We have developed an enhanced YOLOv5 algorithm for target detection in high-resolution optical remote sensing images, using multi-layer feature pyramids, multi-detection head strategies and hybrid attention modules to improve the effect of the target detection network in optical remote sensing images. According to the SIMD data set, the mAP of the new algorithm is 2.2% better than YOLOv5 and 8.48% better than YOLOX, achieving a better balance between detection results and speed. 02 Background & Motivation With the rapid development of remote sensing technology, high-resolution optical remote sensing images have been used to describe many objects on the earth’s surface, including aircraft, cars, buildings, etc. Object detection in the interpretation of remote sensing images

Which company does the Hands app belong to? Mar 13, 2024 am 11:10 AM

Hands-on is a brand new chat and dating software, so which company is the Hand-on-hand app? This software was created by Tianjin Laifu Cultural Development Co., Ltd. You can download it from Xiaomi Mall and Apple Mall. This introduction to the Hands-on app creation company can tell you the specific methods. The following is a detailed introduction, so take a look. Which company is the Qianshou app? Answer: Tianjin Laifu Cultural Development Co., Ltd. Detailed description: On the official software website https://www.qianshouapp.cn/, you can see the company name at the bottom. Software introduction: 1. It can filter according to the conditions that users like, and can find the objects they need faster. 2. It can help users search for the objects they need faster.

Application of algorithms in the construction of 58 portrait platform May 09, 2024 am 09:01 AM

1. Background of the Construction of 58 Portraits Platform First of all, I would like to share with you the background of the construction of the 58 Portrait Platform. 1. The traditional thinking of the traditional profiling platform is no longer enough. Building a user profiling platform relies on data warehouse modeling capabilities to integrate data from multiple business lines to build accurate user portraits; it also requires data mining to understand user behavior, interests and needs, and provide algorithms. side capabilities; finally, it also needs to have data platform capabilities to efficiently store, query and share user profile data and provide profile services. The main difference between a self-built business profiling platform and a middle-office profiling platform is that the self-built profiling platform serves a single business line and can be customized on demand; the mid-office platform serves multiple business lines, has complex modeling, and provides more general capabilities. 2.58 User portraits of the background of Zhongtai portrait construction

See all articles