算法无国界国内算法同样牛-php手册-php.cn

Home

php教程

php手册

算法无国界国内算法同样牛

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 13, 2016 am 09:42 AM

company domestic foreign data analysis of algorithm compile

【CSDN报道】几天前，CSDN编译了国外AddThis公司的数据分析副总监Matt Abrams在High Scalability上发表的一篇文章，Matt Abrams在这篇文章中向读者介绍了AddThis仅用了1.5KB内存就计算了十亿个不同的对象，充分展示了算法的魅力。

这篇文章在微博上得到了广泛关注，并得知一淘的算法也同样出彩。为此，CSDN采访了一淘数据部的张洋（他曾先后就读于烟台大学和北京航空航天大学，2011年在北京航空航天大学取得计算机理论硕士学位，同年加入淘宝，目前在一淘数据部工作），请他讲解一下一淘的相关算法。

图：一淘数据部工程师张洋

CSDN：首先请您介绍一下自己以及平时的工作？

张洋：我叫张洋，在公司的花名是夜沨。目前是一淘数据部一名普通码农，和千千万万码农一样，每天以敲代码写程序为工作，同时也将其视为人生第二大乐趣（第一大乐趣是吃）。我对PHP、Nginx、数据挖掘、机器学习、算法、编译器和分布式存储计算等技术兴趣浓厚，喜爱数学和历史。我很喜欢写程序这个工作，也希望能将编程作为毕生的职业。写程序之余也喜欢研究数学和算法，同时我很乐于将自己学到的东西总结成文章发表在博客上和大家分享，有兴趣的朋友可以来我博客逛逛：codinglabs.org。

我在一淘数据部的职位是前端开发，但是我这个“前端开发”比一般意义上的前端工程师做的事要杂一些，除了负责HTML、CSS和JavaScript外，也开发PHP、Lua的后台程序，偶尔也会根据兴趣和需要来开发一些C和算法的程序（我很喜欢写C和算法，十分乐在其中），同时我还做一些运维工作，例如搭建服务器环境和维护线上服务器。

CSDN：是什么原因促使您对算法感兴趣的？

张洋：可能是源自我对数学的兴趣吧，我一直很喜欢数理性的东西。正式接触算法是大二的时候，当时买了一本算法导论，才真正开始了解渐近复杂度、算法分析、动态规划、贪心算法、NP问题等一系列算法领域最基本的东西。看的时候就觉得很神奇，感觉书中的每个算法都闪耀着人类的智慧，阅读和学习这些东西给我带来一种难以用语言表达的满足感和快感。在后来的学习和工作中我不断从实际应用中了解和领会算法是如何解决各个领域的实际问题，推动人类文明的发展，这更加深了我对算法的崇敬。

CSDN：一淘数据部为什么会开发这个基数估计算法？

张洋：一淘数据部主要在电子商务领域做一些数据的分析挖掘，并将这些技术与业务紧密结合形成一些数据产品和服务，例如数据分析、推荐系统等我们都有做。这些数据产品既对外服务，也会对公司或集团内部的运作提供支持。

在电子商务的数据分析领域有一些很关键的指标（例如unique visitor，简称UV，指在一定的时间空间维度约束下独立访客的数量）的计算是很常见的任务。一般来说我们首先会通过某种手段给每一个独立访客做一个标记（例如通过cookie），然后会在所有访问日志中记录下访客的标记，这样一来，UV的计算就等价为在一个可重复的用户标记集合中计算不重复元素的个数，也就是数学上的基数。

基数的计算有两个难点：

一是不利于实时流计算的实现。例如我们的一些产品中经常会提供实时UV，也就是从某个时间点开始（例如今天零点）到目前的独立访客数。为了做到这点，需在内存中为每一个UV数值维护一个查找性能高的数据结构（例如B树），这样当实时流中新来一个访问时，能快速查找这个访客是否已经来过，由此确定UV值是增加1还是不变。如果我们要为100万家店铺同时提供这种服务，就要在内存中维护100万个B树，而如果还要分不同来源维度计算UV的话，这个数量还会迅速膨胀。这对我们的服务器计算资源和内存资源都是一个很大的挑战。

第二点就是传统的基数计算方法无法有效合并。例如，前一小时和这一小时的UV虽然分别计算出来了，但是要看这两个小时的总UV依然要重新进行一遍复杂的计算。使用bitmap数据结构的方案虽然可以快速合并，但是空间复杂度太高，因为时间段的任意组合数量与时间段数量呈幂级关系，所以不论是B树还是简单的bitmap在大数据面前都不是一个有效的方案。

基于以上背景，一淘数据部的技术专家王晓哲（花名清无）研究了基数估计的相关算法及Clearspring的一个java实现（stream-lib），并率先在我们的全息效果平台（代号月光宝盒）的项目中引入了基数估计算法，目前已成功实现利用少量内存对大量UV进行计算的技术难题，并承担了双十一和双十二大促中天猫和淘宝所有会场坑位的效果实时计算任务。

为了方便更多的非Java项目使用此类算法，王晓哲和我根据相关论文并参考stream-lib给出了一个C版本的实现ccard-lib，接着一淘数据部的工程师张维（花名民瞻）又实现了PHP的扩展。目前这个C的实现已经在一淘数据部多个产品中开始使用，并且也已经通过github进行了开源。

CSDN：能不能向读者详细介绍一下一淘数据部的基数估计算法？

张洋：我们使用的算法主要是Adaptive Counting算法，这个算法出现在 “Fast and accurate traffic matrix measurement using adaptive cardinality counting” 这篇论文里，但是我同时在ccard-lib里也实现了Linear Counting、LogLog Counting和HyperLogLog Counting等常见的基数估计算法。

这些算法是概率算法，就是通过牺牲一定的准确性（但是精度可控，并可以通过数学分析给出控制精度的方法），来大幅节省计算的资源使用。例如我们仅仅使用8k的内存就可以对一个数亿量级的UV进行估计，而误差不超过2%，这比使用B树或原始bitmap要大幅节省内存。同时基数估计算法用到了经过哈希变换的bitmap空间，在大幅节省内存的同时依然可以实现高效合并，这就同时解决了上面提到的两个难点。

使用2^16（64K）位时，估算结果如下：

Linear Counting with Murmurhash:

actual: 50000, estimated: 50062, error: 0.12%

actual: 100000, estimated: 99924, error: 0.08%

actual: 150000, estimated: 149865, error: 0.09%

actual: 200000, estimated: 199916, error: 0.04%

actual: 250000, estimated: 250123, error: 0.05%

actual: 300000, estimated: 299942, error: 0.02%

actual: 350000, estimated: 349801, error: 0.06%

actual: 400000, estimated: 400101, error: 0.03%

actual: 450000, estimated: 449955, error: 0.01%

actual: 500000, estimated: 500065, error: 0.01%

Linear Counting with Lookup3hash:

actual: 50000, estimated: 49835, error: 0.33%

actual: 100000, estimated: 99461, error: 0.54%

actual: 150000, estimated: 149006, error: 0.66%

actual: 200000, estimated: 198501, error: 0.75%

actual: 250000, estimated: 248365, error: 0.65%

actual: 300000, estimated: 298065, error: 0.65%

actual: 350000, estimated: 347504, error: 0.71%

actual: 400000, estimated: 397292, error: 0.68%

actual: 450000, estimated: 446700, error: 0.73%

actual: 500000, estimated: 495944, error: 0.81%

Hyperloglog Counting with Murmurhash:

actual: 50000, estimated: 50015, error: 0.03%

actual: 100000, estimated: 100048, error: 0.05%

actual: 150000, estimated: 149709, error: 0.19%

actual: 200000, estimated: 201595, error: 0.80%

actual: 250000, estimated: 250168, error: 0.07%

actual: 300000, estimated: 299864, error: 0.05%

actual: 350000, estimated: 348571, error: 0.41%

actual: 400000, estimated: 398583, error: 0.35%

actual: 450000, estimated: 448632, error: 0.30%

actual: 500000, estimated: 498330, error: 0.33%

Hyperloglog Counting with Lookup3hash:

actual: 50000, estimated: 49628, error: 0.74%

actual: 100000, estimated: 99357, error: 0.64%

actual: 150000, estimated: 148880, error: 0.75%

actual: 200000, estimated: 200475, error: 0.24%

actual: 250000, estimated: 249362, error: 0.26%

actual: 300000, estimated: 299119, error: 0.29%

actual: 350000, estimated: 349225, error: 0.22%

actual: 400000, estimated: 398805, error: 0.30%

actual: 450000, estimated: 448373, error: 0.36%

actual: 500000, estimated: 498183, error: 0.36%

Adaptive Counting with Murmurhash:

actual: 50000, estimated: 50015, error: 0.03%

actual: 100000, estimated: 100048, error: 0.05%

actual: 150000, estimated: 149709, error: 0.19%

actual: 200000, estimated: 201059, error: 0.53%

actual: 250000, estimated: 249991, error: 0.00%

actual: 300000, estimated: 300067, error: 0.02%

actual: 350000, estimated: 349610, error: 0.11%

actual: 400000, estimated: 399875, error: 0.03%

actual: 450000, estimated: 450348, error: 0.08%

actual: 500000, estimated: 500977, error: 0.20%

Adaptive Counting with Lookup3hash:

actual: 50000, estimated: 49628, error: 0.74%

actual: 100000, estimated: 99357, error: 0.64%

actual: 150000, estimated: 148880, error: 0.75%

actual: 200000, estimated: 199895, error: 0.05%

actual: 250000, estimated: 249563, error: 0.17%

actual: 300000, estimated: 299047, error: 0.32%

actual: 350000, estimated: 348665, error: 0.38%

actual: 400000, estimated: 399266, error: 0.18%

actual: 450000, estimated: 450196, error: 0.04%

actual: 500000, estimated: 499516, error: 0.10%

Loglog Counting with Murmurhash:

actual: 50000, estimated: 59857, error: 19.71%

actual: 100000, estimated: 103108, error: 3.11%

actual: 150000, estimated: 150917, error: 0.61%

actual: 200000, estimated: 201059, error: 0.53%

actual: 250000, estimated: 249991, error: 0.00%

actual: 300000, estimated: 300067, error: 0.02%

actual: 350000, estimated: 349610, error: 0.11%

actual: 400000, estimated: 399875, error: 0.03%

actual: 450000, estimated: 450348, error: 0.08%

actual: 500000, estimated: 500977, error: 0.20%

Loglog Counting with Lookup3hash:

actual: 50000, estimated: 59870, error: 19.74%

actual: 100000, estimated: 103044, error: 3.04%

actual: 150000, estimated: 150435, error: 0.29%

actual: 200000, estimated: 199895, error: 0.05%

actual: 250000, estimated: 249563, error: 0.17%

actual: 300000, estimated: 299047, error: 0.32%

actual: 350000, estimated: 348665, error: 0.38%

actual: 400000, estimated: 399266, error: 0.18%

actual: 450000, estimated: 450196, error: 0.04%

actual: 500000, estimated: 499516, error: 0.10%

限于篇幅，我在这里不能具体描述这些算法的细节，之前我在博客上发表了一篇翻译的文章，不过内容也是概括性描述。但是我已经在准备写博文详细介绍基数估计算法了，那里面会包括算法的数理细节以及对论文的一些解读，欢迎有兴趣的朋友关注我的博客。

CSDN：看到您微博上自称“代码洁癖重度患者”，这是一个很有趣的称呼，那么是否可以理解为您对代码的规范性很在意，您在平时在编码过程中如何保持代码的规范？

张洋：这么说其实是有点自嘲的意思吧。对代码格式我确实是很在意的，如果看到代码不规范、不整齐甚至多一个空行我都会觉得非常不舒服，骨子里对代码格式有一种完美主义倾向。

不过这个事情要分两面看，如果是我自己开发的比较专的东西，如算法库，可以坚持这种完美主义，但需要多人合作的场合实际上是不太合适的。实事求是的说，业务代码总是不可能一直很漂亮，需要在业务进度和代码质量中间做一个权衡。在保持代码规范方面，我始终认为不能完全靠程序员的自觉和代码规范的宣讲，通过工具（例如lint）和流程去保证会更有效一些。

CSDN：还有哪些困难是需要在未来工作中克服的？

张洋：需要克服的困难主要来自两方面吧。

一方面是算法本身改进的困难，这世界不存在完美无暇的算法，例如上面的基数估计算法，虽然大大降低了内存使用，但是如果维度爆炸的话，内存使用仍然会很夸张，而且合并bitmap也不是没有代价，有时需要进行内存和磁盘bitmap的合并，当bitmap量过大时磁盘IO会称为瓶颈，因此如何结合具体场景来优化和改进算法就成为一个难点。一个方法是查阅相关论文，了解和借鉴目前全球各大研究机构和公司对相关算法的最新研究成果。另一个方法就是自己进行改进，这块需要对算法本身极其相关的数学分析有非常深入掌握，因此对相关工程师的理论水平要求较高。

另一方面就是算法和业务产品的结合方案。算法毕竟是较为形式化的东西，要具体应用到产品中还有很长一段路要走。寻求算法与产品的最佳契合点和结合方案也是工作中的重点和难点之一。

2012已经过去，我们度过了世界末日，迎来世界新篇章。在2013年，我们也会进入互联网发展的新时代，各种数据充斥在网络中，大数据成为各个互联网公司都要面对的问题之一。如何消耗最小的资源来获得尽可能多的有用信息，这应该是每个互联网公司都要考虑的问题。通过最近关于算法的两篇文章，想必各位读者都能心中有数。当然，每种算法都有各自的优缺点，我们还是要根据在平时工作中的实际使用情况来对算法进行选择，不能一概而论。（王旭东/作者包研/审校）

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7458

CakePHP Tutorial

1376

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance Mar 26, 2024 pm 12:41 PM

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Implementing Machine Learning Algorithms in C++: Common Challenges and Solutions Jun 03, 2024 pm 01:25 PM

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

Explore the underlying principles and algorithm selection of the C++sort function Apr 02, 2024 pm 05:36 PM

The bottom layer of the C++sort function uses merge sort, its complexity is O(nlogn), and provides different sorting algorithm choices, including quick sort, heap sort and stable sort.

Which company does Blue Star Travel Yao belong to? Mar 22, 2024 pm 03:41 PM

Blue Star Travel Ballad has been on the game hot list after the recent release of a promotional video. Many players are very curious about which company Blue Star Travel Ballad is made from. In fact, it is a new game from Shanghai 2D manufacturer Manjiu. The editor will explain it to you below. Here is the introduction of Blue Star Yuanluyao Game Company, come and take a look together. Which company does Blue Star Travel Yao come from? Answer: It was launched by Manjiu Network. 1. First of all, Blue Star Travel Yao is a game launched by Manju’s Big World RPG. A promotional video was released on March 20. 2. This product will get its version number in October 2023. The game's trademark and operating unit are both registered under the name of a company called. The latter was established in February 2023, and its official website shows that its headquarters is in Singapore. 3. The 11-minute promotional video released this time revealed this

Can artificial intelligence predict crime? Explore CrimeGPT's capabilities Mar 22, 2024 pm 10:10 PM

The convergence of artificial intelligence (AI) and law enforcement opens up new possibilities for crime prevention and detection. The predictive capabilities of artificial intelligence are widely used in systems such as CrimeGPT (Crime Prediction Technology) to predict criminal activities. This article explores the potential of artificial intelligence in crime prediction, its current applications, the challenges it faces, and the possible ethical implications of the technology. Artificial Intelligence and Crime Prediction: The Basics CrimeGPT uses machine learning algorithms to analyze large data sets, identifying patterns that can predict where and when crimes are likely to occur. These data sets include historical crime statistics, demographic information, economic indicators, weather patterns, and more. By identifying trends that human analysts might miss, artificial intelligence can empower law enforcement agencies

Which company does the Hands app belong to? Mar 13, 2024 am 11:10 AM

Hands-on is a brand new chat and dating software, so which company is the Hand-on-hand app? This software was created by Tianjin Laifu Cultural Development Co., Ltd. You can download it from Xiaomi Mall and Apple Mall. This introduction to the Hands-on app creation company can tell you the specific methods. The following is a detailed introduction, so take a look. Which company is the Qianshou app? Answer: Tianjin Laifu Cultural Development Co., Ltd. Detailed description: On the official software website https://www.qianshouapp.cn/, you can see the company name at the bottom. Software introduction: 1. It can filter according to the conditions that users like, and can find the objects they need faster. 2. It can help users search for the objects they need faster.

Improved detection algorithm: for target detection in high-resolution optical remote sensing images Jun 06, 2024 pm 12:33 PM

01 Outlook Summary Currently, it is difficult to achieve an appropriate balance between detection efficiency and detection results. We have developed an enhanced YOLOv5 algorithm for target detection in high-resolution optical remote sensing images, using multi-layer feature pyramids, multi-detection head strategies and hybrid attention modules to improve the effect of the target detection network in optical remote sensing images. According to the SIMD data set, the mAP of the new algorithm is 2.2% better than YOLOv5 and 8.48% better than YOLOX, achieving a better balance between detection results and speed. 02 Background & Motivation With the rapid development of remote sensing technology, high-resolution optical remote sensing images have been used to describe many objects on the earth’s surface, including aircraft, cars, buildings, etc. Object detection in the interpretation of remote sensing images

Application of algorithms in the construction of 58 portrait platform May 09, 2024 am 09:01 AM

1. Background of the Construction of 58 Portraits Platform First of all, I would like to share with you the background of the construction of the 58 Portrait Platform. 1. The traditional thinking of the traditional profiling platform is no longer enough. Building a user profiling platform relies on data warehouse modeling capabilities to integrate data from multiple business lines to build accurate user portraits; it also requires data mining to understand user behavior, interests and needs, and provide algorithms. side capabilities; finally, it also needs to have data platform capabilities to efficiently store, query and share user profile data and provide profile services. The main difference between a self-built business profiling platform and a middle-office profiling platform is that the self-built profiling platform serves a single business line and can be customized on demand; the mid-office platform serves multiple business lines, has complex modeling, and provides more general capabilities. 2.58 User portraits of the background of Zhongtai portrait construction

See all articles

算法无国界 国内算法同样牛

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

AI Hentai Generator

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

算法无国界国内算法同样牛