


Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'
Counting sounds simple, but it is very difficult to implement in practice.
Imagine that you are sent to a pristine tropical rainforest to conduct a wildlife census. Whenever you see an animal, take a photo.
Digital cameras only record the total number of animals tracked, but if you are interested in the number of unique animals, there is no statistics.
So, what is the best way to obtain this unique animal population?
At this point, you must say, start counting from now on, and finally compare each new species from the photo to the list.
However, this common counting method is sometimes not suitable for information amounts up to billions of entries.
Computer scientists from the Indian Statistical Institute, UNL, and the National University of Singapore have proposed a new algorithm - CVM.
It can approximate the number of different entries in a long list, and only needs to remember a small number of entries.
##Paper address: https://arxiv.org/pdf/2301.10191
This algorithm works for any list in which an item appears one at a time, such as text in a speech, merchandise on a conveyor belt, or cars on the interstate.
The CVM algorithm is named after the first letters of the three authors and has made significant progress in solving the "different elements problem".
This problem has troubled computer scientists for more than 40 years.
It requires an efficient way to monitor a stream of elements (the total number of which may exceed available memory) and estimate the number of unique elements in it.
So, how does the CVM algorithm solve the problem?
Pioneering CVM algorithm, the secret lies in "randomization"
Suppose you are listening to the audiobook of "Hamlet".
This drama has a total of 30,557 words. How many are different?
To find the answer, you can pause while listening and write each word in alphabetical order, then skip words already on the list, and finally, just count the list on each word count.
This method is feasible, but it tests one's "memory" too much.
Researcher Vinodchandran Variyam said, "In a typical data flow situation, there may be millions of items to track. You may not want to store all the information.
This is where cloud server algorithms can provide a simpler approach."
The trick is "randomization".
Vinodchandran Variyam helped invent a CVM algorithm for estimating the number of distinct elements in a data stream
How many unique words are there in "Hamlet"? Coin Flip Challenge
Go back to "Hamlet" and assume that your "effective memory" can only hold 100 words.
Once the audio starts playing, you write down the first 100 words you hear and skip any repeated words.
When you have finished recording 100 words, all that’s left is to toss a coin for each word –
Heads, keep word. If it is the reverse side, delete it.
After this preliminary round, you will be left with about 50 different words.
Now you continue with what the team calls Round 1, continuing to read Hamlet and adding new words.
If you encounter a word again that is already on the list, flip the coin again until you have 100 words in your memory whiteboard.
Then, roughly half of the words are randomly deleted again based on the results of 100 coin tosses. Round 1 ends here.
Next, enter the second round, Round 2.
Like the first round, we're going to increase the difficulty of a word - when you encounter a repeated word, flip the coin again.
The condition is, if it's the other side, delete it like before. But if it’s heads, flip the coin again. The word is retained only when it appears heads for the second time.
Once the memory board is full, end the round and then delete about half of the words again based on the 100 tosses.
In Round 3, you need to flip a coin heads three times in a row to retain a word.
In the fourth round, keep one word on the front four times in a row, and so on.
Finally, in round k, you will listen to the entire play of "Hamlet".
The point of this exercise is to ensure that each word has the same probability of occurrence: 1/2 (k).
Suppose, at the end of the Hamlet audio, you have 61 words in your list and it took six rounds to complete.
You can estimate the number of different words by dividing 61 by the probability 1/2 (6) - the final result in this game is 3904.
The accuracy of the algorithm is proportional to the amount of memory
Researchers Chakraborty, Variyam and Meel mathematically proved that the accuracy of the CVM algorithm is proportional to the amount of memory Proportional to the size of the quantity.
And "Hamlet" has exactly 3967 unique words. (By ordinary counting method)
In the experiment using 100 word memory, the average estimate of the results of the 5 rounds of experiments is 3955 words.
With 1,000 words in memory, the average memory capacity increased to 3,964.
Variyam said, "If (the amount of memory) is large enough to accommodate all words, then we can achieve 100% accuracy."
William Kuszmau of Harvard University said, "This is a good example of how even very basic and widely studied problems can sometimes have simple but not obvious answers. Solutions are still to be discovered."
The above is the detailed content of Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

Handling high DPI display in C can be achieved through the following steps: 1) Understand DPI and scaling, use the operating system API to obtain DPI information and adjust the graphics output; 2) Handle cross-platform compatibility, use cross-platform graphics libraries such as SDL or Qt; 3) Perform performance optimization, improve performance through cache, hardware acceleration, and dynamic adjustment of the details level; 4) Solve common problems, such as blurred text and interface elements are too small, and solve by correctly applying DPI scaling.

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

The main steps and precautions for using string streams in C are as follows: 1. Create an output string stream and convert data, such as converting integers into strings. 2. Apply to serialization of complex data structures, such as converting vector into strings. 3. Pay attention to performance issues and avoid frequent use of string streams when processing large amounts of data. You can consider using the append method of std::string. 4. Pay attention to memory management and avoid frequent creation and destruction of string stream objects. You can reuse or use std::stringstream.
