At some point in time, you have probably heard that your chances of winning a lottery are very slim. As with all things related to probability, multiple trials might tilt an outcome in your favor. Now, if you were to participate in many lotteries, your chances of winning one would be a bit better, depending on how many more lotteries you participated in. This is still by no means a guarantee that you will eventually win, but with uniform distribution, and going by the law of large numbers (in this case meaning a large number of lotteries), we can arrive at relatively more likely possibilities.
It's important to understand that each new lottery is independent of any other, and the same lottery "ticket number" could win many different lotteries (following the law of large numbers). You could also be unlucky and pick the wrong number in every lottery, regardless of how many times you tried. You have two options now:
Theoretically (and mathematically), both scenarios have an equal likelihood of occurring. However, scenario 2 will give you a slight edge. As the number of times approaches infinity, every number will eventually be selected. The problem is that with scenario 1, you need to try more times with the hope that the number you've picked at the time matches the number that wins. With scenario 2, you're sure that as the trials tend to infinity, your number will at some point "win". For this blog post, we will use scenario 2.
So, do you think you can answer this question before I tell you the answer?
"If all lotteries around you had slots for exactly 1 million people and you picked the same ticket [x] for everyone you played, how many lotteries would you have to play to finally be a winner?" (Feel free to comment on what your initial answer was)
The answer is...
About 14.4 million times.
The rest of this blog post will be about how I arrived at that value, how the simulations were done, and some caveats. Things will get more technical from here.
The ticket numbers of a lottery of 1 million people would range from 1 - 1,000,000 (or 0 - 999,999). Players can only pick a number within that range for each lottery, and the winning ticket can only be from that range. Essentially, we can say we will have a set of 1 million numbers.
Accounting for the fact that a user can pick any number within that range, we need to satisfy the condition of every item in the set being hit at least once. This is because if every number has been called at least once, it would cover any possible ticket number a player could have picked. This also means we do not care about how many times each number runs, making a "set" the ideal Python data structure to use for our simulation. We will start with an empty set, and populate it with a randomly generated number at each iteration until the set contains every number within the specified range. Since Python sets do not repeat numbers, we do not have to worry about ensuring uniqueness.
def calculate_lottery_chances(lottery_players_count): number_set = set() count = 0 while len(number_set) < lottery_players_count: gen_number = random.randint(1, lottery_players_count) number_set.add(gen_number) count += 1 return count
For a lottery of 1,000,000 people, the function call would look like: calculate_lottery_chances(1000000), and it would return the number of lottery attempts before winning. Arranging the code in this manner makes it very extendable.
In a nutshell, the root cause of the problem is "variation". The first time I ran the function, I got "13.1 million" times as my value. I reran it, and got something along the lines of 13.9 million. I did this even more times and got widely varying answers -at some point, I got 15 million. It was clear that I would need to do this and find an average. Following the existing pattern so far, I figured that as the number of iterations to average it by tended towards infinity, I would be closer to having one reliable answer. There was a need for something that could do this, and do it fast, and that led me to write this function:
def average_over_n_times(function, function_arg, n): """ This returns the average of the returned value of a function when it is called n times, with its (one) arg """ total = 0 for x in range(0, n): total += function(function_arg) return round(total/n)
Subsequently, everything would then be patched up as:
num_of_trials = average_over_n_times(calculate_lottery_chances, lottery_players_count, n)
Where "n" would represent the number of times to average results by. This, however, brings up another problem that will be discussed in the next section.
The larger the value of n, the closer to an "average-case" result. However, considering that there are still no absolutes or certainties, performing this series of tasks too many times stops being productive. I say this for the following reasons:
Keeping these in mind, I tested "n" with the values: 10, 20, 30, 50, 100, 1000, and 5000 times.
At this point, you're probably wondering why the word "PyTorch" in the title of the blog post hasn't even been mentioned. Well, although I mentioned testing n with different values, it wasn't the same code I used for all the tests.
These were computationally heavy experiments, and my CPU had a word with me. The code snippets I shared earlier were written in one file that had zero external package dependencies, and the file was run in the bash shell with the time command prepended to track execution times. Here's what the execution times looked like when using just the CPU:
n | Time (min and sec) |
---|---|
10 | 1m34.494s |
20 | 3m2.591s |
30 | 5m19.903s |
50 | 10m58.844s |
100 | 14m56.157s |
At 1000, I couldn't get the program to work anymore. I wasn't sure if it broke halfway and failed to stop the execution, but I canceled it after 4 hours and 57 minutes. There are a few factors I feel influenced this, which I will discuss in the "caveats" section. Anyway, my fan noise was blaring, and I knew I may have pushed my laptop's modestly powered CPU a bit too much. I refused to accept defeat, and while thinking of what I could do to at least run 4-digit iterations, I remembered something a friend of mine who worked with PyTorch told me:
"GPUs are generally more efficient at computationally intensive than CPUs"
PyTorch uses the GPU, making it the perfect tool for the job.
PyTorch would be used for calculations for our purposes, so refactoring the existing calculate_lottery_chances() code would mean changing CPU-reliant numerical operations and switching to suitable PyTorch data structures. In a nutshell:
The refactor of calculate_lottery_chances would look like:
def calculate_lottery_chances(lottery_players_count): number_set = set() count = 0 while len(number_set) < lottery_players_count: gen_number = random.randint(1, lottery_players_count) number_set.add(gen_number) count += 1 return count
I set my device as "xpu" because my computer uses an Intel Graphics GPU, which PyTorch supports.
To ensure my GPU was used during execution, I opened my Windows task manager and navigated to the "performance" section before running. On running, I saw a noticeable spike in GPU resource usage.
For context, here is a before vs after:
Before:
Notice the GPU usage is at 1%
After:
Notice the GPU usage is at 49%
For the runtimes for varying values of n, the GPU was several times faster. It ran values of n below 100 consistently at less than a minute, and was able to calculate for a value of n at 5000 (five thousand!)
Here's a table of runtimes using the GPU:
n | Time (min and sec) |
---|---|
10 | 0m13.920s |
20 | 0m18.797s |
30 | 0m24.749s |
50 | 0m34.076s |
100 | 1m12.726s |
1000 | 16m9.831s |
To have a visual sense of how large the performance gap between the GPU and CPU operations was for this experiment, here's a data visualization to think about:
The x-axis was capped at 100 because I could no longer get realistically "timely" output from the CPU, thus leaving no room for comparison with the GPU. Performing the experiments with numbers within the range of 1000 - 5000 gave me about "14.4 million times" as a result, more often than not. That's how I got the answer from earlier.
This experiment made assumptions and relied on certain ways of doing things. Additionally, my inexperience with PyTorch potentially means there may have been a more efficient approach. Here are some factors to consider that may have influenced either the accuracy of my findings or the execution times:
Finally, I'd like to point out that this was my first time using PyTorch for anything, and I was quite impressed with the performance.
When I went down the rabbit hole with this, I didn't expect to see such gains in performance. I learned the idea behind tensors and a few things about the supporting mechanisms behind even more computationally complex tasks. You have the freedom to use, replicate, or modify the code snippets as you please.
Thank you for indulging me, and I hope you had a fun read.
Till next time,
Cheers. ?
The above is the detailed content of How a Lottery Quest Led Me to The Powers of PyTorch. For more information, please follow other related articles on the PHP Chinese website!