Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs
Introduction
The task is to reduce the efficiency of a Monte-Carlo simulation program by exploiting the Intel Sandybridge processor architecture. This processor has an out-of-order pipeline with features like register renaming and store buffering, making it challenging to reduce instruction-level parallelism (ILP) and introduce hazards.
Program Analysis
The program is a Monte-Carlo simulation that calculates the price of European vanilla call and put options. The key components of the program are:
- A loop that iterates a specified number of times
- Gaussian random number generation
- Black-Scholes Option Pricing Formula
Optimization Techniques
The following techniques can be used to reduce program efficiency:
-
False dependencies: Introduce unnecessary dependencies between instructions to increase hazard stalls.
-
Memory bottlenecks: Cause cache misses and memory access delays by misaligning data or using non-contiguous memory access patterns.
-
Delayed instructions: Use instructions that have longer latencies and can be delayed by the pipeline.
-
Less efficient operations: Use less efficient mathematical operations like division instead of multiplication.
-
Branch mispredictions: Introduce unpredictable branches to cause pipeline flushes.
-
Store-forwarding stalls: Use techniques like XORing high bytes of doubles to cause store-forwarding stalls.
-
Instruction cache misses: Break up routines into small chunks to cause instruction cache misses.
Specific Suggestions
Based on the above techniques, here are some specific suggestions to pessimize the program:
- Use std::atomic for loop counters and misalign them.
- Induce false sharing among non-atomic variables.
- Multi-thread with a single shared std::atomicloop counter.
- Rewrite expressions with associative/distributive equivalents to increase work.
- Use intrinsic functions carefully to avoid pipeline stalls.
- Use inline assembly to break up the uop cache.
- Use CPUID/RDTSC to time each iteration and induce serialization.
- Traverse arrays in non-contiguous order and use arrays with padding and misaligned elements.
- Use double precision instead of float to increase latency.
- Force conversions from integer to float and back again.
-
Disable compiler optimizations with -O0 and use -march=i386 for slower instructions.
- Set CPU affinity frequently to different CPUs.
The above is the detailed content of How Can We Deoptimize a Monte-Carlo Simulation for Intel Sandybridge Processors?. For more information, please follow other related articles on the PHP Chinese website!