Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs
Goal: To pessimize a program to run slower, by exploiting knowledge of the Intel i7 pipeline.
Problem:
The assignment provided two options: Whetstone or Monte-Carlo programs. The student chose the Monte-Carlo simulation program, but their pessimization efforts only increased the code running time by a second.
Question:
How can the student further pessimize the code to achieve a more significant slowdown?
Answer:
General Strategies:
- Introduce unpredictable branches to increase mispredict penalties.
- Lengthen loop-carried dependency chains to reduce instruction-level parallelism.
- Use slower FP operations and divs, especially exp and log functions.
Uarch-Specific Ideas:
With intrinsics ():
- Use movnti to evict data from cache.
- Use integer shuffles between FP math operations to cause bypass delays.
- Avoid mixing SSE and AVX instructions without using vzeroupper.
With (inline) asm:
- Force alignment issues to break the uop cache.
- Use self-modifying code to trigger pipeline clears.
Inducing Cache Misses and Memory Slowdowns:
- Perform narrow stores to cause store-forwarding stalls.
- Replace local vars with members of a big struct to control memory layout.
- Arrange memory layout to increase cache misses and page-split loads.
- Use misaligned variables to span cache-line or page boundaries.
- Loop over arrays in non-contiguous order.
- Consider using linked lists instead of arrays.
Other Techniques:
- Use std::atomic loop counters for slower atomic operations.
- Compile with -m32 or -march=i386 to force slower code generation.
- Force lower-precision long double calculations for extra slowness.
- Frequently set CPU affinity to different CPUs.
- Implement excessive system calls for context switching overhead.
Final Notes:
- While these techniques effectively slow down the code, their level of "diabolical incompetence" depends on the justification given.
- The assignment instructor may have intended for students to learn about pipeline hazards and dependencies, rather than merely applying these techniques blindly.
The above is the detailed content of How Can a Monte Carlo Simulation Be Further Deoptimized to Significantly Slow Down Execution on an Intel Sandybridge-Family CPU?. For more information, please follow other related articles on the PHP Chinese website!