In your code, you handle population counts within a two-level loop and try to optimize the inner loop with assembly. The loop iterates through a byte slice and uses the __mm_add_epi32_inplace_purego function to add positional popcounts to an array.
To optimize the inner loop, you can implement __mm_add_epi32_inplace_purego in assembly. Below is the suggested optimized version of the function:
<code class="assembly">.text .globl __mm_add_epi32_inplace_purego __mm_add_epi32_inplace_purego: movq rdi, [rsi] movq rsi, [rdi+8] addq rsi, rdi movups (%rsi, %rax, 8), %xmm0 addq , %rsi movups (%rsi, %rax, 8), %xmm1 paddusbd %xmm0, %xmm0 paddusbd %xmm1, %xmm1 vextracti128 <pre class="brush:php;toolbar:false"><code class="assembly">.text .globl __optimized_population_count_loop __optimized_population_count_loop: movq rdi, [rsi] leaq (0, %rdi, 4), %rdx # multiple rdi by 4, rdx = counts movq rsp, r11 and rsp, -16 subq r15, r11 movq r15, r9 mov rdi, (%rsi) movq r15, rsi mov %rsi, rsi pxor %eax, %eax dec %rsi .loop: inc %rsi addq , rsi cmp rsi, rdi cmovge %rsi, rsi movsw (%rdi, %rax, 2), %ax movsw (%rsi, %rax, 2), %dx movw %ax, (%rdx) movw %dx, 2(%rdx) .end_loop:</code>
Explanation:
This assembly code optimizes the function using packed SSE instructions. It:
Explanation:
The complete loop is now optimized in assembly. It uses:
This optimized version should significantly improve the performance of your algorithm for computing positional population counts.
The above is the detailed content of How can SSE instructions and assembly optimization improve the performance of a population count algorithm with a two-level loop?. For more information, please follow other related articles on the PHP Chinese website!