How Can SIMD Instructions Be Used to Optimize the atoi Function?-C++-php.cn

How Can SIMD Instructions Be Used to Optimize the atoi Function?

DDD

Release： 2024-12-30 04:13:09

Original

685 people have browsed it

How Can SIMD Instructions Be Used to Optimize the atoi Function?

How to Implement atoi Using SIMD

In this article, we will explore an algorithm for implementing the atoi function, which converts a string representation of an integer into its numerical value, using Single Instruction Multiple Data (SIMD) instructions. By using SIMD, we can potentially achieve significant performance improvements by processing multiple elements in parallel.

The Algorithm

The proposed algorithm consists of the following steps:

Initialize a vector of length N: Create a vector of length N, where N is the maximum number of digits you want to support. Initialize the vector with values representing the powers of 10 in descending order (e.g., [10^N, 10^(N-1), ..., 10^1]).
Convert each character in the buffer to an integer: Convert each character in the input string to its corresponding integer value and store it in another vector.
Multiply significant digits by powers of 10: Take each element from the vector of significant digits and multiply it by the corresponding element from the vector of powers of 10. Sum the results of these multiplications to obtain the numerical value of the string.

Specifically, for each digit in the input string:

Extract the digit value (0 to 9) by subtracting its ASCII code from 48.
Multiply the digit value by the corresponding power of 10.
Add the result to the sum of the previously computed values.

Implementation Considerations

When implementing this algorithm in SIMD code, we can take advantage of the inherent parallelism of SIMD instructions to process multiple digits simultaneously. The code should be optimized for the specific SIMD instruction set being used (e.g., SSE4.2, AVX2).

Potential Optimization:

It is possible to further optimize this algorithm by eliminating the need for a separate loop to multiply the significant digits by the powers of 10. This can be achieved by using a technique called "vector indexing with fused multiply-add." This technique allows us to perform both the indexing and the multiplication in a single instruction, improving performance.

An Alternative Suggestion

As suggested by Peter Cordes in the comments, an alternative to the last two add xor instructions is to use an imul (integer multiply) instruction. This has the potential to be more efficient in terms of both code size and performance.

Implementation in GNU Assembler with Intel Syntax

Here is a sample implementation of the algorithm in GNU Assembler with Intel syntax:

.intel_syntax noprefix
.data
  .align 64
    ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
    ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
    ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
    ddqFactor2:    .long  1,10000,100000000,0
.text    
_start:
   mov   esi, lpInputNumberString
   /* (**A**) indicate negative number in EDX */
   mov   eax, -1
   xor   ecx, ecx
   xor   edx, edx
   mov   bl,  byte ptr [esi]
   cmp   bl,  '-'
   cmove edx, eax
   cmp   bl,  '+'
   cmove ecx, eax
   sub   esi, edx
   sub   esi, ecx
   /* (**B**)remove leading zeros */
   xor   eax,eax               /* return value ZERO */
  remove_leading_zeros:
   inc   esi
   cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
  je remove_leading_zeros
   cmp   byte ptr [esi-1], 0    /* catch empty string/number */
  je FINISH
   dec   esi
   /* check for valid digit-chars and invert from front to back */
   pxor      xmm2, xmm2         
   movdqa    xmm0, xmmword ptr [ddqDigitRange]
   movdqu    xmm1, xmmword ptr [esi]
   pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
  jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
   mov al , '0'         /* value to subtract from chars */
   sub ecx, 16          /* len-16=negative to zero for shuffle mask */
   movd      xmm0, ecx
   pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
   paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
   pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
   movd      xmm0, eax                         /* AL='0' from above */
   pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
   psubusb   xmm1, xmm0                        /* (**1**) */
   movdqa    xmm0, xmm1
   punpcklbw xmm0, xmm2                        /* (**2**) */
   punpckhbw xmm1, xmm2
   pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
   pmaddwd   xmm1, xmmword ptr [ddqFactor1]
   phaddd    xmm0, xmm1                        /* (**4**) */
   pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
   pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
   paddd     xmm0, xmm1
   pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
   paddd     xmm0, xmm1
   movd      eax, xmm0
   /* negate if negative number */              
   add       eax, edx                          /* (**8**) */
   xor       eax, edx
  FINISH:
   /* EAX is return (u)int value */

Copy after login

Conclusion

This optimized SIMD implementation of the atoi function can significantly improve performance when processing large amounts of numerical data. By utilizing the parallel processing capabilities of SIMD instructions, we can achieve faster execution times and handle numerical computations more efficiently.

The above is the detailed content of How Can SIMD Instructions Be Used to Optimize the atoi Function?. For more information, please follow other related articles on the PHP Chinese website!