How to efficiently implement atoi using SIMD instructions?-C++-php.cn

How to efficiently implement atoi using SIMD instructions?

Susan Sarandon

Release： 2024-11-26 20:16:14

Original

582 people have browsed it

How to efficiently implement atoi using SIMD instructions?

How to implement atoi using SIMD?

Problem:
I would like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON. The algorithm I came up with is as follows:

Initialize a vector of length N in the following fashion:
[10^N..10^1]
Convert each character in the buffer to an integer and place them in another vector.
Take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results.

My algorithm is correct? Is there a better way? Is there a reference implementation for atoi using any SIMD instructions set?

Answer:
The algorithm is correct and complete. It works for int and uint, from MIN_INT=-2147483648 to MAX_INT=2147483647 and from MIN_UINT=0 to MAX_UINT=4294967295.

A reference implementation is provided, written in GNU Assembler with intel syntax.

The properties of this code are as follows:

A leading '-' char indicates a negative number (as sensible), a leading ' ' char is ignored
Leading zeros (with or without sign char) are ignored
Overflow is ignored - bigger numbers just wraparound
Zero length strings result in value 0 = -0
Invalid characters are recognized and the conversion ends at the first invalid char
At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller
Only SSE4.2 is needed

The approach of the algorithm is as follows:

Pointer to number string is expected in ESI
Check if first char is '-', then indicate if negative number in EDX (A)
Check for leading zeros and EOS (B)
Check string for valid digits and get strlen() of valid chars (C)
Reverse string so that power of
10^0 is always at BYTE 15
10^1 is always at BYTE 14
10^2 is always at BYTE 13
10^3 is always at BYTE 12
10^4 is always at BYTE 11
...
and mask out all following chars (D)
Subtract saturated '0' from each of the 16 possible chars (1)
Take 16 consecutive byte-values and and split them to WORDs
in two XMM-registers (2)
P O N M L K J I | H G F E D C B A ->
H G F E | D C B A (XMM0)
P O N M | L K J I (XMM1)
Multiply each WORD by its place-value modulo 10000 (1,10,100,1000)
(factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM)
(2) so we can horizontally combine twice before another multiply.
The PMADDWD instruction can do this and the next step:
Horizontally add adjacent WORDs to DWORDs (3)
H1000 G100 F10 E1 | D1000 C100 B10 A1 (XMM0)
P1000 O100 N10 M1 | L1000 K100 J10 I1 (XMM1)
Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (4)
xmmDst[31-0] = xmm0[63-32] xmm0[31-0]
xmmDst[63-32] = xmm0[127-96] xmm0[95-64]
xmmDst[95-64] = xmm1[63-32] xmm1[31-0]
xmmDst[127-96] = xmm1[127-96] xmm1[95-64]
Values in XMM0 are multiplied with the factors (5)
P1000 O100 N10 M1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings)
L1000 K100 J10 I1 (DWORD factor 100000000)
H1000 G100 F10 E1 (DWORD factor 10000)
D1000 C100 B10 A1 (DWORD factor 1)
The last step is adding these four DWORDs together with 2PHADDD emulated by 2(PSHUFD PADDD)
- xmm0[31-0] = xmm0[63-32] xmm0[31-0] (6)
  xmm0[63-32] = xmm0[127-96] xmm0[95-64]
  (the upper QWORD contains the same and is ignored)
- xmm0[31-0] = xmm0[63-32] xmm0[31-0] (7)
If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(8)

The result of Intel-IACA Throughput Analysis for Haswell 32-bit:

Throughput Analysis Report

Block Throughput: 16.10 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |

| Cycles | 9.5 0.0 | 10.0 | 4.5 4.5 | 4.5 4.5 | 0.0 | 11.1 | 11.4 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

- instruction micro-ops not bound to a port
  ^ - Micro Fusion happened
  
  - ESP Tracking sync uop was issued
  
  @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
  ! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |

| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |

| 0* | | | | | | | | | | xor eax, eax
| 0* | | | | | | | | | | xor ecx, ecx
| 0* | | | | | | | | | | xor edx, edx
| 1 | | 0.1 | | | | | 0.9 |

The above is the detailed content of How to efficiently implement atoi using SIMD instructions?. For more information, please follow other related articles on the PHP Chinese website!