What is the implementation principle of integers in the Python virtual machine?-Python Tutorial-php.cn

Data structure

The implementation data structure of the int type inside cpython is as follows:

typedef struct _longobject PyLongObject;
struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};
#define PyObject_VAR_HEAD      PyVarObject ob_base;
typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Copy after login

The above data structure is represented graphically as shown below:

What is the implementation principle of integers in the Python virtual machine?

ob_refcnt represents the number of reference counts of the object. This is very useful for garbage collection. Later we will analyze the garbage collection part of the virtual machine in depth. .
ob_type, indicates the data type of this object. In python, sometimes it is necessary to judge the data type of the data. For example, the two keywords of isinstance and type will be used. field.
ob_size, this field indicates how many elements there are in this integer object array ob_digit.
The digit type is actually a macro definition of the uint32_t type, representing 32-bit integer data.

In-depth analysis of the semantics of PyLongObject fields

First of all, we know that integers in python will not overflow. This is why PyLongObject uses arrays. In the internal implementation of cpython, integers include 0, positive numbers, and negative numbers. Regarding this, there are the following regulations in cpython:

##ob_size, which stores the length of the array, ob_size When it is greater than 0, a positive number is stored; when ob_size is less than 0, a negative number is stored.
ob_digit, stores the absolute value of the integer. As we mentioned earlier, ob_digit is a 32-bit data, but only the first 30 bits are used internally in cpython, just to avoid overflow problems.

Let’s use a few examples to understand the above rules in depth:

What is the implementation principle of integers in the Python virtual machine?

In the above figure, ob_size is greater than 0, indicating that This number is a positive number, and ob_digit points to an int32 data. The value of the number is equal to 10, so the above number represents the integer 10.

What is the implementation principle of integers in the Python virtual machine?

Similarly, ob_size is less than 0, and ob_digit is equal to 10, so the data in the above figure represents -10.

What is the implementation principle of integers in the Python virtual machine?

The above is an example where the length of the ob_digit array is 2. The data represented above is as follows:

1⋅2

0^1⋅21^1⋅22^{... 1⋅2}29^0⋅230^0⋅231 ^1⋅232

Because we only use the first 30 bits for each array element, so the second integer data corresponds to 230, everyone You can understand the entire calculation process based on the above results.

What is the implementation principle of integers in the Python virtual machine?

The above is very simple:

−(1⋅2

0^1⋅21^1⋅22^{... 1⋅2}29^0⋅230^0⋅231^1⋅2 32⁾

Small integer pool

In order to avoid frequently creating some commonly used integers and speed up program execution, we can cache some commonly used integers first, if necessary Just return this data directly. The relevant code in cpython is as follows: (The interval of cached data in the small integer pool is [-5, 256])

#define NSMALLPOSINTS           257
#define NSMALLNEGINTS           5
 
static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

Copy after login

What is the implementation principle of integers in the Python virtual machine?

We use the following code to test , to see if the data in the small integer pool is used. If so, the return value of their id() is the same for the data in the small integer pool. The id built-in function returns the memory address of the Python object.

>>> a = 1
>>> b = 2
>>> c = 1
>>> id(a), id(c)
(4343136496, 4343136496)
>>> a = -6
>>> c = -6
>>> id(a), id(c)
(4346020624, 4346021072)
>>> a = 257
>>> b = 257
>>> id(a), id(c)
(4346021104, 4346021072)
>>>

Copy after login

What we can see from the above results is that for the values in the interval [-5, 256], the return value of id is indeed the same, and the return value not within this interval is different. of.

We can also implement a small trick with this feature, which is to find the memory space occupied by a PyLongObject object, because we can use the first memory address of the two data -5 and 256, and then add this address By subtracting, you can get the size of the memory space occupied by 261 PyLongObjects (note that although there are 262 data in the small integer pool, the last data is the first address of the memory, not the last address, so there are only 261 data), so we You can find the memory size of a PyLongObject object.

>>> a = -5
>>> b = 256
>>> (id(b) - id(a)) / 261
32.0
>>>

Copy after login

From the above output, we can see that a PyLongObject object occupies 32 bytes. We can use the following C program to view the actual memory space occupied by a PyLongObject.

#include "Python.h"
#include <stdio.h>
 
int main()
{
  printf("%ld\n", sizeof(PyLongObject));
  return 0;
}

Copy after login

The output of the above program is as follows:

What is the implementation principle of integers in the Python virtual machine?

The above two results are equal, so our idea is also verified.

The core code for obtaining data from the small integer pool is as follows:

static PyObject *
get_small_int(sdigit ival)
{
    PyObject *v;
    assert(-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS);
    v = (PyObject *)&small_ints[ival + NSMALLNEGINTS];
    Py_INCREF(v);
    return v;
}

Copy after login

整数的加法实现

如果你了解过大整数加法就能够知道，大整数加法的具体实现过程了，在 cpython 内部的实现方式其实也是一样的，就是不断的进行加法操作然后进行进位操作。

#define Py_ABS(x) ((x) < 0 ? -(x) : (x)) // 返回 x 的绝对值
#define PyLong_BASE	((digit)1 << PyLong_SHIFT)
#define PyLong_MASK	((digit)(PyLong_BASE - 1))
 
 
static PyLongObject *
x_add(PyLongObject *a, PyLongObject *b)
{
    // 首先获得两个整型数据的 size 
    Py_ssize_t size_a = Py_ABS(Py_SIZE(a)), size_b = Py_ABS(Py_SIZE(b));
    PyLongObject *z;
    Py_ssize_t i;
    digit carry = 0;
    // 确保 a 保存的数据 size 是更大的
    /* Ensure a is the larger of the two: */
    if (size_a < size_b) {
        { PyLongObject *temp = a; a = b; b = temp; }
        { Py_ssize_t size_temp = size_a;
            size_a = size_b;
            size_b = size_temp; }
    }
    // 创建一个新的 PyLongObject 对象，而且数组的长度是 size_a + 1
    z = _PyLong_New(size_a+1);
    if (z == NULL)
        return NULL;
    // 下面就是整个加法操作的核心
    for (i = 0; i < size_b; ++i) {
        carry += a->ob_digit[i] + b->ob_digit[i];
        // 将低 30 位的数据保存下来
        z->ob_digit[i] = carry & PyLong_MASK;
        // 将 carry 右移 30 位，如果上面的加法有进位的话 刚好可以在下一次加法当中使用（注意上面的 carry）
        // 使用的是 += 而不是 =
        carry >>= PyLong_SHIFT; // PyLong_SHIFT = 30
    }
    // 将剩下的长度保存 （因为 a 的 size 是比 b 大的）
    for (; i < size_a; ++i) {
        carry += a->ob_digit[i];
        z->ob_digit[i] = carry & PyLong_MASK;
        carry >>= PyLong_SHIFT;
    }
    // 最后保存高位的进位
    z->ob_digit[i] = carry;
    return long_normalize(z); // long_normalize 这个函数的主要功能是保证 ob_size 保存的是真正的数据的长度 因为可以是一个正数加上一个负数 size 还变小了
}
 
PyLongObject *
_PyLong_New(Py_ssize_t size)
{
    PyLongObject *result;
    /* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
       sizeof(digit)*size.  Previous incarnations of this code used
       sizeof(PyVarObject) instead of the offsetof, but this risks being
       incorrect in the presence of padding between the PyVarObject header
       and the digits. */
    if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
        PyErr_SetString(PyExc_OverflowError,
                        "too many digits in integer");
        return NULL;
    }
    // offsetof 会调用 gcc 的一个内嵌函数 __builtin_offsetof 
    // offsetof(PyLongObject, ob_digit)  这个功能是得到 PyLongObject 对象 字段 ob_digit 之前的所有字段所占的内存空间的大小
    result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
                             size*sizeof(digit));
    if (!result) {
        PyErr_NoMemory();
        return NULL;
    }
    // 将对象的 result 的引用计数设置成 1
    return (PyLongObject*)PyObject_INIT_VAR(result, &PyLong_Type, size);
}
 
 
static PyLongObject *
long_normalize(PyLongObject *v)
{
    Py_ssize_t j = Py_ABS(Py_SIZE(v));
    Py_ssize_t i = j;
 
    while (i > 0 && v->ob_digit[i-1] == 0)
        --i;
    if (i != j)
        Py_SIZE(v) = (Py_SIZE(v) < 0) ? -(i) : i;
    return v;
}

Copy after login

The above is the detailed content of What is the implementation principle of integers in the Python virtual machine?. For more information, please follow other related articles on the PHP Chinese website!