Detailed introduction to the pack function and unpack function in PHP (with code)-PHP Tutorial-php.cn

This article brings you a detailed introduction to the pack function and unpack function in PHP (with code). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you. .

PHP has two important unpopular functions: pack and unpack. In scenarios such as network programming and reading and writing image files, these two functions are almost essential. In view of the importance of file reading and writing/network programming, or byte stream processing, mastering these two functions is the basis for advanced PHP programming.

This article first introduces the difference between bytes and characters, and explains the necessity and importance of the existence of the two functions. Then the basic usage and usage scenarios are introduced to give readers a general understanding and lay the foundation for actual use.

Bytes and characters

The advantage of PHP is that it is simple and easy to use. Proficient use of string and array related functions can meet general needs. Strings are often used in daily work, so PHP developers are familiar with characters, and those with a little experience can basically figure out the character encoding. But many PHP developers are not aware/familiar with the accompanying concept of characters: bytes.

It’s not their fault. The concept of "byte (stream)" rarely appears in the PHP world: there is no byte keyword (and of course no char), and the official documentation does not mention bytes; there is no native array support (the commonly used array is actually hashtable); of course Strings can express byte arrays (Byte Array, byte[]) in other languages.

What is the connection and difference between bytes and characters? Simply put, bytes are the smallest unit of computer storage and operation, and characters are the smallest units that people can read; bytes are storage (physical) concepts, and characters are logical concepts; bytes represent data (connotation and essence), and characters represent their meanings. ;Characters are composed of bytes.

Give a few examples to illustrate the difference between the two: "China" contains 2 characters, GBK encoding requires 4 bytes, and UTF-8 encoding requires 6 bytes; the number "1234567890" contains 10 Characters, represented by the int32 type, only require 4 bytes; the following picture occupies 42582 bytes, represented by characters as "my wife", occupying only 3 characters:

Detailed introduction to the pack function and unpack function in PHP (with code)

Give another commonly used example to illustrate the difference between characters and bytes. In development, we often use the md5 algorithm to obtain the hash value of the data. The algorithm returns a 128-bit data (16 bytes). In order to facilitate viewing of its value, people conventionally use hexadecimal representation, and the result is a well-known 32-bit string (case-insensitive). The 32-byte length string is not the inevitable result of the md5 algorithm, 16-byte data is its essence. If you wish, you can use a number less than 2^128 to represent the hash result, or you can base64 encode the 16 bytes as the result. Therefore, the relationship between the commonly used 32-bit hash value and the 16 bytes returned by md5 is: one is the character representation, and the other is its essence (character array) (the second parameter value of PHP's md5 function is true to get 16 characters section data, or the third parameter of the hash function is true).

Related concepts include byte order, character encoding, etc., which will not be expanded upon in this article.

Introduction

There are dozens of functions in PHP that specifically process strings. Adding regular, time and other functions, there are no less than a hundred functions for string processing. In contrast, byte processing is not popular, and there are only a few related functions. In addition to the commonly used ord/chr, the original bytes returned by the hash encryption function, the openssl_random_pseudo_bytes and other functions of the openssl library actually process or return bytes, the most The two important byte processing functions are pack and unpack.

This section leads to the use of the pack function from the question.

Question

Consider a simple question: How is the ultimate answer to the universe 42 represented in memory (or how to obtain its byte array)?

Because 42 is an integer, depending on the hardware, the byte size it occupies may be 1, 2, 4, 8, etc. Here we limit an integer to occupy 4 bytes, so the equivalent formulation of the problem is: How to convert an integer into a byte array (native order, 4 bytes)?

Analysis

Because it is multi-byte, the issue of byte order must be considered. 42 does not exceed 255 and only occupies one byte, so the other three bytes are all 0. Based on this, the conclusion is drawn: If it is big-endian (the low-order byte is stored in the high-order address), the four bytes are: 0 0 0 42; if it is little-endian, the result is: 42 0 0 0.

How do you know the byte order of the machine? PHP does not provide related functions, nor can it directly access byte data by accessing addresses like the C language. How can the all-powerful PHP fix the byte order, or complete the conversion of data to bytes?

Solution

At the PHP application level, the conversion of data to bytes (array) is a special session of pack, and the conversion of bytes (array) to data is ## Special show for #unpack. Except for these two functions, it is almost impossible to convert byte array (or binary data) to data (please give me some advice if possible).

现在我们用pack函数获取42在内存中的字节数组。相关代码如下：

function intToBytes(int $num) : string {
    return pack("l", $num);
}

function outputBytes(string $bytes) {
    echo "bytes: ";
    for ($i = 0; $i < strlen($bytes); ++ $i) {
        echo ord($bytes[$i]), " ";
    }
    echo PHP_EOL;
}

outputBytes(intToBytes(42));

// 程序输出：
bytes: 42 0 0 0

Copy after login

本人计算机用的英特尔的CPU，x86架构是小端序，所以程序输出符合预期。

延伸一下，怎么判断机器的字节序？有了pack函数，答案非常简单：

function bigEndian() : bool {
    $data = 0x1200;
    $bytes = pack("s", $data);

    return ord($bytes[0]) === 0x12;
}

Copy after login

调用函数便返回本机是否大端序。

上述是pack函数简单的使用场景，接下来分别介绍pack和unpack函数。

pack和unpack

pack函数

pack是“打包/封包”的意思。如其名，pack函数的工作是将数据按照格式打包成字节数组。函数原型为：

pack ( string $format [, mixed $... ] ) : string

形式上与printf系列函数相同：第一个参数是格式字符串，其余参数是要格式化的参数。不同之处在于pack函数的格式中不能出现元字符和量词外的其他字符，所以不需要%符号。

上文的例子中使用了"l"和"s"两个格式化元字符，pack函数的元字符主要分为三类：

字符串：a、A等；将数据转成字符串，功能上与sprintf类似，例如整数32转换成字符串"32"；
字节：h和H；对字节进行16进制编码，区别在于低位还是高位在前，功能上与dechex等函数类似；
char/short/int/long/float/double六种基本类型：c/s/i/l等；将数据转换成对应类型的字节数组，除char类型外（暂）没有其他函数可替代；

注意：char和a/A等的区别是a/A等输入为字符(串)，而's/S'的输入要求是小于256的整数，输入字符会得到0。

量词比较简单：数字和""两种。例如"i2"表示将两个参数按照整数转换，"c"表示后续都按照char类型转换。

unpack

unpack是pack的反向操作：将字节数组解析成有意义的数据。其函数原型为：

unpack ( string $format , string $data [, int $offset = 0 ] ) : array

unpack函数需要注意的是第一个参数和返回值。返回值好理解，pack函数相当于将除格式化参数外的参数数组(想象成call_user_func_array的参数)变成一个字节数组；unpack做相反的事情：释放数据，得到输入时的参数数组。

返回一个数组，其键分别是什么呢？这便是格式化参数($format)在pack和unpack的不同之处：unpack应该对释放出来的数据命名，用"/"分隔各组数据。由于格式化参数允许有非元字符和量词外的字符，为了区分数据，不同数据间的"/"分隔符必不可少。

一个例子：

$bytes = pack("iaa*", 42, ":", "The answer to life, the universe and everything");

outputBytes($bytes);


$result = unpack("inumber/acolon/a*word", $bytes);
print_r($result);

// 程序输出：
bytes: 42 0 0 0 58 84 104 101 32 97 110 115 119 101 114 32 116 111 32 108 105 102 101 44 32 116 104 101 32 117 110 105 118 101 114 115 101 32 97 110 100 32 101 118 101 114 121 116 104 105 110 103
Array
(
    [num] => 42
    [colon] => :
    [word] => The answer to life, the universe and everything
)

Copy after login

如果不对释放出来的数据命名会怎么样？例如上例中unpack的格式化参数为："i/a/a*"，结果是什么呢？其结果为：

Array
(
    [1] => The answer to life, the universe and everything
)

Copy after login

为何？官方文档上如是说：

Caution If you do not name an element, numeric indices starting from 1 are used. Be aware that if you have more than one unnamed element, some data is overwritten because the numbering restarts from 1 for each element.

翻译过来就是：如果你不对数据命名，默认的1, 2, 3...就用来当作键值。如果有多组数据，每组都用同样的下标，会导致数据覆盖。

所以能理解 "i/a/a*" 为何只剩最后一组数据了吧？

应用场景

读取图像、word/excel文件，解析binlog、二进制ip数据库文件等场合，pack和unpack几乎必不可少。本文举例说一下pack和unpack在网络编程时协议解析的用途。

假设我们的tcp包格式为：前四个字节表示包大小，其余字节为数据内容。于是客户(发送)端的send函数可以长这样：

public function send($data) {
  // 这里假设$data已经做了序列化、加密等操作，是字节数组
  // 计算报文长度，封装报文
  $len = strlen($data);
  $header = pack("L", $len);
  // 转换成网络(大端)序
  $header = xxx
  // 封包
  $binary = $header . $data;
  // 调用fwrite/socket_send等将数据写入内核缓冲区
  ...
}

Copy after login

服务(接收)端根据协议解析接收到的数据流：

public function decodable($session, $buffer) {
  $dataLen = strlen($buffer);
  // 非法数据包
  if ($dataLen < 4) {
    // 关闭连接、记录ip等
    ....
    return NOT_OK;
  }
  // 获取前四个字节
  $header = substr($buffer, 0, 4);
  // 转换成主机序
  $header = xxx
  // 解析数据长度
  $len = unpack("L", $header);
  // 单个报文不能超过8M，例如限制上传的图像大小
  if ($len > 8 * 1024 * 1024) {
    // 关闭连接等
    return NOT_OK;
  }

  // 检查数据包是否满足协议要求
  if ($dataLen - 4 >= $len) {
    return OK;
  }
  // 数据未全部到达，继续等待
  return NEED_DATA;
}

Copy after login

通过pack和unpack，我们顺利的处理报文协议和二进制字节流的发送和解析。

如果你用\n作为报文分隔符，pack和unpack也许用不到。但在网络通讯中直接传递字符毕竟少数（相当于明文传送），大多数情况下的二进制数据流的解析还是要靠pack和unpack。

总结

In addition to allocating memory, the most important system calls are file reading and writing and network connections, and the essential operating objects of both are byte streams. pack and unpack provide PHP with the ability to perform low-level byte operations, which are very useful in binary data processing. PHP developers who are interested in jumping out of web programming should master these two functions.

The above is the detailed content of Detailed introduction to the pack function and unpack function in PHP (with code). For more information, please follow other related articles on the PHP Chinese website!