Introduction
There is a library Construct in python, which can be used to parse binary data. It is very convenient to use this tool to analyze network packets, formatted data files, etc.
If I had used this tool when analyzing the sqlite database file format a while ago, it would have saved a lot of trouble.
However, Construct2.9 has changed a lot from previous versions. The original python code written in Construct basically needs to be rewritten. In the process of viewing the source code of Construct, I found that the basic implementation idea of Construct is the recursive descent analysis method. Use the object's construction method to dynamically define the data structure, and implement the parsing of binary data in the parse method.
I plan to use php to parse binary data, but the implementation idea is completely different from python's Construct.
Recommended PHP video tutorial: https://www.php.cn/course/list/29/type/2.html
Basic ideas
Since python's Construct uses python's object syntax to implement the definition and parsing of dynamic hierarchical structures, some structure definitions look obscure due to the limitation of python's object syntax.
My idea is to define a small language specifically used to describe dynamic hierarchical structures, so that it can take care of people's common expression habits as much as possible.
Take this small project as an example, based on the structure definition syntax in C language, and on this basis, conditionalization and loop structure definition are added.
There is a common structure in binary data. The first few bytes store the length of the subsequent data block, followed by the data block. It is not convenient to express this kind of structure with the structure definition of C language. The length of the entire data block varies and cannot be determined at compile time, but can only be determined during parsing. Therefore, it is necessary to extend the structure definition syntax of C language.
The first step is to implement the analysis of the C language structure
In this step, we do not consider the dynamic hierarchical structure definition, but implement this small The core part of the language, at least it must be able to understand the structure definition of C language, and be able to parse binary data based on this structure definition. After this step is completed, the definition and analysis of the dynamic structure will be implemented.
This project is based on the ADOS scripting language engine briefly introduced in the previous file.
Let’s take a look at our task first
Here is a structure definition file that is completely C language specification
blockStruct.h
struct student { char name[2]; int num; int age; char addr[3]; }; struct teacher { char name[2]; int num; char addr[3]; };
Binary data block to be parsed
"\x41\x42\x01\x00\x00\x00\x02\x00\x00\x00\x41\x42\x43\x41\x42\x01\x00\x00\x00\x41\x42\x43"
Hope to get the parsing result
[value] => Array ( [name] => AB [num] => 1 [age] => 2 [addr] => ABC ) [value] => Array ( [name] => AB [num] => 1 [addr] => ABC )
Interested friends can pay attention to the relationship between the three
The following It is a lexical rule file that only implements the compilation of C language structures
<?php /*! * struckwkr的词法规则 * 45022300@qq.com * Version 0.9.0 * * Copyright 2019, Zhu Hui * Released under the MIT license */ $GLOBALS['structwkr_lexRules'] = [ ['/^\"(.*?)\"/','_cons' ,'"'], ['/^\(/','_lp' ,'('], ['/^\)/','_rp' ,')'], ['/^\[/','_lb' ,'['], ['/^\]/','_rb' ,']'], ['/^\{/','_lcb' ,'{'], ['/^\}/','_rcb' ,'}'], ['/^;/','_semi' ,';'], ['/^,/','_comma' ,','], ['/^==/','_bieq' ,'='], ['/^!=/','_uneq' ,'!'], ['/^\>=/','_greq' ,'>'], ['/^\<=/','_leeq' ,'<'], ['/^=/','_equa' ,'='], ['/^\>/','_grea' ,'>'], ['/^\</','_less' ,'<'], ['/^\+/','_add' ,'+'], ['/^-/','_sub' ,'-'], ['/^\*/','_mul' ,'*'], ['/^\//','_div' ,'/'], ['/^%/','_mod' ,'$'], ['/^&&/','_and' ,'&'], ['/^\|\|/','_or' ,'|'], ['/^!/','_not' ,'!'], ['/^struct/','_strukey' ,'s'], ['/^char/','_char' ,'c'], ['/^int/','_int' ,'i'], ['/^float/','_float' ,'f'], ['/^double/','_double' ,'d'], ['/^[0-9]+([.]{1}[0-9]+){0,1}/','_num',''], ['/^\./','_dot' ,'.'], ['/^[\x{4e00}-\x{9fa5}A-Za-z_][\x{4e00}-\x{9fa5}A-Za-z0-9_]*\b/u','_iden',''], ['/^\s*/','_null',''] ];
The following is a grammatical rule file that only implements the compilation of C language structures
<?php /*! * structwkr的语法规则处理器 * 45022300@qq.com * Version 0.9.0 * * Copyright 2019, Zhu Hui * Released under the MIT license */ namespace Ados; require_once 'const.php'; require_once __SCRIPTCORE__.'syntax_rule/base_rules_handler.php'; class StructwkrRulesHandler extends BaseRulesHandler{ //语法分析的起始记号,归约到最后得到就是若干个结构体定义的列表 function startToken(){ return '_structList'; } //附加类型 function extraType($extraArray){ if(count($extraArray)>0){ return $extraArray[0]; }else{ return ''; } } //求出放在附加信息中的数组长度 function elementSize($extraArray){ if(count($extraArray)>0){ return intval($extraArray[0]); }else{ return 0; } } //处理源代码中的多条声明语句 function handleStatementList($stack,$coder,$listIndex,$statementIndex){ $t1= $this->topItem($stack,$statementIndex); if($listIndex>0){ $t2= $this->topItem($stack,$listIndex); //将元素名所在项的附加信息数组找出来 $extraArray=$t2[TokenExtraIndex]; }else{ //$listIndex=0表示,statementList是上级节点,附加信息数组应该是空数组 $extraArray=[]; } //将statement中的附加信息添加到statementList的附加信息中去 $extraArray[]=$t1[TokenExtraIndex]; return ['#',$extraArray]; } //处理源代码中的一条声明语句 function handleStatement($stack,$coder,$typeItemIndex,$idenItemIndex){ $t1= $this->topItem($stack,$typeItemIndex); $elementType = $t1[TokenValueIndex]; $t2= $this->topItem($stack,$idenItemIndex); $elementName = $t2[TokenValueIndex]; //将元素名所在项的附加信息数组找出来 $extraArray=$t2[TokenExtraIndex]; $valLen =$extraArray[0]; //附加信息中包含 元素名称,元素类型,数据长度 return [$elementName,[$elementName,$elementType,$valLen]]; } //语法规则处理函数名由规则右边部分与规则左边部分拼接而成 //语法规则定义的先后决定了归约时匹配的顺序,要根据实际的语法安排 //如果不熟悉语法,随意调整语法规则的先后次序将有可能导致语法错误 // struct list {{{ function _structList_0_structList_struct($stack,$coder){ $coder->pushBlockTail(); return ['#',[]]; } function _structList_0_struct($stack,$coder){ $coder->pushBlockTail(); return ['#',[]]; } // struct list }}} // struct {{{ function _struct_0_structName_blockStatement_semi($stack,$coder){ $t1= $this->topItem($stack,3); $structName = $t1[TokenValueIndex]; $t2= $this->topItem($stack,2); $extraArray=$t2[TokenExtraIndex]; return [$structName,$extraArray]; } // struct }}} // struct name {{{ function _structName_0_strukey_iden($stack,$coder){ $t1= $this->topItem($stack,1); $structName = $t1[TokenValueIndex]; $coder->pushBlockHeader($structName); return $this->pass($stack,1); } // struct name }}} // blockStatement {{{ function _blockStatement_0_lcb_statementList_rcb($stack,$coder){ return $this->pass($stack,2); } // blockStatement }}} // statement list {{{ function _statementList_0_statementList_statement($stack,$coder){ return $this->handleStatementList($stack,$coder,2,1); } function _statementList_0_statement($stack,$coder){ //此处0表示statementList是上一级节点,要做特殊处理 return $this->handleStatementList($stack,$coder,0,1); } // statement list }}} // statement {{{ function _statement_0_double_term_semi($stack,$coder){ $t1= $this->topItem($stack,2); $elementName = $t1[TokenValueIndex]; $parseFuncName = 'parseDouble'; $coder->pushBlockBody($parseFuncName,$elementName); return $this->handleStatement($stack,$coder,3,2); } function _statement_0_float_term_semi($stack,$coder){ $t1= $this->topItem($stack,2); $elementName = $t1[TokenValueIndex]; $parseFuncName = 'parseFloat'; $coder->pushBlockBody($parseFuncName,$elementName); return $this->handleStatement($stack,$coder,3,2); } function _statement_0_char_term_semi($stack,$coder){ $t1= $this->topItem($stack,2); $elementName = $t1[TokenValueIndex]; $size = $this->elementSize($t1[TokenExtraIndex]); $parseFuncName = 'parseFixStr'; $coder->pushBlockBody($parseFuncName,$elementName,$size); return $this->handleStatement($stack,$coder,3,2); } function _statement_0_int_term_semi($stack,$coder){ $t1= $this->topItem($stack,2); $elementName = $t1[TokenValueIndex]; $parseFuncName = 'parseInt'; $coder->pushBlockBody($parseFuncName,$elementName,4); return $this->handleStatement($stack,$coder,3,2); } // statement }}} // term {{{ function _term_0_term_array($stack,$coder){ $t1= $this->topItem($stack,1); $valLen = $t1[TokenValueIndex]; //将数据长度放入附加信息 $t2= $this->topItem($stack,2); return [$t2[TokenValueIndex],[$valLen]]; } function _term_0_iden($stack,$coder){ $t1= $this->topItem($stack,1); //未指定数据长度时将长度值设为0 $valLen = '0'; $t2= $this->topItem($stack,2); return [$t1[TokenValueIndex],[$valLen]]; } // term }}} // array {{{ function _array_0_lb_num_rb($stack,$coder){ return $this->pass($stack,2); } // array }}} } // end of class
The following is the basic data Parsing functions for types, integers, and strings (note, used for experiments, not yet covering all basic data types)
<?php //用于解析基本数据类型的内置函数 namespace Ados; //解析一个字节,得到无符号整数值 function parseByte($context,$size = 0){ $pos=$context['pos']; $data = substr($context['data'], $pos); if(strlen($data)>0){ $raw = substr($data, 0,1); $value = unpack("C1",$raw)[1]; return ['value'=>$value,'size'=>1,'error'=>0,'msg'=>'ok']; }else{ return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'not enough for parseByte']; } } //解析一个有符号整数 function parseInt($context,$size = 4){ if(!($size == 2 or $size == 4 or $size == 8 )){ return ['value'=>False,'size'=>0,'error'=>2,'msg'=>'not a valid size']; } if($size == 2) $format = "s1"; if($size == 4) $format = "l1"; if($size == 8) $format = "q1"; $pos=$context['pos']; $data = substr($context['data'], $pos); if(strlen($data)>=$size){ $raw = substr($data, 0,$size); $value = unpack($format,$raw)[1]; return ['value'=>$value,'size'=>$size,'error'=>0,'msg'=>'ok']; }else{ return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'no data for parseInt']; } } //解析一个大端无符号整数 function parseBUInt($context,$size = 4){ if(!($size == 2 or $size == 4 or $size == 8 )){ return ['value'=>False,'size'=>0,'error'=>2,'msg'=>'not a valid size']; } if($size == 2) $format = "n1"; if($size == 4) $format = "N1"; if($size == 8) $format = "J1"; $pos=$context['pos']; $data = substr($context['data'], $pos); if(strlen($data)>=$size){ $raw = substr($data, 0,$size); $value = unpack($format,$raw)[1]; return ['value'=>$value,'size'=>$size,'error'=>0,'msg'=>'ok']; }else{ return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'no data for parseBUInt']; } } //解析一个小端无符号整数 function parseLUInt($context,$size = 4){ if(!($size == 2 or $size == 4 or $size == 8 )){ return ['value'=>False,'size'=>0,'error'=>2,'msg'=>'not a valid size']; } if($size == 2) $format = "v1"; if($size == 4) $format = "NL"; if($size == 8) $format = "P1"; $pos=$context['pos']; $data = substr($context['data'], $pos); if(strlen($data)>=$size){ $raw = substr($data, 0,$size); $value = unpack($format,$raw)[1]; return ['value'=>$value,'size'=>$size,'error'=>0,'msg'=>'ok']; }else{ return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'no data for parseLUInt']; } } //解析一个null结束的字符串 function parseString($context,$size=0){ $pos=$context['pos']; $data = substr($context['data'], $pos); $p=0; $raw = substr($data, $p,1); $byte = unpack("C1",$raw)[1]; $result =''; while($byte){ $result.=$raw; $p++; if($p>=strlen($data)){ return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'not find null end for parseString']; } $raw = substr($data, $p,1); $byte = unpack("C1",$raw)[1]; } return ['value'=>$result,'size'=>$p,'error'=>0,'msg'=>'ok']; } //解析一个定长字符串 function parseFixStr($context,$size=0){ $pos=$context['pos']; $data = substr($context['data'], $pos); //var_dump($data); if(strlen($data)>=$size){ $result =''; for($i=0;$i<$size;$i++){ $raw = substr($data, $i,1); $value = unpack("a1",$raw)[1]; $result.=$value; } return ['value'=>$result,'size'=>$size,'error'=>0,'msg'=>'ok']; } return ['value'=>False,'size'=>0,'error'=>1,'msg'=>'not enough for parseFixedString']; }
The following is a working function for template replacement
<?php //用于进行模板替换的工作函数 namespace Ados; defined('__STRUCT_PARSE_TEMP__') or define('__STRUCT_PARSE_TEMP__', './'); define('_blockParsTempFile_',__STRUCT_PARSE_TEMP__.'parseFunc.template.php'); //取出两个记号串之间的内容 function strBetweenToke($src,$toke1,$toke2){ $p1 = strpos($src,$toke1); $p2 = strpos($src,$toke2); return substr($src,$p1+strlen($toke1),$p2-$p1-strlen($toke1)); } //取得块分析函数的header function blockHeaderStr(){ $src = file_get_contents(_blockParsTempFile_); return strBetweenToke($src,'/*blockHeader{{*/','/*blockHeader}}*/'); } //取得块分析函数的body function blockBodyStr(){ $src = file_get_contents(_blockParsTempFile_); return strBetweenToke($src,'/*blockBody{{*/','/*blockBody}}*/'); } //取得块分析函数的tail function blockTailStr(){ $src = file_get_contents(_blockParsTempFile_); return strBetweenToke($src,'/*blockTail{{*/','/*blockTail}}*/'); } define('_blockHeaderStr_',blockHeaderStr()); define('_blockBodyStr_',blockBodyStr()); define('_blockTailStr_',blockTailStr()); function makeBlockHeader($blockName){ return str_replace('parseBlock_Temp', 'parse'.$blockName, _blockHeaderStr_); } function makeblockBody($parseFuncName,$filedName='',$filedSize=0){ $tmp = str_replace('parseByte', $parseFuncName, _blockBodyStr_); $tmp = str_replace('$filedName', $filedName, $tmp); $tmp = str_replace('$filedSize', $filedSize, $tmp); return $tmp; } function makeblockTail(){ return _blockTailStr_ ; } /* echo blockHeaderStr(); echo blockBodyStr(); echo blockTailStr(); echo makeBlockHeader('Test1'); echo makeblockBody('parseInt'); echo makeblockBody('parseStr'); echo makeblockTail(); */
With that After these preparations, a grammar-guided encoder can be implemented to generate the final parsing function.
The following is a simple encoder that generates script code that can be used to parse binary data when syntactically analyzing the structure definition.
<?php /*! * structwkr编码器, * * 45022300@qq.com * Version 0.9.0 * * Copyright 2019, Zhu Hui * Released under the MIT license */ namespace Ados; require_once __SCRIPTCORE__.'coder/base_coder.php'; require_once __STRUCT_PARSE_TEMP__.'templateReplaceFuncs.php'; class StructwkrCoder extends BaseCoder{ public function __construct($engine) { if($engine){ $this->engine = $engine; }else{ exit('the engine is not valid in StructwkrCoder construct.'); } } //编译得到的最终结果 public function codeLines(){ if(count($this->codeLines)<1){ return ''; } $script=''; for ($i=0;$i< count($this->codeLines);$i+=1) { $script.=$this->codeLines[$i]; } return $script; } //输出编译后的结果 public function printCodeLines(){ echo $this->codeLines(); } //添加一个块解析函数头 public function pushBlockHeader($structName){ $structName=ucfirst($structName); $content = makeBlockHeader($structName); array_push($this->codeLines, $content); $lineIndex=$this->lineIndex; $this->lineIndex+=1; return $lineIndex; } //添加一个块解析函数体 public function pushBlockBody($parseFuncName,$filedName='',$filedSize=0){ $content = makeblockBody($parseFuncName,$filedName,$filedSize); array_push($this->codeLines, $content); $lineIndex=$this->lineIndex; $this->lineIndex+=1; return $lineIndex; } //添加一个块解析函数尾 public function pushBlockTail(){ $content = makeblockTail(); array_push($this->codeLines, $content); $lineIndex=$this->lineIndex; $this->lineIndex+=1; return $lineIndex; } }
Automatically generated script file for parsing
Compile the C language structure, and finally get a series of functions that can be used to parse binary data. For example, two structures are defined in the above structure definition file
struct student { char name[2]; int num; int age; char addr[3]; }; struct teacher { char name[2]; int num; char addr[3]; };
Then the parsing functions parseStudent
and parseTeacher
corresponding to these two structures will be generated.
The following is the automatically generated test script
False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } $expRes = parseInt($context,4); if($expRes['error']==0){ $filed = 'num'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } $expRes = parseInt($context,4); if($expRes['error']==0){ $filed = 'age'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } $expRes = parseFixStr($context,3); if($expRes['error']==0){ $filed = 'addr'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } return ['value'=>$valueArray,'size'=>$totalSize,'error'=>0,'msg'=>'ok']; } function parseTeacher($context,$size=0){ $valueArray=[]; $totalSize = 0; $expRes = parseFixStr($context,2); if($expRes['error']==0){ $filed = 'name'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } $expRes = parseInt($context,4); if($expRes['error']==0){ $filed = 'num'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } $expRes = parseFixStr($context,3); if($expRes['error']==0){ $filed = 'addr'; if($filed){ $valueArray[$filed]=$expRes['value']; }else{ $valueArray[]=$expRes['value']; } $context['pos']+=$expRes['size']; $totalSize+= $expRes['size']; }else{ return ['value'=>False,'size'=>0,'error'=>$expRes['error'],'msg'=>$expRes['msg']]; } return ['value'=>$valueArray,'size'=>$totalSize,'error'=>0,'msg'=>'ok']; }
Run this test script and the results are as follows:
Array ( [value] => Array ( [name] => AB [num] => 1 [age] => 2 [addr] => ABC ) [size] => 13 [error] => 0 [msg] => ok ) Array ( [value] => Array ( [name] => AB [num] => 1 [addr] => ABC ) [size] => 9 [error] => 0 [msg] => ok )
OK! The first step of the task has been completed, and then consider implementing it Parsing of a structure with conditional judgment.
For more related questions, please visit the PHP Chinese website: https://www.php.cn/
The above is the detailed content of PHP implements functions similar to the Construct library in Python (1) Basic design ideas. For more information, please follow other related articles on the PHP Chinese website!