Mysql源码学习――词法分析MYSQLlex

Home

Database

Mysql Tutorial

Mysql源码学习――词法分析MYSQLlex_MySQL

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 01, 2016 pm 01:44 PM

client server plan

bitsCN.com

词法分析MYSQLlex

客户端向服务器发送过来SQL语句后，服务器首先要进行词法分析，而后进行语法分析，语义分析，构造执行树，生成执行计划。词法分析是第一阶段，虽然在理解Mysql实现上意义不是很大，但作为基础还是学习下比较好。

词法分析即将输入的语句进行分词(token)，解析出每个token的意义。分词的本质便是正则表达式的匹配过程，比较流行的分词工具应该是lex，通过简单的规则制定，来实现分词。Lex一般和yacc结合使用。关于lex和yacc的基础知识请参考Yacc 与Lex 快速入门- IBM。如果想深入学习的话，可以看下《LEX与YACC》。

然而Mysql并没有使用lex来实现词法分析，但是语法分析却用了yacc，而yacc需要词法分析函数yylex，故在sql_yacc.cc文件最前面我们可以看到如下的宏定义:

/* Substitute the variable and function names. */

#define yyparse MYSQLparse

#define yylex MYSQLlex

　　这里的MYSQLlex也就是本文的重点，即MYSQL自己的词法分析程序。源码版本5.1.48。源码太长，贴不上来，算啦..在sql_lex.cc里面。

　　我们第一次进入词法分析，state默认值为MY_LEX_START，就是开始状态了，其实state的宏的意义可以从名称上猜个差不多，再比如MY_LEX_IDEN便是标识符。对START状态的处理伪代码如下：

case MY_LEX_START:

{

Skip空格

获取第一个有效字符c

state = state_map[c];

Break;

}

　　我困惑了，这尼玛肿么出来个state_map？找到了在函数开始出有个赋值的地方：

uchar *state_map= cs->state_map;

　　cs？！不会是反恐精英吧!!快速监视下cs为my_charset_latin1,哥了然了，原来cs是latin字符集，character set的缩写吧。那么为神马state_map可以直接决定状态？找到其赋值的地方，在init_state_maps函数中，代码如下所示：

/* Fill state_map with states to get a faster parser */

for (i=0; i

{

if (my_isalpha(cs,i))

state_map[i]=(uchar) MY_LEX_IDENT;

else if (my_isdigit(cs,i))

state_map[i]=(uchar) MY_LEX_NUMBER_IDENT;

#if defined(USE_MB) && defined(USE_MB_IDENT)

else if (my_mbcharlen(cs, i)>1)

state_map[i]=(uchar) MY_LEX_IDENT;

#endif

else if (my_isspace(cs,i))

state_map[i]=(uchar) MY_LEX_SKIP;

else

state_map[i]=(uchar) MY_LEX_CHAR;

}

state_map[(uchar)'_']=state_map[(uchar)'$']=(uchar) MY_LEX_IDENT;

state_map[(uchar)'/'']=(uchar) MY_LEX_STRING;

state_map[(uchar)'.']=(uchar) MY_LEX_REAL_OR_POINT;

state_map[(uchar)'>']=state_map[(uchar)'=']=state_map[(uchar)'!']= (uchar) MY_LEX_CMP_OP;

state_map[(uchar)'

state_map[(uchar)'&']=state_map[(uchar)'|']=(uchar) MY_LEX_BOOL;

state_map[(uchar)'#']=(uchar) MY_LEX_COMMENT;

state_map[(uchar)';']=(uchar) MY_LEX_SEMICOLON;

state_map[(uchar)':']=(uchar) MY_LEX_SET_VAR;

state_map[0]=(uchar) MY_LEX_EOL;

state_map[(uchar)'//']= (uchar) MY_LEX_ESCAPE;

state_map[(uchar)'/']= (uchar) MY_LEX_LONG_COMMENT;

state_map[(uchar)'*']= (uchar) MY_LEX_END_LONG_COMMENT;

state_map[(uchar)'@']= (uchar) MY_LEX_USER_END;

state_map[(uchar) '`']= (uchar) MY_LEX_USER_VARIABLE_DELIMITER;

state_map[(uchar)'"']= (uchar) MY_LEX_STRING_OR_DELIMITER;

　　先来看这个for循环，256应该是256个字符了，每个字符的处理应该如下规则：如果是字母，则state = MY_LEX_IDENT；如果是数字，则state = MY_LEX_NUMBER_IDENT，如果是空格，则state = MY_LEX_SKIP，剩下的全为MY_LEX_CHAR。　

for循环之后，又对一些特殊字符进行了处理，由于我们的语句“select @@version_comment limit 1”中有个特殊字符@，这个字符的state进行了特殊处理，为MY_LEX_USER_END。

对于my_isalpha等这几个函数是如何进行判断一个字符属于什么范畴的呢？跟进去看下，发现是宏定义：

#define my_isalpha(s, c) (((s)->ctype+1)[(uchar) (c)] & (_MY_U | _MY_L))

Wtf，肿么又来了个ctype，c作为ctype的下标，_MY_U | _MY_L如下所示，

#define _MY_U 01 /* Upper case */

#define _MY_L 02 /* Lower case */

　　ctype里面到底存放了什么？在ctype-latin1.c源文件里面，我们找到了my_charset_latin1字符集的初始值：

CHARSET_INFO my_charset_latin1=

{

8,0,0, /* number */

MY_CS_COMPILED | MY_CS_PRIMARY, /* state */

"latin1", /* cs name */

"latin1_swedish_ci", /* name */

"", /* comment */

NULL, /* tailoring */

ctype_latin1,

to_lower_latin1,

to_upper_latin1,

sort_order_latin1,

NULL, /* contractions */

NULL, /* sort_order_big*/

cs_to_uni, /* tab_to_uni */

NULL, /* tab_from_uni */

my_unicase_default, /* caseinfo */

NULL, /* state_map */

NULL, /* ident_map */

1, /* strxfrm_multiply */

1, /* caseup_multiply */

1, /* casedn_multiply */

1, /* mbminlen */

1, /* mbmaxlen */

0, /* min_sort_char */

255, /* max_sort_char */

' ', /* pad char */

0, /* escape_with_backslash_is_dangerous */

&my_charset_handler,

&my_collation_8bit_simple_ci_handler

};

　　可以看出ctype = ctype_latin1；而ctype_latin1值为：

static uchar ctype_latin1[] = {

32, 32, 32, 32, 32, 32, 32, 32, 32, 40, 40, 40, 40, 40, 32, 32,

32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

72, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

132,132,132,132,132,132,132,132,132,132, 16, 16, 16, 16, 16, 16,

16,129,129,129,129,129,129, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 16, 16, 16, 16, 16,

16,130,130,130,130,130,130, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16, 16, 16, 16, 32,

16, 0, 16, 2, 16, 16, 16, 16, 16, 16, 1, 16, 1, 0, 1, 0,

0, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 16, 2, 0, 2, 1,

72, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 16, 1, 1, 1, 1, 1, 1, 1, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 16, 2, 2, 2, 2, 2, 2, 2, 2

};

　　看到这里哥再一次了然了，这些值都是经过预计算的，第一个0是无效的，这也是为什么my_isalpha(s, c)定义里面ctype要先+1的原因。通过_MY_U和_MY_L的定义，可以知道，这些值肯定是按照相应的ASCII码的具体意义进行置位的。比如字符'A'，其ASCII码为65，其实大写字母，故必然具有_MY_U，即第0位必然为1，找到ctype里面第66个（略过第一个无意义的0）元素，为129 = 10000001，显然第0位为1(右边起)，说明为大写字母。写代码的人确实比较牛X，如此运用位，哥估计这辈子也想不到了，小小佩服下。State的问题点到为止了。

继续进行词法分析，第一个字母为s，其state = MY_LEX_IDENT（IDENTIFIER:标识符的意思）,break出来，继续循环，case进入MY_LEX_IDENT分支：

Case MY_LEX_IDENT：

{

由s开始读，直到空格为止

If（读入的单词为关键字）

{

nextstate = MY_LEX_START；

Return tokval; //关键字的唯一标识

}

Else

{

return IDENT_QUOTED 或者IDENT；表示为一般标识符

}

　　这里SELECT肯定为关键字，至于为什么呢？下节的语法分析会讲。

解析完SELECT后，需要解析@@version_comment,第一个字符为@,进入START分支，state = MY_LEX_USER_END；

进入MY_LEX_USER_END分支，如下：

case MY_LEX_USER_END: // end '@' of user@hostname

switch (state_map[lip->yyPeek()]) {

case MY_LEX_STRING:

case MY_LEX_USER_VARIABLE_DELIMITER:

case MY_LEX_STRING_OR_DELIMITER:

break;

case MY_LEX_USER_END:

lip->next_state=MY_LEX_SYSTEM_VAR;

break;

default:

lip->next_state=MY_LEX_HOSTNAME;

break;

　　哥会心的笑了，两个@符号就是系统变量吧～～,下面进入MY_LEX_SYSTEM_VAR分支

case MY_LEX_SYSTEM_VAR:

yylval->lex_str.str=(char*) lip->get_ptr();

yylval->lex_str.length=1;

lip->yySkip(); // Skip '@'

lip->next_state= (state_map[lip->yyPeek()] ==

MY_LEX_USER_VARIABLE_DELIMITER ?

MY_LEX_OPERATOR_OR_IDENT :

MY_LEX_IDENT_OR_KEYWORD);

return((int) '@');

　　所作的操作是略过@，next_state设置为MY_LEX_IDENT_OR_KEYWORD，再之后便是解析MY_LEX_IDENT_OR_KEYWORD了，也就是version_comment了，此解析应该和SELECT解析路径一致，但不是KEYWORD。剩下的留给有心的读者了（想起了歌手经常说的一句话：大家一起来，哈哈）。

Mysql的词法解析的状态还是比较多的，如果细究还是需要点时间的，但这不是Mysql的重点，我就浅尝辄止了。下节会针对上面的SQL语句讲解下语法分析。

PS: 一直想好好学习下Mysql，总是被这样或那样的事耽误，当然都是自己的原因，希望这次能走的远点.....

PS again：本文只代表本人的学习感悟，如有异议，欢迎指正。

摘自心中无码 bitsCN.com

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7579

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

111

Related knowledge

VMware Horizon Client cannot be opened [Fix] Feb 19, 2024 pm 11:21 PM

VMware Horizon Client helps you access virtual desktops conveniently. However, sometimes the virtual desktop infrastructure may experience startup issues. This article discusses the solutions you can take when the VMware Horizon client fails to start successfully. Why won't my VMware Horizon client open? When configuring VDI, if the VMWareHorizon client is not open, an error may occur. Please confirm that your IT administrator has provided the correct URL and credentials. If everything is fine, follow the solutions mentioned in this guide to resolve the issue. Fix VMWareHorizon Client Not Opening If VMW is not opening on your Windows computer

VMware Horizon client freezes or stalls while connecting [Fix] Mar 03, 2024 am 09:37 AM

When connecting to a VDI using the VMWareHorizon client, we may encounter situations where the application freezes during authentication or the connection blocks. This article will explore this issue and provide ways to resolve this situation. When the VMWareHorizon client experiences freezing or connection issues, there are a few things you can do to resolve the issue. Fix VMWareHorizon client freezes or gets stuck while connecting If VMWareHorizon client freezes or fails to connect on Windows 11/10, do the below mentioned solutions: Check network connection Restart Horizon client Check Horizon server status Clear client cache Fix Ho

How to solve the problem that eMule search cannot connect to the server Jan 25, 2024 pm 02:45 PM

Solution: 1. Check the eMule settings to make sure you have entered the correct server address and port number; 2. Check the network connection, make sure the computer is connected to the Internet, and reset the router; 3. Check whether the server is online. If your settings are If there is no problem with the network connection, you need to check whether the server is online; 4. Update the eMule version, visit the eMule official website, and download the latest version of the eMule software; 5. Seek help.

Solution to the inability to connect to the RPC server and the inability to enter the desktop Feb 18, 2024 am 10:34 AM

What should I do if the RPC server is unavailable and cannot be accessed on the desktop? In recent years, computers and the Internet have penetrated into every corner of our lives. As a technology for centralized computing and resource sharing, Remote Procedure Call (RPC) plays a vital role in network communication. However, sometimes we may encounter a situation where the RPC server is unavailable, resulting in the inability to enter the desktop. This article will describe some of the possible causes of this problem and provide solutions. First, we need to understand why the RPC server is unavailable. RPC server is a

Detailed explanation of CentOS installation fuse and CentOS installation server Feb 13, 2024 pm 08:40 PM

As a LINUX user, we often need to install various software and servers on CentOS. This article will introduce in detail how to install fuse and set up a server on CentOS to help you complete the related operations smoothly. CentOS installation fuseFuse is a user space file system framework that allows unprivileged users to access and operate the file system through a customized file system. Installing fuse on CentOS is very simple, just follow the following steps: 1. Open the terminal and Log in as root user. 2. Use the following command to install the fuse package: ```yuminstallfuse3. Confirm the prompts during the installation process and enter `y` to continue. 4. Installation completed

How to configure Dnsmasq as a DHCP relay server Mar 21, 2024 am 08:50 AM

The role of a DHCP relay is to forward received DHCP packets to another DHCP server on the network, even if the two servers are on different subnets. By using a DHCP relay, you can deploy a centralized DHCP server in the network center and use it to dynamically assign IP addresses to all network subnets/VLANs. Dnsmasq is a commonly used DNS and DHCP protocol server that can be configured as a DHCP relay server to help manage dynamic host configurations in the network. In this article, we will show you how to configure dnsmasq as a DHCP relay server. Content Topics: Network Topology Configuring Static IP Addresses on a DHCP Relay D on a Centralized DHCP Server

Best Practice Guide for Building IP Proxy Servers with PHP Mar 11, 2024 am 08:36 AM

In network data transmission, IP proxy servers play an important role, helping users hide their real IP addresses, protect privacy, and improve access speeds. In this article, we will introduce the best practice guide on how to build an IP proxy server with PHP and provide specific code examples. What is an IP proxy server? An IP proxy server is an intermediate server located between the user and the target server. It acts as a transfer station between the user and the target server, forwarding the user's requests and responses. By using an IP proxy server

PHP MQTT Client Development Guide Mar 27, 2024 am 09:21 AM

MQTT (MessageQueuingTelemetryTransport) is a lightweight message transmission protocol commonly used for communication between IoT devices. PHP is a commonly used server-side programming language that can be used to develop MQTT clients. This article will introduce how to use PHP to develop an MQTT client and include the following content: Basic concepts of the MQTT protocol Selection and usage examples of the PHPMQTT client library: Using the PHPMQTT client to publish and

See all articles