Problem Description
Some empty strings "" appear when using JavaScript's split method to split a string, especially when using regular expressions as delimiters.
Related questions
Javascript regular expression produces empty string group when grouping strings?
In the above question, when the subject used regular expressions to split the string, multiple empty strings "" were generated. The code is as follows:
So, what is the reason for these empty strings?
Problem Analysis
After searching on Google, I found that there are not many related results, and even if there are, there are not many detailed explanations. I gave a brief introduction and then gave a link to the ECMAScript specification. It seems that if you want to know the real reason, you have to bite the bullet and read the regulations.
Related standards
Then, next, in accordance with international practice, let’s start with the standard town hall of ECMAScript.
This chapter introduces the execution steps of the split method in detail. If you are interested, you can read it carefully step by step. I will only explain the steps related to generating an empty string here. If there is any inappropriateness, everyone is welcome. propose.
Related steps
Extract some steps:
The most important step in the whole process is the cycle of step 13, and the main things this cycle does are as follows:
•Define the values of p and q. The values of p and q are the same at the beginning of each loop (this step is outside the loop);
•Call the SplitMatch(S, q, R) method to split the string;
•According to the different results returned, different branches are executed, and the main branch is branch ⅲ;
•Branch ⅲ is divided into 8 small steps to fill the returned results into the pre-defined array A
•In these 8 small steps, the function of step 1 is to return a substring of the original string. The starting position is p (inclusive) and the end position is q (not included). Note: In this step An empty string is generated, which I mark as a truncated string for easy reference below.
•Add the substring from the previous step to array A
•The next few steps are to update the relevant variables and continue the next cycle. (The function of step 7 is to save the capture group in the regular expression into array A, and has nothing to do with generating an empty string)
SplitMatch(S, q, R)
Next, we need to understand what the SplitMatch(S, q, R) method does. This method is mentioned further down in the split specification. What it mainly does is to perform corresponding operations according to the type of separator:
•If the delimiter is of type RegExp, call the internal method [[Match]] of RegExp to match the string. If the match fails, failure is returned. Otherwise, a result of type MatchResult is returned.
•If the separator is a string, match is judged, failure is returned, and a MatchResult type result is returned successfully.
MatchResult
The above steps introduce another variable of type MatchResult. By checking the documentation, we found that this type of variable has two attributes, endIndex and captures. The value of endIndex is the string matching position plus 1. Captures can be understood as an array. When the delimiter is a regular expression, the elements in it is the value captured by the group; when the delimiter is a string, it is an empty array.
Next
We can see from the above steps that the split string is generated in the step of intercepting the string (except for the group capture of regular expressions). Its function is to intercept the string between the specified start (inclusive) and end position (not included). So when will it return ""? There is a special case where the values of the start position and the end position are equal. This is just a guess, because the specification does not give the standard steps for intercepting the string.
We have come this far, why not take another step forward?
So, I tried to search some V8 source code to see if I could find a specific implementation method. I did find the relevant code, source code link
Here is an excerpt from one of them:
if (limit === 0) return [];
// ECMA-262 says that if separator is undefined, the result should
// be an array of size 1 containing the entire string.
If (IS_UNDEFINED(separator)) return [subject];
var separator_length = separator_string.length;
//The delimiter is an empty string, and the character array is returned directly
If (separator_length === 0) return %StringToArray(subject, limit);
var result = %StringSplit(subject, separator_string, limit);
return result;
}
if (limit === 0) return [];
// When the separator is a regular expression, call StringSplitOnRegExp
Return StringSplitOnRegExp(subject, separator, limit, length);
}
//Omit some codes here
I found in the code that the %_SubString method is called to intercept the string when filling the array. Unfortunately, I did not find its relevant definition. If you find it, please let me know. However, I found that the StringSubstring method corresponding to the substring method in JavaScript will call the %_SubString method and return the result. Then if 'abc'.substring(1,1) returns "", it means that the %_SubString method will return "" when the start position and end position are the same. You will know the result after trying it.
So, when will the starting position be equal to the ending position (i.e. q === p)? I analyzed step by step according to the above steps and finally found:
•After the original string S matches the delimiter once, the next position of the string S also matches the delimiter. Such as: 'abbbc'.split('b'), 'abbbc'.split(/(b){1}/)
•Another situation is when one or several characters at the beginning of the string match the delimiter. Such as: 'abc'.split('a'), 'abc'.split(/ab/)
•There is also a case where one or several strings at the end of the string match the delimiter, and the related step is step 14.
Such as: 'abc'.split('c'), 'abc'.split(/bc/)
In addition, when using regular expressions as delimiters, undefined may appear in the returned results.
Such as: 'abc'.split(/(d)*/)
Go back and look at the example at the beginning. Does it meet the above conditions?
Digression
This is the first time I have read the ECMAScript standard specification so carefully. The process of reading it is indeed painful, but after understanding it, I feel very happy. Thank you also to the questioner for raising this question and for following up.
By the way, when a regular expression is used as a delimiter, the global modifier g will be ignored, which is an additional gain.