©
Dokumen ini menggunakan Manual laman web PHP Cina Lepaskan
(PHP 4, PHP 5)
preg_match_all — 执行一个全局正则表达式匹配
$pattern
, string $subject
[, array &$matches
[, int $flags
= PREG_PATTERN_ORDER
[, int $offset
= 0
]]] )
搜索subject
中所有匹配pattern
给定正则表达式
的匹配结果并且将它们以flag
指定顺序输出到matches
中.
在第一个匹配找到后, 子序列继续从最后一次匹配位置搜索.
pattern
要搜索的模式,字符串形式。
subject
输入字符串。
matches
多维数组,作为输出参数输出所有匹配结果, 数组排序通过flags
指定。
flags
可以结合下面标记使用(注意不能同时使用 PREG_PATTERN_ORDER
和
PREG_SET_ORDER
):
PREG_PATTERN_ORDER
结果排序为 $matches[0] 保存完整模式的所有匹配, $matches[1] 保存第一个子组的所有匹配,以此类推。
<?php
preg_match_all ( "|<[^>]+>(.*)</[^>]+>|U" ,
"<b>example: </b><div align=left>this is a test</div>" ,
$out , PREG_PATTERN_ORDER );
echo $out [ 0 ][ 0 ] . ", " . $out [ 0 ][ 1 ] . "\n" ;
echo $out [ 1 ][ 0 ] . ", " . $out [ 1 ][ 1 ] . "\n" ;
?>
以上例程会输出:
<b>example: </b>, <div align=left>this is a test</div> example: , this is a test
因此, $out[0] 是包含匹配完整模式的字符串的数组, $out[1] 是包含闭合标签内的字符串的数组。
PREG_SET_ORDER
结果排序为 $matches[0] 包含第一次匹配得到的所有匹配(包含子组), $matches[1] 是包含第二次匹配到的所有匹配(包含子组)的数组,以此类推。
<?php
preg_match_all ( "|<[^>]+>(.*)</[^>]+>|U" ,
"<b>example: </b><div align=\"left\">this is a test</div>" ,
$out , PREG_SET_ORDER );
echo $out [ 0 ][ 0 ] . ", " . $out [ 0 ][ 1 ] . "\n" ;
echo $out [ 1 ][ 0 ] . ", " . $out [ 1 ][ 1 ] . "\n" ;
?>
以上例程会输出:
<b>example: </b>, example: <div align="left">this is a test</div>, this is a test
PREG_OFFSET_CAPTURE
如果这个标记被传递,每个发现的匹配返回时会增加它相对目标字符串的偏移量。
注意这会改变matches
中的每一个匹配结果字符串元素,使其
成为一个第0个元素为匹配结果字符串,第1个元素为
匹配结果字符串在subject
中的偏移量。
如果没有给定排序标记,假定设置为 PREG_PATTERN_ORDER
。
offset
通常, 查找时从目标字符串的开始位置开始。可选参数offset
用于
从目标字符串中指定位置开始搜索(单位是字节)。
Note:
使用
offset
参数不同于传递 substr($subject, $offset) 的 结果到 preg_match_all() 作为目标字符串,因为pattern
可以包含断言比如^, $ 或者 (?<=x) 。 示例查看 preg_match() 。
返回完整匹配次数(可能是0),或者如果发生错误返回 FALSE
。
版本 | 说明 |
---|---|
5.4.0 |
参数matches 成为可选的。
|
5.3.6 |
如果 offset
大于
subject 的程度,将返回 FALSE 。
|
5.2.2 | 子命名分组语法可以接受(?<name>),(?'name')以及 (?P<name>)了。 之前版本仅接受(?P<name>)方式。 |
4.3.3 |
增加了offset 参数。
|
4.3.0 |
增加了 PREG_OFFSET_CAPTURE 标记。
|
Example #1 查找所有文本中的电话号码。
<?php
preg_match_all ( "/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x" ,
"Call 555-1212 or 1-800-555-1212" , $phones );
?>
Example #2 查找匹配的HTML标签(贪婪)
<?php
//\\2是一个后向引用的示例. 这会告诉pcre它必须匹配正则表达式中第二个圆括号(这里是([\w]+))
//匹配到的结果. 这里使用两个反斜线是因为这里使用了双引号.
$html = "<b>bold text</b><a href=howdy.html>click me</a>" ;
preg_match_all ( "/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/" , $html , $matches , PREG_SET_ORDER );
foreach ( $matches as $val ) {
echo "matched: " . $val [ 0 ] . "\n" ;
echo "part 1: " . $val [ 1 ] . "\n" ;
echo "part 2: " . $val [ 2 ] . "\n" ;
echo "part 3: " . $val [ 3 ] . "\n" ;
echo "part 4: " . $val [ 4 ] . "\n\n" ;
}
?>
以上例程会输出:
matched: <b>bold text</b> part 1: <b> part 2: b part 3: bold text part 4: </b>matched: <a href=howdy.html>click me</a> part 1: <a href=howdy.html> part 2: a part 3: click me part 4: </a>
Example #3 使用子命名组
<?php
$str = <<<FOO
a: 1
b: 2
c: 3
FOO;
preg_match_all ( '/(?P<name>\w+): (?P<digit>\d+)/' , $str , $matches );
// preg_match_all('/(?<name>\w+): (?<digit>\d+)/', $str, $matches);
print_r ( $matches );
?>
以上例程会输出:
Array ( [0] => Array ( [0] => a: 1 [1] => b: 2 [2] => c: 3 ) [name] => Array ( [0] => a [1] => b [2] => c ) [1] => Array ( [0] => a [1] => b [2] => c ) [digit] => Array ( [0] => 1 [1] => 2 [2] => 3 ) [2] => Array ( [0] => 1 [1] => 2 [2] => 3 ))
[#1] matt at lvl99 dot com [2015-11-19 14:30:58]
I had been crafting and testing some regexp patterns online using the tools Regex101 and a `preg_match_all()` tester and found that the regexp patterns I wrote worked fine on them, just not in my code.
My problem was not double-escaping backslash characters:
<?php
// Input test
$input = "\"something\",\"something here\",\"some\nnew\nlines\",\"this is the end\"";
// Work with online regexp testers, doesn't work in PHP
preg_match_all( "/(?:,|^)(?<!\\)\".*?(?<!\\)\"(?:(?=,)|$)/s", $input, $matches );
// Works with online regexp testers, does work in PHP
preg_match_all( "/(?:,|^)(?<!\\\\)\".*?(?<!\\\\)\"(?:(?=,)|$)/s", $input, $matches );
?>
[#2] stas kuryan aka stafox [2015-11-09 08:57:49]
Here is a awesome online regex editor https://regex101.com/
which helps you test your regular expressions (prce, js, python) with real-time highlighting of regex match on data input.
[#3] vojjov dot artem at ya dot ru [2015-05-26 15:40:10]
// Here is function that allows you to preg_match_all array of patters
function getMatches($pattern, $subject) {
$matches = array();
if (is_array($pattern)) {
foreach ($pattern as $p) {
$m = getMatches($p, $subject);
foreach ($m as $key => $match) {
if (isset($matches[$key])) {
$matches[$key] = array_merge($matches[$key], $m[$key]);
} else {
$matches[$key] = $m[$key];
}
}
}
} else {
preg_match_all($pattern, $subject, $matches);
}
return $matches;
}
$patterns = array(
'/<span>(.*?)<\/span>/',
'/<a href=".*?">(.*?)<\/a>/'
);
$html = '<span>some text</span>';
$html .= '<span>some text in another span</span>';
$html .= '<a href="path/">here is the link</a>';
$html .= '<address>address is here</address>';
$html .= '<span>here is one more span</span>';
$matches = getMatches($patterns, $html);
print_r($matches); // result is below
[#4] Daniel Klein [2015-05-17 05:26:58]
The code that john at mccarthy dot net posted is not necessary. If you want your results grouped by individual match simply use:
<?php
preg_match_all($pattern, $string, $matches, PREG_SET_ORDER);
?>
E.g.
<?php
preg_match_all('/([GH])([12])([!?])/', 'G1? H2!', $matches); // Default PREG_PATTERN_ORDER
// $matches = array(0 => array(0 => 'G1?', 1 => 'H2!'),
// 1 => array(0 => 'G', 1 => 'H'),
// 2 => array(0 => '1', 1 => '2'),
// 3 => array(0 => '?', 1 => '!'))
preg_match_all('/([GH])([12])([!?])/', 'G1? H2!', $matches, PREG_SET_ORDER);
// $matches = array(0 => array(0 => 'G1?', 1 => 'G', 2 => '1', 3 => '?'),
// 1 => array(0 => 'H2!', 1 => 'H', 2 => '2', 3 => '!'))
?>
[#5] DarkSide [2014-04-15 14:03:35]
This is very useful to combine matches:
$a = array_combine($matches[1], $matches[2]);
[#6] ajeet dot nigam at icfaitechweb dot com [2014-01-22 18:34:25]
Here http://tryphpregex.com/ is a php based online regex editor which helps you test your regular expressions with real-time highlighting of regex match on data input.
[#7] fab [2012-09-24 09:14:45]
Here is a function that replaces all occurrences of a number in a string by the number--
<?php
function decremente_chaine($chaine)
{
//r??cup??rer toutes les occurrences de nombres et leurs indices
preg_match_all("/[0-9]+/",$chaine,$out,PREG_OFFSET_CAPTURE);
//parcourir les occurrences
for($i=0;$i<sizeof($out[0]);$i++)
{
$longueurnombre = strlen((string)$out[0][$i][0]);
$taillechaine = strlen($chaine);
// d??couper la chaine en 3 morceaux
$debut = substr($chaine,0,$out[0][$i][1]);
$milieu = ($out[0][$i][0])-1;
$fin = substr($chaine,$out[0][$i][1]+$longueurnombre,$taillechaine);
// si c'est 10,100,1000 etc. on d??cale tout de 1 car le r??sultat comporte un chiffre de moins
if(preg_match('#[1][0]+$#', $out[0][$i][0]))
{
for($j = $i+1;$j<sizeof($out[0]);$j++)
{
$out[0][$j][1] = $out[0][$j][1] -1;
}
}
$chaine = $debut.$milieu.$fin;
}
return $chaine;
}
?>
[#8] fseverin at free dot fr [2012-09-22 17:52:18]
As I intended to create for my own purpose a clean PHP class to act on XML files, combining the use of DOM and simplexml functions, I had that small problem, but very annoying, that the offsets in a path is not numbered the same in both.
That is to say, for example, if i get a DOM xpath object it appears like:
/ANODE/ANOTHERNODE/SOMENODE[9]/NODE[2]
and as a simplexml object would be equivalent to:
ANODE->ANOTHERNODE->SOMENODE[8]->NODE[1]
So u see what I mean? I used preg_match_all to solve that problem, and finally I got this after some hours of headlock (as I'm french the names of variables are in French sorry), hoping it could be useful to some of you:
<?php
function decrease_string($string)
{
preg_match_all("/[0-9]+/",$chaine,$out,PREG_OFFSET_CAPTURE);
for($i=0;$i<sizeof($out[0]);$i++)
{
$longueurnombre = strlen((string)$out[0][$i][0]);
$taillechaine = strlen($chaine);
// cut the string in 3 pieces
$debut = substr($chaine,0,$out[0][$i][1]);
$milieu = ($out[0][$i][0])-1;
$fin = substr($chaine,$out[0][$i][1]+$longueurnombre,$taillechaine);
if(preg_match('#[1][0]+$#', $out[0][$i][0]))
{
for($j = $i+1;$j<sizeof($out[0]);$j++)
{
$out[0][$j][1] = $out[0][$j][1] -1;
}
}
$chaine = $debut.$milieu.$fin;
}
return $chaine;
}
?>
[#9] marc [2012-05-04 15:48:06]
Better use preg_replace to convert text in a clickable link with tag <a>
$html = preg_replace('"\b(http://\S+)"', '<a href="$1">$1</a>', $text);
[#10] satyavvd at ymail dot com [2011-09-20 04:08:11]
Extract fields out of csv string : ( since before php5.3 you can't use str_getcsv function )
Here is the regex :
<?php
$csvData = <<<EOF
10,'20',"30","'40","'50'","\"60","70,80","09\\/18,/\"2011",'a,sdfcd'
EOF
$reg = <<<EOF
/
(
(
([\'\"])
(
(
[^\'\"]
|
(\\\\.)
)*
)
(\\3)
|
(
[^,]
|
(\\\\.)
)*
),)
/x
EOF;
preg_match_all($reg,$csvData,$matches);
// to extract csv fields
print_r($matches[2]);
?>
[#11] john at mccarthy dot net [2011-02-18 11:21:42]
I needed a function to rotate the results of a preg_match_all query, and made this. Not sure if it exists.
<?php
function turn_array($m)
{
for ($z = 0;$z < count($m);$z++)
{
for ($x = 0;$x < count($m[$z]);$x++)
{
$rt[$x][$z] = $m[$z][$x];
}
}
return $rt;
}
?>
Example - Take results of some preg_match_all query:
Array
(
[0] => Array
(
[1] => Banff
[2] => Canmore
[3] => Invermere
)
[1] => Array
(
[1] => AB
[2] => AB
[3] => BC
)
[2] => Array
(
[1] => 51.1746254
[2] => 51.0938416
[3] => 50.5065193
)
[3] => Array
(
[1] => -115.5719757
[2] => -115.3517761
[3] => -116.0321884
)
[4] => Array
(
[1] => T1L 1B3
[2] => T1W 1N2
[3] => V0B 2G0
)
)
Rotate it 90 degrees to group results as records:
Array
(
[0] => Array
(
[1] => Banff
[2] => AB
[3] => 51.1746254
[4] => -115.5719757
[5] => T1L 1B3
)
[1] => Array
(
[1] => Canmore
[2] => AB
[3] => 51.0938416
[4] => -115.3517761
[5] => T1W 1N2
)
[2] => Array
(
[1] => Invermere
[2] => BC
[3] => 50.5065193
[4] => -116.0321884
[5] => V0B 2G0
)
)
[#12] buuh [2010-12-06 02:03:08]
if you want to extract all {token}s from a string:
<?php
$pattern = "/{[^}]*}/";
$subject = "{token1} foo {token2} bar";
preg_match_all($pattern, $subject, $matches);
print_r($matches);
?>
output:
Array
(
[0] => Array
(
[0] => {token1}
[1] => {token2}
)
)
[#13] no at bo dot dy [2010-09-08 11:23:02]
For parsing queries with entities use:
<?php
preg_match_all("/(?:^|(?<=\&(?![a-z]+\;)))([^\=]+)=(.*?)(?:$|\&(?![a-z]+\;))/i",
$s, $m, PREG_SET_ORDER );
?>
[#14] avengis at gmail dot com [2009-09-23 02:25:32]
The next function works with almost any complex xml/xhtml string
<?php
function close_tags($text) {
$patt_open = "%((?<!</)(?<=<)[\s]*[^/!>\s]+(?=>|[\s]+[^>]*[^/]>)(?!/>))%";
$patt_close = "%((?<=</)([^>]+)(?=>))%";
if (preg_match_all($patt_open,$text,$matches))
{
$m_open = $matches[1];
if(!empty($m_open))
{
preg_match_all($patt_close,$text,$matches2);
$m_close = $matches2[1];
if (count($m_open) > count($m_close))
{
$m_open = array_reverse($m_open);
foreach ($m_close as $tag) $c_tags[$tag]++;
foreach ($m_open as $k => $tag) if ($c_tags[$tag]--<=0) $text.='</'.$tag.'>';
}
}
}
return $text;
}
?>
[#15] royaltm75 at gmail dot com [2009-09-13 14:44:00]
I have received complains, that my html2a() code (see below) doesn't work in some cases.
It is however not the problem with algorithm or procedure, but with PCRE recursive stack limits.
If you use recursive PCRE (?R) you should remember to increase those two ini settings:
ini_set('pcre.backtrack_limit', 10000000);
ini_set('pcre.recursion_limit', 10000000);
But be warned: (from php.ini)
;Please note that if you set this value to a high number you may consume all
;the available process stack and eventually crash PHP (due to reaching the
;stack size limit imposed by the Operating System).
I have written this example mainly to demonstrate the power of PCRE LANGUAGE, not the power of it's implementation :)
But if you like it, use it, of course on your own risk.
[#16] elyknosrac at gmail dot com [2009-07-18 15:51:12]
Using preg_match_all I made a pretty handy function.
<?php
function reg_smart_replace($pattern, $replacement, $subject, $replacementChar = "$$$", $limit = -1)
{
if (! $pattern || ! $subject || ! $replacement ) { return false; }
$replacementChar = preg_quote($replacementChar);
preg_match_all ( $pattern, $subject, $matches);
if ($limit > -1) {
foreach ($matches as $count => $value )
{
if ($count + 1 > $limit ) { unset($matches[$count]); }
}
}
foreach ($matches[0] as $match) {
$rep = ereg_replace($replacementChar, $match, $replacement);
$subject = ereg_replace($match, $rep, $subject);
}
return $subject;
}
?>
This function can turn blocks of text into clickable links or whatever. Example:
<?php
reg_smart_replace(EMAIL_REGEX, '<a href="mailto:$$$">$$$</a>', $description)
?>
will turn all email addresses into actual links.
Just substitute $$$ with the text that will be found by the regex. If you can't use $$$ then use the 4th parameter $replacementChar
[#17] ad [2009-04-01 05:18:13]
i have made up a simple function to extract a number from a string..
I am not sure how good it is, but it works.
It gets only the numbers 0-9, the "-", " ", "(", ")", "."
characters.. This is as far as I know the most widely used characters for a Phone number.
<?php
function clean_phone_number($phone) {
if (!empty($phone)) {
//var_dump($phone);
preg_match_all('/[0-9\(\)+.\- ]/s', $phone, $cleaned);
foreach($cleaned[0] as $k=>$v) {
$ready .= $v;
}
var_dump($ready);
die;
if (mb_strlen($cleaned) > 4 && mb_strlen($cleaned) <=25) {
return $cleaned;
}
else {
return false;
}
}
return false;
}
?>
[#18] royaltm75 at NOSPAM dot gmail dot com [2009-02-21 02:55:15]
The power of pregs is limited only by your *imagination* :)
I wrote this html2a() function using preg recursive match (?R) which provides quite safe and bulletproof html/xml extraction:
<?php
[^\<]*)|(?R))*)
function html2a ( $html ) {
if ( !preg_match_all( '
@
\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>
((?:(?>
\<\/\s*?\\1(?:\b[^\>]*)?\>
|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>
@uxis', $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
return $html;
$i = 0;
$ret = array();
foreach ($m as $set) {
if ( strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
$ret[] = $val;
$val = $set[1][1] < 0
? array( 'tag' => strtolower($set[4][0]) )
: array( 'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
if ( preg_match_all( '
/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix
', isset($set[5]) && $set[2][1] < 0
? $set[5][0]
: $set[2][0]
,$attrs, PREG_SET_ORDER ) ) {
foreach ($attrs as $a) {
$val['attr'][$a[1]]=$a[count($a)-1];
}
}
$ret[] = $val;
$i = $set[0][1]+strlen( $set[0][0] );
}
$l = strlen($html);
if ( $i < $l )
if ( strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
$ret[] = $val;
return $ret;
}
?>
Now let's try it with this example: (there are some really nasty xhtml compliant bugs, but ... we shouldn't worry)
<?php
$html = <<<EOT
some leftover text...
< DIV class=noCompliant style = "text-align:left;" >
... and some other ...
< dIv > < empty> </ empty>
<p> This is yet another text <br >
that wasn't <b>compliant</b> too... <br />
</p>
<div class="noClass" > this one is better but we don't care anyway </div ><P>
<input type= "text" name ='my "name' value = "nothin really." readonly>
end of paragraph </p> </Div> </div> some trailing text
EOT;
$a = html2a($html);
//now we will make some neat html out of it
echo a2html($a);
function a2html ( $a, $in = "" ) {
if ( is_array($a) ) {
$s = "";
foreach ($a as $t)
if ( is_array($t) ) {
$attrs="";
if ( isset($t['attr']) )
foreach( $t['attr'] as $k => $v )
$attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
$s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in." " ).$in."</".$t['tag'] : "/" ).">\n";
} else
$s.= $in.$t."\n";
} else {
$s = empty($a) ? "" : $in.$a."\n";
}
return $s;
}
?>
This produces:
some leftover text...
<div class="noCompliant" style="text-align:left;">
... and some other ...
<div>
<empty>
</empty>
<p>
This is yet another text
<br/>
that wasn't
<b>
compliant
</b>
too...
<br/>
</p>
<div class="noClass">
this one is better but we don't care anyway
</div>
<p>
<input type="text" name='my "name' value="nothin really." readonly="readonly"/>
end of paragraph
</p>
</div>
</div>
some trailing text
[#19] meaneye at mail dot com [2008-10-15 02:56:15]
Recently I had to write search engine in hebrew and ran into huge amount of problems. My data was stored in MySQL table with utf8_bin encoding.
So, to be able to write hebrew in utf8 table you need to do
<?php
$prepared_text = addslashes(urf8_encode($text));
?>
But then I had to find if some word exists in stored text. This is the place I got stuck. Simple preg_match would not find text since hebrew doesnt work that easy. I've tried with /u and who kows what else.
Solution was somewhat logical and simple...
<?php
$db_text = bin2hex(stripslashes(utf8_decode($db_text)));
$word = bin2hex($word);
$found = preg_match_all("/($word)+/i", $db_text, $matches);
?>
I've used preg_match_all since it returns number of occurences. So I could sort search results acording to that.
Hope someone finds this useful!
[#20] MonkeyMan [2008-10-07 01:25:53]
Here is a way to match everything on the page, performing an action for each match as you go. I had used this idiom in other languages, where its use is customary, but in PHP it seems to be not quite as common.
<?php
function custom_preg_match_all($pattern, $subject)
{
$offset = 0;
$match_count = 0;
while(preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $offset))
{
// Increment counter
$match_count++;
// Get byte offset and byte length (assuming single byte encoded)
$match_start = $matches[0][1];
$match_length = strlen(matches[0][0]);
// (Optional) Transform $matches to the format it is usually set as (without PREG_OFFSET_CAPTURE set)
foreach($matches as $k => $match) $newmatches[$k] = $match[0];
$matches = $new_matches;
// Your code here
echo "Match number $match_count, at byte offset $match_start, $match_length bytes long: ".$matches[0]."\r\n";
// Update offset to the end of the match
$offset = $match_start + $match_length;
}
return $match_count;
}
?>
Note that the offsets returned are byte values (not necessarily number of characters) so you'll have to make sure the data is single-byte encoded. (Or have a look at paolo mosna's strByte function on the strlen manual page).
I'd be interested to know how this method performs speedwise against using preg_match_all and then recursing through the results.
[#21] sledge NOSPAM [2008-06-19 13:46:28]
Perhaps you want to find the positions of all anchor tags. This will return a two dimensional array of which the starting and ending positions will be returned.
<?php
function getTagPositions($strBody)
{
define(DEBUG, false);
define(DEBUG_FILE_PREFIX, "/tmp/findlinks_");
preg_match_all("/<[^>]+>(.*)<\/[^>]+>/U", $strBody, $strTag, PREG_PATTERN_ORDER);
$intOffset = 0;
$intIndex = 0;
$intTagPositions = array();
foreach($strTag[0] as $strFullTag) {
if(DEBUG == true) {
$fhDebug = fopen(DEBUG_FILE_PREFIX.time(), "a");
fwrite($fhDebug, $fulltag."\n");
fwrite($fhDebug, "Starting position: ".strpos($strBody, $strFullTag, $intOffset)."\n");
fwrite($fhDebug, "Ending position: ".(strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag))."\n");
fwrite($fhDebug, "Length: ".strlen($strFullTag)."\n\n");
fclose($fhDebug);
}
$intTagPositions[$intIndex] = array('start' => (strpos($strBody, $strFullTag, $intOffset)), 'end' => (strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag)));
$intOffset += strlen($strFullTag);
$intIndex++;
}
return $intTagPositions;
}
$strBody = 'I have lots of <a href="http://my.site.com">links</a> on this <a href="http://my.site.com">page</a> that I want to <a href="http://my.site.com">find</a> the positions.';
$strBody = strip_tags(html_entity_decode($strBody), '<a>');
$intTagPositions = getTagPositions($strBody);
print_r($intTagPositions);
?>
[#22] spambegone at cratemedia dot com [2008-04-20 23:39:55]
I found simpleXML to be useful only in cases where the XML was extremely small, otherwise the server would run out of memory (I suspect there is a memory leak or something?). So while searching for alternative parsers, I decided to try a simpler approach. I don't know how this compares with cpu usage, but I know it works with large XML structures. This is more a manual method, but it works for me since I always know what structure of data I will be receiving.
Essentially I just preg_match() unique nodes to find the values I am looking for, or I preg_match_all to find multiple nodes. This puts the results in an array and I can then process this data as I please.
I was unhappy though, that preg_match_all() stores the data twice (requiring twice the memory), one array for all the full pattern matches, and one array for all the sub pattern matches. You could probably write your own function that overcame this. But for now this works for me, and I hope it saves someone else some time as well.
// SAMPLE XML
<RETS ReplyCode="0" ReplyText="Operation Successful">
<COUNT Records="14" />
<DELIMITER value="09" />
<COLUMNS>PropertyID</COLUMNS>
<DATA>521897</DATA>
<DATA>677208</DATA>
<DATA>686037</DATA>
</RETS>
<?PHP
// SAMPLE FUNCTION
function parse_xml($xml) {
// GET DELIMITER (single instance)
$match_res = preg_match('/<DELIMITER value ?= ?"(.*)" ?\/>/', $xml, $matches);
if(!empty($matches[1])) {
$results["delimiter"] = chr($matches[1]);
} else {
// DEFAULT DELIMITER
$results["delimiter"] = "\t";
}
unset($match_res, $matches);
// GET MULTIPLE DATA NODES (multiple instances)
$results["data_count"] = preg_match_all("/<DATA>(.*)<\/DATA>/", $xml, $matches);
// GET MATCHES OF SUB PATTERN, DISCARD THE REST
$results["data"]=$matches[1];
unset($match_res, $matches);
// UNSET XML TO SAVE MEMORY (should unset outside the function as well)
unset($xml);
// RETURN RESULTS ARRAY
return $results;
}
?>
[#23] bruha [2008-03-04 00:13:21]
To count str_length in UTF-8 string i use
$count = preg_match_all("/[[:print:]\pL]/u", $str, $pockets);
where
[:print:] - printing characters, including space
\pL - UTF-8 Letter
/u - UTF-8 string
other unicode character properties on http://www.pcre.org/pcre.txt
[#24] dolbegraeb [2008-01-28 16:30:06]
please note, that the function of "mail at SPAMBUSTER at milianw dot de" can result in invalid xhtml in some cases. think i used it in the right way but my result is sth like this:
<img src="./img.jpg" alt="nice picture" />foo foo foo foo </img>
correct me if i'm wrong.
i'll see when there's time to fix that. -.-
[#25] mr davin [2007-07-12 14:57:51]
<?php
// Returns an array of strings where the start and end are found
function findinside($start, $end, $string) {
preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $string, $m);
return $m[1];
}
$start = "mary has";
$end = "lambs.";
$string = "mary has 6 lambs. phil has 13 lambs. mary stole phil's lambs. now mary has all the lambs.";
$out = findinside($start, $end, $string);
print_r ($out);
?>
[#26] phektus at gmail dot com [2007-06-26 23:22:21]
If you'd like to include DOUBLE QUOTES on a regular expression for use with preg_match_all, try ESCAPING THRICE, as in: \\\"
For example, the pattern:
'/<table>[\s\w\/<>=\\\"]*<\/table>/'
Should be able to match:
<table>
<row>
<col align="left" valign="top">a</col>
<col align="right" valign="bottom">b</col>
</row>
</table>
.. with all there is under those table tags.
I'm not really sure why this is so, but I tried just the double quote and one or even two escape characters and it won't work. In my frustration I added another one and then it's cool.
[#27] chuckie [2006-12-06 06:20:42]
This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier:
<?php
function mb_preg_match_all($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = PREG_PATTERN_ORDER, $pn_offset = 0, $ps_encoding = NULL) {
// WARNING! - All this function does is to correct offsets, nothing else:
//
if (is_null($ps_encoding))
$ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match_all($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match)
foreach($ha_match as &$ha_match)
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
//
// (code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
return $ret;
}
?>
[#28] phpnet at sinful-music dot com [2006-02-20 00:53:03]
Here's some fleecy code to 1. validate RCF2822 conformity of address lists and 2. to extract the address specification (the part commonly known as 'email'). I wouldn't suggest using it for input form email checking, but it might be just what you want for other email applications. I know it can be optimized further, but that part I'll leave up to you nutcrackers. The total length of the resulting Regex is about 30000 bytes. That because it accepts comments. You can remove that by setting $cfws to $fws and it shrinks to about 6000 bytes. Conformity checking is absolutely and strictly referring to RFC2822. Have fun and email me if you have any enhancements!
<?php
function mime_extract_rfc2822_address($string)
{
//rfc2822 token setup
$crlf = "(?:\r\n)";
$wsp = "[\t ]";
$text = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";
$quoted_pair = "(?:\\\\$text)";
$fws = "(?:(?:$wsp*$crlf)?$wsp+)";
$ctext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F" .
"!-'*-[\\]-\\x7F]";
$comment = "(\\((?:$fws?(?:$ctext|$quoted_pair|(?1)))*" .
"$fws?\\))";
$cfws = "(?:(?:$fws?$comment)*(?:(?:$fws?$comment)|$fws))";
//$cfws = $fws; //an alternative to comments
$atext = "[!#-'*+\\-\\/0-9=?A-Z\\^-~]";
$atom = "(?:$cfws?$atext+$cfws?)";
$dot_atom_text = "(?:$atext+(?:\\.$atext+)*)";
$dot_atom = "(?:$cfws?$dot_atom_text$cfws?)";
$qtext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!#-[\\]-\\x7F]";
$qcontent = "(?:$qtext|$quoted_pair)";
$quoted_string = "(?:$cfws?\"(?:$fws?$qcontent)*$fws?\"$cfws?)";
$dtext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!-Z\\^-\\x7F]";
$dcontent = "(?:$dtext|$quoted_pair)";
$domain_literal = "(?:$cfws?\\[(?:$fws?$dcontent)*$fws?]$cfws?)";
$domain = "(?:$dot_atom|$domain_literal)";
$local_part = "(?:$dot_atom|$quoted_string)";
$addr_spec = "($local_part@$domain)";
$display_name = "(?:(?:$atom|$quoted_string)+)";
$angle_addr = "(?:$cfws?<$addr_spec>$cfws?)";
$name_addr = "(?:$display_name?$angle_addr)";
$mailbox = "(?:$name_addr|$addr_spec)";
$mailbox_list = "(?:(?:(?:(?<=:)|,)$mailbox)+)";
$group = "(?:$display_name:(?:$mailbox_list|$cfws)?;$cfws?)";
$address = "(?:$mailbox|$group)";
$address_list = "(?:(?:^|,)$address)+";
//output length of string (just so you see how f**king long it is)
echo(strlen($address_list) . " ");
//apply expression
preg_match_all("/^$address_list$/", $string, $array, PREG_SET_ORDER);
return $array;
};
?>
[#29] mnc at u dot nu [2006-02-02 22:05:14]
PREG_OFFSET_CAPTURE always seems to provide byte offsets, rather than character position offsets, even when you are using the unicode /u modifier.