Replace text in a string and ignore matches in HTML tags
P粉676821490
P粉676821490 2024-03-27 19:23:55
0
1
327

For a given string (usually a paragraph), I want to replace some words/phrases, but ignore them if they happen to be surrounded by tags in some way. This also needs to be case insensitive.

As an example:

You can find a link here <a href="#">link</a> and a lot 
of things in different styles. Public platform can appear in bold: 
<b>public platform</b>, and we also have italics here too: <i>italics</i>. 
While I like soft pillows I am picky about soft <i>pillows</i>. 
While I want to find fox, I din't want foxes to show up.
The text "shiny fruits" is in a span tag:  one of the <span>shiny fruits</span>.

Suppose I want to replace these words:

  • link: Appears 2 times. The first is plain text (matches), the second is A tags (ignores)
  • Public platform: plain text (match, case insensitive), second in B tags (ignored)
  • softpillows: 1 plain text match.
  • fox: 1 plain text match. It views complete words.
  • fruits: plain text (matched), second in span tags (ignored) with other text

As background; I'm searching for phrase matches (not individual words) and linking the matches to related pages.

I want to avoid nested HTML (bold tags without links and vice versa) or other errors (eg: the <a href="# ">phrase <b>goes</ a> here</b>)

I tried a few things, such as searching for a sanitized copy of the text that had the HTML content removed, and while this told me there was a match, I ran into a whole new problem of mapping it back to the original content.

P粉676821490
P粉676821490

reply all(1)
P粉594941301

I found a mention about regex negative lookahead and after breaking my mind I got this regex (assuming you have VALID html tags paired)

// made function a bit ugly just to try to show how it comes together
public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
{
  $string = $sourceText ?? "Inner text
  You can find a link here link and a lot 
  of things in different styles. Public platform can appear in bold: 
  public platform, and we also have italics here too: italics. 
  While I like soft pillows I am picky about soft pillows. 
  While I want to find fox, I din't want foxes to show up.
  The text \"shiny fruits\" is in a span tag:  one of the shiny fruits.
  The inner text like this inner inner text  here to test too, event inner text
  omg thats sad... or not
  ";
  // it would be nice to use [[:punct:]] but somehow regex thinks that  are also punctuation marks
  $punctuation = "\.,!\?:;\|\/=\"#"; // this part might take additional attention but you get the point
  $stringPart = "\b$toReplace\b";
  $excludeSequence = "(?![\w\n\s>$punctuation]*?";
  $excludeOutside = "$excludeSequence)"; // note on closing )
  $pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
  
  return preg_replace($pattern, $dummyText, $string);
}

Example output with default parameters

"""
     (REPLACED TEXT HERE)\r\n
     You can find a link here link and a lot \r\n
     of things in different styles. Public platform can appear in bold: \r\n
     public platform, and we also have italics here too: italics. \r\n
     While I like soft pillows I am picky about soft pillows. \r\n
     While I want to find fox, I din't want foxes to show up.\r\n
     The text "shiny fruits" is in a span tag:  one of the shiny fruits.\r\n
     The (REPLACED TEXT HERE) like this inner inner text  here to test too, event (REPLACED TEXT HERE)\r\n
     omg thats sad... or not     
     """

Now step by step

  1. No subsequent matches (if there was only pillowS, we wouldn't need pillow)
  2. If the text is followed by any length of \w word symbols, \s spaces or \n newlines and is allowed to end with a start tag Ending punctuation - We don’t need this match, there is a negative lookahead (?![\w\n\s>$Punctuation]*?. Here we can be sure that the match will not go into the new tag because is not in the described sequence ($excludeOutside variable)
  3. The
  4. $excludeTag variable is basically the same as $excludeOutside, but applies to cases where $toReplace can be the html tag itself, such as a
Please note that this code cannot overwrite text with or >, and using these symbols may cause unexpected behavior
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template