For a given string (usually a paragraph), I want to replace some words/phrases, but ignore them if they happen to be surrounded by tags in some way. This also needs to be case insensitive.
As an example:
You can find a link here <a href="#">link</a> and a lot of things in different styles. Public platform can appear in bold: <b>public platform</b>, and we also have italics here too: <i>italics</i>. While I like soft pillows I am picky about soft <i>pillows</i>. While I want to find fox, I din't want foxes to show up. The text "shiny fruits" is in a span tag: one of the <span>shiny fruits</span>.
Suppose I want to replace these words:
link
: Appears 2 times. The first is plain text (matches), the second is A
tags (ignores) Public platform
: plain text (match, case insensitive), second in B
tags (ignored) softpillows
: 1 plain text match. fox
: 1 plain text match. It views complete words. fruits
: plain text (matched), second in span
tags (ignored) with other text As background; I'm searching for phrase matches (not individual words) and linking the matches to related pages.
I want to avoid nested HTML (bold tags without links and vice versa) or other errors (eg: the <a href="# ">phrase <b>goes</ a> here</b>
)
I tried a few things, such as searching for a sanitized copy of the text that had the HTML content removed, and while this told me there was a match, I ran into a whole new problem of mapping it back to the original content.
I found a mention about regex negative lookahead and after breaking my mind I got this regex (assuming you have VALID html tags paired)
Example output with default parameters
Now step by step
pillowS
, we wouldn't needpillow
)\w
word symbols,\s
spaces or\n
newlines and is allowed to end with a start tagEnding punctuation - We don’t need this match, there is a negative lookahead
(?![\w\n\s>$Punctuation]*?. Here we can be sure that the match will not go into the new tag because
is not in the described sequence (
$excludeOutside
variable)$excludeTag
variable is basically the same as$excludeOutside
, but applies to cases where$toReplace
can be the html tag itself, such asa
Please note that this code cannot overwrite text with
or
>
, and using these symbols may cause unexpected behavior