Regular expression to remove spaces between invalid HTML tags - e.g. "</b>" should be "</b>"
P粉884667022
2023-09-02 19:56:28
<p>I have some HTML that is messed up by spaces within tags and want to make it valid again - for example: </p>
<pre class="brush:php;toolbar:false;">< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div ></pre>
<p> should be converted to valid HTML, and when rendered, is expected to produce: </p>
<p>
<pre class="snippet-code-html lang-html prettyprint-override"><code><div class='test'>1 > 0 is <b>true</b> and apples >>> bananas</div></code></pre>
</p>
<p>Any text preceded/followed by spaces in <code>></code> or </code>><</code> should remain unchanged - for example, <code> ;1 > 0</code> should be retained instead of being compressed to <code>1>0</code></p >
<p>I realize this may require several regular expressions, which is fine</p>
<p>I have a few things:</p>
<p><code><\s?\/\s*</code> This will partially fix <code></ b></ div ></code> to< code></b></div ></code> but I'm working on the rest< /p>
<p>For example, I could take a drastic approach, but that would also break the code within the label text portion, not the label name itself</p>
There is no reasonable way to save a document as corrupted as what you posted, but assuming you replace the
>
and similar characters in the text with their related entities, e.g.:> ;
, you can put the document you want to accept into an appropriate library, such as DomDocument which will handle the rest.Output:
This regular expression is also valid:
It divides the valid part in the HTML tag into four parts and replaces the remaining parts (spaces) with them.
Regex101 Demo
/(]*\S)\s*(>)/g
( - Capture the opening angle bracket (section 1)
\s*
- matches any whitespace(\/?)
- Capturing optional backslashes (Part 2)\s*
- matches any space after a backslash([^]*\S)
- captures content within tags without trailing spaces (section 3)\s*
- Matches spaces after the content and before the closing angle bracket(>)
- Capture right angle bracket (section 4)