I've been struggling for a while now trying to get the correct regular expression for the following task:
I want to remove data from table tags in html file using python. My approach to this is to do the following recursively (store the HTML lines between tags as strings):
s = "
s = re.sub('<{1}( is not '<' 也不是 '>').*>{1}', '', s)
My question is how to implement the bold part in brackets. Thanks. Your text
I tried
import re test_str = '<td style="color:blue">Hello</td>' test_str = re.sub('<{1}^[<>].*>{1}','',test_str) print(test_str)
You can see that my test string remains the same. What did i do wrong?
The above code I expect gives me test_str = "Hello", I'll feed that back into this method, which then extracts the "", giving me "Hello".
To negate a character class, place
^
after[
. Additionally, you do not need to specify{1}
for characters that occur once.However, please note that it is more appropriate to use a dedicated HTML parser like BeautifulSoup instead of regular expressions to get data from HTML.