Skip to content Skip to sidebar Skip to footer

Extract All Html Tag Closed With A Regex Expression

I work on R, and I will want to extract all HTML tag closed from a PlainTextDocument. I use a gsub method with a regex : gsub('',' ',fm,perl=TRUE,ignore.case=

Solution 1:

The wording of your question is unclear, and your regex doesn't make much sense, but if you just want to match anything that looks like an HTML tag, this should do it:

"<[^<>]+>"

That will match both opening and closing tags (e.g., <tag attr="value"> and </tag>). If you want to match only self-closing tags (e.g., <tag />), this should work:

"<[^<>]+/>"

Others have suggested that the slash (/) has special meaning and needs to be escaped, but that's not true. If you were using Perl, you might use this command to do the substitution:

s/<[^<>]+\/>/ /g

But the slash itself has no special meaning; I only had to escape it because I used it as the regex delimiter. I could just as easily use a different delimiter:

s~<[^<>]+/>~ ~g

But R doesn't support regexes at the language level like Perl does; the regex and the replacement are written in the form of string literals, just like they are (for example) in Java and C#. And unlike PHP, it doesn't require you to add delimiters anyway, as in:

preg_replace("/<[^<>]+\/>/", " ")

But even PHP allows you to choose your own delimiter:

preg_replace('~<[^<>]+/>~', ' ')

Before anyone calls me out on this, I know <[^<>]+> is flawed--that there is in fact no such thing as a correct regex for HTML tags. This will do in many cases, but the only truly reliable way to parse HTML is with a dedicated HTML parser.


Solution 2:

it likely needs to be 'escaped': \\/


Post a Comment for "Extract All Html Tag Closed With A Regex Expression"