Skip to content Skip to sidebar Skip to footer

Build Regex To Find And Replace Invalid Html Attributes

The sad truth about this post is that I have poor regex skills. I recently came across some code in an old project that I seriously want to do something about. Here it is: strDoc

Solution 1:

I think it's better not to mix it in single mega-regex. I'd prefer several steps:

  1. Identify tag: <([^>]+)/?>
  2. Replace wrong attributes with correct ones iteratively through tag string: replace \s+([\w]+)\s*=\s*(['"]?)(\S+)(\2) pattern with $1="$3" (with a space after last quote). I think that .net allows to track boundaries of match. It can help to avoid searching through already corrected part of tag.

Solution 2:

drop the word 'attribute', i.e.

Dim test AsString = "=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"

which would find every "='something'" string, fine if you have no other code in the pages, i.e. javascript.

Solution 3:

I had trouble that the final update (8/21/09) would replace

<font color=red size=4>

with

<font color="red" size="4>"

(placing the closing quote on second attribute on outside of closing tag)

I changed the attributes string in EvaluateTag to:

Dim attributes As String = "\s*=\s*(?:('|"")(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>|\s]+))"

changed [^>|\s] near end.

This returns my desired results of: <font color="red" size="4">

It works on my exhaustive testcase of one.

Solution 4:

Here is the final product. I hope this helps somebody!

Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim input AsString = "<tag border=2 style='display: none' width=""100%"">Some stuff""""""in between tags==="""" that could be there</tag>" & _
            "<sometag border=2 width=""100%"" /><another that=""is"" completely=""normal"">with some content, of course</another>"

        Console.WriteLine(ConvertMarkupAttributeQuoteType(input, "'"))
        Console.ReadKey()
    EndSubPublicFunction ConvertMarkupAttributeQuoteType(ByVal html AsString, ByVal quoteChar AsString) AsStringDim findTags AsString = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"Return Regex.Replace(html, findTags, New MatchEvaluator(Function(m) EvaluateTag(m, quoteChar)))
    EndFunctionPrivateFunction EvaluateTag(ByVal match As Match, ByVal quoteChar AsString) AsStringDim attributes AsString = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"Return Regex.Replace(match.Value, attributes, String.Format("={0}$2{0}", quoteChar))
    EndFunctionEndModule

I felt that keeping the tag finder and the attribute fixing regex separate from each other in case I wanted to change how they each work in the future. Thanks for all your input.

Post a Comment for "Build Regex To Find And Replace Invalid Html Attributes"