Skip to content Skip to sidebar Skip to footer

Get Beautifulsoup To Correctly Parse Php Tags Or Ignore Them

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add

Solution 1:

Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

from bs4 import BeautifulSoup
import re

html = """<html><body><?php$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?><header><h3><ahref="<?phpecho$viewAllUrl; ?>"class="noContentLink white"><?phpecho"{$title} ({$sideBarCoStarsCount})"; ?></a></h3></body>"""

php_sig = '!!!PHP!!!'
php_elements = []

def php_remove(m):
    php_elements.append(m.group())
    return php_sig

def php_add(m):
    return php_elements.pop(0)

# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S+re.M)

soup = BeautifulSoup(html, "html.parser")

# Make modifications to the soup
# Do not remove any elements containing PHP elements

# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())

print(html)

Post a Comment for "Get Beautifulsoup To Correctly Parse Php Tags Or Ignore Them"