python - BeautifulSoup doesn't find correctly parsed elements -

- June 15, 2013

i using beautifulsoup parse bunch of possibly dirty html documents. stumbled upon bizarre thing.

the html comes page: http://www.wvdnr.gov/

it contains multiple errors, multiple <html></html>, <title> outside <head>, etc...

however, html5lib works in these cases. in fact, when do:

soup = beautifulsoup(document, "html5lib")

and pretti-print soup, see following output: http://pastebin.com/8bkapx88

which contains lot of <a> tags.

however, when soup.find_all("a") empty list. lxml same.

so: has stumbled on problem before? going on? how links html5lib found isn't returning find_all?

when comes parsing not well-formed , tricky html, the parser choice important:

there differences between html parsers. if give beautiful soup perfectly-formed html document, these differences won’t matter. 1 parser faster another, they’ll give data structure looks original html document.

but if document not perfectly-formed, different parsers give different results.

html.parser worked me:

from bs4 import beautifulsoup import requests  document = requests.get('http://www.wvdnr.gov/').content soup = beautifulsoup(document, "html.parser") print soup.find_all('a')

demo:

>>> bs4 import beautifulsoup >>> import requests >>> document = requests.get('http://www.wvdnr.gov/').content >>> >>> soup = beautifulsoup(document, "html5lib") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "lxml") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "html.parser") >>> len(soup.find_all('a')) 147

Search This Blog

Add

python - BeautifulSoup doesn't find correctly parsed elements -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -