python - BeautifulSoup doesn't find correctly parsed elements -


i using beautifulsoup parse bunch of possibly dirty html documents. stumbled upon bizarre thing.

the html comes page: http://www.wvdnr.gov/

it contains multiple errors, multiple <html></html>, <title> outside <head>, etc...

however, html5lib works in these cases. in fact, when do:

soup = beautifulsoup(document, "html5lib") 

and pretti-print soup, see following output: http://pastebin.com/8bkapx88

which contains lot of <a> tags.

however, when soup.find_all("a") empty list. lxml same.

so: has stumbled on problem before? going on? how links html5lib found isn't returning find_all?

when comes parsing not well-formed , tricky html, the parser choice important:

there differences between html parsers. if give beautiful soup perfectly-formed html document, these differences won’t matter. 1 parser faster another, they’ll give data structure looks original html document.

but if document not perfectly-formed, different parsers give different results.

html.parser worked me:

from bs4 import beautifulsoup import requests  document = requests.get('http://www.wvdnr.gov/').content soup = beautifulsoup(document, "html.parser") print soup.find_all('a') 

demo:

>>> bs4 import beautifulsoup >>> import requests >>> document = requests.get('http://www.wvdnr.gov/').content >>> >>> soup = beautifulsoup(document, "html5lib") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "lxml") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "html.parser") >>> len(soup.find_all('a')) 147 

see also:


Comments

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

javascript - angular ng-required radio button not toggling required off in firefox 33, OK in chrome -

xcode - Swift Playground - Files are not readable -