python - BeautifulSoup doesn't find correctly parsed elements -
i using beautifulsoup
parse bunch of possibly dirty html
documents. stumbled upon bizarre thing.
the html comes page: http://www.wvdnr.gov/
it contains multiple errors, multiple <html></html>
, <title>
outside <head>
, etc...
however, html5lib works in these cases. in fact, when do:
soup = beautifulsoup(document, "html5lib")
and pretti-print soup
, see following output: http://pastebin.com/8bkapx88
which contains lot of <a>
tags.
however, when soup.find_all("a")
empty list. lxml
same.
so: has stumbled on problem before? going on? how links html5lib
found isn't returning find_all
?
when comes parsing not well-formed , tricky html, the parser choice important:
there differences between html parsers. if give beautiful soup perfectly-formed html document, these differences won’t matter. 1 parser faster another, they’ll give data structure looks original html document.
but if document not perfectly-formed, different parsers give different results.
html.parser
worked me:
from bs4 import beautifulsoup import requests document = requests.get('http://www.wvdnr.gov/').content soup = beautifulsoup(document, "html.parser") print soup.find_all('a')
demo:
>>> bs4 import beautifulsoup >>> import requests >>> document = requests.get('http://www.wvdnr.gov/').content >>> >>> soup = beautifulsoup(document, "html5lib") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "lxml") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "html.parser") >>> len(soup.find_all('a')) 147
see also:
Comments
Post a Comment