python - BeautifulSoup doesn't find correctly parsed elements -
i using beautifulsoup parse bunch of possibly dirty html documents. stumbled upon bizarre thing.
the html comes page: http://www.wvdnr.gov/
it contains multiple errors, multiple <html></html>, <title> outside <head>, etc...
however, html5lib works in these cases. in fact, when do:
soup = beautifulsoup(document, "html5lib") and pretti-print soup, see following output: http://pastebin.com/8bkapx88
which contains lot of <a> tags.
however, when soup.find_all("a") empty list. lxml same.
so: has stumbled on problem before? going on? how links html5lib found isn't returning find_all?
when comes parsing not well-formed , tricky html, the parser choice important:
there differences between html parsers. if give beautiful soup perfectly-formed html document, these differences won’t matter. 1 parser faster another, they’ll give data structure looks original html document.
but if document not perfectly-formed, different parsers give different results.
html.parser worked me:
from bs4 import beautifulsoup import requests document = requests.get('http://www.wvdnr.gov/').content soup = beautifulsoup(document, "html.parser") print soup.find_all('a') demo:
>>> bs4 import beautifulsoup >>> import requests >>> document = requests.get('http://www.wvdnr.gov/').content >>> >>> soup = beautifulsoup(document, "html5lib") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "lxml") >>> len(soup.find_all('a')) 0 >>> soup = beautifulsoup(document, "html.parser") >>> len(soup.find_all('a')) 147 see also:
Comments
Post a Comment