python - How to crawl a site that redirects to "/" -
i using scrapy crawl several websites. spider isn't allowed jump across domains. in scenario, redirects make crawler stop immediately. in cases know how handle it, weird one.
the culprit is: http://www.cantonsd.org/
i checked redirect pattern http://www.wheregoes.com/ , tells me redirects "/". prevents spider enter parse
function. how can handle this?
edit: code.
i invoke spider using apis provided scrapy here: http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script difference spider custom. created follows:
spider = domainsimplespider( start_urls = [start_url], allowed_domains = [allowed_domain], url_id = url_id, cur_state = cur_state, state_id_url_map = id_url, allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.ignorecase), tags = ('a', 'area', 'frame'), attrs = ('href', 'src'), response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"], state_abbr = state_abbrs[cur_state] )
i think problem allowed_domains
sees /
not part of list (which contains cantonsd.org
) , shuts down everything.
i'm not reporting full spider code because not invoked @ all, can't problem.
Comments
Post a Comment