python - How to crawl a site that redirects to "/" -


i using scrapy crawl several websites. spider isn't allowed jump across domains. in scenario, redirects make crawler stop immediately. in cases know how handle it, weird one.

the culprit is: http://www.cantonsd.org/

i checked redirect pattern http://www.wheregoes.com/ , tells me redirects "/". prevents spider enter parse function. how can handle this?

edit: code.

i invoke spider using apis provided scrapy here: http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script difference spider custom. created follows:

spider = domainsimplespider(    start_urls = [start_url],    allowed_domains = [allowed_domain],    url_id = url_id,    cur_state = cur_state,    state_id_url_map = id_url,    allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.ignorecase),    tags = ('a', 'area', 'frame'),    attrs = ('href', 'src'),    response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"],    state_abbr = state_abbrs[cur_state] ) 

i think problem allowed_domains sees / not part of list (which contains cantonsd.org) , shuts down everything.

i'm not reporting full spider code because not invoked @ all, can't problem.


Comments

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

javascript - angular ng-required radio button not toggling required off in firefox 33, OK in chrome -

xcode - Swift Playground - Files are not readable -