python - How to crawl a site that redirects to "/" -

- February 15, 2014

i using scrapy crawl several websites. spider isn't allowed jump across domains. in scenario, redirects make crawler stop immediately. in cases know how handle it, weird one.

the culprit is: http://www.cantonsd.org/

i checked redirect pattern http://www.wheregoes.com/ , tells me redirects "/". prevents spider enter parse function. how can handle this?

edit: code.

i invoke spider using apis provided scrapy here: http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script difference spider custom. created follows:

spider = domainsimplespider(    start_urls = [start_url],    allowed_domains = [allowed_domain],    url_id = url_id,    cur_state = cur_state,    state_id_url_map = id_url,    allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.ignorecase),    tags = ('a', 'area', 'frame'),    attrs = ('href', 'src'),    response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"],    state_abbr = state_abbrs[cur_state] )

i think problem allowed_domains sees / not part of list (which contains cantonsd.org) , shuts down everything.

i'm not reporting full spider code because not invoked @ all, can't problem.

Search This Blog

Add

python - How to crawl a site that redirects to "/" -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -