php怎么挪用plantomjs技巧_爬取动态js html数据方法二运用phantomjs

文章目录 [+]

本系列文章紧张记录和讲解pyspider的示例代码，希望能抛砖引玉。
pyspider示例代码官方网站是http://demo.pyspider.org/。
上面的示例代码太多，无从下手。
因此本人找出一下比较经典的示例进行大略讲解，希望对新手有一些帮助。

示例解释：

php怎么挪用plantomjs技巧_爬取动态js html数据方法二运用phantomjs

如果页面中部分数据或笔墨由js天生，pyspider不能直接提取页面的数据。
pyspider获取页面的代码，但是个中的js代码phantomjs，办理js代码实行问题。

（图片来自网络侵删）

利用方法：

方法一：在self.crawl函数中添加fetch_type="js"调用phantomjs实行js代码。

方法二：为函数添加参数@config(fetch_type="js")。

示例代码：

1、www.sciencedirect.com网站示例

#!/usr/bin/env python

# -- encoding: utf-8 --

# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:

# Created on 2014-10-31 13:05:52

import re

from libs.base_handler import

class Handler(BaseHandler):

'''

this is a sample handler

'''

crawl_config = {

"headers": {

"User-Agent": "BaiDu_Spider",

"timeout":300,

"connect_timeout":100

}

def on_start(self):

self.crawl('http://www.sciencedirect.com/science/article/pii/S1568494612005741',timeout=300,connect_timeout=100,

callback=self.detail_page)

self.crawl('http://www.sciencedirect.com/science/article/pii/S0167739X12000581',timeout=300,connect_timeout=100,

age=0, callback=self.detail_page)

self.crawl('http://www.sciencedirect.com/science/journal/09659978',timeout=300,connect_timeout=100,

age=0, callback=self.index_page)

@config(fetch_type="js")

def index_page(self, response):

for each in response.doc('a').items():

url=each.attr.href

#print(url)

if url!=None:

if re.match('http://www.sciencedirect.com/science/article/pii/\w+$', url):

self.crawl(url, callback=self.detail_page,timeout=300,connect_timeout=100)

@config(fetch_type="js")

def detail_page(self, response):

self.index_page(response)

self.crawl(response.doc('#relArtList > li > .cLink').attr.href, callback=self.index_page,timeout=300,connect_timeout=100)

return {

"url": response.url,

"title": response.doc('.svTitle').text(),

"authors": [x.text() for x in response.doc('.authorName').items()],

"abstract": response.doc('.svAbstract > p').text(),

"keywords": [x.text() for x in response.doc('.keyword span').items()],

}

针对pyspider中看不到内容，再写爬虫脚本的时候，直接用phantomjs的方法

标签：timeout self

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

php怎么挪用plantomjs技巧_爬取动态js html数据方法二运用phantomjs

相关文章

赣州短视频SEO排名公司助力短视频内容在搜索引擎中脱颖而出

SEO标签规范在网站优化中的应用与方法

SEO标题优化方法,关键词布局的艺术

邢台清河县网络SEO优化助力地域品牌崛起的新引擎

邵阳SEO公司之选13火星,助力企业网络营销的璀璨星辰

鄂州抖音SEO费用价格如何实现高效抖音营销

最近发表

文件下载道理PHP技巧_PHP文件下载怎么做可以参考一下它

山东it培训php技巧_盘点山东IT培训机构鱼龙混杂若何选择

易游变量php技巧_客运起身长白山好风凭借力内外部改进推动业绩进入高增阶段

济南后端php雇用技巧_壹点送岗12家济南市属国有企业集中招聘610人

php若何切换中文技巧_4项技巧使你不再为PHP中文编码忧

php设计对战游戏技巧_若何塑造成功的仇敌并做到物尽其用聊聊游戏中的怪物设计

php若何登录页面技巧_用PHP制作一个简单的注册登录页面

php正则截取目次技巧_php用正则表达式提取文章中的图片地址用于文章列表中显示

tazpkgphp技巧_不容错过的 5 个微型 Linux 发行版

ftp上传到php技巧_PHP操作FTP类实现上传下载移动创建的方法

热门文章

标签列表