python爬虫Selenium +phantomjs 利用 pyquery抓取今日头条视频

使用Selenium + PhantomJS + PyQuery爬取今日头条视频的步骤如下：

环境准备
安装必要的库：pip install selenium pyquery
下载PhantomJS并配置环境变量，或直接指定PhantomJS可执行文件路径。
代码实现
初始化设置：在__init__方法中，配置日志、读取关键字、初始化数据库连接、设置浏览器驱动等。
视频下载方法：down_video方法负责从给定的URL下载视频到本地。
数据抓取主逻辑：scrapy_date方法遍历每个关键字生成的URL，使用Selenium获取页面内容，并用PyQuery解析HTML，提取视频链接并下载。
注意事项
确保PhantomJS路径正确，或已添加到系统环境变量。
根据实际情况调整等待时间，避免因网络延迟导致数据抓取不全。
今日头条网页结构可能变化，需定期检查并更新选择器。
代码优化建议
添加异常处理，增强代码健壮性。
考虑使用更现代的浏览器驱动（如ChromeDriver）替代PhantomJS，因PhantomJS已停止维护。
对下载的视频进行校验，确保文件完整可用。
完整代码示例
# coding=utf-8import osimport refrom selenium import webdriverimport selenium.webdriver.support.ui as uiimport timefrom datetime import datetimeimport IniFilefrom pyquery import PyQuery as pqimport LogFileimport mongoDBimport urllibclass toutiaoSpider(object): def __init__(self): logfile = os.path.join(os.path.dirname(os.getcwd()), time.strftime('%Y-%m-%d') + '.txt') self.log = LogFile.LogFile(logfile) configfile = os.path.join(os.path.dirname(os.getcwd()), 'setting.conf') cf = IniFile.ConfigFile(configfile) webSearchUrl = cf.GetValue("toutiao", "webSearchUrl") self.keyword_list = cf.GetValue("section", "information_keywords").split(';') self.db = mongoDB.mongoDbBase() self.start_urls = [] for word in self.keyword_list: self.start_urls.append(webSearchUrl + urllib.quote(word)) self.driver = webdriver.PhantomJS() self.wait = ui.WebDriverWait(self.driver, 2) self.driver.maximize_window() def down_video(self, videourl): if len(videourl) > 0: fileName = time.strftime('%Y%m%d%H%M%S') + '.mp4' u = urllib.urlopen(videourl) data = u.read() strpath = os.path.join(os.path.dirname(os.getcwd()), 'video') with open(os.path.join(strpath, fileName), 'wb') as f: f.write(data) def scrapy_date(self): strsplit = '------------------------------------------------------------------------------------' index = 0 for link in self.start_urls: self.driver.get(link) keyword = self.keyword_list[index] index = index + 1 time.sleep(1) selenium_html = self.driver.execute_script("return document.documentElement.outerHTML") doc = pq(selenium_html) infoList = [] self.log.WriteLog(strsplit) self.log_print(strsplit) Elements = doc('div[class="articleCard"]') for element in Elements.items(): url = '
http://www.toutiao.com'
+ element.find('a[class="link title"]').attr('href') infoList.append(url) if len(infoList)>0: for url in infoList: self.driver.get(url) htext = self.driver.execute_script("return document.documentElement.outerHTML") dochtml = pq(htext) videourl = dochtml('video[class="vjs-tech"]').find('source').attr('src') if videourl: self.down_video(videourl) self.driver.close() self.driver.quit()obj = toutiaoSpider()obj.scrapy_date()
总结
本方案通过Selenium模拟浏览器行为，结合PyQuery解析HTML，实现了今日头条视频信息的抓取与下载。
需注意环境配置、代码健壮性及网页结构变化，以确保长期稳定运行。

您可能感兴趣问答

Collapsible

热门标签

热点问答