在Python中使用CasperJS获取JS渲染生成的HTML内容的教程

PHP中文网 • 2025年2月28日 02:37:02 • 编程技术 • 阅读 2

文章摘要：其实这里casperjs与python没有直接关系,主要依赖casperjs调用phantomjs webkit获取html文件内容。长期以来，爬虫抓取客户端javascript渲染生成的html页面都极为困难, java里面有 htmlunit, 而python里，我们可以使用独立的跨平台的casperjs。

创建site.js(接口文件，输入:url，输出:html file)

   //USAGE: E:oolkit1k0-casperjs-e3a77d0in>python casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile='temp.html'          var fs = require('fs');     var casper = require('casper').create({      pageSettings: {      loadImages: false,          loadPlugins: false,         userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36 LBBROWSER'     },     logLevel: "debug",//日志等级     verbose: true  // 记录日志到控制台      });     var url = casper.cli.raw.get('url');     var outputfile = casper.cli.raw.get('outputfile');     //请求页面     casper.start(url, function () {     fs.write(outputfile, this.getHTML(), 'w');     });          casper.run();

登录后复制

python 代码, checkout_proxy.py

 import json     import sys     #import requests     #import requests.utils, pickle     from bs4 import BeautifulSoup     import os.path,os     import threading     #from multiprocessing import Process, Manager     from datetime import datetime     import traceback     import logging     import re,random     import subprocess     import shutil     import platform                          output_file = os.path.join(os.path.dirname(os.path.realpath(__file__)),'proxy.txt')     global_log = 'http_proxy' + datetime.now().strftime('%Y-%m-%d') + '.log'     if not os.path.exists(os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs')):       os.mkdir(os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs'))     global_log = os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs',global_log)          logging.basicConfig(level=logging.DEBUG,format='[%(asctime)s] [%(levelname)s] [%(module)s] [%(funcName)s] [%(lineno)d] %(message)s',filename=global_log,filemode='a')     log = logging.getLogger(__name__)      #manager = Manager()     #PROXY_LIST = manager.list()     mutex = threading.Lock()     PROXY_LIST = []               def isWindows():       if "Windows" in str(platform.uname()):       return True       else:       return False               def getTagsByAttrs(tagName,pageContent,attrName,attrRegValue):       soup = BeautifulSoup(pageContent)                                                       return soup.find_all(tagName, { attrName : re.compile(attrRegValue) })               def getTagsByAttrsExt(tagName,filename,attrName,attrRegValue):       if os.path.isfile(filename):       f = open(filename,'r')          soup = BeautifulSoup(f)       f.close()       return soup.find_all(tagName, { attrName : re.compile(attrRegValue) })       else:       return None               class Site1Thread(threading.Thread):       def __init__(self,outputFilePath):         threading.Thread.__init__(self)       self.outputFilePath = outputFilePath       self.fileName = str(random.randint(100,1000)) + ".html"       self.setName('Site1Thread')             def run(self):       site1_file = os.path.join(os.path.dirname(os.path.realpath(__file__)),'site.js')       site2_file = os.path.join(self.outputFilePath,'site.js')       if not os.path.isfile(site2_file) and os.path.isfile(site1_file):         shutil.copy(site1_file,site2_file)       #proc = subprocess.Popen(["bash","-c", "cd %s && ./casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE)       if isWindows():         proc = subprocess.Popen(["cmd","/c", "%s/casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE)       else:         proc = subprocess.Popen(["bash","-c", "cd %s && ./casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE)       out=proc.communicate()[0]       htmlFileName = ''       #因为输出路径在windows不确定，所以这里加了所有可能的路径判断       if os.path.isfile(self.fileName):         htmlFileName = self.fileName       elif os.path.isfile(os.path.join(self.outputFilePath,self.fileName)):         htmlFileName = os.path.join(self.outputFilePath,self.fileName)       elif os.path.isfile(os.path.join(os.path.dirname(os.path.realpath(__file__)),self.fileName)):         htmlFileName = os.path.join(os.path.dirname(os.path.realpath(__file__)),self.fileName)        if (not os.path.isfile(htmlFileName)):         print 'Failed to get html content from http://spys.ru/free-proxy-list/IE/'         print out         sys.exit(3)        mutex.acquire()       PROXYList= getTagsByAttrsExt('font',htmlFileName,'class','spy14$')       for proxy in PROXYList:         tdContent = proxy.renderContents()         lineElems = re.split('[]',tdContent)         if re.compile(r'd+').search(lineElems[-1]) and re.compile('(d+.d+.d+)').search(lineElems[0]):         print lineElems[0],lineElems[-1]         PROXY_LIST.append("%s:%s" % (lineElems[0],lineElems[-1]))       mutex.release()       try:         if os.path.isfile(htmlFileName):         os.remove(htmlFileName)       except:         pass          if __name__ == '__main__':       try:       if(len(sys.argv))

登录后复制

发布者：PHP中文网，转转请注明出处：https://www.chuangxiangniao.com/p/2294226.html

javascript Python

0 0

关于作者

PHP中文网签约作者

285.5K 文章

0 评论

1 粉丝

php中文网提供大量免费、原创、高清的php视频教程，并定期举行公益php培训！可边学习边在线修改示例代码，查看执行效果！php从入门到精通，一站式php自学平台！

详解Python中的多线程编程

上一篇 2025年2月28日 02:36:34

python如何实现自动化运维

下一篇 2025年2月26日 17:57:50

详解Python中的多线程编程

一、简介多线程编程技术可以实现代码并行性，优化处理能力，同时功能的更小划分可以使代码的可重用性更好。Python中threading和Queue模块可以用来实现多线程编程。二、详解1、线程和进程进程（有时被称为…

PHP中文网
2025年2月28日 • 编程技术
2000
用Python解析XML的几种常见方法的介绍

一、简介 XML（eXtensible Markup Language）指可扩展标记语言，被设计用来传输和存储数据，已经日趋成为当前许多新生技术的核心，在不同的领域都有着不同的应用。它是web发展到一定阶段的必然产物，既具有S…

PHP中文网
2025年2月28日 • 编程技术
2000
python uuid模块使用实例

uuid是一种唯一标识，在许多领域作为标识用途。python的uuid模块就是用来生成它的。闲话不说，python提供的生成uuid的方法一共有4种，分别是： 1.从硬件地址和时间生成2.从md5算法生成3.随机生成4.从SHA-1算法生成…

PHP中文网
编程技术 2025年2月28日
2000
编程技术

python内存管理分析

本文较为详细的分析了python内存管理机制。分享给大家供大家参考。具体分析如下：内存管理，对于Python这样的动态语言，是至关重要的一部分，它在很大程度上甚至决定了Python的执行效率，因为在Python的运行中，会创建和销毁大量的…

PHP中文网
2025年2月28日
2000
python集合类型用法分析

本文实例分析了python集合类型用法。分享给大家供大家参考。具体分析如下： python的集合类型和其他语言类似, 是一个无序不重复元素集,我在之前学过的其他的语言好像没有见过这个类型，基本功能包括关系测试和消除重复元素. 集合对象还支持…

PHP中文网
编程技术 2025年2月28日
2000
在Python中使用Mako模版库的简单教程

Mako是一个高性能的Python模板库，它的语法和API借鉴了很多其他的模板库，如Django、Jinja2等等。基本用法创建模板并渲染它的最基本的方法是使用 Template 类： from mako.template import…

PHP中文网
编程技术 2025年2月28日
2000
python中requests模块的使用方法

本文实例讲述了python中requests模块的使用方法。分享给大家供大家参考。具体分析如下：在HTTP相关处理中使用python是不必要的麻烦，这包括urllib2模块以巨大的复杂性代价获取综合性的功能。相比于urllib2,Kenn…

PHP中文网
编程技术 2025年2月28日
2000
python自然语言编码转换模块codecs介绍

python对多国语言的处理是支持的很好的，它可以处理现在任意编码的字符，这里深入的研究一下python对多种不同语言的处理。有一点需要清楚的是，当python要做编码转换的时候，会借助于内部的编码，转换过程是这样的：复制代码代码如下…

PHP中文网
编程技术 2025年2月28日
2000
python文件写入实例分析

本文实例讲述了python文件写入的用法。分享给大家供大家参考。具体分析如下： Python中wirte()方法把字符串写入文件，writelines()方法可以把列表中存储的内容写入文件。 f=file(“hello.txt”,”w+”)…

PHP中文网
编程技术 2025年2月28日
2000
python中Genarator函数用法分析

本文实例讲述了python中genarator函数用法。分享给大家供大家参考。具体如下： Generator函数的定义与普通函数的定义没有什么区别，只是在函数体内使用yield生成数据项即可。Generator函数可以被for循环遍历，而且…

PHP中文网
编程技术 2025年2月28日
2000

发表回复

登录后才能评论

在Python中使用CasperJS获取JS渲染生成的HTML内容的教程

关于作者

AD推荐 黄金广告位招租... 更多推荐

相关推荐

发表回复

分享到:

请登录

AD推荐黄金广告位招租... 更多推荐