怎么使用Python中的正则表达式处理html文件

PHP中文网 • 2025年2月26日 17:54:33 • 编程技术 • 阅读 2

使用python中的正则表达式处理html文件

finditer方法是一种全匹配方法。已经使用过findall方法的话，该方法将返回由多个匹配字符串组成的列表。对于多个匹配项，finditer会按顺序返回一个迭代器，每个迭代生成一个匹配对象。这些匹配对象可通过for循环访问，在下面的代码中，因此组1可以被打印。

您需要撰写 Python 正则表达式，以便在 HTML 文本文件中识别特定的模式。将代码添加到STARTER脚本为这些模式编译RE（将它们分配给有意义的变量名称），并将这些RE应用于文件的每一行，打印出找到的匹配项。

1.编写识别HTML标签的模式，然后将其打印为“TAG:TAG string”（例如“TAG:b”代表标签）。为了简单起见，假设左括号和右括号每个标记的（）将始终出现在同一行文本中。第一次尝试可能使regex“”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点，找出为什么这不是一个好的解决方案。编写一个更好的解决方案，解决这个问题

2.修改代码，使其区分开头和结尾标记（例如p与/p)打印OPENTAG和CLOSETAG

立即学习“Python免费学习笔记（深入）”；

import sys, re#------------------------------testRE = re.compile('(logic|sicstus)', re.I)testI = re.compile('', re.I)testO = re.compile(']*>')testC = re.compile('(S*?)[^>]*>')with open('RGX_DATA.html') as infs:     linenum = 0    for line in infs:        linenum += 1        if line.strip() == '':            continue        print('  ', '-' * 100, '[%d]' % linenum, '   TEXT:', line, end='')            m = testRE.search(line)        if m:            print('** TEST-RE:', m.group(1))        mm = testRE.finditer(line)        for m in mm:            print('** TEST-RE:', m.group(1))                index= testI.finditer(line)        for i in index:           print('Tag:',i.group().replace('', ''))                   open1= testO.finditer(line)        for m in open1:           print('opening:',m.group().replace('', ''))                   close1= testC.finditer(line)        for n in close1:           print('closing:',n.group().replace('', ''))

登录后复制

请注意，有些HTML标签有参数，例如：

登录后复制

成功查找到并打印标记标签，确保启用带参数和不带参数的标记模式。现在扩展您的代码，以便打印两个打开的标签标签和参数，例如:

OPENTAG: tablePARAM: border=1PARAM: cellspacing=0PARAM: cellpadding=8

 open1= testO.finditer(line)        for m in open1:            #print('opening:',m.group().replace('', ''))            firstm= m.group().replace('', '').split()            num = 0            for otherm in firstm:                if num == 0:                    print('opening:',otherm)                else:                    print('pram:',otherm)                num+= 1

登录后复制

在正则表达式中，可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为N（其中N为正整数），并返回到第N个匹配的文本正则表达式组。例如，正则表达式，如：r” (w+) 仅当与组（w+）完全匹配的字符串再次出现时才匹配 backref出现的位置。这可能与字符串“踢”匹配.例如，“the”出现两次。使用反向引用编写一个模式，当一行包含成对的open和关闭标签，例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本，即一个HTML文件，并返回一个纯文本文件，所有HTML标记都已从中删除出来这里我们不打算这样做，而是考虑一个更简单的例子，即删除我们在输入数据文件的任何行中找到的HTML标记。

如果您已经定义了一条RE来识别HTML标签，您应该可以将生成的文本输出为STRIPPED，并将其打印在屏幕上。。

import sys, re#------------------------------# PART 1:    # Key thing is to avoid matching strings that include   # multiple tags, e.g. treating '' as a single   # tag. Can do this in several ways. Firstly, use   # non-greedy matching, so get shortest possible match   # including the two angle brackets:tag = re.compile('?(.*?)>')    # The above treats the '/' of a close tag as a separate   # optional component - so that this doesn't turn up as   # part of the match '.group(1)', which is meant to return   # the tag label.    # Following alternative solution uses a negated character   # class to explicitly prevent this including '>': tag = re.compile('?([^>]+)>')    # Finally, following version separates finding the tag   # label string from any (optional) parameters that might   # also appear before the close angle bracket:tag = re.compile(r'?(w+)([^>]+)?>')    # Note that use of '' (as word boundary anchor) here means   # we must mark the regex string as a 'raw' string (r'..'). #------------------------------# PART 2:    # Following closeTag definition requires first first char   # after the open angle bracket to be '/', while openTag   # definition excludes this by requiring first char to be   # a 'word char' (w):openTag  = re.compile(r']*)>')closeTag = re.compile(r'([^>]*)>')   # Following revised definitions are more carefully stated   # for correct extraction of tag label (separately from   # any parameters:openTag  = re.compile(r']+)?>')closeTag = re.compile(r'(w+)s*>')#------------------------------# PART 3:    # Above openTag definition will already get the string   # encompassing any parameters, and return it as   # m.group(2), i.e. defn: openTag  = re.compile(r']+)?>')   # If assume that parameters are continuous non-whitespace   # chars separated by whitespace chars, then we can divide   # them up using split - and that's how we handle them   # here. (In reality, parameter strings can be a lot more   # messy than this, but we won't try to deal with that.)#------------------------------# PART 4: openCloseTagPair = re.compile(r']+)?>(.*?)s*>')   # Note use of non-greedy matching for the text falling   # *between* the open/close tag pair - to avoid false   # results where have two similar tag pairs on same line.#------------------------------# PART 5: URLS   # This is quite tricky. The URL expressions in the file   # are of two kinds, of which the first is a string   # between double quotes ("..") which may include   # whitespace. For this case we might have a regex: url = re.compile('href=("[^">]+")', re.I)   # The second case does not have quotes, and does not   # allow whitespace, consisting of a continuous sequence   # of non-whitespace material (that ends when you reach a   # space or close bracket '>'). This might be: url = re.compile('href=([^">s]+)', re.I)   # We can combine these two cases as follows, and still   # get the expression back as group(1):url = re.compile(r'href=("[^">]+"|[^">s]+)', re.I)   # Note that I've done nothing here to exclude 'mailto:'   # links as being accepted as URLS. #------------------------------with open('RGX_DATA.html') as infs:     linenum = 0    for line in infs:        linenum += 1        if line.strip() == '':            continue        print('  ', '-' * 100, '[%d]' % linenum, '   TEXT:', line, end='')            # PART 1: find HTML tags        # (The following uses 'finditer' to find ALL matches        # within the line)            mm = tag.finditer(line)        for m in mm:            print('** TAG:', m.group(1), ' + [%s]' % m.group(2))            # PART 2,3: find open/close tags (+ params of open tags)            mm = openTag.finditer(line)        for m in mm:            print('** OPENTAG:', m.group(1))            if m.group(2):                for param in m.group(2).split():                    print('    PARAM:', param)            mm = closeTag.finditer(line)        for m in mm:            print('** CLOSETAG:', m.group(1))            # PART 4: find open/close tag pairs appearing on same line            mm = openCloseTagPair.finditer(line)        for m in mm:            print("** PAIR [%s]: "%s"" % (m.group(1), m.group(3)))            # PART 5: find URLs:            mm = url.finditer(line)        for m in mm:            print('** URL:', m.group(1))        # PART 6: Strip out HTML tags (note that .sub will do all        # possible substitutions, unless number is limited by count        # keyword arg - which is fortunately what we want here)        stripped = tag.sub('', line)        print('** STRIPPED:', stripped, end = '')

登录后复制

以上就是怎么使用Python中的正则表达式处理html文件的详细内容，更多请关注【创想鸟】其它相关文章！

发布者：PHP中文网，转转请注明出处：https://www.chuangxiangniao.com/p/2235153.html

html Python

0 0

关于作者

PHP中文网签约作者

262.6K 文章

0 评论

1 粉丝

php中文网提供大量免费、原创、高清的php视频教程，并定期举行公益php培训！可边学习边在线修改示例代码，查看执行效果！php从入门到精通，一站式php自学平台！

Python中的self怎么使用

上一篇 2025年2月26日 17:54:23

哪些公司需要开发php

下一篇 2025年2月18日 07:42:07

Python中的self怎么使用

在介绍python的self用法之前，先来介绍下python中的类和实例我们知道，面向对象最重要的概念就是类（class）和实例（instance），类是抽象的模板，比如学生这个抽象的事物，可以用一个student类来表示。而实例是根据类创…

PHP中文网
编程技术 2025年2月26日
2000
Python类的基本使用方法有哪些

1、面向对象类（class）：是一种用来描述具有相同属性和方法的对象的集合。类变量：类变量在整个实例化的对象中是公用的。一般定义在类中且在函数体之外。方法：类中的函数数据成员：类变量或者实例变量用于处理类及其实例对象的相关的数据。 …

PHP中文网
编程技术 2025年2月26日
2000
Python的五个具有钱途和潜力的岗位

从2015开始国内就开始慢慢接触Python了，从16年开始Python就已经在国内的热度更高了，目前也可以算的上”全民Python”了。众所周知小学生的教材里面已经有Python了，国家二级计算机证也需要学习Py…

PHP中文网
2025年2月26日 • 编程技术
2000
Python怎么实现发送声情并茂的邮件内容和附件

1.准备工作在开始之前，我们需要准备一些东西。首先，我们需要安装python。python可以从官方网站下载。其次，我们需要安装smtplib库。这可以通过以下命令在终端中完成： pip install smtplib 登录后复制 2.…

PHP中文网
2025年2月26日 • 编程技术
2000
Python之Pygame的Event事件模块怎么使用

Pygame的Event事件模块事件（event）是 pygame 的重要模块之一，它是构建整个游戏程序的核心，比如常用的鼠标点击、键盘敲击、游戏窗口移动、调整窗口大小、触发特定的情节、退出游戏等，这些都可以看做是“事件”。事件类型 …

PHP中文网
编程技术 2025年2月26日
2000
使用Python进行交易策略和投资组合分析

我们将在本文中衡量交易策略的表现。并将开发一个简单的动量交易策略，它将使用四种资产类别:债券、股票和房地产。这些资产类别的相关性很低，这使得它们成为了极佳的风险平衡选择。动量交易策略这个策略是基于动量的的，因为交易者和投资者早就意识到动…

PHP中文网
2025年2月26日 • 编程技术
2000
python包如何使用

python 包的模块函数类定义导入使用详细说明下面是一个使用 python 包的详细案例，它涉及到模块、函数和类的定义、导入以及使用：首先，我们创建一个名为 my_package 的目录，作为包的根目录。在其中创建以下文件： …

PHP中文网
编程技术 2025年2月26日
2000
Python列表解析和生成器表达式的结构是什么

列表解析与生成器表达式生成器表达式是生成容器的一种简洁方式。最常见的是，你会听到列表解析，但也存在集合解析和字典解析。但是，术语上的差异有些重要：如果你实际上是在制作列表，那么它只是一个列表解析。生成器表达式用括号括起来( )，而列表解…

PHP中文网
编程技术 2025年2月26日
2000
Python catplot函数自定义颜色的方法是什么

一、catplot函数 catplot() 函数是 seaborn 中一个非常有用的函数，它可以绘制分类变量的图形，并可以根据另一个或多个变量进行分组。使用不同的图表类型，catplot() 函数可以创建适当的图表。默认情况下，catpl…

PHP中文网
2025年2月26日 • 编程技术
2000
Python怎么用Gradio与EasyOCR构建在线识别文本的Web应用

一、Gradio是什么 gradio是一个开源的 python 库，用于构建机器学习和数据科学演示和 web 应用。官网：https://www.gradio.app/ Gradio适用于: 演示客户/合作者/用户/学生的机器学习模型。 …

PHP中文网
2025年2月26日 • 编程技术
2000

发表回复

登录后才能评论

怎么使用Python中的正则表达式处理html文件

关于作者

AD推荐 黄金广告位招租... 更多推荐

相关推荐

发表回复

分享到:

请登录

AD推荐黄金广告位招租... 更多推荐