phpxpath解析html技巧_若何运用xpath表达式解析HTML和XML文档

文章目录 [+]

准备

本文依然利用lxml进行html/xml数据解析，可以自行利用pip进行安装，如不熟习pip的利用，可以参考另一篇文章：如何管理python软件包。

phpxpath解析html技巧_若何运用xpath表达式解析HTML和XML文档

先利用urllib2去抓取一个HTML页面，比如前篇提到的中航电子2017年一季度交易记录：

（图片来自网络侵删）

>>> import urllib2

>>> url = 'http://quotes.money.163.com/trade/lsjysj_600372.html?year=2017&season=1'

>>> rsp = urllib2.urlopen(url).read()

利用lxml.html软件包进行解析：

>>> from lxml import html

>>> doc = html.document_fromstring(rsp)

>>> doc.tag

'html'

可以看到，当前doc代表的是html元素。

什么是XPath

XPath，是XML Path Language的缩写，是一种用于从XML文档中选取节点的查询措辞，也可以用于对节点内容进行打算。
xpath的标准是由W3C制订的。
xpath虽然经历了多个版本（1999-v1.0, 2007-v2.0, 2014-v3.0, 2017-v3.1），但v1.0的利用仍旧最广泛，其官方文档可以参考：

https://www.w3.org/TR/xpath/

XPath表达式

在xpath看来，XML文档便是一颗由节点构成的树，个中节点可以分为元素节点(element node)、属性节点(attribute node)和文本节点(text node)。
xpath最基本的语法构造是表达式(expression)，表达式运行之后可以返回一下四类结果：

一组节点（不重复）

布尔取值true/false

数值（浮点类型）

字符串

分别看下面例子：

>>> doc.xpath('child::')

[<Element head at 0x10f478ec0>, <Element body at 0x10f478ba8>]

>>> doc.xpath('3 < 2')

False

>>> doc.xpath('3 = 3')

True

>>> doc.xpath('3 + 2')

5.0

>>> doc.xpath('2 3')

6.0

>>> doc.xpath('6 div 3')

2.0

>>> doc.xpath('3 mod 2')

1.0

>>> doc.xpath('/html/head/meta[@name=\公众robots\"大众]/@name')

['robots']

路径与步

XPath最主要的一类表达式，是用于节点选取的路径(Location Path)，看一个例子：

>>> doc.xpath('/html/head/meta[@name=\"大众robots\"大众]')

[<Element meta at 0x10f478ba8>]

可以看到，一条路径有多少步(step)组成，步之间利用斜杠\"大众/\公众分隔，每一步都从高下文节点中选取知足条件的多少节点。
路径可分为相对路径和绝对路径类，比如：

>>> doc.xpath('/html/head/meta[@name=\"大众robots\公众]') #绝对路径，从根节点开始

[<Element meta at 0x10f478ba8>]

>>> doc.xpath('head/meta[@name=\公众robots\"大众]') #相对路径，从当前节点开始

[<Element meta at 0x10f478ba8>]

如果想同时查询多条路径，可以利用或“|”操作符：

>>> doc.xpath('/html/head | /html/body')

[<Element head at 0x10f478c58>, <Element body at 0x10f478c00>]

轴(axis)

上面的路径采取的都是缩写的形式，实在它还有完备形式的写法，比如：

>>> doc.xpath('/child::html/child::head/child::meta[@name=\"大众robots\"大众]')

[<Element meta at 0x10f478ba8>]

在该写法中，以child::meta[@name=\"大众robots\"大众]为例，这一步是什么意思呢？对该步而言，高下文节点是head，该步的意思是从head的meta类型的子节点中选取name属性取值为robots的节点，个中child是axis的名字，meta是要选择的节点类型，@name=\"大众robots\"大众是要知足的条件。
事实上，每一步都由这三部分构成的，它们的浸染：

一个轴(axis）：指明该步要选取的节点跟当前节点的关系，例如上面的child解释要选取当前节点的子节点

一个节点测试(node test)：指明这个路径步要选取的节点类型，比如上面child::meta要选取的是meta类型的节点

零个或多个谓词(predicates)组成：起过滤浸染、选取特定节点，比如上面[@name=\"大众robots\"大众]要筛选属性值为robots的节点

除了child axis, xpath还定义了很多其他的axis，请看下表：

看下面一些例子：

1) 获取子节点：getchildren()等价于child::

>>> doc.xpath('child::')

[<Element head at 0x10e8bad60>, <Element body at 0x10e8e5d60>]

>>> doc.getchildren()

[<Element head at 0x10e8bad60>, <Element body at 0x10e8e5d60>]

2) 获取当前节点：\公众.\"大众等价于 self::node()

>>> doc.xpath(\"大众.\"大众)

[<Element html at 0x10dc1d4c8>]

>>> doc.xpath(\公众self::node()\"大众)

[<Element html at 0x10dc1d4c8>]

3) 获取父节点：\"大众..\"大众等价于parent::node()

>>> doc.head.xpath(\"大众..\"大众)

[<Element html at 0x10dc1d4c8>]

>>> doc.head.xpath(\公众parent::node()\"大众)

[<Element html at 0x10dc1d4c8>]

4）ancestor轴和descendant轴

分别代表当前元素所有先人元素、所有后代元素，比如：

>>> meta.xpath('ancestor::')

[<Element html at 0x10dc1d4c8>, <Element head at 0x10e8bad60>]

>>> meta.xpath('ancestor::head')

[<Element head at 0x10e8bad60>]

>>> doc.xpath('descendant::table')

>>> doc.xpath('descendant::table[@id=\公众tcdatafields\公众]')

>>> doc.xpath('//table[@id=\"大众tcdatafields\"大众]')

5）ancestor-or-self 和 descendant-or-self轴

分别表示当前元素或其所有先人元素、当前元素或其所有后代元素，比如：

>>> meta.xpath('ancestor-or-self::')

[<Element html at 0x10dc1d4c8>, <Element head at 0x10e8bad60>, <Element meta at 0x10e8bae68>]

6）child和parent轴

分别表示当前元素所有子元素、父元素：

>>> doc.xpath('child::')

[<Element head at 0x10e8bad60>, <Element body at 0x10e8baf18>]

>>> doc.xpath('child::head')

[<Element head at 0x10e8bad60>]

>>> head.xpath('child::meta[1]')

[<Element meta at 0x10e8baf18>]

>>> head.xpath('child::meta[position()<3]')

[<Element meta at 0x10e8baf18>, <Element meta at 0x10e8bad08>]

7）attribute轴

表示当前元素的所有属性，例如下面是meta元素的name和content两个属性以及取值：

>>> meta.items()

[('name', 'googlebot'), ('content', 'index, follow')]

获取所有属性取值：

>>> meta.xpath('attribute::')

['googlebot', 'index, follow']

获取name属性的取值：

>>> meta.xpath('attribute::name')

['googlebot']

8）following和preceding

分别表示当前元素的所有后继元素、前置元素，比如：

>>> meta.xpath('following::')

>>> meta.xpath('preceding::')

9）following-sibling和preceding-sibling轴

分别表示当前元素的所有平级后继元素、平级前置元素，比如：

>>> meta.xpath('preceding-sibling::')

>>> meta.xpath('following-sibling::')

10）self轴

表示当前元素自身

>>> doc.xpath(\"大众self::\"大众)

[<Element html at 0x10dc1d4c8>]

利用谓词(predicates)

谓词便是step中利用中括号[...]定义的那部分，利用谓词能实现精确查找，看下面的例子：

>>> doc.xpath('/html/head/meta')

[<Element meta at 0x10e8baf70>, <Element meta at 0x10e8bae10>, <Element meta at 0x10e8bacb0>, <Element meta at 0x10e8baf18>, <Element meta at 0x10e8bad08>, <Element meta at 0x10e8bafc8>, <Element meta at 0x10e8bae68>]

1) 位置谓词

>>> doc.xpath('/html/head/meta[1]')

[<Element meta at 0x10e8baf70>]

>>> doc.xpath('/html/head/meta[2]')

[<Element meta at 0x10e8bae10>]

>>> doc.xpath('/html/head/meta[last()]')

[<Element meta at 0x10e8bae68>]

>>> doc.xpath('/html/head/meta[last()-1]')

[<Element meta at 0x10e8bae10>]

>>> doc.xpath('/html/head/meta[position()<3]')

[<Element meta at 0x10e8baf70>, <Element meta at 0x10e8bae10>]

注：这里利用了last()和position()两个函数，xpath还支持更多的函数，结合这些函数可以得到非常强大的处理能力。

2) 属性谓词

含有属性name的meta元素：

>>> doc.xpath('/html/head/meta[@name]')

[<Element meta at 0x10e8bacb0>, <Element meta at 0x10e8baf70>, <Element meta at 0x10e8bae10>, <Element meta at 0x10e8bae68>]

含有属性name而且其取值为robots的meta元素：

>>> doc.xpath('/html/head/meta[@name=\公众robots\"大众]')

[<Element meta at 0x10e8bacb0>]

含有任意属性的meta元素：

>>> doc.xpath('/html/head/meta[@]')

3) 函数谓词

xpath内置很多函数，灵巧利用这些函数，可以极大提升查找效率，比如：

-利用 text()函数

>>> doc.xpath('//td[text()=\"大众2017-03-21\公众]')

[<Element td at 0x10e8bacb0>]

- 利用contains函数

>>> [ td.text for td in doc.xpath('//td[contains(text(), \"大众2017-03-2\"大众)]')]

['2017-03-29', '2017-03-28', '2017-03-27', '2017-03-24', '2017-03-23', '2017-03-22', '2017-03-21', '2017-03-20']

- 利用starts-with函数

>>> [ td.text for td in doc.xpath('//td[starts-with(text(),\"大众2017-02-2\"大众)]')]

['2017-02-28', '2017-02-27', '2017-02-24', '2017-02-23', '2017-02-22', '2017-02-21', '2017-02-20']

>>> [ td.text for td in doc.xpath('//td[text()>21.0 and text()<23.0]')]

['21.02']

>>> [ td.text for td in doc.xpath('//td[text()< -2.5 or text()>21.0]')]

['21.02', '-2.64']

通配符

xpath也支持通配符\"大众\公众，个中'\"大众可以匹配任何标签元素，\"大众@\"大众可以匹配任何元素属性，node()可以匹配任何节点：

>>> head.xpath('./')

[<Element title at 0x10e8baf18>, <Element meta at 0x10e8bacb0>, <Element meta at 0x10e8baf70>, <Element meta at 0x10e8bad08>, <Element meta at 0x10e8bafc8>, <Element meta at 0x10e8bae10>, <Element meta at 0x10e8c6050>, <Element meta at 0x10e8bae68>, <Element link at 0x10e8c60a8>, <Element link at 0x10e8c6100>]

>>> head.xpath('./meta[@]')

[<Element meta at 0x10e8baf18>, <Element meta at 0x10e8bacb0>, <Element meta at 0x10e8baf70>, <Element meta at 0x10e8bad08>, <Element meta at 0x10e8bafc8>, <Element meta at 0x10e8bae10>, <Element meta at 0x10e8bae68>]

>>> head.xpath('./node()')

['\r\n', <Element title at 0x10e8badb8>, '\r\n', <Element meta at 0x10e8baf18>, '\r\n', <Element meta at 0x10e8bacb0>, '\r\n', <Element meta at 0x10e8baf70>, '\r\n', <Element meta at 0x10e8bad08>, '\r\n', <Element meta at 0x10e8bafc8>, '\r\n', <Element meta at 0x10e8bae10>, '\r\n', <Element meta at 0x10e8bae68>, '\r\n', <Element link at 0x10e8c6050>, '\n', <Element link at 0x10e8c60a8>, '\r\n']

本日就写这么多。