将字符实体引用转换成 Unicode 字符

时间：2011-07-01 12:31:51　

首先我们一起来看下 Character entities references （HTML Entities）和 Numeric Character Reference （NCR）的异同：

HTML Entities 的格式如：<，NCR 的格式如：< 或 <，均都表示“<” 字符。

HTML 中规定了 Character entity references，在 “24.2.1 The list of characters” 列出了 HTML Entities 和 NCR 的对应关系，例如：

<!ENTITY nbsp CDATA " " -- no-break space = non-breaking space, U+00A0 ISOnum -->
<!ENTITY iexcl CDATA "¡" -- inverted exclamation mark, U+00A1 ISOnum -->
<!ENTITY yen CDATA "¥" -- yen sign = yuan sign, U+00A5 ISOnum -->

那在 Python 中我们如何将 HTML Entities 和 NCR 转换成普通字符呢？

在回答这个问题之前，我们做一些简单的回顾：

group 方法

group([group1,…])

group 属于 Match Object 对象拥有的方法，返回匹配到的一个或者多个子组。如果是一个参数，那么结果返回字符串，如果是多个参数，则返回元组。group1 的默认值为 0 （将返回所有的匹配值），如果 groupX 的值是 [1…99] 范围之内的，那么将匹配对应括号组的字符串。如果组号是负的或者比 pattern 中定义的组号大，那么将抛出 IndexError 异常。若 pattern 没有匹配到，但 group 匹配到，那么 group 的值也为 None。如果一个 pattern 可以匹配多个，那么组对应匹配的最后一个。

re.sub 方法

re.sub(pattern , replace , string [, count])

sub 属于 re 模块的字符串替换和修改函数，其在目标字符串中查找与正则相匹配的字符串，并将其替换成指定的字符串。

pattern 参数——需要匹配的正则规则
replace 参数——指定用来替换的字符串或函数。如果 replace 是函数，则会对所有的匹配都回调此函数，这个函数使用单个 Match Object 作为参数，然后返回替换后的字符串。
string 参数——目标字符串
count 参数——最多替换的次数，未指定，则将替换所有匹配到的字符串

re.sub() 的使用案例如下：

import re def dashrepl(matchobj): if matchobj.group(0) == '-': return ' ' else: return '-' re.sub('-{1,2}', dashrepl, 'pro----gram-files') # result: 'pro--gram files'

htmlentitydefs

htmlentitydefs 有三个属性，详细如下：

entitydefs：A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1.
name2codepoint：A dictionary that maps HTML entity names to the Unicode codepoints. New in version 2.3.
codepoint2name：A dictionary that maps Unicode codepoints to HTML entity names.

实际存在的形式大致如下：

entitydefs = {'AElig': '\xc6', 'Aacute': '\xc1', 'Acirc': '\xc2', ...} name2codepoint = {'AElig': 198, 'Aacute': 193, 'Acirc': 194, ...} codepoint2name = {34: 'quot', 38: 'amp', 60: 'lt', 62: 'gt', ...}

对于我们来说，此时最有用的是 name2codepoint 属性，比如：“<”，name 是 lt，我们可以通过 name2codepoint[lt] 获得其 code point：60。

unichr 方法

unichr 是字符串的方法（unichr(int)），可以将整数转化成相应的 Unicode 字符，比如： unichr(60) –> u'\u003c' or u'<'

import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary. def unescape(text): def convert(matchobj): text = matchobj.group(0) if text[:2] == "&#": # Numeric Character Reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # Character entities references try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # Return Unicode characters return re.sub("&#?\w+;", convert, text)

扩展阅读：

标签：unicode,字符

投稿

将字符实体引用转换成 Unicode 字符

猜你喜欢

oracle常用sql语句

极致之美——百行代码实现全新智能语言Lisp

文字超长自动省略，以...代替，CSS实现

xmlHttp msxml3.dll 错误 '800c0008' 解决办法

巧用SQL链接服务器访问远程Access数据库

好用的asp防SQL注入代码

sqlserver 触发器实例代码

JavaScript 组件之旅（三）：用 Ant 构建组件

关于document.createDocumentFragment()

Firefox 的 Jetpack 扩展案例分析：Gmail 邮件提醒

WEB前端开发经验总结

CSS在Internet Explorer 6, 7 和8中的差别

asp提高首页性能的一个技巧

asp fso type属性取得文件类型代码

ASP利用TCPIP.DNS组件实现域名IP查询

单击按钮将内容复制到剪贴板

百度、谷歌和雅虎的近日LOGO

ASP基础教程:常用的 ASP ActiveX 组件

ASP：判断访问是否来自搜索引擎的函数

如何优化JavaScript脚本的性能