>>> import re
>>> s = '请你查找在职员工自入职以来123的薪水涨anx-d幅情况'
>>> re_a = re.compile('\w+')
>>> re_a.search(s)
<re.Match object; span=(0, 23), match='请你查找在职员工自入职以来123的薪水涨anx'>
对, 没错, \w+
这个正则表达式居然匹配出中文来了.
正常情况不应该是:
\w 匹配字母数字下划线.等价于'[A-Za-z0-9_]'.
???
, 为什么在python中匹配出中文?
Welcome to Node.js v16.17.0.
Type ".help" for more information.
> {
... const s = '请你查找在职员工自入职以来123的薪水涨anx-d幅情况';
... const reg = /\w+/g;
... const ms = s.matchAll(reg);
... for (const m of ms) console.log(m);
... }
[
'123',
index: 13,
input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
groups: undefined
]
[
'anx',
index: 20,
input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
groups: undefined
]
[
'd',
index: 24,
input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
groups: undefined
]
翻阅文档才注意到这个问题
\w
- For Unicode (str) patterns:
Matches Unicode word characters; this includes alphanumeric characters (as defined by
str.isalnum()
) as well as the underscore (_
). If theASCII
flag is used, only[a-zA-Z0-9_]
is matched.
- For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to
[a-zA-Z0-9_]
. If theLOCALE
flag is used, matches characters considered alphanumeric in the current locale and the underscore.
匹配的内容并不是固定的, 习惯上理解的[a-zA-Z0-9_]
只有在使用re.A flag
参数后, 才会生效.
由于当前使用的python
是原生支持Unicode
的, 下面的那段描述暂时不理.
但是其文档上这段描述: Matches Unicode word characters
, 怎么翻译, 匹配全部的``Unicode字符?, 显然不是, 应该是满足
str.isalnum()函数返回为
True`以及下划线在内的字符.
str.isalnum()
Return
True
if all characters in the string are alphanumeric and there is at least one character,False
otherwise. A characterc
is alphanumeric if one of the following returnsTrue
:c.isalpha()
,c.isdecimal()
,c.isdigit()
, orc.isnumeric()
.>>> 'ans'.isalnum() True >>> '我是说'.isalnum() True >>> '12345'.isalnum() True >>> '...,_'.isalnum() False >>> '...,_-'.isalnum() False
>>> re_a.search('abc我是_谁')
<re.Match object; span=(0, 7), match='abc我是_谁'>
>>> re_a.search('abc我;是_谁')
<re.Match object; span=(0, 4), match='abc我'>
需要注意的是
\d
- For Unicode (str) patterns:
Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes
[0-9]
, and also many other digit characters. If theASCII
flag is used only[0-9]
is matched.
- For 8-bit (bytes) patterns:
Matches any decimal digit; this is equivalent to
[0-9]
.\s
- For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes
[ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If theASCII
flag is used, only[ \t\n\r\f\v]
is matched.
- For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to
[ \t\n\r\f\v]
.
这些常用的元字符都带有这种特性.
# 加上re.A
>>> reg = re.compile('\w+', flags=re.A)
>>> reg.search(s)
<re.Match object; span=(13, 16), match='123'>
可以看到Python在正则上, 很多方面和JavaScript
不同.