Python正则表达式 -" \w "-坑

>>> import re
>>> s = '请你查找在职员工自入职以来123的薪水涨anx-d幅情况'
>>> re_a = re.compile('\w+')
>>> re_a.search(s)
<re.Match object; span=(0, 23), match='请你查找在职员工自入职以来123的薪水涨anx'>

对, 没错, \w+这个正则表达式居然匹配出中文来了.

正常情况不应该是:

\w 匹配字母数字下划线.等价于'[A-Za-z0-9_]'.

???, 为什么在python中匹配出中文?

Welcome to Node.js v16.17.0.
Type ".help" for more information.
> {
...     const s = '请你查找在职员工自入职以来123的薪水涨anx-d幅情况';
...     const reg = /\w+/g;
...     const ms = s.matchAll(reg);
...     for (const m of ms) console.log(m);
... }
[
  '123',
  index: 13,
  input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
  groups: undefined
]
[
  'anx',
  index: 20,
  input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
  groups: undefined
]
[
  'd',
  index: 24,
  input: '请你查找在职员工自入职以来123的薪水涨anx-d幅情况',
  groups: undefined
]

翻阅文档才注意到这个问题

\w
  • For Unicode (str) patterns:

Matches Unicode word characters; this includes alphanumeric characters (as defined by str.isalnum()) as well as the underscore (_). If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

  • For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

匹配的内容并不是固定的, 习惯上理解的[a-zA-Z0-9_]只有在使用re.A flag参数后, 才会生效.

由于当前使用的python是原生支持Unicode的, 下面的那段描述暂时不理.

但是其文档上这段描述: Matches Unicode word characters, 怎么翻译, 匹配全部的``Unicode字符?, 显然不是, 应该是满足str.isalnum()函数返回为True`以及下划线在内的字符.

str.isalnum()

Return True if all characters in the string are alphanumeric and there is at least one character, False otherwise. A character c is alphanumeric if one of the following returns True: c.isalpha(), c.isdecimal(), c.isdigit(), or c.isnumeric().

>>> 'ans'.isalnum()
True
>>> '我是说'.isalnum()
True
>>> '12345'.isalnum()
True
>>> '...,_'.isalnum()
False
>>> '...,_-'.isalnum()
False
>>> re_a.search('abc我是_谁')
<re.Match object; span=(0, 7), match='abc我是_谁'>
>>> re_a.search('abc我;是_谁')
<re.Match object; span=(0, 4), match='abc我'>

需要注意的是

\d
  • For Unicode (str) patterns:

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.

  • For 8-bit (bytes) patterns:

Matches any decimal digit; this is equivalent to [0-9].

\s
  • For Unicode (str) patterns:

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

  • For 8-bit (bytes) patterns:

Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

这些常用的元字符都带有这种特性.

# 加上re.A
>>> reg = re.compile('\w+', flags=re.A)
>>> reg.search(s)
<re.Match object; span=(13, 16), match='123'>

可以看到Python在正则上, 很多方面和JavaScript不同.