根据底层的编码来去噪音,unicodedata.category()
参考链接:http://ju.outofmemory.cn/entry/374250
类型码 | 类型信息 |
---|---|
Lu | Letter, uppercase |
Ll | Letter, lowercase |
Lt | Letter, titlecase |
Lm | Letter, modifier |
Lo | Letter, other |
Mn | Mark, nonspacing |
Mc | Mark, spacing combining |
Me | Mark, enclosing |
Nd | Number, decimal digit |
Nl | Number, letter |
No | Number, other |
Pc | Punctuation, connector |
Pd | Punctuation, dash |
Ps | Punctuation, open |
Pe | Punctuation, close |
Pi | Punctuation, initial quote (may behave like Ps or Pe depending on usage) |
Pf | Punctuation, final quote (may behave like Ps or Pe depending on usage) |
Po | Punctuation, other |
Sm | Symbol, math |
Sc | Symbol, currency |
Sk | Symbol, modifier |
So | Symbol, other |
Zs | Separator, space |
Zl | Separator, line |
Zp | Separator, paragraph |
Cc | Other, control |
Cf | Other, format |
Cs | Other, surrogate |
Co | Other, private use |
Cn | Other, not assigned (including noncharacters) |