根据底层的编码来去噪音,unicodedata.category()
参考链接:http://ju.outofmemory.cn/entry/374250
| 类型码 | 类型信息 |
|---|---|
| Lu | Letter, uppercase |
| Ll | Letter, lowercase |
| Lt | Letter, titlecase |
| Lm | Letter, modifier |
| Lo | Letter, other |
| Mn | Mark, nonspacing |
| Mc | Mark, spacing combining |
| Me | Mark, enclosing |
| Nd | Number, decimal digit |
| Nl | Number, letter |
| No | Number, other |
| Pc | Punctuation, connector |
| Pd | Punctuation, dash |
| Ps | Punctuation, open |
| Pe | Punctuation, close |
| Pi | Punctuation, initial quote (may behave like Ps or Pe depending on usage) |
| Pf | Punctuation, final quote (may behave like Ps or Pe depending on usage) |
| Po | Punctuation, other |
| Sm | Symbol, math |
| Sc | Symbol, currency |
| Sk | Symbol, modifier |
| So | Symbol, other |
| Zs | Separator, space |
| Zl | Separator, line |
| Zp | Separator, paragraph |
| Cc | Other, control |
| Cf | Other, format |
| Cs | Other, surrogate |
| Co | Other, private use |
| Cn | Other, not assigned (including noncharacters) |