Commit d46fecd
committed
Normalize all PTB produced tokens, not just the German ones, using NFC
Testing on 0.1% of Wikipedia (from a few years ago), this slows down the English tokenizer by about 1.5%
The German umlaut unit test still works as well1 parent 58a2288 commit d46fecd
File tree
3 files changed
+4
-61
lines changed- src/edu/stanford/nlp
- international/german/process
- process
3 files changed
+4
-61
lines changedLines changed: 0 additions & 61 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | 48 | | |
107 | 49 | | |
108 | 50 | | |
| |||
134 | 76 | | |
135 | 77 | | |
136 | 78 | | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | 79 | | |
141 | 80 | | |
142 | 81 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
488 | 489 | | |
489 | 490 | | |
490 | 491 | | |
| 492 | + | |
491 | 493 | | |
492 | 494 | | |
493 | 495 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
| |||
88334 | 88335 | | |
88335 | 88336 | | |
88336 | 88337 | | |
| 88338 | + | |
88337 | 88339 | | |
88338 | 88340 | | |
88339 | 88341 | | |
| |||
0 commit comments