Clone
Daniel Kelleher
committed
on 03 Aug 15
Support German Scharfes 'S' symbol when tokenising This symbol is weird because the so-called 'capital' version (as determined by the JVM) i… Show more
Support German Scharfes 'S' symbol when tokenising This symbol is weird because the so-called 'capital' version (as determined by the JVM) is 'SS'. i.e. longer than the lower-case version. This sends the indexes out of kilter within the TokenStream class when using case insensitive tokenising.

The solution is to override the match method in the CaseInsensitiveToken to convert the current token to upper-case, rather than storing an upper-case version of the entire input string, which may not have the same indexes as the lower-case version.

(cherry picked from commit 9a8ac56)

Show less

4.x + 8 more