Java: looking for the fastest way to check String for presence of Unicode chars in certain range -
i need implement crude language identification algorithm. in world, there 2 languages: english , not-english. have arraylist , need determine if each string in english or other language has unicode chars in range. want check each string against range using type of "presence" test. if passes test, string not english, otherwise it's english. want try 2 type of tests:
- test-any: if char in string falls within range, string passes test
- test-all: if chars in string fall within range, string passes test
since array might long, need implement efficiently. fastest way of doing in java?
thx
update: checking non-english looking @ specific range of unicodes rather checking whether characters ascii, in part take care of "resume" problem mentioned below. trying figure out whether java provides classes/methods implement test-any or test-all (or similar test) efficiently possible. in other words, trying avoid reinventing wheel if wheel invented before me better anyway.
here's how ended implementing test-any:
// test-any string str = "wordtotest"; int urangelow = 1234; // can range e.g. http://www.utf8-chartable.de/unicode-utf8-table.pl int urangehigh = 2345; for(int iletter = 0; iletter < str.length() ; iletter++) { int cp = str.codepointat(iletter); if (cp >= urangelow && cp <= urangehigh) { // word not english return; } } // word english return;
Comments
Post a Comment