(don't know if I rather should post this under general programming.. language is java anyway)
Hi,
I'm writing my master thesis on document layout recognition and I'm writing a program on recognizing dictionary-pages.
I have a list of what could be lemmas (~the "lookup word"). {cab, calculator, car, cat}.
The list is supposed to be sorted in alphabetic order. But sometimes it isn't and I've gotten a word/words in to the list, which isn't supposed to be a lemma, that is, something that has fulfilled all the criterias of being a lemma, but in reality isn't, has snuck in: {cab, calculator, cast, car, call, cat}.
Now, I have to find the word that has been recognized as a lemma but isn't ('cause it doesn't follow the alphabetic order).
I kind of need to find the path most of the words are following and then remove the outcasts.
My main approach is to get sublists and find the part of the list where the faulty word is but then I kind of get stuck.
Please, please help, or comment if I haven't explained good enough. I'm going crazy here!
Somehow I imagine this would be great to solve with recursion - something like this:
Code:/** * check if the lemmas are listed in alphabetic order. If not, remove fake lemmas. * Sorry for the lack of understandable code * @param lemmings * @param finished * @return correct lemma-list */ private List<String> checkingLemmas(List<String> lemmings, ArrayList<String> finished){ if(isSorted(lemmings)) return lemmings; if(lemmings.size() == 2){ //what to do here? // I don't have anything to compare to anymore so I don't know which of the two to remove. //went too far... } if(lemmings.size() == 3){ ArrayList<String> x = new ArrayList<String>(); x.add(lemmings.get(0)); x.add(lemmings.get(lemmings.size()-1)); } List<String> list1 = lemmings.subList(0, (lemmings.size()/2)); List<String> list2 = lemmings.subList((lemmings.size()/2), (lemmings.size()-1)); if(isSorted(list1)){ finished.addAll(list1); Collections.sort(finished); } else if(!isSorted(list1)){ finished.addAll(checkingLemmas(list1, finished)); } if(isSorted(list2)){ finished.addAll(list2); Collections.sort(finished); } else if(!isSorted(list2)){ finished.addAll(checkingLemmas(list2, finished)); } return finished; } private boolean isSorted(List<String> list){ ArrayList<String> sorted = new ArrayList<String>(); sorted.addAll(list); Collections.sort(sorted); if(list.equals(sorted)){ return true; } return false; }
Wouldn't it be easier to use the String.compareTo method? It return 0 if 2 strings are equal, <0 when string 1 is 'smaller than string 2 and >0 when string 1 is 'greater than string 2:
example program:Code:String a="at"; String b="bath"; a.compareTo(b) --> result is <0 b.compareTo(a) --> result is >0
output of this program:Code:import java.util.ArrayList; public class Main { public static void main(String[] arg) { Main main = new Main(); } public Main(){ ArrayList<String> list = new ArrayList<String>(); list.add("cat"); list.add("car"); list.add("caz"); list.add("cc"); list.add("cb"); list.add("ccc"); list.add("cd"); list.add("cef"); list.add("cff"); list.add("cee"); list.add("cgg"); list = removeIntruders(list); for(String item : list){ System.out.println(item); } } private static ArrayList<String> removeIntruders(ArrayList<String> list) { for(int i=0 ; i<list.size()-1; i++){ String first = list.get(i); String second = list.get(i+1); if(first.compareTo(second)>0) list.remove(i+1); } return list; } }
Code:cat caz cc ccc cd cef cff cgg
I do have a question however. If the correct list = {a,b,c,d,e,f,g} and with the intruder it's {a,b,z,c,d,e,f,g} then you would always end up with {a,b,z} as result as until there it follow the alphabetical order... if this is possible then i think it becomes a bit more difficult to catch the intruder.
Exactly that is my problem.I do have a question however. If the correct list = {a,b,c,d,e,f,g} and with the intruder it's {a,b,z,c,d,e,f,g} then you would always end up with {a,b,z} as result as until there it follow the alphabetical order... if this is possible then i think it becomes a bit more difficult to catch the intruder.
Believe me, I've tried your method, but I was too tired to think of the compare to-method (thanks btw), and I have used my isSorted instead, but it doesn't matter, as long as I only compare the element with the preceeding elements in the list, I will always get stuck there. So I have to make the method look at the whole list and decide, which element doesn't fit in.
I've been thinking about, assuming that the first and the last elements are correct and then just check, if everything in between is alphabetically ordered.
Note, that the method doesn't have to be a hundred percent fool proof. I know that I will have a marginal at the end. But I'd like to make it pretty, and by only comparing to preceeding elements, I would sometimes cut off the list waay to early and the correctness would be far worse than without the check.
hmm, and what if you go backwards trough the list...
{a,b,z,c,d,e,f,g}
would become {a,b,c,d,e,f,g}...
{a,b,a,c,d,e,f,g} --> {a,c,d,e,f,g}
So going trough the list forward and backwards, look which of the 2 has most elements and take that one will solve it..... when it contains 1 error. I got to think a bit more for multiple errors![]()
Yup, multipe errors there are.got to think a bit more for multiple errors
I wrote an algorithm that appears to accomplish what you're trying to do. The way it does this is by evaluating an arbitrary list of words and giving it a "score" of how ordered it is as a double between 0 and 1. It's basically determined by how much of a percentage of the terms are in order (IE followed by a lemma), vs those that are not. It keeps that number as a baseline, and then removes each word and determines it's percentage. These percentages are compared, and any word removal that results in an overall increase in the order percent will be removed, the ones that result in a decrease remain. This is done for each word, and it seems to remove the intruders pretty well from my rudimentary experiments. I'll throw together an applet for it to let peoples experiment, because I don't think this is the right algorithm, to me there's still problems. Anyway, that's an idea, I'll post it soon.
Wow I changed my sig!
Thanx ZekeDragon, it would be great if you'd post it when it's finished.Anyway, that's an idea, I'll post it soon.
Anyway, I have found one (giant) solution now. The method is huge but it does miracles and I have still not been able to find any faults in it. I'm not great in programming with low complexity (I'm actually also quite lousy on figuring the complexity out, so I won't tell you if it's fast or not) but fact is, that I'm trying this solution on about 1200 pages of dictionary-pages and it works like a charm.
The idea is to take out the longest sorted substring of the list and then take it from there.
I think the comments should explain the whole process
If anyone has an idea for making it even more effective and great, please tellCode:/** * Check if the lemmas are in alphabetic order, remove the words that have been recognized as lemmas but aren't */ private void checkLemmas(){ //{ABZDEFGPGKLOPVRSTAUH} -> {{ABZ}, {DEFG}, {P}, {GKLOPV}, {RST}. {AU}, {H}} ArrayList<ArrayList<String>> subLists = new ArrayList<ArrayList<String>>(); ArrayList<String> temp = new ArrayList<String>(); for(int i = 0; i<lemmas.size(); i++){ String possible = lemmas.get(i).getLexeme().toLowerCase(); if(temp.isEmpty()){ //behöver jag inte ha sen när jag har basfallet! temp.add(possible); } else if(possible.compareTo(temp.get(temp.size()-1))>0){ temp.add(possible); } else{ ArrayList<String> temp2 = new ArrayList<String>(); temp2.addAll(temp); subLists.add(temp2); temp.clear(); temp.add(possible); } } ArrayList<String> temp2 = new ArrayList<String>(); temp2.addAll(temp); subLists.add(temp2); temp.clear(); if(subLists.size() == 1) System.out.println("SOOORTED!!!"); //get the longest substring - {GKLOPV} ArrayList<String> longest = new ArrayList<String>(); for(int i = 0; i<subLists.size(); i++){ if(longest.isEmpty()){ longest.addAll(subLists.get(i)); } else if(subLists.get(i).size() > longest.size()){ longest.clear(); longest.addAll(subLists.get(i)); } } //here we compare our longest substring to the string before and after //we remove the "intruders" //and concat the before-string to the longest, and the after-string to the longest //explanations are to find in the comments. //ex. subLists: {{ABZ}, {DEFG}, {P}, {CKLOPV}, {RST}, {AU}, {H}} while(subLists.size() != 1){ int i = subLists.indexOf(longest); //if longest just contains one element, we don't have a pattern, no use trying, only occurred once in a file of 700 pages. if(longest.size() == 1) break; //if there are sublists left to the longest sublist if(i != 0){ ArrayList<String> before = subLists.get(i-1); //{P}, {CKLOPV} if(before.size() == 1){ //if P < K - remove G if(before.get(0).compareTo(longest.get(1))<0){ longest.remove(0); subLists.get(i).remove(0); longest.addAll(0, before); //I've been having troubled with my variables subLists.get(i).addAll(0, before); subLists.remove(before); } //if P > K - remove P -> {{ABZ}, {DEFG}, {CKLOPV}, {RST}. {AU}} else { subLists.remove(before); } } //{{ABZ}, {DEFG}, {CKLOPV}, {RST}, {AU}, {H}} else{ for(int depth = 1, depth1 = 0; depth <= longest.size() && depth1<longest.size();depth++){ //depth = size of longest or max 4 - max can be changed, but I still haven't encountered enough reasons why if(depth == longest.size()){ depth1++; depth = 1; } if(depth == 4){ depth1++; depth = 1; } //depth1 sets in when depth isn't enough, if depth1 ha passed 2 (can also go deeper if one finds it necessary) //then we just skips the sublist before if(depth1 == 2){ subLists.remove(before); break; } //{DEFG}, {CKLOPV} //if G (in before) < K - remove C -> {DEFGKLOPV} if(depth1<before.size() && before.get(before.size()-(depth1+1)). compareTo(longest.get(depth))<0){ for(int k = 0; k<depth; k++){ longest.remove(0); subLists.get(i).clear(); subLists.get(i).addAll(longest); } //if we would have had to go on depth1 >0 for(int l = 0; l<depth1; l++){ subLists.get(i-1).remove(before.size()-1); } longest.addAll(0, before); subLists.get(i).addAll(0, before); subLists.remove(before); break; } //{ABZ}, {DEFGKLOPRST} //if B < D - remove Z -> {AB} + {DEFGKLOPRST} -> {{ABDEFGKLOPRST}} else if(depth<before.size() && before.get(before.size()-(depth+1)). compareTo(longest.get(depth1))<0){ for(int k = 0; k<depth; k++){ subLists.get(i-1).remove(before.size()-1); } //if we would have had to go on depth1 >0 for(int l = 0; l<depth1; l++){ longest.remove(0); subLists.get(i).clear(); subLists.get(i).addAll(longest); } longest.addAll(0, subLists.get(i-1)); subLists.get(i).addAll(0, subLists.get(i-1)); subLists.remove(before); break; } } } } else{ System.out.println("left side finished: " + subLists); } i = subLists.indexOf(longest); //if there are sublists right to the longest one if(i != subLists.size()-1){ ArrayList<String> after = subLists.get(i+1); //{{ABDEFGKLOPRSTU}, {H}} if(after.size() == 1){ //if H > T - remove U (false) if(after.get(0).compareTo(longest.get(longest.size()-2))>0){ subLists.get(i).remove(longest.size()-1); longest.remove(longest.size()-1); longest.addAll(after); subLists.get(i).addAll(after); subLists.remove(after); } //if H < T - remove H -> {{ABDEFGKLOPRSTU}} else { subLists.remove(after); } } else{ //{{ABZ}, {DEFGKLOPV}, {RST}, {AU}, {H}} for(int depth = 1, depth1 = 0; depth <= longest.size() && depth1<longest.size();depth++){ //depth = size of longest or max 4 - max can be changed, but I still haven't encountered enough reasons why if(depth == longest.size()){ depth1++; depth = 1; } if(depth == 4){ depth1++; depth = 1; } //depth1 sets in when depth isn't enough, if depth1 ha passed 2 (can also go deeper if one finds it necessary) //then we just skips the sublist before if(depth1 == 2){ subLists.remove(after); break; } // {ABDEFGKLOPV}, {RST} //if S > V - remove R (false) // {ABDEFGKLOPRST}, {AU} //if U > T - remove A -> {{ABDEFGKLOPRST} + {U}} -> {{ABDEFGKLOPRSTU}} if(depth<after.size() && after.get(depth). compareTo(longest.get(longest.size()-(depth1+1)))>0){ for(int k = 0; k<depth; k++){ subLists.get(i+1).remove(0); } for(int l = 0; l<depth1; l++){ longest.remove(longest.size()-1); subLists.get(i).clear(); subLists.get(i).addAll(longest); } longest.addAll(after); subLists.get(i).addAll(after); subLists.remove(after); break; } // {DEFGKLOPV}, {RST} //if R > P - remove V -> {DEFGKLOP} + {RST} -> {DEFGKLOPRST} else if(depth1<after.size() && after.get(depth1). compareTo(longest.get(longest.size()-(depth+1)))>0){ for(int k = 0; k<depth; k++){ longest.remove(longest.size()-1); subLists.get(i).clear(); subLists.get(i).addAll(longest); } for(int l = 0; l<depth1; l++){ subLists.get(i+1).remove(0); } longest.addAll(subLists.get(i+1)); subLists.get(i).addAll(subLists.get(i+1)); subLists.remove(subLists.get(i+1)); break; } } } } else{ System.out.println("right side finished " + subLists); } } if(subLists.size() == 1) System.out.println("/while - finished SORTING!"); }![]()
I found a problem in that there are some situations where it is arbitrary as to whether or not which order is intended by the user, here's an example of such a list:
a, c, b, d, e
Is the "b" the intruder, or is the "c"? It's impossible to tell algorithmically, so I have to have a default answer in the case of an ambiguity, because right now my algorithm eliminates both of them.Which one is it you need?
Wow I changed my sig!
Now, you just have to come up with the trickiest questions, don't you
I've been having the same problem, if you look at my algorithm, you might as well put all the checks in a different order and come up with a different result in cases like these.
It's kind of impossible to tell. But as I've said before, it won't be a problem if the algorithm does some mistakes.
Here I'd say that the b is the intruder, I think, if I had to pick one, mostly because I would like to get the most elements as possible as near each other as possible a-cde rather than ab-de. Then if the first element would be waay far of, like - a, p, g, r, s - you would get - a - prs where you might be able to see that the 'a' doesn't fit in either.
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks