Jump to content

"Intruders" in supposed to be list in alphabetic order

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
8 replies to this topic

#1
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts
(don't know if I rather should post this under general programming.. language is java anyway)
Hi,

I'm writing my master thesis on document layout recognition and I'm writing a program on recognizing dictionary-pages.

I have a list of what could be lemmas (~the "lookup word"). {cab, calculator, car, cat}.

The list is supposed to be sorted in alphabetic order. But sometimes it isn't and I've gotten a word/words in to the list, which isn't supposed to be a lemma, that is, something that has fulfilled all the criterias of being a lemma, but in reality isn't, has snuck in: {cab, calculator, cast, car, call, cat}.

Now, I have to find the word that has been recognized as a lemma but isn't ('cause it doesn't follow the alphabetic order).
I kind of need to find the path most of the words are following and then remove the outcasts.

My main approach is to get sublists and find the part of the list where the faulty word is but then I kind of get stuck.

Please, please help, or comment if I haven't explained good enough. I'm going crazy here!

Somehow I imagine this would be great to solve with recursion - something like this:

/**
 * check if the lemmas are listed in alphabetic order. If not, remove fake lemmas.
 * Sorry for the lack of understandable code
 * @param lemmings
 * @param finished
 * @return correct lemma-list
 */
private List<String> checkingLemmas(List<String> lemmings, ArrayList<String> finished){
	if(isSorted(lemmings))
		return lemmings;
	if(lemmings.size() == 2){
		//what to do here?
		// I don't have anything to compare to anymore so I don't know which of the two to remove.
		//went too far...
	 			
	}
	if(lemmings.size() == 3){
		ArrayList<String> x = new ArrayList<String>();
		x.add(lemmings.get(0));
		x.add(lemmings.get(lemmings.size()-1));
	}
 	List<String> list1 = lemmings.subList(0, (lemmings.size()/2));
	List<String> list2 = lemmings.subList((lemmings.size()/2), (lemmings.size()-1));
 	if(isSorted(list1)){
 			finished.addAll(list1);
 			Collections.sort(finished);
	}
	else if(!isSorted(list1)){
		finished.addAll(checkingLemmas(list1, finished));
	}
	if(isSorted(list2)){
		finished.addAll(list2);
		Collections.sort(finished);
	}
	else if(!isSorted(list2)){
		finished.addAll(checkingLemmas(list2, finished));
	}
	return finished;
}

private boolean isSorted(List<String> list){
	 		ArrayList<String> sorted = new ArrayList<String>();
	 		sorted.addAll(list);
	 		Collections.sort(sorted);
	 		if(list.equals(sorted)){
	 			return true;
	 		}
	 		return false;
	 		
	 	}


#2
wim DC

wim DC

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,084 posts
Wouldn't it be easier to use the String.compareTo method? It return 0 if 2 strings are equal, <0 when string 1 is 'smaller than string 2 and >0 when string 1 is 'greater than string 2:
String a="at";
String b="bath";
a.compareTo(b)  --> result is <0
b.compareTo(a)  --> result is >0

example program:
import java.util.ArrayList;

public class Main {
    public static void main(String[] arg)
    {
        Main main = new Main();
    }

    public Main(){
        ArrayList<String> list = new ArrayList<String>();
        list.add("cat");
        list.add("car");
        list.add("caz");
        list.add("cc");
        list.add("cb");
        list.add("ccc");
        list.add("cd");
        list.add("cef");
        list.add("cff");
        list.add("cee");
        list.add("cgg");


        list = removeIntruders(list);
        for(String item : list){
            System.out.println(item);
        }
    }

    private static ArrayList<String> removeIntruders(ArrayList<String> list) {
        for(int i=0 ; i<list.size()-1; i++){
            String first = list.get(i);
            String second = list.get(i+1);
            if(first.compareTo(second)>0)
                list.remove(i+1);
        }

        return list;
    }
}
output of this program:
cat
caz
cc
ccc
cd
cef
cff
cgg


I do have a question however. If the correct list = {a,b,c,d,e,f,g} and with the intruder it's {a,b,z,c,d,e,f,g} then you would always end up with {a,b,z} as result as until there it follow the alphabetical order... if this is possible then i think it becomes a bit more difficult to catch the intruder.

#3
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts

Quote

I do have a question however. If the correct list = {a,b,c,d,e,f,g} and with the intruder it's {a,b,z,c,d,e,f,g} then you would always end up with {a,b,z} as result as until there it follow the alphabetical order... if this is possible then i think it becomes a bit more difficult to catch the intruder.

Exactly that is my problem.

Believe me, I've tried your method, but I was too tired to think of the compare to-method (thanks btw :thumbup1:), and I have used my isSorted instead, but it doesn't matter, as long as I only compare the element with the preceeding elements in the list, I will always get stuck there. So I have to make the method look at the whole list and decide, which element doesn't fit in.

I've been thinking about, assuming that the first and the last elements are correct and then just check, if everything in between is alphabetically ordered.

Note, that the method doesn't have to be a hundred percent fool proof. I know that I will have a marginal at the end. But I'd like to make it pretty, and by only comparing to preceeding elements, I would sometimes cut off the list waay to early and the correctness would be far worse than without the check.

#4
wim DC

wim DC

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,084 posts
hmm, and what if you go backwards trough the list...
{a,b,z,c,d,e,f,g}
would become {a,b,c,d,e,f,g}...

{a,b,a,c,d,e,f,g} --> {a,c,d,e,f,g}

So going trough the list forward and backwards, look which of the 2 has most elements and take that one will solve it..... when it contains 1 error. I got to think a bit more for multiple errors ;)

#5
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts

Quote

got to think a bit more for multiple errors

Yup, multipe errors there are.

#6
ZekeDragon

ZekeDragon

    Writes binary right handed and hex left handed

  • Moderators
  • 2,103 posts
I wrote an algorithm that appears to accomplish what you're trying to do. The way it does this is by evaluating an arbitrary list of words and giving it a "score" of how ordered it is as a double between 0 and 1. It's basically determined by how much of a percentage of the terms are in order (IE followed by a lemma), vs those that are not. It keeps that number as a baseline, and then removes each word and determines it's percentage. These percentages are compared, and any word removal that results in an overall increase in the order percent will be removed, the ones that result in a decrease remain. This is done for each word, and it seems to remove the intruders pretty well from my rudimentary experiments. I'll throw together an applet for it to let peoples experiment, because I don't think this is the right algorithm, to me there's still problems. Anyway, that's an idea, I'll post it soon.
Wow I changed my sig!

#7
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts

Quote

Anyway, that's an idea, I'll post it soon.
Thanx ZekeDragon, it would be great if you'd post it when it's finished.

Anyway, I have found one (giant) solution now. The method is huge but it does miracles and I have still not been able to find any faults in it. I'm not great in programming with low complexity (I'm actually also quite lousy on figuring the complexity out, so I won't tell you if it's fast or not) but fact is, that I'm trying this solution on about 1200 pages of dictionary-pages and it works like a charm.

The idea is to take out the longest sorted substring of the list and then take it from there.

I think the comments should explain the whole process

/**
 * Check if the lemmas are in alphabetic order, remove the words that have been recognized as lemmas but aren't
 */
private void checkLemmas(){
	//{ABZDEFGPGKLOPVRSTAUH} -> {{ABZ}, {DEFG}, {P}, {GKLOPV}, {RST}. {AU}, {H}}
	ArrayList<ArrayList<String>> subLists = new ArrayList<ArrayList<String>>();
	ArrayList<String> temp = new ArrayList<String>();
	for(int i = 0; i<lemmas.size(); i++){
		String possible = lemmas.get(i).getLexeme().toLowerCase(); 
		if(temp.isEmpty()){		//behöver jag inte ha sen när jag har basfallet!
			temp.add(possible);
		}
		else if(possible.compareTo(temp.get(temp.size()-1))>0){
			temp.add(possible);
		}
		else{
			ArrayList<String> temp2 = new ArrayList<String>();
			temp2.addAll(temp);
			subLists.add(temp2);
			temp.clear();
			temp.add(possible);
		}
	}
	ArrayList<String> temp2 = new ArrayList<String>();
	temp2.addAll(temp);
        subLists.add(temp2);
	temp.clear();
	if(subLists.size() == 1)
		System.out.println("SOOORTED!!!");
			
	 		
	//get the longest substring - {GKLOPV}
	ArrayList<String> longest = new ArrayList<String>();
	for(int i = 0; i<subLists.size(); i++){
		if(longest.isEmpty()){
			longest.addAll(subLists.get(i));
		}
		else if(subLists.get(i).size() > longest.size()){
			longest.clear();
			longest.addAll(subLists.get(i));
		}
	}
	 		
	//here we compare our longest substring to the string before and after
	//we remove the "intruders"
	//and concat the before-string to the longest, and the after-string to the longest
	//explanations are to find in the comments. 
	//ex. subLists: {{ABZ}, {DEFG}, {P}, {CKLOPV}, {RST}, {AU}, {H}}
	while(subLists.size() != 1){
		int i = subLists.indexOf(longest);
		//if longest just contains one element, we don't have a pattern, no use trying, only occurred once in a file of 700 pages.
	 	if(longest.size() == 1)
	 		break;
	 	
	 	//if there are sublists left to the longest sublist
	 		if(i != 0){
	 			ArrayList<String> before = subLists.get(i-1);
	 			//{P}, {CKLOPV}
	 			if(before.size() == 1){
	 				//if P < K - remove G
	 				if(before.get(0).compareTo(longest.get(1))<0){
	 					longest.remove(0);
						subLists.get(i).remove(0);
						longest.addAll(0, before);
						//I've been having troubled with my variables
						subLists.get(i).addAll(0, before);
						subLists.remove(before);
	 				}
	 				//if P > K - remove P -> {{ABZ}, {DEFG}, {CKLOPV}, {RST}. {AU}}
	 				else {
	 					subLists.remove(before);
	 				}
	 			}
	 			//{{ABZ}, {DEFG}, {CKLOPV}, {RST}, {AU}, {H}}
	 			else{
	 				for(int depth = 1, depth1 = 0; depth <= longest.size() 
                                                     && depth1<longest.size();depth++){
	 					//depth = size of longest or max 4 - max can be changed, but I still haven't encountered enough reasons why
	 					if(depth == longest.size()){
	 						depth1++;
	 						depth = 1;
	 					}	
	 					if(depth == 4){
	 						depth1++;
	 						depth = 1;
	 					}	
	 					//depth1 sets in when depth isn't enough, if depth1 ha passed 2 (can also go deeper if one finds it necessary)
	 					//then we just skips the sublist before
	 					if(depth1 == 2){
	 					        subLists.remove(before);
	 						break;
	 					}
	 							
	 					//{DEFG}, {CKLOPV}
	 							
	 					//if G (in before) < K - remove C -> {DEFGKLOPV}
	 					if(depth1<before.size() && before.get(before.size()-(depth1+1)).
                                                   compareTo(longest.get(depth))<0){
	 					       
                                                        for(int k = 0; k<depth; k++){
	 						        longest.remove(0);
	 							subLists.get(i).clear();
	 							subLists.get(i).addAll(longest);
	 						}
	 						//if we would have had to go on depth1 >0
	 						for(int l = 0; l<depth1; l++){
	 						       subLists.get(i-1).remove(before.size()-1);
	 						}
	 						longest.addAll(0, before);
	 						subLists.get(i).addAll(0, before);
	 						subLists.remove(before);
	 						break;
	 					}
	 					//{ABZ}, {DEFGKLOPRST}
	 							
	 					//if B < D - remove Z -> {AB} + {DEFGKLOPRST} -> {{ABDEFGKLOPRST}}
	 					else if(depth<before.size() && before.get(before.size()-(depth+1)).
                                                             compareTo(longest.get(depth1))<0){
	 						for(int k = 0; k<depth; k++){
	 							subLists.get(i-1).remove(before.size()-1);
	 						}
	 						//if we would have had to go on depth1 >0
	 						for(int l = 0; l<depth1; l++){
	 							longest.remove(0);
	 							subLists.get(i).clear();
	 							subLists.get(i).addAll(longest);
	 						}
	 						longest.addAll(0, subLists.get(i-1));
	 						subLists.get(i).addAll(0, subLists.get(i-1));
	 						subLists.remove(before);
	 						break;
	 					}		
	 				}
	
	 			}
	 		}
	 		else{
	 			System.out.println("left side finished: " + subLists);
	 		}
	 		i = subLists.indexOf(longest);
	 		//if there are sublists right to the longest one
	 		if(i != subLists.size()-1){
	 			ArrayList<String> after = subLists.get(i+1);
	 			//{{ABDEFGKLOPRSTU}, {H}}
	 			if(after.size() == 1){
	 			//if H > T - remove U (false)
	 			if(after.get(0).compareTo(longest.get(longest.size()-2))>0){
		 			subLists.get(i).remove(longest.size()-1);
	 				longest.remove(longest.size()-1);
		 			longest.addAll(after);
		 			subLists.get(i).addAll(after);
		 			subLists.remove(after);
	 			}
	 			//if H < T - remove H -> {{ABDEFGKLOPRSTU}}
	 			else {
	 				subLists.remove(after);
	 			}
	 		}
	 		else{
	 		    //{{ABZ}, {DEFGKLOPV}, {RST}, {AU}, {H}}
	 			for(int depth = 1, depth1 = 0; depth <= longest.size() && depth1<longest.size();depth++){
					//depth = size of longest or max 4 - max can be changed, but I still haven't encountered enough reasons why
					if(depth == longest.size()){
						depth1++;
						depth = 1;
					}	
					if(depth == 4){
						depth1++;
						depth = 1;
					}	
					//depth1 sets in when depth isn't enough, if depth1 ha passed 2 (can also go deeper if one finds it necessary)
					//then we just skips the sublist before
					if(depth1 == 2){
						subLists.remove(after);
						break;
					}
            				// {ABDEFGKLOPV}, {RST}
	 				//if S > V - remove R (false)
	 					
	          			// {ABDEFGKLOPRST}, {AU}
	 				//if U > T - remove A -> {{ABDEFGKLOPRST} + {U}} -> {{ABDEFGKLOPRSTU}}
	 				if(depth<after.size() && after.get(depth).
                                                compareTo(longest.get(longest.size()-(depth1+1)))>0){
	 					for(int k = 0; k<depth; k++){
	 						subLists.get(i+1).remove(0);
	 					}
	 					for(int l = 0; l<depth1; l++){
	 						longest.remove(longest.size()-1);
	 						subLists.get(i).clear();
	 						subLists.get(i).addAll(longest);
	 					}
	 					longest.addAll(after);
	 					subLists.get(i).addAll(after);
	 					subLists.remove(after);
	 					break;
	 				}
	 				// {DEFGKLOPV}, {RST}
	 				//if R > P - remove V -> {DEFGKLOP} + {RST} -> {DEFGKLOPRST}
	 				else if(depth1<after.size() && after.get(depth1).
                                                     compareTo(longest.get(longest.size()-(depth+1)))>0){
	 					for(int k = 0; k<depth; k++){
	 						longest.remove(longest.size()-1);
	 						subLists.get(i).clear();
	 						subLists.get(i).addAll(longest);
	 					}
	 					for(int l = 0; l<depth1; l++){
	 						subLists.get(i+1).remove(0);
	 					}
	 					longest.addAll(subLists.get(i+1));
	 					subLists.get(i).addAll(subLists.get(i+1));
	 					subLists.remove(subLists.get(i+1));
	 					break;
	 				}		
	 			}
	
	 		}
	 	}
	 	else{
	 		System.out.println("right side finished " + subLists);
	 	}
	 				
	 }
	 if(subLists.size() == 1)
	 	System.out.println("/while - finished SORTING!");

}

If anyone has an idea for making it even more effective and great, please tell :)

#8
ZekeDragon

ZekeDragon

    Writes binary right handed and hex left handed

  • Moderators
  • 2,103 posts
I found a problem in that there are some situations where it is arbitrary as to whether or not which order is intended by the user, here's an example of such a list:

a, c, b, d, e

Is the "b" the intruder, or is the "c"? It's impossible to tell algorithmically, so I have to have a default answer in the case of an ambiguity, because right now my algorithm eliminates both of them. :P Which one is it you need?
Wow I changed my sig!

#9
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts
Now, you just have to come up with the trickiest questions, don't you :P

I've been having the same problem, if you look at my algorithm, you might as well put all the checks in a different order and come up with a different result in cases like these.

It's kind of impossible to tell. But as I've said before, it won't be a problem if the algorithm does some mistakes.

Here I'd say that the b is the intruder, I think:confused:, if I had to pick one, mostly because I would like to get the most elements as possible as near each other as possible a-cde rather than ab-de. Then if the first element would be waay far of, like - a, p, g, r, s - you would get - a - prs where you might be able to see that the 'a' doesn't fit in either.