Jump to content

Counting instances of words in a file, best way?

- - - - -

  • Please log in to reply
5 replies to this topic

#1
Root23

Root23

    Programmer

  • Members
  • PipPipPipPip
  • 144 posts
I'm working on a homework assignment that wants us to add all the java keywords to a hashset, and then go through a java file and count how many times a keyword appears in the document.

I'm just trying to think of the best way to go about counting how many time 50'ish different words appear in a document?

I found an example on stackoverflow that was looking for a single word in a document, and it used something like this:
for (String word : words)
    {
        if(word.equals(stringSearch))
            System.out.println("Word was found at position " + indexfound + " on line " + linecount);
    }
Is there a similar way to do something like that, but compare the word in the document against every word in the hashset? I don't need to count the occurrence of each word separately, just as long as the word in the doc is any of the keywords.. I can increment the counter.

Any help would be appreciated. This homework assignment kinda has me blindsided because we haven't really did anything that involved iterating through a document. So, I'm trying to look up all the info on reading a document, and at the same time figure out my logic for testing each word against the list of words in the set.

All I have so far is an array of the words, then added those to a hashset, and created a variable for the file name (which is passed via command line).

Thanks for any help/insight.
Posted Image

#2
gregwarner

gregwarner

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 853 posts
  • Location:Arkansas
You shouldn't need to compare each word in the document with every word in the hashset. That wastes time. The purpose of a hashset is a constant lookup time. Each word is "hashed" to a position in the table, at which you can store anything you want. My suggestion would be to store an integer there to count how many times that word has appeared. The pseudocode would look like this:

create a hash table of integers.

loop for every "word" in the file:

    lookup the hash key "word" in the hash table.

    if the hash key exists,

        retrieve that value and increment it by one.

        store that value back in the hash table at the same hash key: "word".

    else (the hash key doesn't exist),

        create a new value at the hash key: "word" and set to 1.

end loop

At the end of your program, you can retrieve a list of all the keys in the hash table, and look up their values one by one, which will give you the count of each word. Hash tables make things simple! :)
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.

– Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid


#3
Root23

Root23

    Programmer

  • Members
  • PipPipPipPip
  • 144 posts
Thanks for the info. You did a good job in giving me a pretty clear direction to take.. I'll have to dig through my book to figure out how to do a few things, but I should be good to go after that.

(I've only had to iterate through a text file once before in my first Java class[I'm almost done with my 2nd java class now], so needless to say I don't remember how!)

Posted Image

#4
Root23

Root23

    Programmer

  • Members
  • PipPipPipPip
  • 144 posts
Just to get something complete.. I just worked off that little I already completed... Ie. left it as a hashset. I'm going to go through and redo this using a hashtable though.. just for the extra practice.

I've got that working, except one thing I'm curious about..
StringTokenizer parser = new StringTokenizer(currentLine, " \t\n\r\f.,;:!?'");
That basically has \t, \n, \r, \f, . , ; : ! ? and ', but not " because that seems to mess things up because it sees that as the end of the 'string', so when I do actually put the closing " it throws a 'unclosed string literal' error when I try to compile.
How can I solve that problem?
Posted Image

#5
lethalwire

lethalwire

    while(false){ ... }

  • Members
  • PipPipPipPipPipPipPip
  • 748 posts
  • Programming Language:Java, PHP
  • Learning:Java, PHP
StringTokenizer parser = new StringTokenizer(currentLine, " \t\n\r\f.,;:!?'[COLOR="red"]\"[/COLOR]");


#6
Root23

Root23

    Programmer

  • Members
  • PipPipPipPip
  • 144 posts
Thanks for pointing out the obvious.... ha.

+rep to you guys.

I needed to fix the delimeters.. Since, I'm guessing I'm supposed to count just the real keywords that are used in the file, and not any word that happens to be a keyword.. say if a keyword appeared in a string but wasn't used as a keyword. Works good now.

Edited by Root23, 26 April 2011 - 05:48 PM.

Posted Image




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users