Jump to content

Java scanning a text file

- - - - -

  • Please log in to reply
1 reply to this topic

#1
shabu

shabu

    Newbie

  • Members
  • Pip
  • 3 posts
Hello,

I'm scanning a word file and creating Word objects out of each word in the file. I added a delimiter to my scanner because the 'rules' for what is considered to be a word are different. I got it to work and I have an arraylist of Word objects.

My problem has to do with what I need to do next. I need to rescan the file, and whenever I encounter one of the words from my list of words in the text file, I need to store the line number and paragraph number(occurance of the word) in the word object. I have a method to do that.

To be more particular, I can't figure out how to set up my counts to count the line number and paragraph number. Both are meant to start at 1, but the line number needs to reset back to 1 each time I get to a new paragraph. Paragraphs are separated by one or more blank lines. Also, the text file is UNIX-format.

Here is my code which first scans in all the words and adds them to the list, then removes the duplicates:

File file = new File("prog4.dat");

		Scanner scanInput = new Scanner(new FileReader(file));

		scanInput.useDelimiter("[^a-zA-Z0-9\\-\']+");

		ArrayList<Word> words = new ArrayList<Word>();

		

		

		Word theWord;

		while(scanInput.hasNext()){

			String next = scanInput.next();

			theWord = new Word(next);

			words.add(theWord);

		}

			

		for(int i = 0; i < words.size(); i++){

			for(int j = (i+1); j < words.size(); j++){

			if	  (words.get(i).getWord().equalsIgnoreCase(words.get(j).getWord()))

					words.remove(words.get(j));

			}

		}

		

Here is my flawed code: (where i try to add the paragraph-line pairs to each word)

Scanner scanInput2 = new Scanner(new FileReader(file));

		scanInput2.useDelimiter("[^a-zA-Z0-9\\-\']+");

		

		String word;

		int parNum = 1;

		int lineNum = 1;

		while (scanInput2.hasNextLine()){

			word = scanInput2.next();

			String line = scanInput2.nextLine();

			for (Word w: words){

				if (line.contains(word)){

					w.addPair(parNum, lineNum);

				}

			}

			lineNum++;

				

		}

I'm not even sure where to increment/how to increment the paragraph count.

Any help with this would be greatly appreciated!

Thank you!

#2
ZekeDragon

ZekeDragon

    Writes binary right handed and hex left handed

  • Moderators
  • 2,103 posts
First off, if you're removing terms like you are with the ArrayList, you may be better off simply using a Set instead, then if you really need it as a list use the ArrayList's Collection constructor. This way, you don't have to write any of the remove code.

shabu said:

Both are meant to start at 1, but the line number needs to reset back to 1 each time I get to a new paragraph. Paragraphs are separated by one or more blank lines.
I'm interpreting this to mean that a paragraph is defined as a section of text that is delimited by two or more \n characters in succession, where there can be any amount of whitespace between the two \n characters. So first you must come up with an algorithm to separate these paragraphs, before we even worry about counting lines and such. Here, just use a simple ArrayList<String> to store the paragraphs, and separate them in a method.
    public ArrayList<String> separateParagraphs(String contents)

    {

        // First just build an ArrayList from the split array.

        ArrayList<String> retLst = new ArrayList<String>(contents.split("^\\s*$[\n\r]*"));

        // Eliminate any blank lines

        String toCheck;

        for (Iterator<String> iter = retLst.iterator(); iter.hasNext(); toCheck = iter.next())

        {

            if (toCheck.equals(""))

            {

                iter.remove();

            }

        }


        return retLst;

    }
Now we'll separate each paragraph by it's lines, indicated by single \n characters.
public ArrayList<String> separateLines(String paragraph)

{

    return new ArrayList<String>(paragraph.split("\n"));

}
Then, as you're looking through each line, you should already know what the line number is simply based on what your iterating value is.
for (int i = 0; i < paragraphs.size(); ++i)

{

    ArrayList<String> lines = separateLines(paragraphs[i]);

    for (int j = 0; j < lines.size(); ++j)

    {

        Set<String> words = new HashSet<String>(new ArrayList<String>(lines[j].split(" ")));

        if (words.contains("")) words.remove("");

        // Here you'd go through each word in words (not repeated as they're in a set)

        // and you now know the paragraph number is i + 1 and the line number is j + 1.

    }

}

None of this code is tested, but it should work...
Wow I changed my sig!




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users