Jump to content

Searching program

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
10 replies to this topic

#1
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
I have been assigned a project to make a program that searches through documents , basically it will be like a mini search engine. What i have to do is first
parse an html page (removing all tags and stuff) and save it in a txt file [DONE]
now how will i search through them , for example if 2 txt file contains some same information then how will i make the program to display that data from both the files?
which data structure should i use ? and what can be done to make the searching fast.

#2
alienkinetics

alienkinetics

    Programmer

  • Members
  • PipPipPipPip
  • 154 posts
If this is memory based or are you are going to develop a SQL database?

If this was a database I'd 3 tables:

Files = A table of all filenames (Id: Int, Name: String)
Words = A table of all words found (Id: Int, Word: String)
Links = Position of words found in files (Id: Int, FileId: Int, WordId: Int, Position: Int)

I'd find all words and for each word I would store the filename and location where that word was found.

The next problems will be writing the SQL if you are looking to support boolean searchs. ie: "apple orange", "apple" and "orange", "apple" or "orange".

here is some pascal code that generates a very basic search database. It would't be hard to extend. The HTML parser is very basic. It should include attribute parsing.


program Project;


{$APPTYPE CONSOLE}


uses SysUtils, Classes;


procedure Parse(const AFileName: TFileName; const ADataBase: TStrings);

var

  LStart, P, LWord: PChar;

  LHTML: String;

  S: String;

begin

  with TMemoryStream.Create do

  try

    LoadFromFile(AFileName);

    SetString(LHTML, PChar(Memory), Size);

    LStart := PChar(LHTML);

    P := LStart;

    while P[0] <> #0 do begin

      while P[0] in [#1..#32] do Inc(P);

      if P[0] = '<' then begin

        repeat Inc(P) until P[0] in ['>', #0];

        if P[0] <> #0 then Inc(P);

      end else

      if P[0] in ['a'..'z', 'A'..'Z'] then begin

        LWord := P;

        repeat Inc(P) until not(P[0] in ['a'..'z', 'A'..'Z', '''']);

        SetString(S, LWord, P - LWord);

        ADataBase.Values[S] := ADataBase.Values[S] +

          '(' + AFileName + ':' + IntToStr(LWord-LStart) + ')';

      end;

    end;

  finally

    Free;

  end;

end;


procedure ScanFiles(const APath: TFileName; const ADataBase: TStrings);

var LSearchRec: TSearchRec;

begin

  if FindFirst(APath, faAnyFile, LSearchRec) = 0 then try

    repeat

      WriteLn(LSearchRec.Name);

      Parse(LSearchRec.Name, ADataBase);

    until FindNext(LSearchRec) <> 0;

  finally

    FindClose(LSearchRec);

  end;

end;

The code generates this:

apple=(f1.html:6)

orange=(f1.html:17)(f2.html:16)

pear=(f1.html:29)(f2.html:28)

here=(f2.html:6)

So for any word, I know the file and location. You can see why you would need a filename table to reduce the number of times the file names are referenced.

#3
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
thanks alot alienkinetics for the information you gave i have another question as i want it to be a memory based ,so what if i want to use a data structure that gets populated from a text file of the information given in it?cause that is how searching algo thingy will work :/ , correct me if wrong

#4
alienkinetics

alienkinetics

    Programmer

  • Members
  • PipPipPipPip
  • 154 posts
Just use XML. hand code the XML Format that you think will handle your requirements. Load it into Java using what ever XML api you prefer.

Java API for XML Processing - Wikipedia, the free encyclopedia

Code your search algorithm. If you have any speed issues, then store all the <word> nodes in a HashTable. Unless you want to support partial searches, in which case you will have to use a sorted string array.

<data>

<files>

<file id="0" name="index.html"/>

<file id="1" name="about.html"/>

<file id="2" name="home.html"/>

<file id="3" name="contacts.html"/>

</files>

<words>

<word text="apple">

<link fileid="0" position="10"/>

<link fileid="1" position="3"/>

<link fileid="2" position="54"/>

</word>

<word text="orange">

<link fileid="2" position="5"/>

<link fileid="3" position="4"/>

</word>

</words>

</data>


#5
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
**** this seems to advance for me :( , the XML thingy :(

#6
alienkinetics

alienkinetics

    Programmer

  • Members
  • PipPipPipPip
  • 154 posts
Use a simple file structure then. something like INI will work. but, then you need to have your data structure defined first.

Start defining your internat data structures and im sure someone will help you write a file loader for them

#7
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
thanks again :) , some questions if you can clear out .
as in your first post you mentioned 3 tables will be used , if i use 2 Hashtables( for files and words) , then how will i retrieve data from it? plus what can be done for the 3rd one(Links)?

#8
alienkinetics

alienkinetics

    Programmer

  • Members
  • PipPipPipPip
  • 154 posts
To handle "one-to-many" with a database you tend to need that 3rd link file.
But, if you are memory based you dont have to.

The files can be stored in an ArrayList. the words should be in a HashTable so you get get details from a word quickly.

Each word should contain an array of links which provide a file and file position where the word was found.

I would populate the data structure via java code until im happy with the design, then I'd look into populating it from a text file. It shoudn't be that hard to have a simple text file format.

#9
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
then for the links , if i make my own class , will that be suitable?or can you save an object in a list ?

#10
alienkinetics

alienkinetics

    Programmer

  • Members
  • PipPipPipPip
  • 154 posts

Quote

...if i make my own class , will that be suitable?

If it works and you are comfortable with it, then its suitable.

There is no right answer when designing data structures. I could propose a structure, but mine would be one of a bunch of possiblities.

#11
ahmed

ahmed

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 304 posts
lol its fine i might get some ideas :)