Closed Thread
Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: Searching program

  1. #1
    ahmed is offline Programming Professional
    Join Date
    Oct 2008
    Posts
    300
    Rep Power
    0

    Searching program

    I have been assigned a project to make a program that searches through documents , basically it will be like a mini search engine. What i have to do is first
    parse an html page (removing all tags and stuff) and save it in a txt file [DONE]
    now how will i search through them , for example if 2 txt file contains some same information then how will i make the program to display that data from both the files?
    which data structure should i use ? and what can be done to make the searching fast.

  2. CODECALL Circuit advertisement
    Join Date
    Always
    Posts
    Many

     
  3. #2
    alienkinetics's Avatar
    alienkinetics is offline Programmer
    Join Date
    Feb 2010
    Location
    Australia
    Posts
    154
    Rep Power
    0

    Re: Searching program

    If this is memory based or are you are going to develop a SQL database?

    If this was a database I'd 3 tables:

    Files = A table of all filenames (Id: Int, Name: String)
    Words = A table of all words found (Id: Int, Word: String)
    Links = Position of words found in files (Id: Int, FileId: Int, WordId: Int, Position: Int)

    I'd find all words and for each word I would store the filename and location where that word was found.

    The next problems will be writing the SQL if you are looking to support boolean searchs. ie: "apple orange", "apple" and "orange", "apple" or "orange".

    here is some pascal code that generates a very basic search database. It would't be hard to extend. The HTML parser is very basic. It should include attribute parsing.

    Code:
    program Project;
    
    {$APPTYPE CONSOLE}
    
    uses SysUtils, Classes;
    
    procedure Parse(const AFileName: TFileName; const ADataBase: TStrings);
    var
      LStart, P, LWord: PChar;
      LHTML: String;
      S: String;
    begin
      with TMemoryStream.Create do
      try
        LoadFromFile(AFileName);
        SetString(LHTML, PChar(Memory), Size);
        LStart := PChar(LHTML);
        P := LStart;
        while P[0] <> #0 do begin
          while P[0] in [#1..#32] do Inc(P);
          if P[0] = '<' then begin
            repeat Inc(P) until P[0] in ['>', #0];
            if P[0] <> #0 then Inc(P);
          end else
          if P[0] in ['a'..'z', 'A'..'Z'] then begin
            LWord := P;
            repeat Inc(P) until not(P[0] in ['a'..'z', 'A'..'Z', '''']);
            SetString(S, LWord, P - LWord);
            ADataBase.Values[S] := ADataBase.Values[S] +
              '(' + AFileName + ':' + IntToStr(LWord-LStart) + ')';
          end;
        end;
      finally
        Free;
      end;
    end;
    
    procedure ScanFiles(const APath: TFileName; const ADataBase: TStrings);
    var LSearchRec: TSearchRec;
    begin
      if FindFirst(APath, faAnyFile, LSearchRec) = 0 then try
        repeat
          WriteLn(LSearchRec.Name);
          Parse(LSearchRec.Name, ADataBase);
        until FindNext(LSearchRec) <> 0;
      finally
        FindClose(LSearchRec);
      end;
    end;
    The code generates this:

    Code:
    apple=(f1.html:6)
    orange=(f1.html:17)(f2.html:16)
    pear=(f1.html:29)(f2.html:28)
    here=(f2.html:6)
    So for any word, I know the file and location. You can see why you would need a filename table to reduce the number of times the file names are referenced.

  4. #3
    ahmed is offline Programming Professional
    Join Date
    Oct 2008
    Posts
    300
    Rep Power
    0

    Re: Searching program

    thanks alot alienkinetics for the information you gave i have another question as i want it to be a memory based ,so what if i want to use a data structure that gets populated from a text file of the information given in it?cause that is how searching algo thingy will work :/ , correct me if wrong

  5. #4
    alienkinetics's Avatar
    alienkinetics is offline Programmer
    Join Date
    Feb 2010
    Location
    Australia
    Posts
    154
    Rep Power
    0

    Re: Searching program

    Just use XML. hand code the XML Format that you think will handle your requirements. Load it into Java using what ever XML api you prefer.

    Java API for XML Processing - Wikipedia, the free encyclopedia

    Code your search algorithm. If you have any speed issues, then store all the <word> nodes in a HashTable. Unless you want to support partial searches, in which case you will have to use a sorted string array.

    Code:
    <data>
    <files>
    <file id="0" name="index.html"/>
    <file id="1" name="about.html"/>
    <file id="2" name="home.html"/>
    <file id="3" name="contacts.html"/>
    </files>
    <words>
    <word text="apple">
    <link fileid="0" position="10"/>
    <link fileid="1" position="3"/>
    <link fileid="2" position="54"/>
    </word>
    <word text="orange">
    <link fileid="2" position="5"/>
    <link fileid="3" position="4"/>
    </word>
    </words>
    </data>

  6. #5
    ahmed is offline Programming Professional
    Join Date
    Oct 2008
    Posts
    300
    Rep Power
    0

    Re: Searching program

    **** this seems to advance for me , the XML thingy

  7. #6
    alienkinetics's Avatar
    alienkinetics is offline Programmer
    Join Date
    Feb 2010
    Location
    Australia
    Posts
    154
    Rep Power
    0

    Re: Searching program

    Use a simple file structure then. something like INI will work. but, then you need to have your data structure defined first.

    Start defining your internat data structures and im sure someone will help you write a file loader for them

  8. #7
    ahmed is offline Programming Professional
    Join Date
    Oct 2008
    Posts
    300
    Rep Power
    0

    Re: Searching program

    thanks again , some questions if you can clear out .
    as in your first post you mentioned 3 tables will be used , if i use 2 Hashtables( for files and words) , then how will i retrieve data from it? plus what can be done for the 3rd one(Links)?

  9. #8
    alienkinetics's Avatar
    alienkinetics is offline Programmer
    Join Date
    Feb 2010
    Location
    Australia
    Posts
    154
    Rep Power
    0

    Re: Searching program

    To handle "one-to-many" with a database you tend to need that 3rd link file.
    But, if you are memory based you dont have to.

    The files can be stored in an ArrayList. the words should be in a HashTable so you get get details from a word quickly.

    Each word should contain an array of links which provide a file and file position where the word was found.

    I would populate the data structure via java code until im happy with the design, then I'd look into populating it from a text file. It shoudn't be that hard to have a simple text file format.

  10. #9
    ahmed is offline Programming Professional
    Join Date
    Oct 2008
    Posts
    300
    Rep Power
    0

    Re: Searching program

    then for the links , if i make my own class , will that be suitable?or can you save an object in a list ?

  11. #10
    alienkinetics's Avatar
    alienkinetics is offline Programmer
    Join Date
    Feb 2010
    Location
    Australia
    Posts
    154
    Rep Power
    0

    Re: Searching program

    ...if i make my own class , will that be suitable?
    If it works and you are comfortable with it, then its suitable.

    There is no right answer when designing data structures. I could propose a structure, but mine would be one of a bunch of possiblities.

Closed Thread
Page 1 of 2 12 LastLast

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Replies: 7
    Last Post: 11-17-2010, 02:24 PM
  2. Searching for...
    By Bioshox in forum PHP Development
    Replies: 3
    Last Post: 04-27-2010, 01:17 PM
  3. Searching designer
    By Coldhearth in forum Website Design
    Replies: 2
    Last Post: 02-15-2010, 12:39 AM
  4. HI.. i'm just new here and searching something
    By hardaez in forum C and C++
    Replies: 5
    Last Post: 01-18-2008, 09:28 AM
  5. Searching a file
    By The Midnighter in forum C and C++
    Replies: 2
    Last Post: 11-27-2007, 01:05 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts