I have been assigned a project to make a program that searches through documents , basically it will be like a mini search engine. What i have to do is first
parse an html page (removing all tags and stuff) and save it in a txt file [DONE]
now how will i search through them , for example if 2 txt file contains some same information then how will i make the program to display that data from both the files?
which data structure should i use ? and what can be done to make the searching fast.
If this is memory based or are you are going to develop a SQL database?
If this was a database I'd 3 tables:
Files = A table of all filenames (Id: Int, Name: String)
Words = A table of all words found (Id: Int, Word: String)
Links = Position of words found in files (Id: Int, FileId: Int, WordId: Int, Position: Int)
I'd find all words and for each word I would store the filename and location where that word was found.
The next problems will be writing the SQL if you are looking to support boolean searchs. ie: "apple orange", "apple" and "orange", "apple" or "orange".
here is some pascal code that generates a very basic search database. It would't be hard to extend. The HTML parser is very basic. It should include attribute parsing.
The code generates this:Code:program Project; {$APPTYPE CONSOLE} uses SysUtils, Classes; procedure Parse(const AFileName: TFileName; const ADataBase: TStrings); var LStart, P, LWord: PChar; LHTML: String; S: String; begin with TMemoryStream.Create do try LoadFromFile(AFileName); SetString(LHTML, PChar(Memory), Size); LStart := PChar(LHTML); P := LStart; while P[0] <> #0 do begin while P[0] in [#1..#32] do Inc(P); if P[0] = '<' then begin repeat Inc(P) until P[0] in ['>', #0]; if P[0] <> #0 then Inc(P); end else if P[0] in ['a'..'z', 'A'..'Z'] then begin LWord := P; repeat Inc(P) until not(P[0] in ['a'..'z', 'A'..'Z', '''']); SetString(S, LWord, P - LWord); ADataBase.Values[S] := ADataBase.Values[S] + '(' + AFileName + ':' + IntToStr(LWord-LStart) + ')'; end; end; finally Free; end; end; procedure ScanFiles(const APath: TFileName; const ADataBase: TStrings); var LSearchRec: TSearchRec; begin if FindFirst(APath, faAnyFile, LSearchRec) = 0 then try repeat WriteLn(LSearchRec.Name); Parse(LSearchRec.Name, ADataBase); until FindNext(LSearchRec) <> 0; finally FindClose(LSearchRec); end; end;
So for any word, I know the file and location. You can see why you would need a filename table to reduce the number of times the file names are referenced.Code:apple=(f1.html:6) orange=(f1.html:17)(f2.html:16) pear=(f1.html:29)(f2.html:28) here=(f2.html:6)
thanks alot alienkinetics for the information you gave i have another question as i want it to be a memory based ,so what if i want to use a data structure that gets populated from a text file of the information given in it?cause that is how searching algo thingy will work :/ , correct me if wrong
Just use XML. hand code the XML Format that you think will handle your requirements. Load it into Java using what ever XML api you prefer.
Java API for XML Processing - Wikipedia, the free encyclopedia
Code your search algorithm. If you have any speed issues, then store all the <word> nodes in a HashTable. Unless you want to support partial searches, in which case you will have to use a sorted string array.
Code:<data> <files> <file id="0" name="index.html"/> <file id="1" name="about.html"/> <file id="2" name="home.html"/> <file id="3" name="contacts.html"/> </files> <words> <word text="apple"> <link fileid="0" position="10"/> <link fileid="1" position="3"/> <link fileid="2" position="54"/> </word> <word text="orange"> <link fileid="2" position="5"/> <link fileid="3" position="4"/> </word> </words> </data>
**** this seems to advance for me, the XML thingy
![]()
Use a simple file structure then. something like INI will work. but, then you need to have your data structure defined first.
Start defining your internat data structures and im sure someone will help you write a file loader for them
thanks again, some questions if you can clear out .
as in your first post you mentioned 3 tables will be used , if i use 2 Hashtables( for files and words) , then how will i retrieve data from it? plus what can be done for the 3rd one(Links)?
To handle "one-to-many" with a database you tend to need that 3rd link file.
But, if you are memory based you dont have to.
The files can be stored in an ArrayList. the words should be in a HashTable so you get get details from a word quickly.
Each word should contain an array of links which provide a file and file position where the word was found.
I would populate the data structure via java code until im happy with the design, then I'd look into populating it from a text file. It shoudn't be that hard to have a simple text file format.
then for the links , if i make my own class , will that be suitable?or can you save an object in a list ?
If it works and you are comfortable with it, then its suitable....if i make my own class , will that be suitable?
There is no right answer when designing data structures. I could propose a structure, but mine would be one of a bunch of possiblities.
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks