I have been assigned a project to make a program that searches through documents , basically it will be like a mini search engine. What i have to do is first
parse an html page (removing all tags and stuff) and save it in a txt file [DONE]
now how will i search through them , for example if 2 txt file contains some same information then how will i make the program to display that data from both the files?
which data structure should i use ? and what can be done to make the searching fast.
Searching program
Started by ahmed, Feb 08 2010 07:08 AM
10 replies to this topic
#1
Posted 08 February 2010 - 07:08 AM
|
|
|
#2
Posted 08 February 2010 - 08:07 AM
If this is memory based or are you are going to develop a SQL database?
If this was a database I'd 3 tables:
Files = A table of all filenames (Id: Int, Name: String)
Words = A table of all words found (Id: Int, Word: String)
Links = Position of words found in files (Id: Int, FileId: Int, WordId: Int, Position: Int)
I'd find all words and for each word I would store the filename and location where that word was found.
The next problems will be writing the SQL if you are looking to support boolean searchs. ie: "apple orange", "apple" and "orange", "apple" or "orange".
here is some pascal code that generates a very basic search database. It would't be hard to extend. The HTML parser is very basic. It should include attribute parsing.
The code generates this:
So for any word, I know the file and location. You can see why you would need a filename table to reduce the number of times the file names are referenced.
If this was a database I'd 3 tables:
Files = A table of all filenames (Id: Int, Name: String)
Words = A table of all words found (Id: Int, Word: String)
Links = Position of words found in files (Id: Int, FileId: Int, WordId: Int, Position: Int)
I'd find all words and for each word I would store the filename and location where that word was found.
The next problems will be writing the SQL if you are looking to support boolean searchs. ie: "apple orange", "apple" and "orange", "apple" or "orange".
here is some pascal code that generates a very basic search database. It would't be hard to extend. The HTML parser is very basic. It should include attribute parsing.
program Project;
{$APPTYPE CONSOLE}
uses SysUtils, Classes;
procedure Parse(const AFileName: TFileName; const ADataBase: TStrings);
var
LStart, P, LWord: PChar;
LHTML: String;
S: String;
begin
with TMemoryStream.Create do
try
LoadFromFile(AFileName);
SetString(LHTML, PChar(Memory), Size);
LStart := PChar(LHTML);
P := LStart;
while P[0] <> #0 do begin
while P[0] in [#1..#32] do Inc(P);
if P[0] = '<' then begin
repeat Inc(P) until P[0] in ['>', #0];
if P[0] <> #0 then Inc(P);
end else
if P[0] in ['a'..'z', 'A'..'Z'] then begin
LWord := P;
repeat Inc(P) until not(P[0] in ['a'..'z', 'A'..'Z', '''']);
SetString(S, LWord, P - LWord);
ADataBase.Values[S] := ADataBase.Values[S] +
'(' + AFileName + ':' + IntToStr(LWord-LStart) + ')';
end;
end;
finally
Free;
end;
end;
procedure ScanFiles(const APath: TFileName; const ADataBase: TStrings);
var LSearchRec: TSearchRec;
begin
if FindFirst(APath, faAnyFile, LSearchRec) = 0 then try
repeat
WriteLn(LSearchRec.Name);
Parse(LSearchRec.Name, ADataBase);
until FindNext(LSearchRec) <> 0;
finally
FindClose(LSearchRec);
end;
end;
The code generates this:
apple=(f1.html:6) orange=(f1.html:17)(f2.html:16) pear=(f1.html:29)(f2.html:28) here=(f2.html:6)
So for any word, I know the file and location. You can see why you would need a filename table to reduce the number of times the file names are referenced.
#3
Posted 08 February 2010 - 09:07 AM
thanks alot alienkinetics for the information you gave i have another question as i want it to be a memory based ,so what if i want to use a data structure that gets populated from a text file of the information given in it?cause that is how searching algo thingy will work :/ , correct me if wrong
#4
Posted 08 February 2010 - 10:11 AM
Just use XML. hand code the XML Format that you think will handle your requirements. Load it into Java using what ever XML api you prefer.
Java API for XML Processing - Wikipedia, the free encyclopedia
Code your search algorithm. If you have any speed issues, then store all the <word> nodes in a HashTable. Unless you want to support partial searches, in which case you will have to use a sorted string array.
Java API for XML Processing - Wikipedia, the free encyclopedia
Code your search algorithm. If you have any speed issues, then store all the <word> nodes in a HashTable. Unless you want to support partial searches, in which case you will have to use a sorted string array.
<data> <files> <file id="0" name="index.html"/> <file id="1" name="about.html"/> <file id="2" name="home.html"/> <file id="3" name="contacts.html"/> </files> <words> <word text="apple"> <link fileid="0" position="10"/> <link fileid="1" position="3"/> <link fileid="2" position="54"/> </word> <word text="orange"> <link fileid="2" position="5"/> <link fileid="3" position="4"/> </word> </words> </data>
#5
Posted 08 February 2010 - 11:04 AM
**** this seems to advance for me :( , the XML thingy :(
#6
Posted 08 February 2010 - 11:36 AM
Use a simple file structure then. something like INI will work. but, then you need to have your data structure defined first.
Start defining your internat data structures and im sure someone will help you write a file loader for them
Start defining your internat data structures and im sure someone will help you write a file loader for them
#7
Posted 08 February 2010 - 12:02 PM
thanks again :) , some questions if you can clear out .
as in your first post you mentioned 3 tables will be used , if i use 2 Hashtables( for files and words) , then how will i retrieve data from it? plus what can be done for the 3rd one(Links)?
as in your first post you mentioned 3 tables will be used , if i use 2 Hashtables( for files and words) , then how will i retrieve data from it? plus what can be done for the 3rd one(Links)?
#8
Posted 08 February 2010 - 12:33 PM
To handle "one-to-many" with a database you tend to need that 3rd link file.
But, if you are memory based you dont have to.
The files can be stored in an ArrayList. the words should be in a HashTable so you get get details from a word quickly.
Each word should contain an array of links which provide a file and file position where the word was found.
I would populate the data structure via java code until im happy with the design, then I'd look into populating it from a text file. It shoudn't be that hard to have a simple text file format.
But, if you are memory based you dont have to.
The files can be stored in an ArrayList. the words should be in a HashTable so you get get details from a word quickly.
Each word should contain an array of links which provide a file and file position where the word was found.
I would populate the data structure via java code until im happy with the design, then I'd look into populating it from a text file. It shoudn't be that hard to have a simple text file format.
#9
Posted 08 February 2010 - 09:41 PM
then for the links , if i make my own class , will that be suitable?or can you save an object in a list ?
#10
Posted 08 February 2010 - 11:17 PM
Quote
...if i make my own class , will that be suitable?
If it works and you are comfortable with it, then its suitable.
There is no right answer when designing data structures. I could propose a structure, but mine would be one of a bunch of possiblities.
#11
Posted 09 February 2010 - 12:27 AM
lol its fine i might get some ideas :)


Sign In
Create Account


Back to top









