Jump to content

need php script to extract info from html files

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
11 replies to this topic

#1
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts
Hello everyone... Hope you are all doing well...

Okay, What I am trying to do is this:

I have html files that I need to get the info out of, and to be put into a CSV file for Microsoft excel.

each file has a few lines of info in them, I need them all in seperate columns, so That I can use it in Microsoft excel.

Below is what is in most of the html files. Some do not have the info, so it needs to be able to move on if it does not have the info in it. I also put where I would want the info to be for example...

********************************************************

AQUARIUM PHARMACEUTICALS - P/C ALGAEFIX W/ FREE ECOFIX - (column A)

Royal ID: AAP169P - (column B)

UPC: 317163161692 - (column C)

Vendor: AQUARIUM PHARMACEUTICALS - (column D)

****Need to have Royal id, UPC,Vendor to be headers with the info following in the cells.*****

*****(everything below vender should all go into same cell, keeping the linebreaks, etc... the same.) in (column E)*****

AlgaeFix effectively controls the growth of many types of algae, including blanketweed. Will not harm live plants or koi and goldfish. EcoFix makes pond water clean and clear. Breaks down dead algae. Increases oxygen levels in pond water.

AlgaeFix effectively controls many types of green or green water algae, string or hair algae and blanketweed in ponds that contain live plants. Controls existing algae and helps resolve additional algae blooms. Keeps ornamental ponds and water gardens clean & clear. EcoFix helps create a healthy ecosystem for pond fish. By digesting sludge, and reducing organics, EcoFix increases oxygen levels and makes pond water clean and clear.

* 16 fl oz PondCare® AlgaeFix® with free 8 fl oz PondCare® EcoFix
* 169B Treats 4,800 US Gal (18,168 L) (147A Treats Up to 2,000 US Gal)
* Restricted for sale in Canada, UK

*********************************************************

Thats about it in a nutshell... I had someone help me before with this, and it worked great.. But somehow I lost the files,,, :(

I will be using this on my home comp. if that makes any diff.


I would appreciate any help I could get on this.

Thanks in advance

Vonneffdobermans

#2
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
If the format of the files is consistent, you may want to use a regular expression to grab the data. jEdit has multifile searches that could get the data pretty quickly. I can look into what you would need when I get to work, but it should be pretty straightforward.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#3
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts
Well, there are some like the one I posted with all the info in them.
And some have just the name and vendor, and some have nothing...

Is that a problem?

I will upload some of the files tomorrow so you can see what I am talking about.

The last script I had, would place the vendor, product id, upc ,etc... as the headers, and the info following would be in the cell underneith the headers. This way when I bring it up in excel, it would be all in order for me to edit, etc...

The main thing I need out if the files are the name of the file in a cell or product id (both same), and the full description of the item. if that is not there, then I really do not need that html file data saved.

#4
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
Include a sample of what the data from about 3 files should look like. I may be able to give you something where you can use it more or less forever in the future.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#5
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts
Sure thing...

I will upload 3 files and a sample what the output should look like in both CSV and XLS format

Attached Files



#6
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
These instructions are for jEdit:

Search for ".*" in Directory, Settings: Regular Expression, Filter: *.htm, in appropriate directory
Macros: Misc: HyperSearch Results to Buffer
Search for "^(.*)<body>" replace with "" in Current buffer as Regular Expression: replace all
Search for "^C:\\(.*)\n" replace with "" in Current buffer as Regular Expression: replace all
Search for "^(.*)<h3>Royal ID: (.*?)</h3>(.*)$" replace with "$2,$1$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),(.*)<h2>(.*?)</h2>(.*)$" replace with "$1,$3,$2$4" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>UPC: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>Vendor: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<p>(.*)$" replace with "$1,$2" in Current buffer as Regular Expression: replace all
Search for "^(.*)</p><p>(.*)$" replace with "$1\n,,,,,,\n,,,,,,$2" in Current buffer as Regular Expression: replace all (repeat until there are 0 replacements made)
Search for "^(.*)</p>(.*)$" replace with "$1" in Current buffer as Regular Expression: replace all

At this point, you can to a non-regular expression search/replace for the escape sequences and add the header row.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#7
TkTech

TkTech

    The Crazy One

  • Moderators
  • 1,396 posts
Winged pretty much summed it up, I can create a quick application for you that will automate it and export as an excel file. (I have prebuilt tools, but I require post-Office 2000)

#8
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts

TkTech said:

Winged pretty much summed it up, I can create a quick application for you that will automate it and export as an excel file. (I have prebuilt tools, but I require post-Office 2000)

So if you did that, Would I have to put in all of the searching stuff... Because I tried it, But I am new top this.. and I could not get it to work correctly...

If you can do this, so that it would look like the output files I uploaded... Then that would be great....
You are talking about Microsoft Office 2000? I have 2007, and 2000 which one will due?
With key...

Let me know...

and let me know if this would work like I was saying...

thanks for all the help you guys...

Edited by vonneffdobermans, 18 December 2008 - 12:49 PM.


#9
TkTech

TkTech

    The Crazy One

  • Moderators
  • 1,396 posts
I can make it very simple, either select a file or folder, press convert, and choose a place to save the .xls it produces (you can open it right into excel) no need to entire anything. I just require that YOU have a copy of excel thats newer than 2000. The one you have is perfect.

PM if you are interested ^^

#10
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts

TkTech said:

I can make it very simple, either select a file or folder, press convert, and choose a place to save the .xls it produces (you can open it right into excel) no need to entire anything. I just require that YOU have a copy of excel thats newer than 2000. The one you have is perfect.

PM if you are interested ^^


Sounds good to me... I do not have enough posts yet to IM, But I will get busy here and contact you...

#11
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts
Well... Nobody got back to me... Anyone else oput there that can do this all automatic...

Any help would be greatly appreciated...

#12
vonneffdobermans

vonneffdobermans

    Newbie

  • Members
  • PipPip
  • 15 posts
Okay, it seems that TK Tech WAS gonna help me out, but he never got back to me.

So is there anyone else that can do this for me in any language, that I can use on my computer, PHP, etc.. or whatever...

Want it just like TK said in his last post, as automatic as I can get it.

Thanks in advance, and hope to gear from someone soon.