Hello everyone... Hope you are all doing well...
Okay, What I am trying to do is this:
I have html files that I need to get the info out of, and to be put into a CSV file for Microsoft excel.
each file has a few lines of info in them, I need them all in seperate columns, so That I can use it in Microsoft excel.
Below is what is in most of the html files. Some do not have the info, so it needs to be able to move on if it does not have the info in it. I also put where I would want the info to be for example...
********************************************************
AQUARIUM PHARMACEUTICALS - P/C ALGAEFIX W/ FREE ECOFIX - (column A)
Royal ID: AAP169P - (column B)
UPC: 317163161692 - (column C)
Vendor: AQUARIUM PHARMACEUTICALS - (column D)
****Need to have Royal id, UPC,Vendor to be headers with the info following in the cells.*****
*****(everything below vender should all go into same cell, keeping the linebreaks, etc... the same.) in (column E)*****
AlgaeFix effectively controls the growth of many types of algae, including blanketweed. Will not harm live plants or koi and goldfish. EcoFix makes pond water clean and clear. Breaks down dead algae. Increases oxygen levels in pond water.
AlgaeFix effectively controls many types of green or green water algae, string or hair algae and blanketweed in ponds that contain live plants. Controls existing algae and helps resolve additional algae blooms. Keeps ornamental ponds and water gardens clean & clear. EcoFix helps create a healthy ecosystem for pond fish. By digesting sludge, and reducing organics, EcoFix increases oxygen levels and makes pond water clean and clear.
* 16 fl oz PondCare® AlgaeFix® with free 8 fl oz PondCare® EcoFix
* 169B Treats 4,800 US Gal (18,168 L) (147A Treats Up to 2,000 US Gal)
* Restricted for sale in Canada, UK
*********************************************************
Thats about it in a nutshell... I had someone help me before with this, and it worked great.. But somehow I lost the files,,, :(
I will be using this on my home comp. if that makes any diff.
I would appreciate any help I could get on this.
Thanks in advance
Vonneffdobermans
need php script to extract info from html files
Started by vonneffdobermans, Dec 16 2008 06:48 PM
11 replies to this topic
#1
Posted 16 December 2008 - 06:48 PM
|
|
|
#2
Posted 16 December 2008 - 06:52 PM
If the format of the files is consistent, you may want to use a regular expression to grab the data. jEdit has multifile searches that could get the data pretty quickly. I can look into what you would need when I get to work, but it should be pretty straightforward.
#3
Posted 16 December 2008 - 07:03 PM
Well, there are some like the one I posted with all the info in them.
And some have just the name and vendor, and some have nothing...
Is that a problem?
I will upload some of the files tomorrow so you can see what I am talking about.
The last script I had, would place the vendor, product id, upc ,etc... as the headers, and the info following would be in the cell underneith the headers. This way when I bring it up in excel, it would be all in order for me to edit, etc...
The main thing I need out if the files are the name of the file in a cell or product id (both same), and the full description of the item. if that is not there, then I really do not need that html file data saved.
And some have just the name and vendor, and some have nothing...
Is that a problem?
I will upload some of the files tomorrow so you can see what I am talking about.
The last script I had, would place the vendor, product id, upc ,etc... as the headers, and the info following would be in the cell underneith the headers. This way when I bring it up in excel, it would be all in order for me to edit, etc...
The main thing I need out if the files are the name of the file in a cell or product id (both same), and the full description of the item. if that is not there, then I really do not need that html file data saved.
#4
Posted 16 December 2008 - 07:13 PM
Include a sample of what the data from about 3 files should look like. I may be able to give you something where you can use it more or less forever in the future.
#5
Posted 16 December 2008 - 08:08 PM
Sure thing...
I will upload 3 files and a sample what the output should look like in both CSV and XLS format
I will upload 3 files and a sample what the output should look like in both CSV and XLS format
Attached Files
#6
Posted 17 December 2008 - 07:44 AM
These instructions are for jEdit:
Search for ".*" in Directory, Settings: Regular Expression, Filter: *.htm, in appropriate directory
Macros: Misc: HyperSearch Results to Buffer
Search for "^(.*)<body>" replace with "" in Current buffer as Regular Expression: replace all
Search for "^C:\\(.*)\n" replace with "" in Current buffer as Regular Expression: replace all
Search for "^(.*)<h3>Royal ID: (.*?)</h3>(.*)$" replace with "$2,$1$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),(.*)<h2>(.*?)</h2>(.*)$" replace with "$1,$3,$2$4" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>UPC: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>Vendor: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<p>(.*)$" replace with "$1,$2" in Current buffer as Regular Expression: replace all
Search for "^(.*)</p><p>(.*)$" replace with "$1\n,,,,,,\n,,,,,,$2" in Current buffer as Regular Expression: replace all (repeat until there are 0 replacements made)
Search for "^(.*)</p>(.*)$" replace with "$1" in Current buffer as Regular Expression: replace all
At this point, you can to a non-regular expression search/replace for the escape sequences and add the header row.
Search for ".*" in Directory, Settings: Regular Expression, Filter: *.htm, in appropriate directory
Macros: Misc: HyperSearch Results to Buffer
Search for "^(.*)<body>" replace with "" in Current buffer as Regular Expression: replace all
Search for "^C:\\(.*)\n" replace with "" in Current buffer as Regular Expression: replace all
Search for "^(.*)<h3>Royal ID: (.*?)</h3>(.*)$" replace with "$2,$1$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),(.*)<h2>(.*?)</h2>(.*)$" replace with "$1,$3,$2$4" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>UPC: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<h3>Vendor: (.*?)</h3>(.*)$" replace with "$1,$2,,$3" in Current buffer as Regular Expression: replace all
Search for "^(.*),<p>(.*)$" replace with "$1,$2" in Current buffer as Regular Expression: replace all
Search for "^(.*)</p><p>(.*)$" replace with "$1\n,,,,,,\n,,,,,,$2" in Current buffer as Regular Expression: replace all (repeat until there are 0 replacements made)
Search for "^(.*)</p>(.*)$" replace with "$1" in Current buffer as Regular Expression: replace all
At this point, you can to a non-regular expression search/replace for the escape sequences and add the header row.
#7
Posted 17 December 2008 - 05:13 PM
Winged pretty much summed it up, I can create a quick application for you that will automate it and export as an excel file. (I have prebuilt tools, but I require post-Office 2000)
#8
Posted 18 December 2008 - 12:01 PM
TkTech said:
Winged pretty much summed it up, I can create a quick application for you that will automate it and export as an excel file. (I have prebuilt tools, but I require post-Office 2000)
So if you did that, Would I have to put in all of the searching stuff... Because I tried it, But I am new top this.. and I could not get it to work correctly...
If you can do this, so that it would look like the output files I uploaded... Then that would be great....
You are talking about Microsoft Office 2000? I have 2007, and 2000 which one will due?
With key...
Let me know...
and let me know if this would work like I was saying...
thanks for all the help you guys...
Edited by vonneffdobermans, 18 December 2008 - 12:49 PM.
#9
Posted 18 December 2008 - 07:14 PM
I can make it very simple, either select a file or folder, press convert, and choose a place to save the .xls it produces (you can open it right into excel) no need to entire anything. I just require that YOU have a copy of excel thats newer than 2000. The one you have is perfect.
PM if you are interested ^^
PM if you are interested ^^
#10
Posted 18 December 2008 - 07:33 PM
TkTech said:
I can make it very simple, either select a file or folder, press convert, and choose a place to save the .xls it produces (you can open it right into excel) no need to entire anything. I just require that YOU have a copy of excel thats newer than 2000. The one you have is perfect.
PM if you are interested ^^
PM if you are interested ^^
Sounds good to me... I do not have enough posts yet to IM, But I will get busy here and contact you...
#11
Posted 22 December 2008 - 10:32 AM
Well... Nobody got back to me... Anyone else oput there that can do this all automatic...
Any help would be greatly appreciated...
Any help would be greatly appreciated...
#12
Posted 20 January 2009 - 01:06 PM
Okay, it seems that TK Tech WAS gonna help me out, but he never got back to me.
So is there anyone else that can do this for me in any language, that I can use on my computer, PHP, etc.. or whatever...
Want it just like TK said in his last post, as automatic as I can get it.
Thanks in advance, and hope to gear from someone soon.
So is there anyone else that can do this for me in any language, that I can use on my computer, PHP, etc.. or whatever...
Want it just like TK said in his last post, as automatic as I can get it.
Thanks in advance, and hope to gear from someone soon.


Sign In
Create Account


Back to top










