Hello, first time on the forum.
I have to convert a bunch of html files (like hundreds) in a certain folder location and convert them into excel files. What would be the best programming language to do this?
Once it's done, I plan to run the code on schedule on a monthly basis, unattended.
A batch file that renames them from XXX.html to XXX.xls would be my first inclination.
Are you trying to convert it to a binary Excel file, or just get it to open in Excel? Excel will open and render HTML as a spreadsheet (this trick is done a LOT by web apps that need to serve up Excel reports).
You could create a utility with Java using the POI utility to convert the HTML to native Excel. You could also do something similar with a .NET utility. Getting it to run on a schedule will depend somewhat on the OS.
You have an html file in a local folder. If you are trying to convert a <table> from this file to a spreadsheet file, use biterscripting.
Read the into a str variable $html.
$html now has a table starting at <table...> ending at </table>. ( If this file has more than one <table>, see later.)var str html ; cat "C:/somefolder/somefile.html" > $html
Collect rows one by one into a str variable $row
$rows now has all the rows separated by newlines.var str rows
while ( { sen -r -c "^<tr&</tr\>^" $html } > 0 )
do
stex -r -c "^<tr&</tr\>^" $html >> $rows
echo "\n" >> $rows # End of row
done
Collect columns one by one into a str variable $columns.
Note, this will contain all rows also - we are just inserting a comma
after each column within each row. We can do this all at once for all
rows and all columns.
$columns now has all rows separated by newline, all columns within eachvar str columns
while ( { sen -r -c "^<td&</td\>^" $rows } > 0 )
do
stex -r -c "^<td&</td\>^" $rows >> $columns
echo "," >> $columns # End of column
done
row separated by commas.
$columns still has html tags. Remove them. biterscripting has a sample script for this SS_WebPageToText.
C:/table.csv now has a CSV (comma separated values) file, which can be opened in any spreadsheet program.echo $columns > "C:/intermediatefile.txt"
script "C:/Scripts/SS_WebPageToText.txt" page("C:/intermediatefile.txt") > "C:/table.csv"
You say, you have hundreds of files. Put the above code into a script and pass an input argument $file . (The command cat "C:/somefolder/somefile.html" will become cat $file in the script.) Pass each file one by one using the following:
If a $file will contain more than one <table>, and you want to extract, say, the second <table>, extract the second table using the following.var str filelist
lf -rn "*.html" "C:/somefolder" > $filelist
while ($filelist <> "")
do
var str file ; lex "1" $filelist
# Call your script with $file here.
done
$html is now ready to do the rest of the processing above.cat $file > $html
# Throw away everything before the second instance of <table .
stex -c "]^<table^2" $html > null
# Throw away everything after the immediate next instance of </table>.
stex -c "^</table>^[" $html > null
Get biterscripting if you don't have it, from biterscripting.com . I think it is still free.
J
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks