Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

need to convert html file to excel, what is the best way?

HTML

  • Please log in to reply
6 replies to this topic

#1 fftw_ayi

fftw_ayi

    CC Lurker

  • Just Joined
  • Pip
  • 3 posts

Posted 25 February 2009 - 07:40 AM

Hello, first time on the forum.
I have to convert a bunch of html files (like hundreds) in a certain folder location and convert them into excel files. What would be the best programming language to do this?

Once it's done, I plan to run the code on schedule on a monthly basis, unattended.
  • 0

#2 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 25 February 2009 - 08:24 AM

A batch file that renames them from XXX.html to XXX.xls would be my first inclination.
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#3 fftw_ayi

fftw_ayi

    CC Lurker

  • Just Joined
  • Pip
  • 3 posts

Posted 25 February 2009 - 08:28 AM

A batch file that renames them from XXX.html to XXX.xls would be my first inclination.


Simply renaming it does not make it an excel file. If I could do this manually, I would have to open the html file in excel and do a save as while explicitly choosing .xls for the Save as type.
  • 0

#4 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 25 February 2009 - 08:55 AM

Are you trying to convert it to a binary Excel file, or just get it to open in Excel? Excel will open and render HTML as a spreadsheet (this trick is done a LOT by web apps that need to serve up Excel reports).
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#5 fftw_ayi

fftw_ayi

    CC Lurker

  • Just Joined
  • Pip
  • 3 posts

Posted 25 February 2009 - 09:03 AM

Are you trying to convert it to a binary Excel file, or just get it to open in Excel? Excel will open and render HTML as a spreadsheet (this trick is done a LOT by web apps that need to serve up Excel reports).


You are correct. I need it to be a binary Excel file. If I simply rename it, I can double click it and the file will open in excel, but it is not an excel file.
  • 0

#6 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 25 February 2009 - 05:45 PM

You could create a utility with Java using the POI utility to convert the HTML to native Excel. You could also do something similar with a .NET utility. Getting it to run on a schedule will depend somewhat on the OS.
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#7 JenniC

JenniC

    CC Lurker

  • New Member
  • Pip
  • 6 posts

Posted 14 March 2009 - 08:53 AM

You have an html file in a local folder. If you are trying to convert a <table> from this file to a spreadsheet file, use biterscripting.

Read the into a str variable $html.

var str html ; cat "C:/somefolder/somefile.html" > $html


$html now has a table starting at <table...> ending at </table>. ( If this file has more than one <table>, see later.)

Collect rows one by one into a str variable $row

var str rows
while ( { sen -r -c "^<tr&</tr\>^" $html } > 0 )
do
stex -r -c "^<tr&</tr\>^" $html >> $rows
echo "\n" >> $rows # End of row
done


$rows now has all the rows separated by newlines.

Collect columns one by one into a str variable $columns.
Note, this will contain all rows also - we are just inserting a comma
after each column within each row. We can do this all at once for all
rows and all columns.

var str columns
while ( { sen -r -c "^<td&</td\>^" $rows } > 0 )
do
stex -r -c "^<td&</td\>^" $rows >> $columns
echo "," >> $columns # End of column
done


$columns now has all rows separated by newline, all columns within each
row separated by commas.

$columns still has html tags. Remove them. biterscripting has a sample script for this SS_WebPageToText.

echo $columns > "C:/intermediatefile.txt"
script "C:/Scripts/SS_WebPageToText.txt" page("C:/intermediatefile.txt") > "C:/table.csv"


C:/table.csv now has a CSV (comma separated values) file, which can be opened in any spreadsheet program.

You say, you have hundreds of files. Put the above code into a script and pass an input argument $file . (The command cat "C:/somefolder/somefile.html" will become cat $file in the script.) Pass each file one by one using the following:

var str filelist
lf -rn "*.html" "C:/somefolder" > $filelist
while ($filelist <> "")
do
var str file ; lex "1" $filelist
# Call your script with $file here.
done


If a $file will contain more than one <table>, and you want to extract, say, the second <table>, extract the second table using the following.

cat $file > $html
# Throw away everything before the second instance of <table .
stex -c "]^<table^2" $html > null
# Throw away everything after the immediate next instance of </table>.
stex -c "^</table>^[" $html > null


$html is now ready to do the rest of the processing above.

Get biterscripting if you don't have it, from biterscripting.com . I think it is still free.

J
  • 0





Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download