Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Reading Ms Word's .docx File Format

c# docx Microsoft word reader file format tutorial

  • Please log in to reply
2 replies to this topic

#1 BlackRabbit

BlackRabbit

    CodeCall Legend

  • Expert Member
  • PipPipPipPipPipPipPipPip
  • 3871 posts
  • Location:Argentina
  • Programming Language:C, C++, C#, PHP, JavaScript, Transact-SQL, Bash, Others
  • Learning:Java, Others

Posted 08 June 2012 - 08:38 PM

We have all met that .docx file format lastly introduced by Microsoft in its word processor as much as we have also met the .xlsx file format ( which i covered in a previous tutorial ) for spreadsheets, today's goal would be for us to understand what is that .docx format about and of course, make ourselves a reader for it in C#, are you ready ?


What is a .docx file format ?

No other thing than an Office Open XML format, which as its name tell us is an XML file set ( the document file plus support xml files, for templates, formats, tables, configuration, etc ) with one of those XML files containing the actual text document, and the other as aforementioned for decoration, format and culture support.
Click here for a wiki about docx file format
 


Edited by BlackRabbit, 24 November 2015 - 10:29 AM.

  • 2

#2 Luthfi

Luthfi

    CC Leader

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1320 posts
  • Programming Language:PHP, Delphi/Object Pascal, Pascal, Transact-SQL
  • Learning:C, Java, PHP

Posted 18 June 2012 - 07:02 PM

Nice information. Will bookmark this for future reference.
  • 0

#3 ImmortalGuy

ImmortalGuy

    CC Lurker

  • Just Joined
  • Pip
  • 1 posts

Posted 16 January 2015 - 08:07 AM

We have all met that .docx file format lastly introduced by Microsoft in its word processor as much as we have also met the .xlsx file format ( which i covered in a previous tutorial ) for spreadsheets, today's goal would be for us to understand what is that .docx format about and of course, make ourselves a reader for it in C#, are you ready ?


What is a .docx file format ?

No other thing than an Office Open XML format, which as its name tell us is an XML file set ( the document file plus support xml files, for templates, formats, tables, configuration, etc ) with one of those XML files containing the actual text document, and the other as aforementioned for decoration, format and culture support.
Click here for a wiki about docx file format

In order to read the document we are gonna help ourselves with the following :

. The ICSharpCode.sharpZiplib
. The System.Xml namespace and its xml management functions
. A sample .docx file that you will find in the attachment

so, before we start you better Download the Zip lib here

and take a look to .NET System.XML namespace and methods in case this is your first met with it.


In the attachment you will find the file changes.docx, which is a propper docx file from silverlight, but if we open it with winrar you will find out that there is many files inside it.

docxView.png


As you can see, There is one xml for the document itself, where the text will be, and then you have xmls for the fonts, settings, styles, etc
We are going to focus in the document.xml only in order to extract just the document's text.

so basically, we will have to unzip the file, find the document.xml and parse it. lets do it.


If you are going to do this for yourself, here is what you should do :

. Create a windows form application
. In the form, place a button for a FileOpen dialog, which you will use to choose the .docx file to be read
. Add to your project a reference for the previously downloaded iCSharpCode.SharpZiplib.dll
. Add a new class for the DocxTextReader, and paste the following code on it :

 

using System;
using System.IO;
using System.Text;
using System.Xml;
using ICSharpCode.SharpZipLib.Zip;

namespace tut_reading_docx
{
    class DocxTextReader
    {	    
	    private string file = "";
	    private string location = "";
	    
	    // constructor, with the fileName you want to extract the text from
	    public DocxTextReader(string theFile)   {	    file = theFile;	  }
 
	    // Here the do it all method, call it after the constructor
	    // it will try to find and parse document.xml from the zipped file
	    // and return the docx's text in a string
	    public string getDocumentText()
	    {
		    if (string.IsNullOrEmpty(file))
		    {
			    throw new Exception("No Input file");
		    }
	    
		    location = getDocumentXmlFile_FromZipFile();

		    if (string.IsNullOrEmpty(location))
		    {
			    throw new Exception("Invalid Docx");
		    }

		    return ReadDocumentText();
	    }

	    // we go to the xml file location
	    // load it
	    // and return the extracted text
	    private string ReadDocumentText()
	    {
		    StringBuilder result = new StringBuilder();

		    string bodyXPath = "/w:document/w:body";

		    ZipFile zipped = new ZipFile(file);
		    foreach (ZipEntry entry in zipped)
		    {
			    if (string.Compare(entry.Name, location, true) == 0)
			    {
				    XmlDocument xmlDoc = new XmlDocument();
				    xmlDoc.PreserveWhitespace = true;
				    xmlDoc.Load(zipped.GetInputStream(entry));
				    
				    XmlNamespaceManager xnm = new XmlNamespaceManager(xmlDoc.NameTable);
				    xnm.AddNamespace("w", @"http://schemas.openxmlformats.org/wordprocessingml/2006/main");

				    XmlNode node = xmlDoc.DocumentElement.SelectSingleNode(bodyXPath, xnm);

				    if (node == null) { return ""; }
				    result.Append(ReadNode(node));
				    break;
			    }
		    }
		    zipped.Close();

		    return result.ToString();
	    }

	    // Xml node reader helper :D
	    private string ReadNode(XmlNode node)
	    {
		    // not a good node ?
		    if (node == null || node.NodeType != XmlNodeType.Element) { return ""; }

		    StringBuilder result = new StringBuilder();
		    foreach (XmlNode child in node.ChildNodes)
		    {
			    // not an element node ?
			    if (child.NodeType != XmlNodeType.Element) { continue; }

			    // lets get the text, or replace the tags for the actua text's characters
			    switch (child.LocalName)
			    {
				    case "tab": result.Append("\t"); break;
				    case "p": result.Append(ReadNode(child)); result.Append("\r\n\r\n"); break;
				    case "cr":
				    case "br": result.Append("\r\n"); break;

				    case "t": // its Text !
					    result.Append(child.InnerText.TrimEnd());
					    string space = ((XmlElement)child).GetAttribute("xml:space");
					    if (!string.IsNullOrEmpty(space) && space == "preserve") { result.Append(' '); }
				    break;

				    default:  result.Append(ReadNode(child));   break;
			    }
		    }

		    return result.ToString();
	    }

	    // lets open the zip file and look up for the
	    // document.xml file
	    // and save its zip location into the location variable
	    private string getDocumentXmlFile_FromZipFile()
	    {
		    // ICsharpCode helps here to open the zipped file
		    ZipFile zip = new ZipFile(file);

		    // lets take a look to the file entries inside the zip file
		    // up to we get
		    foreach (ZipEntry entry in zip)
		    {

			    if (string.Compare(entry.Name, "[Content_Types].xml", true) == 0)
			    {
				    Stream contentTypes = zip.GetInputStream(entry);

				    XmlDocument xmlDoc = new XmlDocument();
				    xmlDoc.PreserveWhitespace = true;
				    xmlDoc.Load(contentTypes);

				    contentTypes.Close();

				    // we need a XmlNamespaceManager for resolving namespaces
				    XmlNamespaceManager xnm = new XmlNamespaceManager(xmlDoc.NameTable);
				    xnm.AddNamespace("t", @"http://schemas.openxmlformats.org/package/2006/content-types");

				    // lets find the location of document.xml
				    XmlNode node = xmlDoc.DocumentElement.SelectSingleNode("/t:Types/t:Override[@ContentType=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml\"]", xnm);

				    if (node != null)
				    {
					    string location = ((XmlElement)node).GetAttribute("PartName");
					    return location.TrimStart(new char[] { '/' });
				    }
				    break;
			    }
		    }

		    // close the zip
		    zip.Close();

		    return null;
	    }

    }
	    
}


you will finally get something like this :

docxform.png


Just in case, this is the way you call the reader helper.


// Create a docxReader object
DocxTextReader docxReader = new DocxTextReader(file);

// and load the readed text to you favorite textbox (multiline mode of course)
tbDocxText.Text =  docxReader.getDocumentText();


So today we learnt what is all that .docx and open office xml file format, we got ourselves introduced to icsharpcode libs which is very helpful managing zipped files and we learnt how to find our good old word's text content inside all that zipped xml thingie, not bad i would say.

so, what do you say ? did you like it ? i hope so, see you in the next tutorial.

 

in deep !!Thank so much


in deep!!! Thanks so much


  • 0