How to properly use OpenXmlWriter to write large Excel files

Brian Jones already wrote something on using the OpenXmlWriter with the Open XML SDK to write large Excel files. There are a couple of things that weren’t mentioned though.

OpenXmlWriter works best for creating new things

For the purposes of explanation, we’ll assume the “new thing” is a new worksheet, although OpenXmlWriter can be used to write any Open XML SDK part (such as WorkbookPart).

The reason for this is that OpenXmlWriter is essentially writing out XML tags. The SDK offers classes and properties that we’re familiar with, but underneath it all, an Open XML spreadsheet’s internal structure are XML files zipped together. So instead of using the SDK classes to form the “XML” parts, we use the OpenXmlWriter to directly write out the XML.

The problem comes when we need to save. If you’re working with an existing worksheet, there’s already an existing XML structure in place. This means there’s a corresponding WorksheetPart (to your Worksheet class), and this WorksheetPart already has the XML structure (and more importantly, a DOM structure).

So you can’t just use the OpenXmlWriter to write the XML stuff into an existing WorksheetPart, because any changes you made will be ignored.

This is why in Brian’s code, he reads in the existing worksheet and writes the content into a new worksheet, thus using a new WorksheetPart that’s “attached” to the WorkbookPart. After he’s done reading, he deletes the old WorksheetPart.

In conclusion, either you

  • Work with the DOM structure using the SDK, which means snail-pace if you’re writing many cells
  • Work with new WorksheetPart

If you have something like worksheetPart.Worksheet.Save() mixed with the use of OpenXmlWriter, it’s not going to work.

XML attributes

The more subtle (and possibly more crucial) information that’s missing is of attributes. When you use the OpenXmlWriter, it works best if it’s used all the way. As in you don’t mix SDK class/property usage code with OpenXmlWriter code.

Basically, if there’s an SDK class property you want to set, don’t assign to that property in code. Use the OpenXmlWriter to write that property, because that property is eventually an XML attribute anyway. In this case, think of the SDK as getting in the way of making your code run faster.

In Brian’s code, notice that the RowIndex property of the Row class is not assigned. Which is fine since it’s optional. In this case, Excel tries to guess which row it is, and places the first Row class content into the first row on the spreadsheet. This is good if that’s what you want. And bad if your want your “first” Row class to be on the 3rd row instead.

You might also notice that the Cell class has no properties assigned. And no, CellFormula and CellValue are not properties of the Cell class, they’re actually child classes. The SDK allows you to access them as though they’re properties. This means the “standard” set of cell properties are not assigned:

  • The cell reference, such as “A1”
  • The cell data type, such as string or number
  • The cell style index

The code works because the cell content is a random number. And the default data type of a cell is a number (so it doesn’t need to be explicitly assigned). It will fail if you need strings or other data types.

And because there’s no cell reference given, Excel tries to guess where each cell is. Which is to say, place all the cells in that row flush to the left. This means you can’t have a cell in A1, and a blank cell in B1, and then a cell in C1, because Excel mooshes all the Cell classes together.

And because there’s no style index given, the default style is used. This is typically the minor font (such as Calibri) with no special formatting (such as font colour or bold or italics or borders).

So here’s sample code on how to use OpenXmlWriter. Note the class OpenXmlAttribute. And if you understand this, you will know that the OpenXmlWriter isn’t only for writing large Excel files. It just happens that it directly writes XML tags and attributes, and thus works the fastest for writing a lot of data.

using (SpreadsheetDocument xl = SpreadsheetDocument.Create("LargeFile.xlsx", SpreadsheetDocumentType.Workbook))
{
    List<OpenXmlAttribute> oxa;
    OpenXmlWriter oxw;

    xl.AddWorkbookPart();
    WorksheetPart wsp = xl.WorkbookPart.AddNewPart<WorksheetPart>();

    oxw = OpenXmlWriter.Create(wsp);
    oxw.WriteStartElement(new Worksheet());
    oxw.WriteStartElement(new SheetData());

    for (int i = 1; i <= 50000; ++i)
    {
        oxa = new List<OpenXmlAttribute>();
        // this is the row index
        oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));

        oxw.WriteStartElement(new Row(), oxa);

        for (int j = 1; j <= 100; ++j)
        {
            oxa = new List<OpenXmlAttribute>();
            // this is the data type ("t"), with CellValues.String ("str")
            oxa.Add(new OpenXmlAttribute("t", null, "str"));

            // it's suggested you also have the cell reference, but
            // you'll have to calculate the correct cell reference yourself.
            // Here's an example:
            //oxa.Add(new OpenXmlAttribute("r", null, "A1"));

            oxw.WriteStartElement(new Cell(), oxa);

            oxw.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));

            // this is for Cell
            oxw.WriteEndElement();
        }

        // this is for Row
        oxw.WriteEndElement();
    }

    // this is for SheetData
    oxw.WriteEndElement();
    // this is for Worksheet
    oxw.WriteEndElement();
    oxw.Close();

    oxw = OpenXmlWriter.Create(xl.WorkbookPart);
    oxw.WriteStartElement(new Workbook());
    oxw.WriteStartElement(new Sheets());

    // you can use object initialisers like this only when the properties
    // are actual properties. SDK classes sometimes have property-like properties
    // but are actually classes. For example, the Cell class has the CellValue
    // "property" but is actually a child class internally.
    // If the properties correspond to actual XML attributes, then you're fine.
    oxw.WriteElement(new Sheet()
    {
        Name = "Sheet1",
        SheetId = 1,
        Id = xl.WorkbookPart.GetIdOfPart(wsp)
    });

    // this is for Sheets
    oxw.WriteEndElement();
    // this is for Workbook
    oxw.WriteEndElement();
    oxw.Close();

    xl.Close();
}

This creates a spreadsheet with 100 columns and 50000 rows of data. On my machine, it ran in about 18 seconds.

I’m not going to list all the available XML attributes, but you can find out the correct attribute name by exploring the Open XML SDK documentation, or the official ECMA documentation.

Open XML SDK class structure

I’ve gotten a few questions on the class structure of the Open XML SDK. There are articles on Open XML itself, where you work directly with XML files and tags, and zip them up yourself. Basically you work with WordprocessingML (Word), SpreadsheetML (Excel), PresentationML (PowerPoint) and possibly DrawingML (for images and stuff). Eric White did a lot of stuff on this.

There are also articles on the use of Open XML SDK itself. However, the articles I’ve found tend to give you code samples and some explanation of why you do those things, but didn’t really explain the deep “why”.

The fundamental question I was asked was “How are the Open XML SDK classes related to each other?“. A related question is “Why do I have to use that particular class?”.

While I’m familiar with the spreadsheet portion of the SDK, I believe the general class structure applies to the WordprocessingML and PresentationML parts too. So I’ll use the SpreadsheetML part as the example.

I also didn’t find any article giving names to the class categories I’m going to tell you, so I’m going to make up my own. If there’s an official Microsoft article on this, let me know. Generally speaking, there are 4 types of SDK classes:

  • Relationship Part Classes (henceforth referred to as RPCs)
  • Root Classes (henceforth referred to as RCs)
  • Content Classes (henceforth referred to as CCs)
  • Special Classes

Before I continue, keep in mind that Open XML documents are basically a bunch of XML files zipped together. Just like a relational database, certain XML files are related to each other (just like certain database tables are related to each other). Which brings us to the first type of class.

Relationship Part Classes

RPCs are the glue that holds certain SDK classes together. They are most easily recognisable by their class type name. For example, WorkbookPart and WorksheetPart.

However, not all SDK classes with names that end with “Part” are RPCs. For example, TablePart is not an RPC (it’s actually a Content Class). The corresponding RPC is actually TableDefinitionPart.

The most important part of an RPC is that it carries a relationship ID, and this relationship ID is used to tie relevant classes together. An RPC is also different in that it has as a property, a Root Class.

Root Classes

Remember that Open XML documents are a bunch of XML files zipped together? Well, an RC represents one of those XML files.

For example, the RPC WorksheetPart has as its RC, the Worksheet class. The Worksheet class holds information that basically translates into an XML file, typically sheet1.xml (and sheet2.xml and so on). The Worksheet class contains your worksheet cell data information.

Content Classes

If RCs represent an XML file, then CCs are basically XML tags.

For example, the Worksheet class contains the SheetData class, which contains the Row class(es), which in turn contains the Cell class(es). The corresponding XML tags are “worksheet”, “sheetData”, “row” and “c”.

Yes, an RC represents an XML file, and also translates to be the first XML tag of that XML file. That’s why it’s called a Root Class, because it also represents the root element (of the underlying XML structure/document).

Special Classes

These aren’t really that special. As far as I know, there are only 3 classes under this category: WordprocessingDocument, SpreadsheetDocument and PresentationDocument.

Those 3 classes form the starting point of any code relying on the Open XML SDK. You can consider them as Super Relationship Part Classes, because their properties are mainly RPCs.

An illustration

You might still be confused at this point (I don’t blame you…). Here’s a diagram for a simple Open XML spreadsheet:
Open XML SDK class structure

In green, we have the Special Class SpreadsheetDocument as the ultimate root.

In blue, we have the RPCs, 1 WorkbookPart class and 2 WorksheetPart classes. The SpreadsheetDocument class has the WorkbookPart class as a property. The WorkbookPart class contains a collection of WorksheetPart classes.

In grey, we have the RCs, 1 Workbook class and 2 Worksheet classes. The Workbook class is the RC of WorkbookPart class. The Worksheet classes are RCs of corresponding WorksheetPart classes. The Workbook class represents the workbook.xml file and the Worksheet classes (typically) represent sheet1.xml and sheet2.xml files.

In orange, we have the CCs. The Workbook class contains the Sheets class, which in turn contains 2 Sheet classes. The Sheet classes have a property holding the relationship ID of the corresponding WorksheetPart classes, which is how they’re tied to the Worksheet classes.

One of the most confusing parts…

After working with the Open XML SDK for a while, you might find yourself asking these questions:

  • Why are there so many classes?
  • Why are some of these classes devoid of any meaningful functionality?
  • Why are some of these classes duplicates of each other?

When I was first using the SDK, I felt the same way when I first used the .NET Framework: Being overwhelmed. There were many namespaces, with many classes in them, and I didn’t know which class to use for a specific purpose until I looked it up and wrote a small test program for it. Having a comprehensive help database/file for the .NET Framework was a really good idea.

And so it was with the Open XML SDK. I mean, it’s a spreadsheet. I can see a couple dozen of classes. Maybe. It turns out to be a lot of classes. That’s why there are so many code samples out there, but you don’t really know why you need to use that particular class.

And there are classes without any meaningful properties or functions. They inherit from a base Open XML class, and that’s it. For example, the Sheets class.

Then there are classes with identical properties. For example, the InlineString class, the SharedStringItem class and CommentText class. Or the Color, BackgroundColor, ForegroundColor and TabColor classes.

The answer is the same for all the above questions. The Open XML SDK is meant to abstract away the XML file structure.

There are duplicate classes because each class eventually translates into an XML tag (if it’s a CC or RC). XML requires a different tag for different purposes, hence the Open XML SDK has different classes. Even though the classes are identical in programming functionality, they become rendered as different XML tags.

There are classes without any seemingly meaningful properties or functions because their sole purpose is to have children. (Ooh Open XML SDK joke!) The Sheets class has as children, the Sheet classes. In XML, they’re correspondingly the “sheets” tag with “sheet” tags as children. The final XML tags have no XML attributes, hence the corresponding SDK classes also have no properties. Tada!

And finally, there are so many classes, because frankly speaking, you need one SDK class corresponding to each individually different XML tag. There are a lot of XML tags used in Open XML, hence so many classes. And that’s before we add in the Relationship Part Classes.

If you have any questions, leave them in a comment or contact me. And I’ll see if I can answer them in an article.

DefinedNames in Cells for Open XML spreadsheets

So a while back, a customer of mine asked me if I knew how to set names (or labels) to cells in a spreadsheet, so that a cell formula just referenced those names. Frankly, I didn’t even know I could do that. I’m not an Excel wizard, as you can tell.

In case you’re in a hurry, here’s the code and the resulting Excel spreadsheet. The code runs against the Open XML SDK.

To my surprise, the name given to the spreadsheet cell isn’t defined at the Cell class level. You’d think you would name your child and make sure to slap a name tag on your progeny just so everyone knows what to call your offspring, right? After looking through the documentation, and thinking it through, I guess it makes sense to separate it. It works by having a central depository with all the available names, contained within DefinedNames and DefinedName classes. Then the names are available throughout the spreadsheet’s workbook.

It’s sort of like the SharedStringTable, where every piece of text is stored in a SharedStringItem, and referenced with an index. This has problems, in that I don’t know what is contained in the Cell class itself, but I don’t really want to go there right now…

Let’s go through the code a bit. Here’s the part where you define DefinedName(s):

defname = new DefinedName();
defname.Name = "PrimeNum1";
defname.Text = "Sheet1!$C$2";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum2";
defname.Text = "Sheet1!$C$3";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum3";
defname.Text = "Sheet1!$C$4";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum4";
defname.Text = "Sheet1!$C$5";
defnames.Append(defname);

In case you’re unfamiliar with Excel, “Sheet1!$C$5” means the cell C5 of the sheet named “Sheet1”. The dollar sign acts as a separator, I think. In this case, the relevant cells are C2, C3, C4 and C5 (containing the 1st, 2nd, 3rd and 4th prime numbers). And then we have this part:

r = new Row();
r.RowIndex = 6;
c = new Cell();
c.CellReference = "B6";
c.DataType = CellValues.String;
c.CellValue = new CellValue("SUM");
r.Append(c);
c = new Cell();
c.CellReference = "C6";
c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");
c.CellValue = new CellValue("17");
r.Append(c);
sd.Append(r);

Note this particular line:

c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");

That’s our defined names in effect. Otherwise, we would use this:

c.CellFormula = new CellFormula("SUM(C2, C3, C4, C5)");
// or even
// c.CellFormula = new CellFormula("SUM(C2:C5)");

I teach you more about the CellFormula class, as well as a whole bunch of Open XML concepts in my programming guide.

3 facts about Open XML SDK

In the course of writing my guide on Open XML and Excel spreadsheets, I discovered some things about the Open XML SDK. There’s been some flak about the SDK or even on Open XML in general. But the price of the SDK is free, so you can’t beat that. While there are some inconveniences with using the SDK, the solutions and workarounds aren’t particularly nasty.

Here we go.

Open XML SDK is not a convenience library

If you’re looking for a neat little function that will toast bread, mop your floor, press your shirt and wish you “Have a nice day!” as you’re leaving the house, Open XML SDK is not what you’re looking for. (Actually if you find a software library that does the above, I wanna know.)

The Open XML SDK is a software development kit (hence the “SDK”). It is not a conventional software library in that sense. It provides atomic functions that allow you to do all sorts of nifty things to create/manipulate Open XML spreadsheets, word processing documents and presentation slides. The keyword is “atomic”.

This isn’t necessarily a bad thing. I know some companies will prefer not to use a software library (or third-party software). For example, government agencies or financial institutions, where data security is a concern and all software is (preferably) written in-house. The furthest they would allow is probably standard libraries.

I’m not going into a discussion about whether Open XML SDK is a “standard” library, since it’s from Microsoft. There seems to be a lot of unhappiness with Microsoft. Is it because of their monopoly? Because Bill Gates is rich? I like the .NET Framework, because it makes coding easier. Open XML SDK is built on top of that, so that makes Open XML documents easier to create and manipulate.

That said, because Open XML SDK offers atomic functions to do atomic tasks, this allows you to create your own functions to encapsulate whatever you’re trying to achieve. This flexibility is sometimes more valuable than convenience. Why do you think software companies charge a lot more for the source code of their libraries? You want to know what’s going on behind that function from a third-party library, because you don’t know if that bug is due to your own code or their code. (It’s probably your own code though…)

Open XML SDK is not a data manipulation library

There are 4 main parts to the Open XML SDK:

  • Spreadsheets
  • Word processing documents
  • Presentation slides
  • Graphics

The graphics part is a common set of class objects used by the other 3. Out of the first 3, spreadsheets are typically used to hold lots of data. Financial data, fiscal summaries, call logs, database dumps and so on. Text documents and slides hold data too, but not usually as intensive as spreadsheets.

Now it might surprise you that Open XML SDK offers no functions to manipulate any kind of data. None at all.

Need to dump 100 database records with 8 fields into a spreadsheet? You need to write code to fill in a 8-column-100-row block of cells.

Need to sort data? Do it within the database environment (there’s a SORT BY clause, you know?), or write your own sorting function. Because Open XML SDK ain’t got it.

This isn’t a big problem, but it does make writing code a little bit of a hassle. Particularly when you’re creating or manipulating a data-intensive spreadsheet. It’s not like when you’re attaching the DataReader object to a DataTable, and voila! All the data is nicely formatted and shown on a web page.

But this is also where the flexibility of writing your own functions come in.

Code execution order is important

When writing code with Open XML SDK, you have to put aside how you normally work with spreadsheets, text documents and slides. I’ll lump them together as “office documents”. Small “o” so it’s not copyright, right?

As a user, you would normally just go about your task of typing text, inserting images, styling text, cut-and-copy sections of an office document. The order of how you go about doing it is irrelevant, because you get to see the final result. And the final result is all that matters.

When working with Open XML SDK, you need to keep to an ordering of code sections. Each class in the SDK has to be appended to the appropriate parent class in the correct order, or you’ll get a corrupt file. This means your code has to be executed in the correct order too.

Go to the Open XML SDK help file. Let’s say we look at the Worksheet class. There will be a section called XML schema. In that, there will be an XML tag called “sequence”. See those “element” tags? That’s the order.

This code order isn’t a big problem. In your spreadsheet, it looks like an image is at the top, with a table of cell data below it. In code, you need to code for the table of cells first, then the image. Because that’s the order.

The solution is to ignore the visual placements of the data sections in your office document, and look at what is in your office document. Then you rearrange all the data parts according to the order, and then write code for that.

Conclusion

The Open XML SDK offers you flexibility in how you write your code and a great price (it’s free). It is backed by a large software company (Microsoft), so it’s not likely to disappear with no support in the future.

Sure it’s not as easy to use, but sometimes your situation doesn’t allow you to use a third-party library (cost, licensing, security). Hey, it works well enough. And sometimes, when you’re in a tight project, that’s all you really need. That you produce software (that produces office documents) well enough.

If you’re interested, you can check out my guide. I teach you how to create a spreadsheet with the standard functions in Excel such as text styling, image insertions, multiple worksheets, cell formulas. I also show you how to use a template in case you need something really complicated. And there’s working source code in C# and VB.NET.