How to properly use OpenXmlWriter to write large Excel files

Brian Jones already wrote something on using the OpenXmlWriter with the Open XML SDK to write large Excel files. There are a couple of things that weren’t mentioned though.

OpenXmlWriter works best for creating new things

For the purposes of explanation, we’ll assume the “new thing” is a new worksheet, although OpenXmlWriter can be used to write any Open XML SDK part (such as WorkbookPart).

The reason for this is that OpenXmlWriter is essentially writing out XML tags. The SDK offers classes and properties that we’re familiar with, but underneath it all, an Open XML spreadsheet’s internal structure are XML files zipped together. So instead of using the SDK classes to form the “XML” parts, we use the OpenXmlWriter to directly write out the XML.

The problem comes when we need to save. If you’re working with an existing worksheet, there’s already an existing XML structure in place. This means there’s a corresponding WorksheetPart (to your Worksheet class), and this WorksheetPart already has the XML structure (and more importantly, a DOM structure).

So you can’t just use the OpenXmlWriter to write the XML stuff into an existing WorksheetPart, because any changes you made will be ignored.

This is why in Brian’s code, he reads in the existing worksheet and writes the content into a new worksheet, thus using a new WorksheetPart that’s “attached” to the WorkbookPart. After he’s done reading, he deletes the old WorksheetPart.

In conclusion, either you

  • Work with the DOM structure using the SDK, which means snail-pace if you’re writing many cells
  • Work with new WorksheetPart

If you have something like worksheetPart.Worksheet.Save() mixed with the use of OpenXmlWriter, it’s not going to work.

XML attributes

The more subtle (and possibly more crucial) information that’s missing is of attributes. When you use the OpenXmlWriter, it works best if it’s used all the way. As in you don’t mix SDK class/property usage code with OpenXmlWriter code.

Basically, if there’s an SDK class property you want to set, don’t assign to that property in code. Use the OpenXmlWriter to write that property, because that property is eventually an XML attribute anyway. In this case, think of the SDK as getting in the way of making your code run faster.

In Brian’s code, notice that the RowIndex property of the Row class is not assigned. Which is fine since it’s optional. In this case, Excel tries to guess which row it is, and places the first Row class content into the first row on the spreadsheet. This is good if that’s what you want. And bad if your want your “first” Row class to be on the 3rd row instead.

You might also notice that the Cell class has no properties assigned. And no, CellFormula and CellValue are not properties of the Cell class, they’re actually child classes. The SDK allows you to access them as though they’re properties. This means the “standard” set of cell properties are not assigned:

  • The cell reference, such as “A1”
  • The cell data type, such as string or number
  • The cell style index

The code works because the cell content is a random number. And the default data type of a cell is a number (so it doesn’t need to be explicitly assigned). It will fail if you need strings or other data types.

And because there’s no cell reference given, Excel tries to guess where each cell is. Which is to say, place all the cells in that row flush to the left. This means you can’t have a cell in A1, and a blank cell in B1, and then a cell in C1, because Excel mooshes all the Cell classes together.

And because there’s no style index given, the default style is used. This is typically the minor font (such as Calibri) with no special formatting (such as font colour or bold or italics or borders).

So here’s sample code on how to use OpenXmlWriter. Note the class OpenXmlAttribute. And if you understand this, you will know that the OpenXmlWriter isn’t only for writing large Excel files. It just happens that it directly writes XML tags and attributes, and thus works the fastest for writing a lot of data.

using (SpreadsheetDocument xl = SpreadsheetDocument.Create("LargeFile.xlsx", SpreadsheetDocumentType.Workbook))
{
    List<OpenXmlAttribute> oxa;
    OpenXmlWriter oxw;

    xl.AddWorkbookPart();
    WorksheetPart wsp = xl.WorkbookPart.AddNewPart<WorksheetPart>();

    oxw = OpenXmlWriter.Create(wsp);
    oxw.WriteStartElement(new Worksheet());
    oxw.WriteStartElement(new SheetData());

    for (int i = 1; i <= 50000; ++i)
    {
        oxa = new List<OpenXmlAttribute>();
        // this is the row index
        oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));

        oxw.WriteStartElement(new Row(), oxa);

        for (int j = 1; j <= 100; ++j)
        {
            oxa = new List<OpenXmlAttribute>();
            // this is the data type ("t"), with CellValues.String ("str")
            oxa.Add(new OpenXmlAttribute("t", null, "str"));

            // it's suggested you also have the cell reference, but
            // you'll have to calculate the correct cell reference yourself.
            // Here's an example:
            //oxa.Add(new OpenXmlAttribute("r", null, "A1"));

            oxw.WriteStartElement(new Cell(), oxa);

            oxw.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));

            // this is for Cell
            oxw.WriteEndElement();
        }

        // this is for Row
        oxw.WriteEndElement();
    }

    // this is for SheetData
    oxw.WriteEndElement();
    // this is for Worksheet
    oxw.WriteEndElement();
    oxw.Close();

    oxw = OpenXmlWriter.Create(xl.WorkbookPart);
    oxw.WriteStartElement(new Workbook());
    oxw.WriteStartElement(new Sheets());

    // you can use object initialisers like this only when the properties
    // are actual properties. SDK classes sometimes have property-like properties
    // but are actually classes. For example, the Cell class has the CellValue
    // "property" but is actually a child class internally.
    // If the properties correspond to actual XML attributes, then you're fine.
    oxw.WriteElement(new Sheet()
    {
        Name = "Sheet1",
        SheetId = 1,
        Id = xl.WorkbookPart.GetIdOfPart(wsp)
    });

    // this is for Sheets
    oxw.WriteEndElement();
    // this is for Workbook
    oxw.WriteEndElement();
    oxw.Close();

    xl.Close();
}

This creates a spreadsheet with 100 columns and 50000 rows of data. On my machine, it ran in about 18 seconds.

I’m not going to list all the available XML attributes, but you can find out the correct attribute name by exploring the Open XML SDK documentation, or the official ECMA documentation.

How to create an Open XML spreadsheet bar chart

Today, you’re going to learn how to insert a bar chart into an Open XML spreadsheet. Frankly speaking, the basics apply to the other types of charts, so if you master this, you can figure out how to create the other charts.

First, download the source code and the resulting Excel file.

I’m not going to go through every detailed step of the code. Instead, I’ll go through broad strokes of explanation. You’re expected to know the basics of using Open XML SDK to create cells and insert images. Refresher on cells and styles here, and inserting multiple images here.

Hardcoded data

You may have noticed that the spreadsheet cell data is hardcoded into arrays. The reason is it’s used in creating the cells, and creating a copy of the data when creating the chart. Open XML charts store the actual data it’s representing.

For charts, there are 3 parts to the data: the category entries, the “legend” entries, and the cell data. “Legend” is in quotes because most charts use that data part as the legend, but it’s not called legend entries. I believe internally, it’s referred as the series name of data series.

Absolute cell references

You may also notice that in the chart parts, cell references are absolute. For example, “Sheet1!$B$3”. It means the cell references are absolute, as opposed to relative (such as “Sheet1!B3”).

The sheet name is used

You may also notice that the sheet name is used. This means you can have a chart in sheet “Sheet1” but contain data from “Sheet2”.

Axis IDs

If you’ve explored an Open XML spreadsheet chart, you may have noticed magic numbers used as axis IDs. So how does Excel calculate them? I don’t know, but you don’t have to worry about that.

Within each chart, as long as the axis IDs are unique, you’re fine (this means within each ChartSpace class). This means theoretically, you only need 6 unique axis IDs. There are 3 axis IDs (category/date axis, value axis and series axis) for the primary axis and another 3 for the secondary axis. The secondary axis comes into play when you plot a particular data series on its own (or against an existing secondary axis).

When data series are plotted on a secondary axis, they can also be plotted as a different chart type. This is what’s known as a combination chart. For example, a column chart with a line chart.

I don’t know all the valid types of combination charts, thus the theoretical limit is 6. But I don’t think 3D charts can be part of combination charts. In practice, you only need 4 unique IDs.

And you don’t even need to fret. Just use 1 through 6 as the IDs.

Data series

Excel assumes that if your data is wider than it’s tall, then rows of horizontal cells are taken as data (with first row as category entries and the first cell of each row as the “legend” entry). Otherwise, it’s columns of vertical data (with first column as category entries and the first cell of each column as the “legend” entry). You can change this if you find your chart weird-looking on the Excel UI, but programmatically, the source code uses rows as data series.

Each data series has the category entries, the associated “legend” entry, and the actual data.

Different charts have different Open XML SDK classes representing data series. For example, the bar charts and column charts use the BarChartSeries. The pie charts (and surprisingly the doughnut charts) use the PieChartSeries. The reason is that there are different properties associated with different charts, and thus each chart series needs to take care of that. For example, line charts can have markers on each data point, and pie charts can have exploded pie slices.

Charts are also images

A chart has 2 main parts: the chart content, and the drawing part.

The chart content contains the chart data, legend, title, axes and what not. The drawing part refers to how big the chart is displayed on the worksheet, and where to display it. And in this sense, a chart technically behaves like an image.

So not only do you have to create the chart content, you have to take care of its “image-y” properties. Hence the reference link I gave on inserting multiple images earlier on.

Because of this, you have to take care of unique image IDs. Each worksheet needs a unique ID assigned to each image part. And a chart is considered an image for this purpose.

Here’s the gist. A WorksheetPart class is tied to each separate worksheet. Each WorksheetPart class has a DrawingsPart class. And a DrawingsPart class can have ImageParts (the “normal” images) and ChartParts (for charts).

For more Inception-like information, a ChartPart can have ImagePart’s too (to have picture fills for the chart title or legend for example). It’s sort of complicated, so let’s not go there in this article.

The dreaded TwoCellAnchor

Position of images (and thus charts) is done with either an AbsoluteAnchor, OneCellAnchor or TwoCellAnchor class. I explained a little why TwoCellAnchor took out more of my hair than 4 years of university mathematics here.

But for charts, you don’t really need fine-tuned positioning. So a TwoCellAnchor is fine. The column/row properties are zero-based indices of the column and row indices. The offsets are in EMUs. Theoretically, the offsets are less than the column width or row height of its corresponding column or row.

Chart anatomy

The typical chart goes like this. Each chart has a title, plot area and legend. The plot area contains a chart type (or several chart types for combination charts), and relevant axes.

Each chart type might have specific properties unique to the chart type assigned (bar charts will have the bar direction assigned for example). And data series unique to the chart type assigned (the BarChartSeries for bar and column charts for example). And any relevant axis IDs assigned (some charts don’t have axes, such as doughnut charts).

Relevant axes are typically a category (or text) axis on the horizontal, and a value axis on the vertical. If the category entries are date values, Excel might have used a date axis instead of a category axis. The source code provided assumes textual category axis is used.

For data series charted on the secondary axis, the order of appending axes is as follows:

  • Primary horizontal axis (whether category or date axis) [typically at bottom]
  • Primary vertical axis (value axis) [typically on the left]
  • Secondary vertical axis (value axis) [typically on the right]
  • Secondary horizontal axis (whether category or date axis) [typically on the top]

Bonus!

If you change the bar direction in the source code to Column, you’ll get a clustered column chart. Everything else works just fine.

In fact, all the charts look surprisingly similar in terms of code structure. Change the chart class, and the chart series class used. Fiddle with a couple of properties unique to the particular chart type. BOOM! New chart born.

The really hard part is getting the chart data series down pat. Which is why I hardcoded the data for this explanation. In practice, you’re probably getting your data from a database, so you’ll be iterating through that. Or obtaining the actual cell data somehow. So I might as well have a simple data structure so you can expand it yourself instead of hardcoding the absolute cell references everywhere (which will be a pain for you to follow).

In conclusion…

Charts are hard… I’m writing a reference manual for Open XML charts in spreadsheets. So stay tuned for that. In the meantime, my software library SpreadsheetLight takes care of all the above in a few lines of code. So take a look at that (it’s free and source code is available).

Future of spreadsheets

So I was checking some site statistics for my spreadsheet library website and I found an interesting search phrase: “open xml excel future”. The person might have been checking how important spreadsheets are in the future. Or maybe just checking if Microsoft Excel is still around.

Spreadsheets are still going to be used in future. This is based on my experience working in a medium-large telecommunications company, and based on the companies who bought my Open XML spreadsheet reference. Before I go on, let me tell you about an incident.

I went on Facebook and asked my friends to tell everyone they know of my spreadsheet library (SpreadsheetLight). One person commented that his friends are all hackers and don’t use spreadsheets. I will assume “hackers” to mean “programmers with higher than average programming skills” and “don’t use spreadsheets” to mean “hardly ever thought about them”.

Software is meant to be used by people. Usually “normal” people, not “hacker” people.

PayPal allows you to download transactional data. You can download it in CSV format. What do you do to see it clearer? You paste it into Excel. Or whatever spreadsheet software you use.

I’ve worked with satellite transactional data. There are lots of data fields. Source of transmission, destination of transmission, length of transmission, type of data. Why does this matter? Because the satellite companies charge the transactions to the customers. Even the source and destination is important (think of calling from America to China, and from Singapore to China. Wait, you might be too young to know about the overseas call charges…).

And the best way to view satellite transactional data is on a grid-like display. A spreadsheet.

So here’s a short rant. If you are a highfalutin coder who thinks you’re better than anyone, even another programmer who’s using some “inferior” or “non-open source” language or platform, get off your high tower. The only people who care about your code are other programmers. The normal people don’t care a flying fishball about your code.

I’m practical. I use whatever tools I have to do what I need to do. So far, C# is great. And there are people who don’t like the .NET Framework simply on principle that it’s from Microsoft.

I’ve never really understood the tension between the open source software people and the “closed” source software (aka enterprises and large companies) people. In particular, Microsoft seems to bear the biggest brunt of the force whereas Apple (the supposed “opposite”) is enshrined, despite the various salient restrictions Apple imposes (App Store apps approval process, closed system, proprietary hardware [iPhone, iPad]).

If I didn’t know better, I’d say some programmers hate Microsoft because Microsoft makes a lot of money. Oh there are specific reasons like closed proprietary systems, or crushing small competitors by buying them out, or whatever. I’d say the underlying cause of their distress is that Microsoft makes lots of money. And if I didn’t know better, I’d say these programmers find money a repulsive thing.

Views on money is a separate topic. My short version is: money allows you to be more of who you are. Money gives you more options. Money is not inherently evil.

Now Microsoft introduced the Open XML specifications, with the creation of the Open XML SDK for .NET Framework. People have criticised the Open XML specs, whether it’s “yet another standard” or “it’s from evil Microsoft!”.

The Open XML specs have their flaws. I should know because I’ve spent hours studying them and poring over the details. From my research, the specs are created to conform to the existing Microsoft Office software (Word, Excel and PowerPoint). If it’s in the Office software, the specs reflect it.

You can’t blame Microsoft. It’s a good move. It also made the specs very complicated. The reason is that the Office software is the accumulated knowledge and code base from years of testing and marketing.

You know what’s the most common thing to ask when creating Open XML documents? “How to create a simple Word document?” or “How to create a simple Excel spreadsheet?” Because the complexity made creating even an empty file a big pain in the derriere.

So what’s the future, really?

Some kind of standard would be used. Possibly the Open XML standard. Google Docs and Apple’s Number and other spreadsheet software will adopt it, or at least support it.

Spreadsheets are going to be around because companies and businesses use them. Even the small ones. I don’t care if you’re a one-man startup. You will eventually find a need to look at data in a grid-like pattern. That’s when you’ll use spreadsheets.

Businesses who hire programmers pay those programmers. And businesses use spreadsheets. That means somewhere along the line, you as a programmer will have to work with spreadsheets. Maybe not using spreadsheets like your users, but create them for your users.

Your users need monthly sales reports. You’re tasked to dump the database data into a spreadsheet that they can actually use. You don’t expect them to write SQL statements right? (If you do, why are you being paid?)

However, I don’t expect spreadsheets to be very complicated. Tabular data, with heading text. Simple charts (although I’m revising my view on charts. Even creating a simple bar chart with Open XML SDK is giving me a headache due to the many various options and settings available on the Excel UI). Simple styling (too much styling and your spreadsheet looks untidy and won’t be useful).

This means the full Open XML specs won’t have to be implemented. My gut guess is that at least 50% of the specs won’t be used for the simpler spreadsheets.

And in the future, simpler spreadsheets are probably going to be more common. You need to see and understand the data fast. The fancy styling just makes it harder to do that.

Setting it free

It was the hardest, most painful decision I had to make in a long while.

I had worked hard on the project. I downloaded a few open-source versions of similar projects for comparison. I went through the use cases, on how to do certain tasks with those open-source projects.

Then I went to a couple of commercial projects. I couldn’t afford to buy them, but I went through sample source code to compare the use cases as well.

I did all that for research. Then I came up with a list of features that would be useful. Then I implemented them in a way that makes it easy for programmers to use.

It was also supposed to be a “big ticket” item, something I could sell for a higher price. I’m biased of course, but I believe my software is easier to use than either the open-source projects or the commercial projects.

The open-source projects targeted people who wanted free software, or supported open source projects (vehemently sometimes), or small to medium businesses. The commercial projects targeted the big enterprises. I targeted the small to medium businesses, appealing to the programmers or the IT managers/directors of those businesses and companies.

I launched my project. It didn’t do very well.

I added more features. I wrote detailed documentation of the software. I tweaked the price. I changed the sales copy. I did some advertising. I marketed the software project as best as I could. I created a whole website around it. Months of work went into the project. Still it floundered.

In the end, I dug into the core of why the project existed, and it was because I wanted to make the lives of programmers easier. And putting the software behind a payment wall might have put a dent in that.

And so I set it free.

It was heart-wrenching. I cried. Not so much for the lack of sales, but more because no one wanted to use the software. Ok, fine, it was equal measure of “no one using it” and “I need to eat”.

So if you have a few minutes, I’d appreciate it if you’d check out my spreadsheet software library. It’s called SpreadsheetLight, and it’s free for download, and has source code available. Tell someone who might find the software useful. Thanks.

OSBC Launch

My Open XML spreadsheet course is up! Click here for details.

So here’s an interesting tidbit. Writing lesson 3 (recognising styles) was the most tedious part for me. It was unbearably boring for me. It was all I could do to slog through writing each word, coming up with screenshots and writing up example source code.

Ironically, styling a spreadsheet (Open XML or otherwise) is the hardest part of creating or modifying the spreadsheet. Maybe it’s because I’ve been looking at all the styling options so much that it’s become second nature to me. I flit between the Excel user interface and Open XML SDK classes with equal ease.

Right now, I have a different view of how teachers do their work. How do teachers continue to teach the same (or similar) materials to students every day (well, every semester, but you get the point)? At this moment, I don’t want to look at another Excel colour picker.

Anyway, if you want to learn about Open XML spreadsheets, and want a more student-friendly approach, consider OpenXML Spreadsheet Boot Camp. The course includes my Open XML reference manual as the official textbook. You will learn a lot, I promise.