How to create an Open XML spreadsheet bar chart

Today, you’re going to learn how to insert a bar chart into an Open XML spreadsheet. Frankly speaking, the basics apply to the other types of charts, so if you master this, you can figure out how to create the other charts.

First, download the source code and the resulting Excel file.

I’m not going to go through every detailed step of the code. Instead, I’ll go through broad strokes of explanation. You’re expected to know the basics of using Open XML SDK to create cells and insert images. Refresher on cells and styles here, and inserting multiple images here.

Hardcoded data

You may have noticed that the spreadsheet cell data is hardcoded into arrays. The reason is it’s used in creating the cells, and creating a copy of the data when creating the chart. Open XML charts store the actual data it’s representing.

For charts, there are 3 parts to the data: the category entries, the “legend” entries, and the cell data. “Legend” is in quotes because most charts use that data part as the legend, but it’s not called legend entries. I believe internally, it’s referred as the series name of data series.

Absolute cell references

You may also notice that in the chart parts, cell references are absolute. For example, “Sheet1!$B$3”. It means the cell references are absolute, as opposed to relative (such as “Sheet1!B3”).

The sheet name is used

You may also notice that the sheet name is used. This means you can have a chart in sheet “Sheet1” but contain data from “Sheet2”.

Axis IDs

If you’ve explored an Open XML spreadsheet chart, you may have noticed magic numbers used as axis IDs. So how does Excel calculate them? I don’t know, but you don’t have to worry about that.

Within each chart, as long as the axis IDs are unique, you’re fine (this means within each ChartSpace class). This means theoretically, you only need 6 unique axis IDs. There are 3 axis IDs (category/date axis, value axis and series axis) for the primary axis and another 3 for the secondary axis. The secondary axis comes into play when you plot a particular data series on its own (or against an existing secondary axis).

When data series are plotted on a secondary axis, they can also be plotted as a different chart type. This is what’s known as a combination chart. For example, a column chart with a line chart.

I don’t know all the valid types of combination charts, thus the theoretical limit is 6. But I don’t think 3D charts can be part of combination charts. In practice, you only need 4 unique IDs.

And you don’t even need to fret. Just use 1 through 6 as the IDs.

Data series

Excel assumes that if your data is wider than it’s tall, then rows of horizontal cells are taken as data (with first row as category entries and the first cell of each row as the “legend” entry). Otherwise, it’s columns of vertical data (with first column as category entries and the first cell of each column as the “legend” entry). You can change this if you find your chart weird-looking on the Excel UI, but programmatically, the source code uses rows as data series.

Each data series has the category entries, the associated “legend” entry, and the actual data.

Different charts have different Open XML SDK classes representing data series. For example, the bar charts and column charts use the BarChartSeries. The pie charts (and surprisingly the doughnut charts) use the PieChartSeries. The reason is that there are different properties associated with different charts, and thus each chart series needs to take care of that. For example, line charts can have markers on each data point, and pie charts can have exploded pie slices.

Charts are also images

A chart has 2 main parts: the chart content, and the drawing part.

The chart content contains the chart data, legend, title, axes and what not. The drawing part refers to how big the chart is displayed on the worksheet, and where to display it. And in this sense, a chart technically behaves like an image.

So not only do you have to create the chart content, you have to take care of its “image-y” properties. Hence the reference link I gave on inserting multiple images earlier on.

Because of this, you have to take care of unique image IDs. Each worksheet needs a unique ID assigned to each image part. And a chart is considered an image for this purpose.

Here’s the gist. A WorksheetPart class is tied to each separate worksheet. Each WorksheetPart class has a DrawingsPart class. And a DrawingsPart class can have ImageParts (the “normal” images) and ChartParts (for charts).

For more Inception-like information, a ChartPart can have ImagePart’s too (to have picture fills for the chart title or legend for example). It’s sort of complicated, so let’s not go there in this article.

The dreaded TwoCellAnchor

Position of images (and thus charts) is done with either an AbsoluteAnchor, OneCellAnchor or TwoCellAnchor class. I explained a little why TwoCellAnchor took out more of my hair than 4 years of university mathematics here.

But for charts, you don’t really need fine-tuned positioning. So a TwoCellAnchor is fine. The column/row properties are zero-based indices of the column and row indices. The offsets are in EMUs. Theoretically, the offsets are less than the column width or row height of its corresponding column or row.

Chart anatomy

The typical chart goes like this. Each chart has a title, plot area and legend. The plot area contains a chart type (or several chart types for combination charts), and relevant axes.

Each chart type might have specific properties unique to the chart type assigned (bar charts will have the bar direction assigned for example). And data series unique to the chart type assigned (the BarChartSeries for bar and column charts for example). And any relevant axis IDs assigned (some charts don’t have axes, such as doughnut charts).

Relevant axes are typically a category (or text) axis on the horizontal, and a value axis on the vertical. If the category entries are date values, Excel might have used a date axis instead of a category axis. The source code provided assumes textual category axis is used.

For data series charted on the secondary axis, the order of appending axes is as follows:

  • Primary horizontal axis (whether category or date axis) [typically at bottom]
  • Primary vertical axis (value axis) [typically on the left]
  • Secondary vertical axis (value axis) [typically on the right]
  • Secondary horizontal axis (whether category or date axis) [typically on the top]

Bonus!

If you change the bar direction in the source code to Column, you’ll get a clustered column chart. Everything else works just fine.

In fact, all the charts look surprisingly similar in terms of code structure. Change the chart class, and the chart series class used. Fiddle with a couple of properties unique to the particular chart type. BOOM! New chart born.

The really hard part is getting the chart data series down pat. Which is why I hardcoded the data for this explanation. In practice, you’re probably getting your data from a database, so you’ll be iterating through that. Or obtaining the actual cell data somehow. So I might as well have a simple data structure so you can expand it yourself instead of hardcoding the absolute cell references everywhere (which will be a pain for you to follow).

In conclusion…

Charts are hard… I’m writing a reference manual for Open XML charts in spreadsheets. So stay tuned for that. In the meantime, my software library SpreadsheetLight takes care of all the above in a few lines of code. So take a look at that (it’s free and source code is available).

Future of spreadsheets

So I was checking some site statistics for my spreadsheet library website and I found an interesting search phrase: “open xml excel future”. The person might have been checking how important spreadsheets are in the future. Or maybe just checking if Microsoft Excel is still around.

Spreadsheets are still going to be used in future. This is based on my experience working in a medium-large telecommunications company, and based on the companies who bought my Open XML spreadsheet reference. Before I go on, let me tell you about an incident.

I went on Facebook and asked my friends to tell everyone they know of my spreadsheet library (SpreadsheetLight). One person commented that his friends are all hackers and don’t use spreadsheets. I will assume “hackers” to mean “programmers with higher than average programming skills” and “don’t use spreadsheets” to mean “hardly ever thought about them”.

Software is meant to be used by people. Usually “normal” people, not “hacker” people.

PayPal allows you to download transactional data. You can download it in CSV format. What do you do to see it clearer? You paste it into Excel. Or whatever spreadsheet software you use.

I’ve worked with satellite transactional data. There are lots of data fields. Source of transmission, destination of transmission, length of transmission, type of data. Why does this matter? Because the satellite companies charge the transactions to the customers. Even the source and destination is important (think of calling from America to China, and from Singapore to China. Wait, you might be too young to know about the overseas call charges…).

And the best way to view satellite transactional data is on a grid-like display. A spreadsheet.

So here’s a short rant. If you are a highfalutin coder who thinks you’re better than anyone, even another programmer who’s using some “inferior” or “non-open source” language or platform, get off your high tower. The only people who care about your code are other programmers. The normal people don’t care a flying fishball about your code.

I’m practical. I use whatever tools I have to do what I need to do. So far, C# is great. And there are people who don’t like the .NET Framework simply on principle that it’s from Microsoft.

I’ve never really understood the tension between the open source software people and the “closed” source software (aka enterprises and large companies) people. In particular, Microsoft seems to bear the biggest brunt of the force whereas Apple (the supposed “opposite”) is enshrined, despite the various salient restrictions Apple imposes (App Store apps approval process, closed system, proprietary hardware [iPhone, iPad]).

If I didn’t know better, I’d say some programmers hate Microsoft because Microsoft makes a lot of money. Oh there are specific reasons like closed proprietary systems, or crushing small competitors by buying them out, or whatever. I’d say the underlying cause of their distress is that Microsoft makes lots of money. And if I didn’t know better, I’d say these programmers find money a repulsive thing.

Views on money is a separate topic. My short version is: money allows you to be more of who you are. Money gives you more options. Money is not inherently evil.

Now Microsoft introduced the Open XML specifications, with the creation of the Open XML SDK for .NET Framework. People have criticised the Open XML specs, whether it’s “yet another standard” or “it’s from evil Microsoft!”.

The Open XML specs have their flaws. I should know because I’ve spent hours studying them and poring over the details. From my research, the specs are created to conform to the existing Microsoft Office software (Word, Excel and PowerPoint). If it’s in the Office software, the specs reflect it.

You can’t blame Microsoft. It’s a good move. It also made the specs very complicated. The reason is that the Office software is the accumulated knowledge and code base from years of testing and marketing.

You know what’s the most common thing to ask when creating Open XML documents? “How to create a simple Word document?” or “How to create a simple Excel spreadsheet?” Because the complexity made creating even an empty file a big pain in the derriere.

So what’s the future, really?

Some kind of standard would be used. Possibly the Open XML standard. Google Docs and Apple’s Number and other spreadsheet software will adopt it, or at least support it.

Spreadsheets are going to be around because companies and businesses use them. Even the small ones. I don’t care if you’re a one-man startup. You will eventually find a need to look at data in a grid-like pattern. That’s when you’ll use spreadsheets.

Businesses who hire programmers pay those programmers. And businesses use spreadsheets. That means somewhere along the line, you as a programmer will have to work with spreadsheets. Maybe not using spreadsheets like your users, but create them for your users.

Your users need monthly sales reports. You’re tasked to dump the database data into a spreadsheet that they can actually use. You don’t expect them to write SQL statements right? (If you do, why are you being paid?)

However, I don’t expect spreadsheets to be very complicated. Tabular data, with heading text. Simple charts (although I’m revising my view on charts. Even creating a simple bar chart with Open XML SDK is giving me a headache due to the many various options and settings available on the Excel UI). Simple styling (too much styling and your spreadsheet looks untidy and won’t be useful).

This means the full Open XML specs won’t have to be implemented. My gut guess is that at least 50% of the specs won’t be used for the simpler spreadsheets.

And in the future, simpler spreadsheets are probably going to be more common. You need to see and understand the data fast. The fancy styling just makes it harder to do that.

OSBC Launch

My Open XML spreadsheet course is up! Click here for details.

So here’s an interesting tidbit. Writing lesson 3 (recognising styles) was the most tedious part for me. It was unbearably boring for me. It was all I could do to slog through writing each word, coming up with screenshots and writing up example source code.

Ironically, styling a spreadsheet (Open XML or otherwise) is the hardest part of creating or modifying the spreadsheet. Maybe it’s because I’ve been looking at all the styling options so much that it’s become second nature to me. I flit between the Excel user interface and Open XML SDK classes with equal ease.

Right now, I have a different view of how teachers do their work. How do teachers continue to teach the same (or similar) materials to students every day (well, every semester, but you get the point)? At this moment, I don’t want to look at another Excel colour picker.

Anyway, if you want to learn about Open XML spreadsheets, and want a more student-friendly approach, consider OpenXML Spreadsheet Boot Camp. The course includes my Open XML reference manual as the official textbook. You will learn a lot, I promise.

Teaching classes

I’ve been asked what possible career could I have with a maths degree. The asker usually already have an answer. Be a teacher.

When I graduated from university, it was about a year after Sep 9/11. Finding a job was hard. My friends, driven to desperation, decided to see if teaching was a viable career. I almost wanted to try too, but then I deliberately decided not to be a teacher (at least in Singapore), so I just followed my friends to the application centre.

I spent maybe an hour looking at the architecture of the centre while my friends filled in the application forms. So in Singapore, before you’re approved as a teacher, you had to go study at National Institute of Education (NIE) first. This is on top of whatever degree you’ve already earned. NIE teaches you how to be an effective teacher (we take education seriously here in Singapore).

After NIE, then you get assigned to schools. I don’t know the details, but I believe you don’t get a choice about the assignment.

Out of my university friends, only one actually became a teacher. There was once I mentioned that I forgot the double angle formulas of sine and cosine. She lectured me on that, together with another friend who was a part-time maths tutor.

I have since learned my lesson. (By not saying anything that’s maths-related. What, you think I’d go memorise the double angle formula just in case? Sheesh…)

So it’s a weird turn of events that I contemplated teaching a class of my own. It’s when I realised that classes on professional subjects are different than academic subjects. I’ve gone to a few Visual Studio courses (because they were the only courses I was approved to attend, and upper management pressured me to “Go take some courses dammit Vincent!”. Well, my manager didn’t actually say that, but the tone was there…), and the main difference was that you actually have to get something done.

So now, I’m teaching a programming course. Apparently, there are many tutorials out there teaching you how to do something in an Open XML spreadsheet (and for Word documents), but no one really teaches you how to know what settings to use.

The goal of the course: Given an Open XML spreadsheet (say, as a user requirement), reproduce that spreadsheet (with data changes of course). It’s called “OpenXML Spreadsheet Boot Camp”, and you can find out more here.

Open XML SDK class structure

I’ve gotten a few questions on the class structure of the Open XML SDK. There are articles on Open XML itself, where you work directly with XML files and tags, and zip them up yourself. Basically you work with WordprocessingML (Word), SpreadsheetML (Excel), PresentationML (PowerPoint) and possibly DrawingML (for images and stuff). Eric White did a lot of stuff on this.

There are also articles on the use of Open XML SDK itself. However, the articles I’ve found tend to give you code samples and some explanation of why you do those things, but didn’t really explain the deep “why”.

The fundamental question I was asked was “How are the Open XML SDK classes related to each other?“. A related question is “Why do I have to use that particular class?”.

While I’m familiar with the spreadsheet portion of the SDK, I believe the general class structure applies to the WordprocessingML and PresentationML parts too. So I’ll use the SpreadsheetML part as the example.

I also didn’t find any article giving names to the class categories I’m going to tell you, so I’m going to make up my own. If there’s an official Microsoft article on this, let me know. Generally speaking, there are 4 types of SDK classes:

  • Relationship Part Classes (henceforth referred to as RPCs)
  • Root Classes (henceforth referred to as RCs)
  • Content Classes (henceforth referred to as CCs)
  • Special Classes

Before I continue, keep in mind that Open XML documents are basically a bunch of XML files zipped together. Just like a relational database, certain XML files are related to each other (just like certain database tables are related to each other). Which brings us to the first type of class.

Relationship Part Classes

RPCs are the glue that holds certain SDK classes together. They are most easily recognisable by their class type name. For example, WorkbookPart and WorksheetPart.

However, not all SDK classes with names that end with “Part” are RPCs. For example, TablePart is not an RPC (it’s actually a Content Class). The corresponding RPC is actually TableDefinitionPart.

The most important part of an RPC is that it carries a relationship ID, and this relationship ID is used to tie relevant classes together. An RPC is also different in that it has as a property, a Root Class.

Root Classes

Remember that Open XML documents are a bunch of XML files zipped together? Well, an RC represents one of those XML files.

For example, the RPC WorksheetPart has as its RC, the Worksheet class. The Worksheet class holds information that basically translates into an XML file, typically sheet1.xml (and sheet2.xml and so on). The Worksheet class contains your worksheet cell data information.

Content Classes

If RCs represent an XML file, then CCs are basically XML tags.

For example, the Worksheet class contains the SheetData class, which contains the Row class(es), which in turn contains the Cell class(es). The corresponding XML tags are “worksheet”, “sheetData”, “row” and “c”.

Yes, an RC represents an XML file, and also translates to be the first XML tag of that XML file. That’s why it’s called a Root Class, because it also represents the root element (of the underlying XML structure/document).

Special Classes

These aren’t really that special. As far as I know, there are only 3 classes under this category: WordprocessingDocument, SpreadsheetDocument and PresentationDocument.

Those 3 classes form the starting point of any code relying on the Open XML SDK. You can consider them as Super Relationship Part Classes, because their properties are mainly RPCs.

An illustration

You might still be confused at this point (I don’t blame you…). Here’s a diagram for a simple Open XML spreadsheet:
Open XML SDK class structure

In green, we have the Special Class SpreadsheetDocument as the ultimate root.

In blue, we have the RPCs, 1 WorkbookPart class and 2 WorksheetPart classes. The SpreadsheetDocument class has the WorkbookPart class as a property. The WorkbookPart class contains a collection of WorksheetPart classes.

In grey, we have the RCs, 1 Workbook class and 2 Worksheet classes. The Workbook class is the RC of WorkbookPart class. The Worksheet classes are RCs of corresponding WorksheetPart classes. The Workbook class represents the workbook.xml file and the Worksheet classes (typically) represent sheet1.xml and sheet2.xml files.

In orange, we have the CCs. The Workbook class contains the Sheets class, which in turn contains 2 Sheet classes. The Sheet classes have a property holding the relationship ID of the corresponding WorksheetPart classes, which is how they’re tied to the Worksheet classes.

One of the most confusing parts…

After working with the Open XML SDK for a while, you might find yourself asking these questions:

  • Why are there so many classes?
  • Why are some of these classes devoid of any meaningful functionality?
  • Why are some of these classes duplicates of each other?

When I was first using the SDK, I felt the same way when I first used the .NET Framework: Being overwhelmed. There were many namespaces, with many classes in them, and I didn’t know which class to use for a specific purpose until I looked it up and wrote a small test program for it. Having a comprehensive help database/file for the .NET Framework was a really good idea.

And so it was with the Open XML SDK. I mean, it’s a spreadsheet. I can see a couple dozen of classes. Maybe. It turns out to be a lot of classes. That’s why there are so many code samples out there, but you don’t really know why you need to use that particular class.

And there are classes without any meaningful properties or functions. They inherit from a base Open XML class, and that’s it. For example, the Sheets class.

Then there are classes with identical properties. For example, the InlineString class, the SharedStringItem class and CommentText class. Or the Color, BackgroundColor, ForegroundColor and TabColor classes.

The answer is the same for all the above questions. The Open XML SDK is meant to abstract away the XML file structure.

There are duplicate classes because each class eventually translates into an XML tag (if it’s a CC or RC). XML requires a different tag for different purposes, hence the Open XML SDK has different classes. Even though the classes are identical in programming functionality, they become rendered as different XML tags.

There are classes without any seemingly meaningful properties or functions because their sole purpose is to have children. (Ooh Open XML SDK joke!) The Sheets class has as children, the Sheet classes. In XML, they’re correspondingly the “sheets” tag with “sheet” tags as children. The final XML tags have no XML attributes, hence the corresponding SDK classes also have no properties. Tada!

And finally, there are so many classes, because frankly speaking, you need one SDK class corresponding to each individually different XML tag. There are a lot of XML tags used in Open XML, hence so many classes. And that’s before we add in the Relationship Part Classes.

If you have any questions, leave them in a comment or contact me. And I’ll see if I can answer them in an article.