DefinedNames in Cells for Open XML spreadsheets

So a while back, a customer of mine asked me if I knew how to set names (or labels) to cells in a spreadsheet, so that a cell formula just referenced those names. Frankly, I didn’t even know I could do that. I’m not an Excel wizard, as you can tell.

In case you’re in a hurry, here’s the code and the resulting Excel spreadsheet. The code runs against the Open XML SDK.

To my surprise, the name given to the spreadsheet cell isn’t defined at the Cell class level. You’d think you would name your child and make sure to slap a name tag on your progeny just so everyone knows what to call your offspring, right? After looking through the documentation, and thinking it through, I guess it makes sense to separate it. It works by having a central depository with all the available names, contained within DefinedNames and DefinedName classes. Then the names are available throughout the spreadsheet’s workbook.

It’s sort of like the SharedStringTable, where every piece of text is stored in a SharedStringItem, and referenced with an index. This has problems, in that I don’t know what is contained in the Cell class itself, but I don’t really want to go there right now…

Let’s go through the code a bit. Here’s the part where you define DefinedName(s):

defname = new DefinedName();
defname.Name = "PrimeNum1";
defname.Text = "Sheet1!$C$2";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum2";
defname.Text = "Sheet1!$C$3";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum3";
defname.Text = "Sheet1!$C$4";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum4";
defname.Text = "Sheet1!$C$5";
defnames.Append(defname);

In case you’re unfamiliar with Excel, “Sheet1!$C$5″ means the cell C5 of the sheet named “Sheet1″. The dollar sign acts as a separator, I think. In this case, the relevant cells are C2, C3, C4 and C5 (containing the 1st, 2nd, 3rd and 4th prime numbers). And then we have this part:

r = new Row();
r.RowIndex = 6;
c = new Cell();
c.CellReference = "B6";
c.DataType = CellValues.String;
c.CellValue = new CellValue("SUM");
r.Append(c);
c = new Cell();
c.CellReference = "C6";
c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");
c.CellValue = new CellValue("17");
r.Append(c);
sd.Append(r);

Note this particular line:

c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");

That’s our defined names in effect. Otherwise, we would use this:

c.CellFormula = new CellFormula("SUM(C2, C3, C4, C5)");
// or even
// c.CellFormula = new CellFormula("SUM(C2:C5)");

I teach you more about the CellFormula class, as well as a whole bunch of Open XML concepts in my programming guide.

Multi-use variables or multiple variables?

So I’ve been working on a software project of mine. I’ll tell you more about it soon enough, but for now, it’s enough to say that I’m writing source code that generates source code.

One thing I’ve noticed is variable declaration. There are 2 extremes.

One variable used multiple times

This is the memory-efficient version. If you need the use of an integer variable, you just declare one variable. For example,

int i;
i = DoSomething() + DoSomethingElse();
DoAlpha(i);
i = DoThis() + DoThat();
DoBeta(i);

That’s just for illustrative purposes. If you’ve written a fair amount of code, I’m sure you can think of better examples. Which are probably (and usually) more elaborate and lengthier.

The drawback to this is that the variable is temporary. As the code continues its execution, previous values stored in that variable are considered to be unimportant to future executions. That’s why the value can be discarded and the variable overwritten.

Multiple variables but one-off use

Then there’s the “declare as many variables as you can (or think you need)” method. For example,

int i1 = DoSomething();
int i2 = DoSomethingElse();
int i3 = DoThis();
int i4 = DoThat();

This has the advantage of keeping the variable values “alive” through that section of code. The drawback is that you use more memory, even if seemingly trivial. I mean, that’s like 12 more bytes of memory (assuming integers still take up 32 bits when you’re reading this). That hardly makes a dent in the computer’s memory space.

The hybrid

The above 2 are extreme cases. What happens when you write code is probably a hybrid, somewhere in between the 2 extremes. For example,

int iSubtotal;
int iTotal;
iSubtotal = DoSomething();
iTotal += iSubtotal;
iSubtotal = DoThis() + DoThat();
DoSomethingElse(iSubtotal);
iTotal += iSubtotal;

You know what you declared those variables for, so you have an idea how many “unique” variables you need. This have the benefits of using the least number of variables (sort of), balanced with keeping the least number of “live” variable values around.

So why am I talking about this?

Auto-generated source code cannot generate hybrids

When you’re writing code, you have one very important advantage: You have context. A program that generates source code, such as a decompiler, does not have that.

When you’re writing code, you make variable decisions such as naming, naming conventions, how many you need and so on.

A decompiler has difficulty making decisions like those, so it has to choose one of the extremes. Typically the multiple variables route, because that’s the safest. All a decompiler can do is detect that a variable is needed, and so writes out the variable declaration in the resulting source code. It cannot decide on whether this part of the code can reuse one of the variables it has already declared (or at least has difficulty doing so).

Ok, so the cat’s out of the bag. I’m writing a decompiler. That’s not exactly true but will suffice for now (I promise I’ll tell you more soon!).

Anyway, that’s what I discovered while working on my software project. I have decided to go the multi-use variable route, because of a human (and programmer) behaviour. A human programmer has difficulty holding on to many separate variables in his head.

When a section of code requires many variables, I tend to try to limit the number of variables I remember in my head. Maybe there’s a pattern. I might remember there’s fFinancialYear1 up to fFinancialYear7. I might decide to refactor the code such that I only need one fFinancialYear floating point variable (assuming the appended numeral makes sense, and not just laziness in naming). I might separate the code section into several sections, so each section has a limited number of variables.

Maybe that’s not how most programmers work, but I find it “friendlier” than having thisIsAnAwesomeClass1 through thisIsAnAwesomeClass20, and I can’t remember which awesome class does which. I tend to work with tighter variable names (where possible and logical), and write code that’s as tight in scope as possible. So the variable values can be discarded, which means I don’t have to keep track of whether that value is still needed, even if the computer doesn’t mind having to keep track of it.

So how do you write your code where variables are concerned?

3 facts about Open XML SDK

In the course of writing my guide on Open XML and Excel spreadsheets, I discovered some things about the Open XML SDK. There’s been some flak about the SDK or even on Open XML in general. But the price of the SDK is free, so you can’t beat that. While there are some inconveniences with using the SDK, the solutions and workarounds aren’t particularly nasty.

Here we go.

Open XML SDK is not a convenience library

If you’re looking for a neat little function that will toast bread, mop your floor, press your shirt and wish you “Have a nice day!” as you’re leaving the house, Open XML SDK is not what you’re looking for. (Actually if you find a software library that does the above, I wanna know.)

The Open XML SDK is a software development kit (hence the “SDK”). It is not a conventional software library in that sense. It provides atomic functions that allow you to do all sorts of nifty things to create/manipulate Open XML spreadsheets, word processing documents and presentation slides. The keyword is “atomic”.

This isn’t necessarily a bad thing. I know some companies will prefer not to use a software library (or third-party software). For example, government agencies or financial institutions, where data security is a concern and all software is (preferably) written in-house. The furthest they would allow is probably standard libraries.

I’m not going into a discussion about whether Open XML SDK is a “standard” library, since it’s from Microsoft. There seems to be a lot of unhappiness with Microsoft. Is it because of their monopoly? Because Bill Gates is rich? I like the .NET Framework, because it makes coding easier. Open XML SDK is built on top of that, so that makes Open XML documents easier to create and manipulate.

That said, because Open XML SDK offers atomic functions to do atomic tasks, this allows you to create your own functions to encapsulate whatever you’re trying to achieve. This flexibility is sometimes more valuable than convenience. Why do you think software companies charge a lot more for the source code of their libraries? You want to know what’s going on behind that function from a third-party library, because you don’t know if that bug is due to your own code or their code. (It’s probably your own code though…)

Open XML SDK is not a data manipulation library

There are 4 main parts to the Open XML SDK:

  • Spreadsheets
  • Word processing documents
  • Presentation slides
  • Graphics

The graphics part is a common set of class objects used by the other 3. Out of the first 3, spreadsheets are typically used to hold lots of data. Financial data, fiscal summaries, call logs, database dumps and so on. Text documents and slides hold data too, but not usually as intensive as spreadsheets.

Now it might surprise you that Open XML SDK offers no functions to manipulate any kind of data. None at all.

Need to dump 100 database records with 8 fields into a spreadsheet? You need to write code to fill in a 8-column-100-row block of cells.

Need to sort data? Do it within the database environment (there’s a SORT BY clause, you know?), or write your own sorting function. Because Open XML SDK ain’t got it.

This isn’t a big problem, but it does make writing code a little bit of a hassle. Particularly when you’re creating or manipulating a data-intensive spreadsheet. It’s not like when you’re attaching the DataReader object to a DataTable, and voila! All the data is nicely formatted and shown on a web page.

But this is also where the flexibility of writing your own functions come in.

Code execution order is important

When writing code with Open XML SDK, you have to put aside how you normally work with spreadsheets, text documents and slides. I’ll lump them together as “office documents”. Small “o” so it’s not copyright, right?

As a user, you would normally just go about your task of typing text, inserting images, styling text, cut-and-copy sections of an office document. The order of how you go about doing it is irrelevant, because you get to see the final result. And the final result is all that matters.

When working with Open XML SDK, you need to keep to an ordering of code sections. Each class in the SDK has to be appended to the appropriate parent class in the correct order, or you’ll get a corrupt file. This means your code has to be executed in the correct order too.

Go to the Open XML SDK help file. Let’s say we look at the Worksheet class. There will be a section called XML schema. In that, there will be an XML tag called “sequence”. See those “element” tags? That’s the order.

This code order isn’t a big problem. In your spreadsheet, it looks like an image is at the top, with a table of cell data below it. In code, you need to code for the table of cells first, then the image. Because that’s the order.

The solution is to ignore the visual placements of the data sections in your office document, and look at what is in your office document. Then you rearrange all the data parts according to the order, and then write code for that.

Conclusion

The Open XML SDK offers you flexibility in how you write your code and a great price (it’s free). It is backed by a large software company (Microsoft), so it’s not likely to disappear with no support in the future.

Sure it’s not as easy to use, but sometimes your situation doesn’t allow you to use a third-party library (cost, licensing, security). Hey, it works well enough. And sometimes, when you’re in a tight project, that’s all you really need. That you produce software (that produces office documents) well enough.

If you’re interested, you can check out my guide. I teach you how to create a spreadsheet with the standard functions in Excel such as text styling, image insertions, multiple worksheets, cell formulas. I also show you how to use a template in case you need something really complicated. And there’s working source code in C# and VB.NET.

Revenue sharing and operations research – part 3

This is a mini-series on how revenue sharing and operations research are linked. You might want to read part 1 on the specific business problem I was solving, and part 2 for the mathematical formulation of the problem. In this final part, I’ll tell you what was the solution eventually used.

First, although I said it was basically an assignment problem in part 1, on further thought, I don’t believe it is. What I was thinking was “how to assign that $0.01 such that there are no rounding errors”. Thus the “assignment” part. I apologise for any confusion.

Second, I said the financial/maths formulation was an integer problem. But the values are money values with decimal values, right? How can it be an integer problem? Because it has to also be correct up to 2 decimal places. That means fixed arithmetic. Therefore it becomes an integer problem. Just think of the values as cents (instead of dollars correct up to 2 decimal places).

Now, if you’ve read part 2 (and I applaud you if you actually sat through that mass of maths formulations), you should have guessed that using operations research to solve the business problem was not advisable. It might not even be suitable.

However, the problem still needed to be solved. How do you get rid of any extra or missing money?

More maths…

Going back to the example I gave in part 1, there were 3 products with revenue to be split between 2 parties. So there were 6 parts. If each part generated a rounding error of +$0.01, then there was a maximum potential difference of $0.06 between the original revenue to be shared and the sum of the parts after revenue sharing calculations.

I remind you that whatever solution I came up with had to make mathematical sense and financial sense to the programmers and the users. There are actually 2 goals:

  • To have the sum of the parts be equal to the original revenue amount
  • To have each part’s amount (after rounding) be as close to the calculated value as possible

The 1st goal ensures no summation errors. After revenue sharing, there are no extra or missing money amounts. This part is actually easy to fulfill. The 2nd goal is fulfilled with a bit of adjustments. So here’s the solution.

The easy-to-understand maths solution

We split the revenue accordingly to each part first, rounding each part’s amount to 2 decimal places. Then we sort each part in ascending order. Then we sum all the part’s amounts together. If there’s a discrepancy, we correct the discrepancy by adjusting the largest amount. This calls for an example.

Here’s the original example used. Total revenue for the 3 products are:

  • ProductA: $63.13
  • ProductB: $20.75
  • ProductC: $16.12

Assuming a 30-70 percentage split, we have:

  • ProductA: $18.94 (us), $44.19 (them)
  • ProductB: $6.23 (us), $14.53 (them)
  • ProductC: $4.84 (us), $11.28 (them)

Sorting all the parts in ascending order, we have:

  • ProductC: $4.84 (us)
  • ProductB: $6.23 (us)
  • ProductC: $11.28 (them)
  • ProductB: $14.53 (them)
  • ProductA: $18.94 (us)
  • ProductA: $44.19 (them)

The sum of the parts’ amounts is $100.01, which is not equal to the original revenue being shared ($100). The discrepancy is a +$0.01. So we adjust the largest amount. Specifically, we deduct $0.01 from the largest amount (because our discrepancy is positive).

So the revenue share for the content provider for ProductA becomes $44.18, and thus the sum of the parts become $100.

This method ensures that the sum of the each part’s amounts is still equal to the original revenue, which is very important (because this is a financial operation!). This satisfies the 1st goal.

And for each part, the amount is rounded to the nearest 2 decimal place. So each part’s amount is as close to the calculated split value as possible. The only exception is the largest amount might be off a little.

Now I chose the largest amount to “absorb” any rounding discrepancy precisely because it is the largest amount. Note that the term “largest” refers to the magnitude, so if you happen to deal with negative values (it happens, even in financial situations. Consider debt as an example), use the maths absolute function to do the sorting.

Any discrepancy can be mathematically shown to be at most equal to (number of parts) multiply by $0.01 (rounding error).

D <= ± (N * $0.01)

where D is the discrepancy and N equals the number of parts.

Note that the discrepancy is bounded, which is the mathematical way of saying it has an upper and lower limit (or bound).

Note also that in a fraction, a larger numerator means a larger fraction and a smaller numerator means a smaller fraction. A larger denominator means a smaller fraction and a smaller denominator means a larger fraction.

Now, whatever the discrepancy value is, it is bounded, it is fixed, and it is a small value. If we want any amount to “absorb” this discrepancy, then the larger the amount, the smaller the resulting error fraction or error percentage.

For example, if the discrepancy is $0.01 and the amount is $1, the resulting error percentage is 1% ($0.01 / $1.00 * 100%). If the amount is $5, the resulting error percentage becomes 0.2% ($0.01 / $5.00 * 100%).

Suppose the discrepancy is $0.02. We could spread the discrepancy error among the largest 2 amounts, each amount absorbing $0.01. But this makes the programming a little more complicated than it is. Also, it makes the algorithm a bit “dynamic”, which makes tracing any calculations by a programmer or user difficult.

Implementing it in code

All the revenue amounts were stored in the database. Because of this, I recommended that any revenue sharing calculations be done within the database environment itself. Namely, with stored procedures running SQL statements.

Benefits of using stored procedures within the database environment:

  • Can sort values easily (use the SORT BY clause)
  • Can handle dynamic number of values (with temp tables or the original tables)
  • Some calculations can be grouped into a single UPDATE statement
  • All the values are in the database!

The last benefit means there’s little switching of context from the database environment to the… whatever environment. In that project, it was scheduled Unix shell scripts combined with C programs that called the stored procedures. We didn’t want the context to switch back to a Unix environment to do calculations. Doing calculations in the Unix environment with C might be fast, but there are many content providers and many products. The context switching might eat up any performance benefits. Besides, having the calculations in a few stored procedures mean better modularity and separation of functions.

Further considerations

Because we’re in the business of … uh, doing business, we might want the customer to have a better deal. Or at least an easier report to read.

In the method above, we sorted the amounts in ascending order, regardless of whether it’s us or them. So it could well mean that the largest revenue share of the content provider be used to absorb the discrepancy.

This might mean when they read the revenue sharing report, they might question why that amount is $44.18 instead of $44.19. It might be just $0.01 to you, but it’s still money!

What we can do is sort the amounts by theirs first, then ours. And within each, sort by ascending order. So we could have this instead:

  • ProductC: $11.28 (them)
  • ProductB: $14.53 (them)
  • ProductA: $44.19 (them)
  • ProductC: $4.84 (us)
  • ProductB: $6.23 (us)
  • ProductA: $18.94 (us)

In this case, we adjust our revenue share for ProductA to be $18.93 (instead of $18.94) so the revenue sum is correct. Basically, we absorb any discrepancy, using our largest revenue share amount.

And that’s the end of the discussion of revenue sharing: the business part, the maths part and the implementing/programming part. Let me know if you have any questions.

Modularity in programming guides

I’ve read many programming guide books and tutorials. The one thing I’m looking for is, “I want to do X. What is the code I need to write to do just that?”

Many times, the author of the book or tutorial had mixed in other code or concepts into that. I want to know the simplest way to print “Hello World!”. I don’t want to include any extra libraries that don’t help with that. I don’t want any custom functions that makes printing a string any easier. I just want to print a string, ok?

The point is, the author already knows how to accomplish that task you want to learn. It’s when he gets, I don’t know, bored, that he adds other concepts to make it, I don’t know, interesting.

I’m not looking for the least number of code lines to write that accomplishes that task, although it’s usually that. I’m looking for the lines of code that just do what I want.

Because sometimes, I can’t differentiate the important from the extraneous. I don’t know, that’s why I’m learning, remember? This is especially important when I need to mix and match different concepts. If what I learnt has other extra stuff mixed in, then the resulting code has “more” extra stuff.

It’s like I want to mix X and Y, but got (X + dx) and (Y + dy). And I don’t know which parts are dx or dy.

Some authors make it clear which parts are the actual lines of code to accomplish X. Some authors are great at explaining stuff. I’m saying there are many others who don’t or can’t.

So when I wrote my Open XML spreadsheet programming guide, I made sure each chapter was modular. If not, I had sufficient comments and explanations so the reader knows which parts are the important parts. Each chapter was modelled after a major feature/function in the Excel software. How to style text, how to insert images, how to add more worksheets, that kind of thing. The Excel user mixes and matches those functions, so I want the programmer using the guide to be able to do a similar thing.

I got an email from a programmer who bought my guide that he liked to pick apart code to figure it out. I wrote a few custom functions but only because it made the code more readable. The full source code is given, so the reader is free to pick apart those functions and write his own (to better suit his needs).

I believe this is attributed to Albert Einstein:

Make things as simple as possible, but not simpler.

Be careful of encapsulating too much into just one function call.

Remember the stride

I was reading this article by Raymond Chen where he mentioned this:

Most of the time, your code won’t care (there are just pixels out there that you aren’t using), but if you use the Get­Buffered­Paint­Bits function to obtain direct access to the bits, don’t forget to take the stride into account.

The word “stride” evoked a ton of memories where I used to track (laboriously) the dimensions of a bitmap. This was back in the days when I was fiddling with writing my own computer games and computer graphics. Specifically, I was working with the screen graphics buffer.

Double buffers, 8-bit colours and screen resolutions

What happens (happened?) is that for fast visuals, we need a decent frame rate, and 30 frames per second is fast enough to fool human eyes that we’re looking at smooth uninterrupted motion. Back in those days, computer monitors weren’t that advanced, so the double buffer trick was used. (Is it still used? I don’t know…).

The double buffer trick refers to having 2 sets of screen buffers. The first one is used to “blit” the contents onto the screen. I’m using blit as a verb too, but this will take too long to explain if I told the story of blitting too… While this blitting is being done, the pixels of the next frame is drawn on the second buffer. When the drawing of pixels is completed, this second buffer is used to blit to the screen.

When you force the program to use double buffering, the next frame is held until the buffer has the drawing completed. This is what “lag” means. The frame rate is sort of ignored, because only when the next frame is completely drawn, will the buffer contents be displayed. Usually this isn’t a problem because the pixel drawing for one frame is within 0.03333 seconds (based on 30 FPS). As you go for higher resolutions and more complex objects being drawn and more complex calculations being done (such as calculating hit points and bullet trajectories), this next-frame-drawing gets slowed down.

In code, what you have are 2 bitmaps in memory and 1 pointer (yay, pointers!). You actually point the pointer to either bitmap based on which bitmap’s contents are to be blit.

  • First bitmap on screen, second bitmap drawing next frame, point to first bitmap.
  • Second bitmap done drawing.
  • Point to second bitmap.
  • Draw next frame on first bitmap.
  • Continue till application (most probably a game) is done.

“So what’s the stride got to do with this? And what’s a stride?”

Well, the thing is, when you request a bitmap from within your code, you might not get the exact dimensions you want. What you get is, as Raymond mentioned:

it only promises that the bitmap will be at least the size you requested.

So the bitmap given to you can be larger in size. And the larger size is based on the stride. I’ve not done a whole lot of research, and the following explanation is based on what I remember from those game-developing days. Let’s say the stride is 4 bytes. This means the memory size of the bitmap given to you will be in multiples of 4. … Uh, yeah, I think that’s about it.

If you ask for a bitmap with dimensions such that the projected final size memory is not in multiples of 4, you will be given a bitmap such that it is.

This “problem” is compounded by the fact that you also have to take care of the red, green and blue bytes (3 contiguous bytes for 1 pixel). Sometimes, there’s an alpha component. There’s also the 8-bit colour, where the first 2 bits are for red, next 3 bits are for green, and the final 2 bits are for blue. Not 1 byte for each colour component. (Just FYI, green has more bits because we can differentiate more shades of green than red or blue).

This is why you might find that in graphics programming, you are advised to have dimensions in multiples of 2. For example, your texture images used for mapping onto 2D planes and 3D objects should preferably have dimensions in powers of 2.

Wait, why powers of 2, and not just multiples of 2? I believe it has something to do with the texture mapping functions, because those functions don’t work well when the bitmap/image/texture argument doesn’t have dimensions with powers of 2. This is why I prefer to use square images at 128, 256 or 512 pixels. Mipmaps were used to alleviate this, but that’s another topic…

And the final complication? “What? There’s more?!?” Yes.

The bitmap you requested in code, the one where you might have to take note of the stride? That bitmap might have a different dimension than the screen dimension of the computer monitor. Computer monitors weren’t quite “standard” back then (I’m starting to feel old…). The computer monitor also has its own stride (I’m basing this on the memories of my research. Don’t just take my word for it). This means blitting pixels from a bitmap buffer to the screen isn’t quite so straightforward.

For example, if you’re working on a screen resolution of 800 by 600 pixels (yes, I know that’s like ancient. Work with me here…), and then you ask for a bitmap to represent that. Well, you might get a bitmap with dimensions 1024 by 768 pixels. Maybe it’s because it’s more memory efficient that way. 1024 = 2^10, and 768 = 2^9 + 2^8. Wait, now we have sums of powers of 2?!? I don’t know a whole lot about these things… I’m just trying to make a computer game…

So based on the above example, you have an “extra” width of 224 pixels (1024 – 800) and “extra” height of 168 pixels (768 – 600). So even if the stride is taken note of, the computer might just decide to throw you more memory space. Just for the heck of it.

In summary…

The bitmap you request in code might have a different dimension than what you wanted. The computer monitor might have a different dimension to work with. You have to remember each pixel has a red, green, blue and even alpha component (each of which uses a byte of storage. Or not, in the case of 8-bit colours). Then you have to take note of the dimensions of the textures you’re using to map onto 2D/3D objects.

And none of that has very much to do with the “fun” part of game programming. You’re just trying to work with one image.

I hope you learnt something. Because I sure as heck don’t know what’s the point I was trying to make, other than bombard you with seemingly random pieces of information…

Spreadsheet Open XML V2

I was rushing to get this out. The updated version of my programming guide is out! I first launched it on 17 Jan this year, so I was hoping to meet the personal-and-unseen deadline of 17 May, so it’s a nice 4 months interval. Ah well, I’m the only one who cares anyway…

I’ve added loads more content to help you with your Open XML spreadsheet needs. Version 1 was 53 pages. Version 2 is 147 pages. There are a lot of pages with screenshots, but still… 147 pages!

You can find out more here.

The last few weeks had been interesting while I rushed to get working source code and write explanations for the guide… I need to sleep… wait, I’ve got a magazine deadline! *sigh*

Programming guide update

So. Quick update.

I’ve finished writing a bunch of code for my Open XML guide to create spreadsheets. They include:

  • Inserting comments
  • Using formulas
  • Setting header/footer options
  • Setting page setup options
  • Creating multiple worksheets
  • Freezing/Splitting panes
  • Using and styling tables (but not pivot tables)
  • How to write the cells if you have multiple sets of tabular data

The last one was a problem that a customer faced. Let’s say you have 2 sets of tabular data, and they’re side by side in the worksheet. Using Open XML SDK, you have to run through the 1st row of the 1st set of data, then the 1st row of the 2nd set of data. You can’t completely iterate over the 1st set of data first. This is because with Open XML SDK, you have to completely define the first row to be inserted into the worksheet. That will include the 1st row of the 2nd set of data.

So I’m (feverishly) typing the explanation part of the guide (the code is all done and tested). I’m telling you now because when I release this update, I’m going to increase the price. I calculated that with the amount of work done on writing the code and the explanation, and that I currently offer a 1-year free technical support together with the guide, I’m severely undercharging. Besides, the number of hours saved studying the guide versus learning from scratch is worth the price increase.

Since I currently give customers a free update within one year of purchase, you might want to get the guide now before the price goes up. Then you’ll get the current guide and do awesome stuff with it right now, and when I release the updated guide, you get it at no extra cost.

Depending on how fast I type (and how many cups of tea I imbibe), I might release the update in a week’s time or two. I don’t really have a deadline. But the latest release date will be around mid-May (because I’ll have to work on my magazine by then. But I digress…). But it might be sooner, because I want to get this off my todo list so I can work on something else.

Built-in styles for Excel Open XML

So a blog reader, Sebastien, once asked about built-in style numbers used in Open XML for Excel. I’m sorry to break the news to you. It’s not straightforward to use those built-in styles.

What are these built-in styles? Check these out:

Excel built-in styles

For instance, Sebastien was asking how to use the “Good” and “Bad” built-in styles. So the good news is, you can use those named styles. The bad news is, the various style effects (font colour, background colour, borders) aren’t automatically added in for you. From the Microsoft documentation of CellStyle:

This element represents the name and related formatting records for a named cell style in this workbook.

Annex H contains a listing of cellStyles whose corresponding formatting records are implied rather than explicitly saved in the file. In this case, a builtinId attribute is written on the cellStyle record, but no corresponding formatting records are written.

For all built-in cell styles, the builtinId determines the style, not the name. For all cell styles, Normal is applied by default.

Read the second paragraph again. It means even if you use the built-in style “Good”, you still need to create a style with the green background and green text font colour.

Since this is the case, I suggest you just stick to creating your own custom styles. It’s what’s going to happen anyway.

Now there is a magic number, 164. Your custom styles will start being numbered from 164 onwards. This implies there are 164 built-in styles (0-indexed). My search efforts only gave me a partial list.

Please refer to the ECMA-376 documentation for Office Open XML formats. Download the 1st edition, part 4. Inside, you’ll find a PDF. Turn to page 2135 (as of this writing. It’s page 2128 on the PDF itself though). You’ll find this:

Excel Open XML built-in style list

That list is probably that of “standard” styles. From reading the next few pages of the documentation, I believe most of those 164 built-in styles are due to internationalisation issues.

So my conclusion is, for the “standard” built-in styles you’re interested in, you still have to create the supporting style effects (font colour, background colour and the like). For the “non-standard” built-in styles, they are probably different based on the localisation of your Excel file. So you’re better off just creating your own custom styles, which I teach you how here and here. Or you can get my programming guide for detailed explanations and source code.

Tim Coulter and ExtremeML OpenXML library

If you’re creating Excel spreadsheets using C# and Open XML SDK, consider Tim’s excellent library ExtremeML. It can dramatically cut the number of lines of code you write. Then you can pretend you’re still feverishly working on that code while playing WoW. Tim also said this about my Open XML programming guide:

I know from personal experience how easy it is to become overwhelmed by the complexity of OpenXML, but your guide takes away the pain by presenting simple, modular solutions to many of the common challenges that developers face when creating Excel spreadsheets from code. I highly recommend this to any C# or VB.NET developer who’s getting started with OpenXML, and especially to those who have already hit the frustration barrier.

That’s nice.

Disclosure: I gave Tim a complimentary copy of my guide. He commented on my articles about Open XML, and I found him asking questions on StackOverflow. That’s where I learned the Double Underscore hack (more info in my guide).

Want to get comped copies of my products? Comment on my articles. Subscribe to my blog. Read my magazine.