Considerations for storing Excel cell value in code

You may hate Excel, but you may find a discussion of how Excel stores cell values interesting.

So I have a spreadsheet library. The biggest concern at the initial stage was how to store all the spreadsheet data efficiently. I hear people talking about millions of cells, so I’m scared. If my program stores a spreadsheet cell using 10 bytes (for example), a million cells would take up 10 million bytes in memory.

Let’s start by looking at all the different types of information you can type into a spreadsheet cell. You have:

  • booleans: TRUE or FALSE
  • numbers
  • text
  • rich text (different styled text within the entire text itself)
  • dates and times

For us programmers, “numbers” can be separated into floating point or integer types. An Excel user won’t see a difference.

So how does Excel actually store those values? I’m going to focus only on Open XML because I’m not interested in BIFF files…

  • booleans: TRUE stored as text “1” and FALSE stored as text “0”
  • numbers: stored as text
  • text: duh
  • rich text: stored in a separate shared strings list, with the index to that list stored as text here.
  • dates and times: stored as number that’s in text form

You will see everything is basically stored as text. That’s because the underlying XML files are text files. There’s a property (XML attribute) that differentiates the data, such as boolean, number, string, inline string, shared string.

So why are dates stored as a number? It’s easier to do date calculations with 41449 than “24 June 2013”. So how is this number obtained? See here.

So if you’ve been looking closely enough, Excel’s optimisation tactic is to store everything as numeric text as far as possible. So I want to follow that.

Before doing so however, I went to read what other people are doing AKA open source spreadsheet libraries. In code, they use an object to store the cell value. As in System.Object, the mother of all data types in .NET.

So you have an integer? Dump it into the object variable. Floating point? Dump into object. String of characters? Dump.

How do you read it out? Boxing and unboxing. You remember it’s a floating point value and cast it back from an object to a double variable type.

So what did I do? I have a double variable and a string variable, and I store the cell value in one or the other based on the input.

The “all in object” way has variable (no pun intended) memory size, based on the contents. Sort of. I’m not an expert in this.

My way has a fixed memory size for double’s. Each double takes up 8 bytes (for sure?). A string variable takes up variable size, but because the optimisation tactic is to store data as a number, I can assign the data to the double variable and set the string variable to null. This means the string variable size is sort of fixed too.

So this is what I do. If it’s a number, I store it in the double variable and set the string variable to null. If it’s text, I convert it to a number by using shared strings (out of scope for discussion here) and store the index into the double variable and set the string variable to null. The only cases where the string variable is actually used is if I store the text there, or if I want to store the actual number there (because “1.23456789” may not be stored exactly as that in a double variable. Go read on how floating points are implemented for details), which are rare.

According to Jon Skeet, strings take up 20 + (n/2)*4 bytes (where n is the number of characters). But a null string takes up 8 bytes (it’s either 4 or 8 bytes. I’ll assume the worse scenario).

This means for the most part, each cell has a double variable that takes up 8 bytes and a null string that takes up 8 bytes. A cell value of 10 or 3.14 or 12345678.9 takes up 16 bytes regardless.

Since 16 bytes is less than 20 + (n/2)*4 bytes, I save more memory in most cases. I also have less boxing and unboxing operations, which make things go faster.

Modulo 26 and column names

I was sitting in the lecture theatre valiantly trying to keep awake. The professor was speaking on the rigorous application and proving of the modulus function. It’s basically the remainder, but I’ve never been introduced to it in such, uh, rigor.

He brought up an example using modulo 26. And demonstrated the wrapping around of values. And the use of it in cryptology (a class I took later on, and I got tasked by the cryptography professor to write a program to do simple encryption. But another story perhaps…).

Modulo 26 is similar to finding the remainder. The difference is that the remainder is unique. This is important to our discussion.

“About what?” you may ask.

Excel column names.

There are 2 types of cell references used in spreadsheets, the R1C1 format and the A1 format. The R1C1 is simple. If the row index is 5 and the column index is 7, the result is R5C7.

The A1 format takes on a column name and the row index. Using our example again, the result is G5, because “G” is the 7th alphabet. Yes, that list of 26 alphabets.

The version of Excel currently (Excel 2007 and 2010) has up to 3 letters, with XFD as the last column name (that’s the 16384th column). What happens is you have column names A, B, and then up to Z. Then the next column name is AA, then AB and then up AZ. Then BA, BB and so on.

Basically, it’s base 26 arithmetic.

As far as I know, the typical method of getting the column name given the column index, is to run a loop. You add 1 until it hits 26, then you move to the next “position”, and then start from 1 again.

There’s nothing wrong with this method. It’s just that you have to iterate as many times as the given column index. If you’re given 16384, the loop runs 16384 times. This is regardless of the fact that the result is always the same. Given the range of values, the result can only be one of 16384 values.

So it was with this in mind, and my dabbling in game development (which said “Precalculate everything!”), that I precalculated an array of strings that are the column names. The array has 16384 items. The context is a spreadsheet software library.

Now to recoup that precalculation cost, I’d have to access my array at least 16384 times. This is where the context of the spreadsheet library comes in. Everybody (and their dog) wants to know if my library can handle millions of cells. This means if the column name calculation is in a method, that method is called millions of times. Given the iteration loop in it, that means the method with the iteration loop thing isn’t efficient (it’s O(n^2)).

However, due to technical issues, I can’t keep the array of strings. The column name array needs to be static to be available throughout the library. This causes issues if multi-threads or multi-processors or multi-whatevers comes in.

So I can’t use my static array anymore. Bummer. The O(n) of simple array access was working so well.

But I still want to have an efficient way of getting column names. So instead of iterating, I used simple division and modulus operations.

Consider 587. How do you know there’s 5 hundred? 587 / 100 equals 5 (truncating remainders). How do you know there’s 8 tens? 87 / 10 equals 8 (truncating remainders).

Yes, it’s elementary arithmetic, but it works great. So we can do the same for column names. There’s a problem though.

In the case above, we divided by 100. Why 100? Because it’s 10^2, and we’re concerned with the 2nd position after the ones position. The ones position is 10^0 by the way, which is 1.

So for our case, for the “hundreds” position, we divide by 676, which is 26^2. And for the “tens” position, we divide by 26.

Now 587 is 100*5 + 10*8 + 7. I’m going to use the notation (5,8,7) to denote this.

Now consider a column index of 3380. 3380 is equal to 676*5 + 26*0 + 0. This is (5,0,0).

However, in our case, our acceptable range of values is 1 through 26. It doesn’t contain zero. So (5,0,0) is not valid.

In the case of 587, we’re working in base 10, with the acceptable range of values being 0 to 9. This is “proper” remainder. Given any number in base 10, there’s a unique number within [0, 9] that’s the remainder.

However, for our purposes, there’s no unique number. Because we’re working in modulo 26, not just “remainder 26”.

The correct column name corresponding to column index 3380 is “DYZ”. This corresponds to (4,25,26). Or
3380 = 676*4 +26*25 + 26.

Note that 3380 is also 676*5 + 26*0 + 0.

My solution is to start from the “ones” position. If it’s greater than zero, fine. If it’s less than or equal to zero, borrow from the next larger position. Then we move to the next larger position, and check again. Continue to borrow until there are no zero values (or negatives) on the “right” side of the resulting notation (we can have “leading” zeroes).

So (5,0,0) becomes (5, -1, 0 + 26), or just (5,-1,26), borrowing 1 from the “tens” position. We cannot have -1, so that becomes (4, -1 + 26, 26), which becomes (4, 25, 26).

An interesting effect is that we typically assign 0 to A, 1 to B, and 25 to Z. In this case, 1 is assigned to A, 2 is to B, and most interestingly, both 0 and 26 map to Z. In fact, any multiple of 26 will map to Z.

Don’t think Z is special. Any multiple of 26 plus 1 also maps to A. So 1, 27, 53 and so on map to A. This is a property of the modulo thing.

Do you have a better way of converting (5,0,0) to (4,25,26)? Let me know in the comments.

Calculate Excel column width pixel interval

Brace yourself. You’re about to learn the secret behind how Excel mysteriously calculates the column width intervals.

In this article, I’m not going into the details of the column widths, but the column width intervals. There’s a difference. From the Open XML SDK specs:

width = Truncate([{Number of Characters} * {Maximum Digit Width} + {5 pixel padding}] / {Maximum Digit Width} * 256) / 256

To put it mildly, that’s a load of hogwash. In the documentation, it says that for Calibri 11 point at 96 DPI, the maximum digit width is 7 pixels. That is also another load of hogwash. It’s actually 8 pixels (well, 7 point something…).

When you move the line on the column width in Excel, just 1 pixel to the left, what is the column width? When you move it 1 pixel to the right, what’s the column width?

It turns out the each pixel interval isn’t a simple multiple of an internal column width interval.

Let’s take Calibri 11 pt 96 DPI again. With a maximum digit width of 8 pixels, each column width interval per pixel is supposedly 1/7 or 1/(max digit width -1).

But wait! It’s not actually 1/7. It’s the largest number of 1/256 multiples that is less than 1/7.

Now 1/7 is about 0.142857142857143. The actual interval is 0.140625, which is 36/256.

4/7 is about 0.571428571428571. The actual interval is 0.5703125, which is 146/256. And you will note that 146 is not equal to (4 * 36).

If you’re using Open XML SDK (or however you choose to access an Open XML Excel file), when you set the column width as 8.142857142857143, internally, Excel will save it as 8.140625.

Here’s some code:

int iPixelWidth = 8;
double fIntervalCheck;
double fInterval;
for (int step = 0; step < iPixelWidth; ++step)
{
    fIntervalCheck = (double)step / (double)(iPixelWidth - 1);
    fInterval = Math.Truncate(256.0 * fIntervalCheck) / 256.0;
    Console.WriteLine("{0:f15} {1:f15}", fIntervalCheck, fInterval);
}

So now you know how the intervals are calculated. But what about the actual column width? Hmm... perhaps another article...

P.S. I'm currently doing research for how to do autofitting for rows and columns for my spreadsheet library. I found this "secret" after fiddling with Excel files for a couple of hours... I know I'm talking about my library a lot, but it's taking up a lot of my brain space right now, so yeah...

How to properly use OpenXmlWriter to write large Excel files

Brian Jones already wrote something on using the OpenXmlWriter with the Open XML SDK to write large Excel files. There are a couple of things that weren’t mentioned though.

OpenXmlWriter works best for creating new things

For the purposes of explanation, we’ll assume the “new thing” is a new worksheet, although OpenXmlWriter can be used to write any Open XML SDK part (such as WorkbookPart).

The reason for this is that OpenXmlWriter is essentially writing out XML tags. The SDK offers classes and properties that we’re familiar with, but underneath it all, an Open XML spreadsheet’s internal structure are XML files zipped together. So instead of using the SDK classes to form the “XML” parts, we use the OpenXmlWriter to directly write out the XML.

The problem comes when we need to save. If you’re working with an existing worksheet, there’s already an existing XML structure in place. This means there’s a corresponding WorksheetPart (to your Worksheet class), and this WorksheetPart already has the XML structure (and more importantly, a DOM structure).

So you can’t just use the OpenXmlWriter to write the XML stuff into an existing WorksheetPart, because any changes you made will be ignored.

This is why in Brian’s code, he reads in the existing worksheet and writes the content into a new worksheet, thus using a new WorksheetPart that’s “attached” to the WorkbookPart. After he’s done reading, he deletes the old WorksheetPart.

In conclusion, either you

  • Work with the DOM structure using the SDK, which means snail-pace if you’re writing many cells
  • Work with new WorksheetPart

If you have something like worksheetPart.Worksheet.Save() mixed with the use of OpenXmlWriter, it’s not going to work.

XML attributes

The more subtle (and possibly more crucial) information that’s missing is of attributes. When you use the OpenXmlWriter, it works best if it’s used all the way. As in you don’t mix SDK class/property usage code with OpenXmlWriter code.

Basically, if there’s an SDK class property you want to set, don’t assign to that property in code. Use the OpenXmlWriter to write that property, because that property is eventually an XML attribute anyway. In this case, think of the SDK as getting in the way of making your code run faster.

In Brian’s code, notice that the RowIndex property of the Row class is not assigned. Which is fine since it’s optional. In this case, Excel tries to guess which row it is, and places the first Row class content into the first row on the spreadsheet. This is good if that’s what you want. And bad if your want your “first” Row class to be on the 3rd row instead.

You might also notice that the Cell class has no properties assigned. And no, CellFormula and CellValue are not properties of the Cell class, they’re actually child classes. The SDK allows you to access them as though they’re properties. This means the “standard” set of cell properties are not assigned:

  • The cell reference, such as “A1”
  • The cell data type, such as string or number
  • The cell style index

The code works because the cell content is a random number. And the default data type of a cell is a number (so it doesn’t need to be explicitly assigned). It will fail if you need strings or other data types.

And because there’s no cell reference given, Excel tries to guess where each cell is. Which is to say, place all the cells in that row flush to the left. This means you can’t have a cell in A1, and a blank cell in B1, and then a cell in C1, because Excel mooshes all the Cell classes together.

And because there’s no style index given, the default style is used. This is typically the minor font (such as Calibri) with no special formatting (such as font colour or bold or italics or borders).

So here’s sample code on how to use OpenXmlWriter. Note the class OpenXmlAttribute. And if you understand this, you will know that the OpenXmlWriter isn’t only for writing large Excel files. It just happens that it directly writes XML tags and attributes, and thus works the fastest for writing a lot of data.

using (SpreadsheetDocument xl = SpreadsheetDocument.Create("LargeFile.xlsx", SpreadsheetDocumentType.Workbook))
{
    List<OpenXmlAttribute> oxa;
    OpenXmlWriter oxw;

    xl.AddWorkbookPart();
    WorksheetPart wsp = xl.WorkbookPart.AddNewPart<WorksheetPart>();

    oxw = OpenXmlWriter.Create(wsp);
    oxw.WriteStartElement(new Worksheet());
    oxw.WriteStartElement(new SheetData());

    for (int i = 1; i <= 50000; ++i)
    {
        oxa = new List<OpenXmlAttribute>();
        // this is the row index
        oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));

        oxw.WriteStartElement(new Row(), oxa);

        for (int j = 1; j <= 100; ++j)
        {
            oxa = new List<OpenXmlAttribute>();
            // this is the data type ("t"), with CellValues.String ("str")
            oxa.Add(new OpenXmlAttribute("t", null, "str"));

            // it's suggested you also have the cell reference, but
            // you'll have to calculate the correct cell reference yourself.
            // Here's an example:
            //oxa.Add(new OpenXmlAttribute("r", null, "A1"));

            oxw.WriteStartElement(new Cell(), oxa);

            oxw.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));

            // this is for Cell
            oxw.WriteEndElement();
        }

        // this is for Row
        oxw.WriteEndElement();
    }

    // this is for SheetData
    oxw.WriteEndElement();
    // this is for Worksheet
    oxw.WriteEndElement();
    oxw.Close();

    oxw = OpenXmlWriter.Create(xl.WorkbookPart);
    oxw.WriteStartElement(new Workbook());
    oxw.WriteStartElement(new Sheets());

    // you can use object initialisers like this only when the properties
    // are actual properties. SDK classes sometimes have property-like properties
    // but are actually classes. For example, the Cell class has the CellValue
    // "property" but is actually a child class internally.
    // If the properties correspond to actual XML attributes, then you're fine.
    oxw.WriteElement(new Sheet()
    {
        Name = "Sheet1",
        SheetId = 1,
        Id = xl.WorkbookPart.GetIdOfPart(wsp)
    });

    // this is for Sheets
    oxw.WriteEndElement();
    // this is for Workbook
    oxw.WriteEndElement();
    oxw.Close();

    xl.Close();
}

This creates a spreadsheet with 100 columns and 50000 rows of data. On my machine, it ran in about 18 seconds.

I’m not going to list all the available XML attributes, but you can find out the correct attribute name by exploring the Open XML SDK documentation, or the official ECMA documentation.

Collision detection in merging Excel cells

You can merge cells in Excel. “Duh!” you say. So how does Excel ensure that the cells you want to merge don’t overlap any existing merged cells? How do you check programmatically that the cells you want to merge won’t overlap any existing merged cells?

Well, that was the problem I faced while writing my spreadsheet library software (check out SpreadsheetLight here!). The inkling of an idea was formed when I visited StackOverflow. Can’t remember what I was there for, but the answer page wasn’t about Excel or spreadsheets. It was about collision detection.

Yes, the type of collision detection used in games so you don’t walk through walls and stuff.

And thus I learnt a new mathematical theorem: the separating axis theorem (or hyperplane separation theorem). From what I understand, given two N-dimensional thingies, if you can find one (N-1)-dimensional hyperplane that separates those two thingies, then those two thingies are separate (or not overlapping).

Yes, “thingies” is a technical term.

For example, given two 3D objects, if you can find one 2D plane that separates them, then the 3D objects are not overlapping. Another example is when you’re given two 2D objects, and if you can find one line (which is 1D) that separates them, then the 2D objects are not overlapping.

There are some conditions to be fulfilled, such as the objects being convex and disjoint and whatever. (yes I’m a mathematician…) I’ll leave it to you to do your own reading.

But for our purposes, how do we check if that rectangle of Excel cells we want to merge won’t overlap with any existing merged cells? We have these conditions that make our lives easier:

  • The merged cells are rectangular
  • The 4 end points (top-left, top-right, bottom-left, bottom-right) of a merged cell are in whole numbers
  • The (merged) cells map strictly to a 2D rectangular grid

Since the merged cells are rectangular, they’re also convex (I’m not going to explain why, just trust me). Since the 4 end points are in whole numbers, line boundaries can be easily calculated (it’s easy to check if something is <7 or >=8 for example). And since they map strictly to a 2D rectangular grid, the separating axis is a line. And even better, you can always find a subset of solutions of those separating axes that are either horizontal or vertical.

Ok, diagrams are in order.

Collision detection in merged cells

So in our case, as long as you can find one horizontal line or one vertical line that separates the 2 merged cells we’re checking, then the 2 merged cells are separate.

Let’s try an example. Excel uses cell references that are in the A1 format, which means row 3 column 4 becomes “D3”. The column index uses alphabets and is in base-26. Read more here.

We’re going to simplify that. Let’s say our existing set of merged cells only has one entry. This merged cell has the top-left corner at row 1 column 1, and the bottom-right corner at row 3 column 4. Suppose we want to check if this merge cell overlaps: top-left corner at row 7 column 8, bottom-right corner at row 10 column 10.

The horizontal line of 5 fits the bill, or the line y=5 if you want to be mathematical about it (but y goes from negative to positive downwards instead of the usual Cartesian y). Or y=6. Or even y=7 (note that the line can “touch” one of the merged cells, but not both. This is where the “whole number” condition comes in).

The vertical lines x=5, x=6 or x=8 also fit the bill.

Thus, our 2 merged cells don’t overlap.

So what’s the programmatic way to check? You’d be surprised at the simplicity. To make it easier, the variables with “StartRowIndex” mean the top row of the merged cell, “EndRowIndex” mean the bottom row of the merged cell. And “StartColumnIndex” mean the left-most column of the merged cell, and “EndColumnIndex” mean the right-most column of the merged cell.

if (!(iEndRowIndex < mc.StartRowIndex || iStartRowIndex > mc.EndRowIndex || iEndColumnIndex < mc.StartColumnIndex || iStartColumnIndex > mc.EndColumnIndex))
{
    // there's an overlap!
}

So the merge cell we want to check has top-left corner at (iStartRowIndex, iStartColumnIndex) and bottom-right corner at (iEndRowIndex, iEndColumnIndex).

The variable “mc” refers to a custom class I use to represent merged cells. Obviously, you’d run that condition in a loop through all your existing merged cells. If you can’t find an overlap after running through that loop, then the merge cell you’re checking is good to go.

Let’s run through the individual conditions.

  • (iEndRowIndex < mc.StartRowIndex) means our merged cell is completely above the existing merged cell
  • (iStartRowIndex > mc.EndRowIndex) means our merged cell is completely below the existing merged cell
  • (iEndColumnIndex < mc.StartColumnIndex) means our merged cell is completely to the left of the existing merged cell
  • (iStartColumnIndex > mc.EndColumnIndex) means our merged cell is completely to the right of the existing merged cell

The first 2 conditions check for the existence of a horizontal separating axis. The next 2 conditions check for the existence of a vertical separating axis.

Note the negation of the entire boolean condition. Those 4 conditions check for existence of solutions. Negation means checking for overlaps.

Note also that we use strictly greater than or strictly less than checks. If the merge cells share a row or column, then they overlap, right?

Did you know my first draft of writing the code had me checking 6 different conditions? Each condition had its own if statements. This one just had one. I’m so lucky to have learnt the separating axis theorem. I was checking if a point was above, within or below the merge cell, then I had to check if the point was to the left, within or to the right of the merge cell, and … urgh, it was horrible…

Named cell styles are still explicitly declared

Styling cells in Microsoft Excel has its difficulties (as I’ve written before). The biggest one is keeping track of all the indices. In Open XML SDK, you have the ability to have a named cell style.

At first glance, you might think that’s awesome. You just use a named cell style, and all the related styles are applied. It’s like there’s a red car that uses hybrid fuels. “Yes, I would like to have a red car that uses hybrid fuels.” Not quite.

You see, the named cell style is dependent on the implementing spreadsheet software. For example, Microsoft Excel has the “Normal”, “Bad”, “Good” and “Neutral” named cell styles. But Google Spreadsheets and OpenOffice.org Calc do not have to have those named cell styles, or even style it the same as Microsoft Excel. This is where Open XML SDK isn’t quite “open”… After much research and work, I discovered the SDK is basically Open-XML-ising Microsoft Excel (and Word and PowerPoint). I’m neutral on the stands of open source and “forcing standards”. I just use whatever there is, and make something within the limitations.

Because of the dependency, the underlying individual styles need to be declared explicitly. Actually more so because of the dependency.

So for my spreadsheet software library SpreadsheetLight, I used Excel as the guideline.

In researching Excel named cell styles, I had to look at the underlying XML files (because Open XML spreadsheets are made of XML files). While the Open XML SDK comes with a document explorer (the Productivity Tool), I needed to make notes and also that I felt the need to see the XML file itself, rather than using the explorer tool.

This gave me a problem because while XML files are supposed to be human-readable, it doesn’t make it easy to read. The “natural” XML file has no indents. Oh my Godiva chocolate, it’s so hard to read… Then I remembered I had an XML tool, XML Studio. I fired that up and a few clicks later, the XML file had nice indents and I could find out where the individual style tags were. XML Studio was amazing to use.

Disclaimer: I was given a free developer license of XML Studio by Liquid Technologies. But the software is really useful if you work with XML files a lot.

After doing my notes for a while, I discovered even that’s not enough. There were too many individual styles! I needed the indices for those styles, because only the index was referenced in the final style (CellFormat classes). I didn’t really feel up to annotating the indices… until I remembered my partially completed Open XML spreadsheet decompiler tool. When I created that tool, one of my aims was to put in comments on the index of the individual styles.

Note to Liquid Technologies: You might want to consider putting in XML comments on the index of an XML child tag with respect to its parent. But I don’t know if that’s useful to programming spheres other than Open XML…

Anyway, my hard work paid off, and SpreadsheetLight allows you to apply named cell styles. Here’s how the spreadsheet looks like:
Applying named cell styles

Note that some of the named cell styles use accent colours. The accent colours are part of the spreadsheet’s theme. So in offering named cell styles as a feature, I also had to allow you to create your own theme. And here’s the code using SpreadsheetLight:

System.Drawing.Color[] clrs = new System.Drawing.Color[12];
clrs[0] = System.Drawing.Color.White;
clrs[1] = System.Drawing.Color.Black;
clrs[2] = System.Drawing.Color.WhiteSmoke;
clrs[3] = System.Drawing.Color.DarkSlateGray;
clrs[4] = System.Drawing.Color.DarkRed;
clrs[5] = System.Drawing.Color.OrangeRed;
clrs[6] = System.Drawing.Color.DarkGoldenrod;
clrs[7] = System.Drawing.Color.DarkOliveGreen;
clrs[8] = System.Drawing.Color.Navy;
clrs[9] = System.Drawing.Color.Indigo;
clrs[10] = System.Drawing.Color.SkyBlue;
clrs[11] = System.Drawing.Color.MediumPurple;

SLDocument sl = new SLDocument("ColourWheel", "Castellar", "Harrington", clrs);

sl.SetRowHeight(6, 24);
sl.SetColumnWidth(1, 1);
sl.SetColumnWidth(2, 13);
sl.SetColumnWidth(3, 13);
sl.SetColumnWidth(4, 13);
sl.SetColumnWidth(5, 13);
sl.SetColumnWidth(6, 13);
sl.SetColumnWidth(7, 13);

sl.SetCellValue(2, 2, "Normal");
sl.ApplyNamedCellStyle(2, 2, SLNamedCellStyleValues.Normal);
sl.SetCellValue(2, 3, "Bad");
sl.ApplyNamedCellStyle(2, 3, SLNamedCellStyleValues.Bad);
sl.SetCellValue(2, 4, "Good");
sl.ApplyNamedCellStyle(2, 4, SLNamedCellStyleValues.Good);
sl.SetCellValue(2, 5, "Neutral");
sl.ApplyNamedCellStyle(2, 5, SLNamedCellStyleValues.Neutral);

sl.SetCellValue(3, 2, "Calculation");
sl.ApplyNamedCellStyle(3, 2, SLNamedCellStyleValues.Calculation);
sl.SetCellValue(3, 3, "Check Cell");
sl.ApplyNamedCellStyle(3, 3, SLNamedCellStyleValues.CheckCell);
sl.SetCellValue(3, 4, "Explanatory Text");
sl.ApplyNamedCellStyle(3, 4, SLNamedCellStyleValues.ExplanatoryText);
sl.SetCellValue(3, 5, "Input");
sl.ApplyNamedCellStyle(3, 5, SLNamedCellStyleValues.Input);

sl.SetCellValue(4, 2, "Linked Cell");
sl.ApplyNamedCellStyle(4, 2, SLNamedCellStyleValues.LinkedCell);
sl.SetCellValue(4, 3, "Note");
sl.ApplyNamedCellStyle(4, 3, SLNamedCellStyleValues.Note);
sl.SetCellValue(4, 4, "Output");
sl.ApplyNamedCellStyle(4, 4, SLNamedCellStyleValues.Output);
sl.SetCellValue(4, 5, "Warning Text");
sl.ApplyNamedCellStyle(4, 5, SLNamedCellStyleValues.WarningText);

sl.SetCellValue(6, 2, "Heading 1");
sl.ApplyNamedCellStyle(6, 2, SLNamedCellStyleValues.Heading1);
sl.SetCellValue(6, 3, "Heading 2");
sl.ApplyNamedCellStyle(6, 3, SLNamedCellStyleValues.Heading2);
sl.SetCellValue(6, 4, "Heading 3");
sl.ApplyNamedCellStyle(6, 4, SLNamedCellStyleValues.Heading3);
sl.SetCellValue(6, 5, "Heading 4");
sl.ApplyNamedCellStyle(6, 5, SLNamedCellStyleValues.Heading4);
sl.SetCellValue(6, 6, "Title");
sl.ApplyNamedCellStyle(6, 6, SLNamedCellStyleValues.Title);
sl.SetCellValue(6, 7, "Total");
sl.ApplyNamedCellStyle(6, 7, SLNamedCellStyleValues.Total);

sl.SetCellValue(8, 2, "Accent1");
sl.ApplyNamedCellStyle(8, 2, SLNamedCellStyleValues.Accent1);
sl.SetCellValue(8, 3, "Accent2");
sl.ApplyNamedCellStyle(8, 3, SLNamedCellStyleValues.Accent2);
sl.SetCellValue(8, 4, "Accent3");
sl.ApplyNamedCellStyle(8, 4, SLNamedCellStyleValues.Accent3);
sl.SetCellValue(8, 5, "Accent4");
sl.ApplyNamedCellStyle(8, 5, SLNamedCellStyleValues.Accent4);
sl.SetCellValue(8, 6, "Accent5");
sl.ApplyNamedCellStyle(8, 6, SLNamedCellStyleValues.Accent5);
sl.SetCellValue(8, 7, "Accent6");
sl.ApplyNamedCellStyle(8, 7, SLNamedCellStyleValues.Accent6);

sl.SetCellValue(9, 2, "Accent1Perc60");
sl.ApplyNamedCellStyle(9, 2, SLNamedCellStyleValues.Accent1Percentage60);
sl.SetCellValue(9, 3, "Accent2Perc60");
sl.ApplyNamedCellStyle(9, 3, SLNamedCellStyleValues.Accent2Percentage60);
sl.SetCellValue(9, 4, "Accent3Perc60");
sl.ApplyNamedCellStyle(9, 4, SLNamedCellStyleValues.Accent3Percentage60);
sl.SetCellValue(9, 5, "Accent4Perc60");
sl.ApplyNamedCellStyle(9, 5, SLNamedCellStyleValues.Accent4Percentage60);
sl.SetCellValue(9, 6, "Accent5Perc60");
sl.ApplyNamedCellStyle(9, 6, SLNamedCellStyleValues.Accent5Percentage60);
sl.SetCellValue(9, 7, "Accent6Perc60");
sl.ApplyNamedCellStyle(9, 7, SLNamedCellStyleValues.Accent6Percentage60);

sl.SetCellValue(10, 2, "Accent1Perc40");
sl.ApplyNamedCellStyle(10, 2, SLNamedCellStyleValues.Accent1Percentage40);
sl.SetCellValue(10, 3, "Accent2Perc40");
sl.ApplyNamedCellStyle(10, 3, SLNamedCellStyleValues.Accent2Percentage40);
sl.SetCellValue(10, 4, "Accent3Perc40");
sl.ApplyNamedCellStyle(10, 4, SLNamedCellStyleValues.Accent3Percentage40);
sl.SetCellValue(10, 5, "Accent4Perc40");
sl.ApplyNamedCellStyle(10, 5, SLNamedCellStyleValues.Accent4Percentage40);
sl.SetCellValue(10, 6, "Accent5Perc40");
sl.ApplyNamedCellStyle(10, 6, SLNamedCellStyleValues.Accent5Percentage40);
sl.SetCellValue(10, 7, "Accent6Perc40");
sl.ApplyNamedCellStyle(10, 7, SLNamedCellStyleValues.Accent6Percentage40);

sl.SetCellValue(11, 2, "Accent1Perc20");
sl.ApplyNamedCellStyle(11, 2, SLNamedCellStyleValues.Accent1Percentage20);
sl.SetCellValue(11, 3, "Accent2Perc20");
sl.ApplyNamedCellStyle(11, 3, SLNamedCellStyleValues.Accent2Percentage20);
sl.SetCellValue(11, 4, "Accent3Perc20");
sl.ApplyNamedCellStyle(11, 4, SLNamedCellStyleValues.Accent3Percentage20);
sl.SetCellValue(11, 5, "Accent4Perc20");
sl.ApplyNamedCellStyle(11, 5, SLNamedCellStyleValues.Accent4Percentage20);
sl.SetCellValue(11, 6, "Accent5Perc20");
sl.ApplyNamedCellStyle(11, 6, SLNamedCellStyleValues.Accent5Percentage20);
sl.SetCellValue(11, 7, "Accent6Perc20");
sl.ApplyNamedCellStyle(11, 7, SLNamedCellStyleValues.Accent6Percentage20);

sl.SetCellValue(13, 2, 12345678);
sl.ApplyNamedCellStyle(13, 2, SLNamedCellStyleValues.Comma);
sl.SetCellValue(13, 4, 12345678);
sl.ApplyNamedCellStyle(13, 4, SLNamedCellStyleValues.Comma0);
sl.SetCellValue(14, 2, 12345678);
sl.ApplyNamedCellStyle(14, 2, SLNamedCellStyleValues.Currency);
sl.SetCellValue(14, 4, 12345678);
sl.ApplyNamedCellStyle(14, 4, SLNamedCellStyleValues.Currency0);
sl.SetCellValue(15, 2, 123);
sl.ApplyNamedCellStyle(156, 2, SLNamedCellStyleValues.Percentage);

sl.SaveAs("NamedCellStyles.xlsx");

I set the column widths of the 2nd to 7th column, and the row height of the 6th row so it’s easier to see.

The main body text (the minor font) is in Harrington, and the title font (major font) is Castellar. You will note that even though the major font is supposedly used for heading and title texts, only the named cell style Title uses the major font. The headings 1 through 4 use the minor font.

Calculating Excel spreadsheet column names

I’ve been working with Open XML spreadsheets for the past, I don’t know how long… A year? I just realised that getting that Excel column header name is a frequent task. You know, given that it’s the 4th column, it’s “D”. I don’t work frequently with spreadsheets with lots of columns. So it was interesting that the 26th column is “Z” and the 27th column becomes “AA”. Basically, base-26 arithmetic, using the 26 letters of the English alphabet as tokens.

There are probably lots of code snippets out there showing you how to calculate a column name given the column index. Here’s mine:

string[] saExcelColumnHeaderNames = new string[16384];
string[] sa = new string[] { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };
string s = string.Empty;
int i, j, k, l;
i = j = k = -1;
for (l = 0; l < 16384; ++l)
{
    s = string.Empty;
    ++k;
    if (k == 26)
    {
        k = 0;
        ++j;
        if (j == 26)
        {
            j = 0;
            ++i;
        }
    }
    if (i >= 0) s += sa[i];
    if (j >= 0) s += sa[j];
    if (k >= 0) s += sa[k];
    saExcelColumnHeaderNames[l] = s;
}

That gives you a zero-based indexing version. So to get the 30th column name, you use saExcelColumnHeaderNames[29].

In case you’re wondering, 16384 is the maximum number of columns supported by Excel 2010.

You will notice that it’s not a function given the column index. I find that not as useful. Look, typically when you need the column name, you probably also need to get it frequently, usually with different parameters.

What I did was to store all the calculation results into a string array. Then you reference it with an index. The calculation function typically is a O(n) operation. With you needing to use the function multiple times, your whole algorithm probably goes up to O(n^2).

My method is also an O(n) operation. But referencing a string array is I think an O(1), meaning it’s a constant. I’ve never been good with big O notation…

This style of solving the problem is called pre-calculation. Pre-calculation is especially useful in the games development region, where speed is important. For example, selected values of sine and cosine were pre-calculated and stored in arrays, for use in the numerous 3D/2D calculations in games. Calculating sine’s and cosine’s in real-time were detrimental to a speedy game.

That’s not as useful now because you need a fuller range of floating point values as input. But the concept is still useful.

I think I read somewhere (while I was doing hobbyist game development) this quote:

Pre-calculate everything!

Maybe computers are now much faster. I don’t care. That doesn’t give you an excuse to be sloppy. It’s an optimisation that doesn’t take much effort.

If you need to calculate it, see if you can calculate it just once.

Spreadsheet Open XML V2

I was rushing to get this out. The updated version of my programming guide is out! I first launched it on 17 Jan this year, so I was hoping to meet the personal-and-unseen deadline of 17 May, so it’s a nice 4 months interval. Ah well, I’m the only one who cares anyway…

I’ve added loads more content to help you with your Open XML spreadsheet needs. Version 1 was 53 pages. Version 2 is 147 pages. There are a lot of pages with screenshots, but still… 147 pages!

You can find out more here.

The last few weeks had been interesting while I rushed to get working source code and write explanations for the guide… I need to sleep… wait, I’ve got a magazine deadline! *sigh*

Built-in styles for Excel Open XML

So a blog reader, Sebastien, once asked about built-in style numbers used in Open XML for Excel. I’m sorry to break the news to you. It’s not straightforward to use those built-in styles.

What are these built-in styles? Check these out:

Excel built-in styles

For instance, Sebastien was asking how to use the “Good” and “Bad” built-in styles. So the good news is, you can use those named styles. The bad news is, the various style effects (font colour, background colour, borders) aren’t automatically added in for you. From the Microsoft documentation of CellStyle:

This element represents the name and related formatting records for a named cell style in this workbook.

Annex H contains a listing of cellStyles whose corresponding formatting records are implied rather than explicitly saved in the file. In this case, a builtinId attribute is written on the cellStyle record, but no corresponding formatting records are written.

For all built-in cell styles, the builtinId determines the style, not the name. For all cell styles, Normal is applied by default.

Read the second paragraph again. It means even if you use the built-in style “Good”, you still need to create a style with the green background and green text font colour.

Since this is the case, I suggest you just stick to creating your own custom styles. It’s what’s going to happen anyway.

Now there is a magic number, 164. Your custom styles will start being numbered from 164 onwards. This implies there are 164 built-in styles (0-indexed). My search efforts only gave me a partial list.

Please refer to the ECMA-376 documentation for Office Open XML formats. Download the 1st edition, part 4. Inside, you’ll find a PDF. Turn to page 2135 (as of this writing. It’s page 2128 on the PDF itself though). You’ll find this:

Excel Open XML built-in style list

That list is probably that of “standard” styles. From reading the next few pages of the documentation, I believe most of those 164 built-in styles are due to internationalisation issues.

So my conclusion is, for the “standard” built-in styles you’re interested in, you still have to create the supporting style effects (font colour, background colour and the like). For the “non-standard” built-in styles, they are probably different based on the localisation of your Excel file. So you’re better off just creating your own custom styles, which I teach you how here and here. Or you can get my programming guide for detailed explanations and source code.

Man, spreadsheet code libraries are expensive!

I did more market research on commercial code libraries that create, edit or otherwise manipulate Excel spreadsheets (specifically using .NET languages (C# *ahem*)). Mother of columns and rows, they’re expensive! The mid-range products are about US$ 500 and the high-end products start at about US$ 1000. And that’s just the per-developer license. If you have more developers, or need a site-wide license, then you’re looking at thousands of dollars. If you consider license renewals, or subscription renewals, the options can boggle your mind.

So after thinking it through, I have to increase the price of my programming guide (Spreadsheet Open XML From Scratch). As ironic as it sounds, it might be the best thing I need to do. Simply put, my product is not expensive enough.

You will do well if you read some psychological books on human purchasing behaviour. It turns out, as studies show, there is a “perfect” price for every product (or service) for the intended audience (or market if you prefer) at a particular time period. Veer too far from that perfect price, either too cheap or too expensive, and you might need Herculean powers of persuasion to get the customer to buy.

And here I thought I was pricing my guide too expensive… I want to thank the people who’ve already bought my guide. You’re awesome.

So here’s the thing. I’m increasing the price of the guide from US$ 47 to US$ 97 (but please check the product page for current price since I might have changed it. Again…). And I’m doing it on 1 March. Why am I not doing it earlier? So you can buy the guide at the current price if you so choose. And because I’m busy writing my magazine and updates to the guide.

There are updates to the guide? Yes. And if you have anything you want to learn about creating spreadsheets from C# (or VB.NET) and with the Open XML SDK, now’s the time to tell me.

Then buy the guide at its current price, and get the updated version (soon). You get full working source code (not a code library, source code) that you can use with complete freedom (no license fees, no GPL’s, no license requirements, no need for attribution). Use the source code in a personal project or commercial product. No problem. And you get in-depth explanations of concepts taught for a particular task.

Buy it now before the price goes up. Leave comments here if you have specific Excel creation/editing tasks you want me to cover in the updated guide.