Considerations for storing Excel cell value in code

You may hate Excel, but you may find a discussion of how Excel stores cell values interesting.

So I have a spreadsheet library. The biggest concern at the initial stage was how to store all the spreadsheet data efficiently. I hear people talking about millions of cells, so I’m scared. If my program stores a spreadsheet cell using 10 bytes (for example), a million cells would take up 10 million bytes in memory.

Let’s start by looking at all the different types of information you can type into a spreadsheet cell. You have:

  • booleans: TRUE or FALSE
  • numbers
  • text
  • rich text (different styled text within the entire text itself)
  • dates and times

For us programmers, “numbers” can be separated into floating point or integer types. An Excel user won’t see a difference.

So how does Excel actually store those values? I’m going to focus only on Open XML because I’m not interested in BIFF files…

  • booleans: TRUE stored as text “1” and FALSE stored as text “0”
  • numbers: stored as text
  • text: duh
  • rich text: stored in a separate shared strings list, with the index to that list stored as text here.
  • dates and times: stored as number that’s in text form

You will see everything is basically stored as text. That’s because the underlying XML files are text files. There’s a property (XML attribute) that differentiates the data, such as boolean, number, string, inline string, shared string.

So why are dates stored as a number? It’s easier to do date calculations with 41449 than “24 June 2013”. So how is this number obtained? See here.

So if you’ve been looking closely enough, Excel’s optimisation tactic is to store everything as numeric text as far as possible. So I want to follow that.

Before doing so however, I went to read what other people are doing AKA open source spreadsheet libraries. In code, they use an object to store the cell value. As in System.Object, the mother of all data types in .NET.

So you have an integer? Dump it into the object variable. Floating point? Dump into object. String of characters? Dump.

How do you read it out? Boxing and unboxing. You remember it’s a floating point value and cast it back from an object to a double variable type.

So what did I do? I have a double variable and a string variable, and I store the cell value in one or the other based on the input.

The “all in object” way has variable (no pun intended) memory size, based on the contents. Sort of. I’m not an expert in this.

My way has a fixed memory size for double’s. Each double takes up 8 bytes (for sure?). A string variable takes up variable size, but because the optimisation tactic is to store data as a number, I can assign the data to the double variable and set the string variable to null. This means the string variable size is sort of fixed too.

So this is what I do. If it’s a number, I store it in the double variable and set the string variable to null. If it’s text, I convert it to a number by using shared strings (out of scope for discussion here) and store the index into the double variable and set the string variable to null. The only cases where the string variable is actually used is if I store the text there, or if I want to store the actual number there (because “1.23456789” may not be stored exactly as that in a double variable. Go read on how floating points are implemented for details), which are rare.

According to Jon Skeet, strings take up 20 + (n/2)*4 bytes (where n is the number of characters). But a null string takes up 8 bytes (it’s either 4 or 8 bytes. I’ll assume the worse scenario).

This means for the most part, each cell has a double variable that takes up 8 bytes and a null string that takes up 8 bytes. A cell value of 10 or 3.14 or 12345678.9 takes up 16 bytes regardless.

Since 16 bytes is less than 20 + (n/2)*4 bytes, I save more memory in most cases. I also have less boxing and unboxing operations, which make things go faster.

Modulo 26 and column names

I was sitting in the lecture theatre valiantly trying to keep awake. The professor was speaking on the rigorous application and proving of the modulus function. It’s basically the remainder, but I’ve never been introduced to it in such, uh, rigor.

He brought up an example using modulo 26. And demonstrated the wrapping around of values. And the use of it in cryptology (a class I took later on, and I got tasked by the cryptography professor to write a program to do simple encryption. But another story perhaps…).

Modulo 26 is similar to finding the remainder. The difference is that the remainder is unique. This is important to our discussion.

“About what?” you may ask.

Excel column names.

There are 2 types of cell references used in spreadsheets, the R1C1 format and the A1 format. The R1C1 is simple. If the row index is 5 and the column index is 7, the result is R5C7.

The A1 format takes on a column name and the row index. Using our example again, the result is G5, because “G” is the 7th alphabet. Yes, that list of 26 alphabets.

The version of Excel currently (Excel 2007 and 2010) has up to 3 letters, with XFD as the last column name (that’s the 16384th column). What happens is you have column names A, B, and then up to Z. Then the next column name is AA, then AB and then up AZ. Then BA, BB and so on.

Basically, it’s base 26 arithmetic.

As far as I know, the typical method of getting the column name given the column index, is to run a loop. You add 1 until it hits 26, then you move to the next “position”, and then start from 1 again.

There’s nothing wrong with this method. It’s just that you have to iterate as many times as the given column index. If you’re given 16384, the loop runs 16384 times. This is regardless of the fact that the result is always the same. Given the range of values, the result can only be one of 16384 values.

So it was with this in mind, and my dabbling in game development (which said “Precalculate everything!”), that I precalculated an array of strings that are the column names. The array has 16384 items. The context is a spreadsheet software library.

Now to recoup that precalculation cost, I’d have to access my array at least 16384 times. This is where the context of the spreadsheet library comes in. Everybody (and their dog) wants to know if my library can handle millions of cells. This means if the column name calculation is in a method, that method is called millions of times. Given the iteration loop in it, that means the method with the iteration loop thing isn’t efficient (it’s O(n^2)).

However, due to technical issues, I can’t keep the array of strings. The column name array needs to be static to be available throughout the library. This causes issues if multi-threads or multi-processors or multi-whatevers comes in.

So I can’t use my static array anymore. Bummer. The O(n) of simple array access was working so well.

But I still want to have an efficient way of getting column names. So instead of iterating, I used simple division and modulus operations.

Consider 587. How do you know there’s 5 hundred? 587 / 100 equals 5 (truncating remainders). How do you know there’s 8 tens? 87 / 10 equals 8 (truncating remainders).

Yes, it’s elementary arithmetic, but it works great. So we can do the same for column names. There’s a problem though.

In the case above, we divided by 100. Why 100? Because it’s 10^2, and we’re concerned with the 2nd position after the ones position. The ones position is 10^0 by the way, which is 1.

So for our case, for the “hundreds” position, we divide by 676, which is 26^2. And for the “tens” position, we divide by 26.

Now 587 is 100*5 + 10*8 + 7. I’m going to use the notation (5,8,7) to denote this.

Now consider a column index of 3380. 3380 is equal to 676*5 + 26*0 + 0. This is (5,0,0).

However, in our case, our acceptable range of values is 1 through 26. It doesn’t contain zero. So (5,0,0) is not valid.

In the case of 587, we’re working in base 10, with the acceptable range of values being 0 to 9. This is “proper” remainder. Given any number in base 10, there’s a unique number within [0, 9] that’s the remainder.

However, for our purposes, there’s no unique number. Because we’re working in modulo 26, not just “remainder 26”.

The correct column name corresponding to column index 3380 is “DYZ”. This corresponds to (4,25,26). Or
3380 = 676*4 +26*25 + 26.

Note that 3380 is also 676*5 + 26*0 + 0.

My solution is to start from the “ones” position. If it’s greater than zero, fine. If it’s less than or equal to zero, borrow from the next larger position. Then we move to the next larger position, and check again. Continue to borrow until there are no zero values (or negatives) on the “right” side of the resulting notation (we can have “leading” zeroes).

So (5,0,0) becomes (5, -1, 0 + 26), or just (5,-1,26), borrowing 1 from the “tens” position. We cannot have -1, so that becomes (4, -1 + 26, 26), which becomes (4, 25, 26).

An interesting effect is that we typically assign 0 to A, 1 to B, and 25 to Z. In this case, 1 is assigned to A, 2 is to B, and most interestingly, both 0 and 26 map to Z. In fact, any multiple of 26 will map to Z.

Don’t think Z is special. Any multiple of 26 plus 1 also maps to A. So 1, 27, 53 and so on map to A. This is a property of the modulo thing.

Do you have a better way of converting (5,0,0) to (4,25,26)? Let me know in the comments.

Calculate Excel column width pixel interval

Brace yourself. You’re about to learn the secret behind how Excel mysteriously calculates the column width intervals.

In this article, I’m not going into the details of the column widths, but the column width intervals. There’s a difference. From the Open XML SDK specs:

width = Truncate([{Number of Characters} * {Maximum Digit Width} + {5 pixel padding}] / {Maximum Digit Width} * 256) / 256

To put it mildly, that’s a load of hogwash. In the documentation, it says that for Calibri 11 point at 96 DPI, the maximum digit width is 7 pixels. That is also another load of hogwash. It’s actually 8 pixels (well, 7 point something…).

When you move the line on the column width in Excel, just 1 pixel to the left, what is the column width? When you move it 1 pixel to the right, what’s the column width?

It turns out the each pixel interval isn’t a simple multiple of an internal column width interval.

Let’s take Calibri 11 pt 96 DPI again. With a maximum digit width of 8 pixels, each column width interval per pixel is supposedly 1/7 or 1/(max digit width -1).

But wait! It’s not actually 1/7. It’s the largest number of 1/256 multiples that is less than 1/7.

Now 1/7 is about 0.142857142857143. The actual interval is 0.140625, which is 36/256.

4/7 is about 0.571428571428571. The actual interval is 0.5703125, which is 146/256. And you will note that 146 is not equal to (4 * 36).

If you’re using Open XML SDK (or however you choose to access an Open XML Excel file), when you set the column width as 8.142857142857143, internally, Excel will save it as 8.140625.

Here’s some code:

int iPixelWidth = 8;
double fIntervalCheck;
double fInterval;
for (int step = 0; step < iPixelWidth; ++step)
{
    fIntervalCheck = (double)step / (double)(iPixelWidth - 1);
    fInterval = Math.Truncate(256.0 * fIntervalCheck) / 256.0;
    Console.WriteLine("{0:f15} {1:f15}", fIntervalCheck, fInterval);
}

So now you know how the intervals are calculated. But what about the actual column width? Hmm... perhaps another article...

P.S. I'm currently doing research for how to do autofitting for rows and columns for my spreadsheet library. I found this "secret" after fiddling with Excel files for a couple of hours... I know I'm talking about my library a lot, but it's taking up a lot of my brain space right now, so yeah...

How to properly use OpenXmlWriter to write large Excel files

Brian Jones already wrote something on using the OpenXmlWriter with the Open XML SDK to write large Excel files. There are a couple of things that weren’t mentioned though.

OpenXmlWriter works best for creating new things

For the purposes of explanation, we’ll assume the “new thing” is a new worksheet, although OpenXmlWriter can be used to write any Open XML SDK part (such as WorkbookPart).

The reason for this is that OpenXmlWriter is essentially writing out XML tags. The SDK offers classes and properties that we’re familiar with, but underneath it all, an Open XML spreadsheet’s internal structure are XML files zipped together. So instead of using the SDK classes to form the “XML” parts, we use the OpenXmlWriter to directly write out the XML.

The problem comes when we need to save. If you’re working with an existing worksheet, there’s already an existing XML structure in place. This means there’s a corresponding WorksheetPart (to your Worksheet class), and this WorksheetPart already has the XML structure (and more importantly, a DOM structure).

So you can’t just use the OpenXmlWriter to write the XML stuff into an existing WorksheetPart, because any changes you made will be ignored.

This is why in Brian’s code, he reads in the existing worksheet and writes the content into a new worksheet, thus using a new WorksheetPart that’s “attached” to the WorkbookPart. After he’s done reading, he deletes the old WorksheetPart.

In conclusion, either you

  • Work with the DOM structure using the SDK, which means snail-pace if you’re writing many cells
  • Work with new WorksheetPart

If you have something like worksheetPart.Worksheet.Save() mixed with the use of OpenXmlWriter, it’s not going to work.

XML attributes

The more subtle (and possibly more crucial) information that’s missing is of attributes. When you use the OpenXmlWriter, it works best if it’s used all the way. As in you don’t mix SDK class/property usage code with OpenXmlWriter code.

Basically, if there’s an SDK class property you want to set, don’t assign to that property in code. Use the OpenXmlWriter to write that property, because that property is eventually an XML attribute anyway. In this case, think of the SDK as getting in the way of making your code run faster.

In Brian’s code, notice that the RowIndex property of the Row class is not assigned. Which is fine since it’s optional. In this case, Excel tries to guess which row it is, and places the first Row class content into the first row on the spreadsheet. This is good if that’s what you want. And bad if your want your “first” Row class to be on the 3rd row instead.

You might also notice that the Cell class has no properties assigned. And no, CellFormula and CellValue are not properties of the Cell class, they’re actually child classes. The SDK allows you to access them as though they’re properties. This means the “standard” set of cell properties are not assigned:

  • The cell reference, such as “A1”
  • The cell data type, such as string or number
  • The cell style index

The code works because the cell content is a random number. And the default data type of a cell is a number (so it doesn’t need to be explicitly assigned). It will fail if you need strings or other data types.

And because there’s no cell reference given, Excel tries to guess where each cell is. Which is to say, place all the cells in that row flush to the left. This means you can’t have a cell in A1, and a blank cell in B1, and then a cell in C1, because Excel mooshes all the Cell classes together.

And because there’s no style index given, the default style is used. This is typically the minor font (such as Calibri) with no special formatting (such as font colour or bold or italics or borders).

So here’s sample code on how to use OpenXmlWriter. Note the class OpenXmlAttribute. And if you understand this, you will know that the OpenXmlWriter isn’t only for writing large Excel files. It just happens that it directly writes XML tags and attributes, and thus works the fastest for writing a lot of data.

using (SpreadsheetDocument xl = SpreadsheetDocument.Create("LargeFile.xlsx", SpreadsheetDocumentType.Workbook))
{
    List<OpenXmlAttribute> oxa;
    OpenXmlWriter oxw;

    xl.AddWorkbookPart();
    WorksheetPart wsp = xl.WorkbookPart.AddNewPart<WorksheetPart>();

    oxw = OpenXmlWriter.Create(wsp);
    oxw.WriteStartElement(new Worksheet());
    oxw.WriteStartElement(new SheetData());

    for (int i = 1; i <= 50000; ++i)
    {
        oxa = new List<OpenXmlAttribute>();
        // this is the row index
        oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));

        oxw.WriteStartElement(new Row(), oxa);

        for (int j = 1; j <= 100; ++j)
        {
            oxa = new List<OpenXmlAttribute>();
            // this is the data type ("t"), with CellValues.String ("str")
            oxa.Add(new OpenXmlAttribute("t", null, "str"));

            // it's suggested you also have the cell reference, but
            // you'll have to calculate the correct cell reference yourself.
            // Here's an example:
            //oxa.Add(new OpenXmlAttribute("r", null, "A1"));

            oxw.WriteStartElement(new Cell(), oxa);

            oxw.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));

            // this is for Cell
            oxw.WriteEndElement();
        }

        // this is for Row
        oxw.WriteEndElement();
    }

    // this is for SheetData
    oxw.WriteEndElement();
    // this is for Worksheet
    oxw.WriteEndElement();
    oxw.Close();

    oxw = OpenXmlWriter.Create(xl.WorkbookPart);
    oxw.WriteStartElement(new Workbook());
    oxw.WriteStartElement(new Sheets());

    // you can use object initialisers like this only when the properties
    // are actual properties. SDK classes sometimes have property-like properties
    // but are actually classes. For example, the Cell class has the CellValue
    // "property" but is actually a child class internally.
    // If the properties correspond to actual XML attributes, then you're fine.
    oxw.WriteElement(new Sheet()
    {
        Name = "Sheet1",
        SheetId = 1,
        Id = xl.WorkbookPart.GetIdOfPart(wsp)
    });

    // this is for Sheets
    oxw.WriteEndElement();
    // this is for Workbook
    oxw.WriteEndElement();
    oxw.Close();

    xl.Close();
}

This creates a spreadsheet with 100 columns and 50000 rows of data. On my machine, it ran in about 18 seconds.

I’m not going to list all the available XML attributes, but you can find out the correct attribute name by exploring the Open XML SDK documentation, or the official ECMA documentation.

Collision detection in merging Excel cells

You can merge cells in Excel. “Duh!” you say. So how does Excel ensure that the cells you want to merge don’t overlap any existing merged cells? How do you check programmatically that the cells you want to merge won’t overlap any existing merged cells?

Well, that was the problem I faced while writing my spreadsheet library software (check out SpreadsheetLight here!). The inkling of an idea was formed when I visited StackOverflow. Can’t remember what I was there for, but the answer page wasn’t about Excel or spreadsheets. It was about collision detection.

Yes, the type of collision detection used in games so you don’t walk through walls and stuff.

And thus I learnt a new mathematical theorem: the separating axis theorem (or hyperplane separation theorem). From what I understand, given two N-dimensional thingies, if you can find one (N-1)-dimensional hyperplane that separates those two thingies, then those two thingies are separate (or not overlapping).

Yes, “thingies” is a technical term.

For example, given two 3D objects, if you can find one 2D plane that separates them, then the 3D objects are not overlapping. Another example is when you’re given two 2D objects, and if you can find one line (which is 1D) that separates them, then the 2D objects are not overlapping.

There are some conditions to be fulfilled, such as the objects being convex and disjoint and whatever. (yes I’m a mathematician…) I’ll leave it to you to do your own reading.

But for our purposes, how do we check if that rectangle of Excel cells we want to merge won’t overlap with any existing merged cells? We have these conditions that make our lives easier:

  • The merged cells are rectangular
  • The 4 end points (top-left, top-right, bottom-left, bottom-right) of a merged cell are in whole numbers
  • The (merged) cells map strictly to a 2D rectangular grid

Since the merged cells are rectangular, they’re also convex (I’m not going to explain why, just trust me). Since the 4 end points are in whole numbers, line boundaries can be easily calculated (it’s easy to check if something is <7 or >=8 for example). And since they map strictly to a 2D rectangular grid, the separating axis is a line. And even better, you can always find a subset of solutions of those separating axes that are either horizontal or vertical.

Ok, diagrams are in order.

Collision detection in merged cells

So in our case, as long as you can find one horizontal line or one vertical line that separates the 2 merged cells we’re checking, then the 2 merged cells are separate.

Let’s try an example. Excel uses cell references that are in the A1 format, which means row 3 column 4 becomes “D3”. The column index uses alphabets and is in base-26. Read more here.

We’re going to simplify that. Let’s say our existing set of merged cells only has one entry. This merged cell has the top-left corner at row 1 column 1, and the bottom-right corner at row 3 column 4. Suppose we want to check if this merge cell overlaps: top-left corner at row 7 column 8, bottom-right corner at row 10 column 10.

The horizontal line of 5 fits the bill, or the line y=5 if you want to be mathematical about it (but y goes from negative to positive downwards instead of the usual Cartesian y). Or y=6. Or even y=7 (note that the line can “touch” one of the merged cells, but not both. This is where the “whole number” condition comes in).

The vertical lines x=5, x=6 or x=8 also fit the bill.

Thus, our 2 merged cells don’t overlap.

So what’s the programmatic way to check? You’d be surprised at the simplicity. To make it easier, the variables with “StartRowIndex” mean the top row of the merged cell, “EndRowIndex” mean the bottom row of the merged cell. And “StartColumnIndex” mean the left-most column of the merged cell, and “EndColumnIndex” mean the right-most column of the merged cell.

if (!(iEndRowIndex < mc.StartRowIndex || iStartRowIndex > mc.EndRowIndex || iEndColumnIndex < mc.StartColumnIndex || iStartColumnIndex > mc.EndColumnIndex))
{
    // there's an overlap!
}

So the merge cell we want to check has top-left corner at (iStartRowIndex, iStartColumnIndex) and bottom-right corner at (iEndRowIndex, iEndColumnIndex).

The variable “mc” refers to a custom class I use to represent merged cells. Obviously, you’d run that condition in a loop through all your existing merged cells. If you can’t find an overlap after running through that loop, then the merge cell you’re checking is good to go.

Let’s run through the individual conditions.

  • (iEndRowIndex < mc.StartRowIndex) means our merged cell is completely above the existing merged cell
  • (iStartRowIndex > mc.EndRowIndex) means our merged cell is completely below the existing merged cell
  • (iEndColumnIndex < mc.StartColumnIndex) means our merged cell is completely to the left of the existing merged cell
  • (iStartColumnIndex > mc.EndColumnIndex) means our merged cell is completely to the right of the existing merged cell

The first 2 conditions check for the existence of a horizontal separating axis. The next 2 conditions check for the existence of a vertical separating axis.

Note the negation of the entire boolean condition. Those 4 conditions check for existence of solutions. Negation means checking for overlaps.

Note also that we use strictly greater than or strictly less than checks. If the merge cells share a row or column, then they overlap, right?

Did you know my first draft of writing the code had me checking 6 different conditions? Each condition had its own if statements. This one just had one. I’m so lucky to have learnt the separating axis theorem. I was checking if a point was above, within or below the merge cell, then I had to check if the point was to the left, within or to the right of the merge cell, and … urgh, it was horrible…