You may hate Excel, but you may find a discussion of how Excel stores cell values interesting.
So I have a spreadsheet library. The biggest concern at the initial stage was how to store all the spreadsheet data efficiently. I hear people talking about millions of cells, so I’m scared. If my program stores a spreadsheet cell using 10 bytes (for example), a million cells would take up 10 million bytes in memory.
Let’s start by looking at all the different types of information you can type into a spreadsheet cell. You have:
- booleans: TRUE or FALSE
- rich text (different styled text within the entire text itself)
- dates and times
For us programmers, “numbers” can be separated into floating point or integer types. An Excel user won’t see a difference.
So how does Excel actually store those values? I’m going to focus only on Open XML because I’m not interested in BIFF files…
- booleans: TRUE stored as text “1″ and FALSE stored as text “0″
- numbers: stored as text
- text: duh
- rich text: stored in a separate shared strings list, with the index to that list stored as text here.
- dates and times: stored as number that’s in text form
You will see everything is basically stored as text. That’s because the underlying XML files are text files. There’s a property (XML attribute) that differentiates the data, such as boolean, number, string, inline string, shared string.
So why are dates stored as a number? It’s easier to do date calculations with 41449 than “24 June 2013″. So how is this number obtained? See here.
So if you’ve been looking closely enough, Excel’s optimisation tactic is to store everything as numeric text as far as possible. So I want to follow that.
Before doing so however, I went to read what other people are doing AKA open source spreadsheet libraries. In code, they use an object to store the cell value. As in System.Object, the mother of all data types in .NET.
So you have an integer? Dump it into the object variable. Floating point? Dump into object. String of characters? Dump.
How do you read it out? Boxing and unboxing. You remember it’s a floating point value and cast it back from an object to a double variable type.
So what did I do? I have a double variable and a string variable, and I store the cell value in one or the other based on the input.
The “all in object” way has variable (no pun intended) memory size, based on the contents. Sort of. I’m not an expert in this.
My way has a fixed memory size for double’s. Each double takes up 8 bytes (for sure?). A string variable takes up variable size, but because the optimisation tactic is to store data as a number, I can assign the data to the double variable and set the string variable to null. This means the string variable size is sort of fixed too.
So this is what I do. If it’s a number, I store it in the double variable and set the string variable to null. If it’s text, I convert it to a number by using shared strings (out of scope for discussion here) and store the index into the double variable and set the string variable to null. The only cases where the string variable is actually used is if I store the text there, or if I want to store the actual number there (because “1.23456789″ may not be stored exactly as that in a double variable. Go read on how floating points are implemented for details), which are rare.
This means for the most part, each cell has a double variable that takes up 8 bytes and a null string that takes up 8 bytes. A cell value of 10 or 3.14 or 12345678.9 takes up 16 bytes regardless.
Since 16 bytes is less than 20 + (n/2)*4 bytes, I save more memory in most cases. I also have less boxing and unboxing operations, which make things go faster.