Modulo 26 and column names

I was sitting in the lecture theatre valiantly trying to keep awake. The professor was speaking on the rigorous application and proving of the modulus function. It’s basically the remainder, but I’ve never been introduced to it in such, uh, rigor.

He brought up an example using modulo 26. And demonstrated the wrapping around of values. And the use of it in cryptology (a class I took later on, and I got tasked by the cryptography professor to write a program to do simple encryption. But another story perhaps…).

Modulo 26 is similar to finding the remainder. The difference is that the remainder is unique. This is important to our discussion.

“About what?” you may ask.

Excel column names.

There are 2 types of cell references used in spreadsheets, the R1C1 format and the A1 format. The R1C1 is simple. If the row index is 5 and the column index is 7, the result is R5C7.

The A1 format takes on a column name and the row index. Using our example again, the result is G5, because “G” is the 7th alphabet. Yes, that list of 26 alphabets.

The version of Excel currently (Excel 2007 and 2010) has up to 3 letters, with XFD as the last column name (that’s the 16384th column). What happens is you have column names A, B, and then up to Z. Then the next column name is AA, then AB and then up AZ. Then BA, BB and so on.

Basically, it’s base 26 arithmetic.

As far as I know, the typical method of getting the column name given the column index, is to run a loop. You add 1 until it hits 26, then you move to the next “position”, and then start from 1 again.

There’s nothing wrong with this method. It’s just that you have to iterate as many times as the given column index. If you’re given 16384, the loop runs 16384 times. This is regardless of the fact that the result is always the same. Given the range of values, the result can only be one of 16384 values.

So it was with this in mind, and my dabbling in game development (which said “Precalculate everything!”), that I precalculated an array of strings that are the column names. The array has 16384 items. The context is a spreadsheet software library.

Now to recoup that precalculation cost, I’d have to access my array at least 16384 times. This is where the context of the spreadsheet library comes in. Everybody (and their dog) wants to know if my library can handle millions of cells. This means if the column name calculation is in a method, that method is called millions of times. Given the iteration loop in it, that means the method with the iteration loop thing isn’t efficient (it’s O(n^2)).

However, due to technical issues, I can’t keep the array of strings. The column name array needs to be static to be available throughout the library. This causes issues if multi-threads or multi-processors or multi-whatevers comes in.

So I can’t use my static array anymore. Bummer. The O(n) of simple array access was working so well.

But I still want to have an efficient way of getting column names. So instead of iterating, I used simple division and modulus operations.

Consider 587. How do you know there’s 5 hundred? 587 / 100 equals 5 (truncating remainders). How do you know there’s 8 tens? 87 / 10 equals 8 (truncating remainders).

Yes, it’s elementary arithmetic, but it works great. So we can do the same for column names. There’s a problem though.

In the case above, we divided by 100. Why 100? Because it’s 10^2, and we’re concerned with the 2nd position after the ones position. The ones position is 10^0 by the way, which is 1.

So for our case, for the “hundreds” position, we divide by 676, which is 26^2. And for the “tens” position, we divide by 26.

Now 587 is 100*5 + 10*8 + 7. I’m going to use the notation (5,8,7) to denote this.

Now consider a column index of 3380. 3380 is equal to 676*5 + 26*0 + 0. This is (5,0,0).

However, in our case, our acceptable range of values is 1 through 26. It doesn’t contain zero. So (5,0,0) is not valid.

In the case of 587, we’re working in base 10, with the acceptable range of values being 0 to 9. This is “proper” remainder. Given any number in base 10, there’s a unique number within [0, 9] that’s the remainder.

However, for our purposes, there’s no unique number. Because we’re working in modulo 26, not just “remainder 26”.

The correct column name corresponding to column index 3380 is “DYZ”. This corresponds to (4,25,26). Or
3380 = 676*4 +26*25 + 26.

Note that 3380 is also 676*5 + 26*0 + 0.

My solution is to start from the “ones” position. If it’s greater than zero, fine. If it’s less than or equal to zero, borrow from the next larger position. Then we move to the next larger position, and check again. Continue to borrow until there are no zero values (or negatives) on the “right” side of the resulting notation (we can have “leading” zeroes).

So (5,0,0) becomes (5, -1, 0 + 26), or just (5,-1,26), borrowing 1 from the “tens” position. We cannot have -1, so that becomes (4, -1 + 26, 26), which becomes (4, 25, 26).

An interesting effect is that we typically assign 0 to A, 1 to B, and 25 to Z. In this case, 1 is assigned to A, 2 is to B, and most interestingly, both 0 and 26 map to Z. In fact, any multiple of 26 will map to Z.

Don’t think Z is special. Any multiple of 26 plus 1 also maps to A. So 1, 27, 53 and so on map to A. This is a property of the modulo thing.

Do you have a better way of converting (5,0,0) to (4,25,26)? Let me know in the comments.

Calculating Excel spreadsheet column names

I’ve been working with Open XML spreadsheets for the past, I don’t know how long… A year? I just realised that getting that Excel column header name is a frequent task. You know, given that it’s the 4th column, it’s “D”. I don’t work frequently with spreadsheets with lots of columns. So it was interesting that the 26th column is “Z” and the 27th column becomes “AA”. Basically, base-26 arithmetic, using the 26 letters of the English alphabet as tokens.

There are probably lots of code snippets out there showing you how to calculate a column name given the column index. Here’s mine:

string[] saExcelColumnHeaderNames = new string[16384];
string[] sa = new string[] { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };
string s = string.Empty;
int i, j, k, l;
i = j = k = -1;
for (l = 0; l < 16384; ++l)
{
    s = string.Empty;
    ++k;
    if (k == 26)
    {
        k = 0;
        ++j;
        if (j == 26)
        {
            j = 0;
            ++i;
        }
    }
    if (i >= 0) s += sa[i];
    if (j >= 0) s += sa[j];
    if (k >= 0) s += sa[k];
    saExcelColumnHeaderNames[l] = s;
}

That gives you a zero-based indexing version. So to get the 30th column name, you use saExcelColumnHeaderNames[29].

In case you’re wondering, 16384 is the maximum number of columns supported by Excel 2010.

You will notice that it’s not a function given the column index. I find that not as useful. Look, typically when you need the column name, you probably also need to get it frequently, usually with different parameters.

What I did was to store all the calculation results into a string array. Then you reference it with an index. The calculation function typically is a O(n) operation. With you needing to use the function multiple times, your whole algorithm probably goes up to O(n^2).

My method is also an O(n) operation. But referencing a string array is I think an O(1), meaning it’s a constant. I’ve never been good with big O notation…

This style of solving the problem is called pre-calculation. Pre-calculation is especially useful in the games development region, where speed is important. For example, selected values of sine and cosine were pre-calculated and stored in arrays, for use in the numerous 3D/2D calculations in games. Calculating sine’s and cosine’s in real-time were detrimental to a speedy game.

That’s not as useful now because you need a fuller range of floating point values as input. But the concept is still useful.

I think I read somewhere (while I was doing hobbyist game development) this quote:

Pre-calculate everything!

Maybe computers are now much faster. I don’t care. That doesn’t give you an excuse to be sloppy. It’s an optimisation that doesn’t take much effort.

If you need to calculate it, see if you can calculate it just once.

DefinedNames in Cells for Open XML spreadsheets

So a while back, a customer of mine asked me if I knew how to set names (or labels) to cells in a spreadsheet, so that a cell formula just referenced those names. Frankly, I didn’t even know I could do that. I’m not an Excel wizard, as you can tell.

In case you’re in a hurry, here’s the code and the resulting Excel spreadsheet. The code runs against the Open XML SDK.

To my surprise, the name given to the spreadsheet cell isn’t defined at the Cell class level. You’d think you would name your child and make sure to slap a name tag on your progeny just so everyone knows what to call your offspring, right? After looking through the documentation, and thinking it through, I guess it makes sense to separate it. It works by having a central depository with all the available names, contained within DefinedNames and DefinedName classes. Then the names are available throughout the spreadsheet’s workbook.

It’s sort of like the SharedStringTable, where every piece of text is stored in a SharedStringItem, and referenced with an index. This has problems, in that I don’t know what is contained in the Cell class itself, but I don’t really want to go there right now…

Let’s go through the code a bit. Here’s the part where you define DefinedName(s):

defname = new DefinedName();
defname.Name = "PrimeNum1";
defname.Text = "Sheet1!$C$2";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum2";
defname.Text = "Sheet1!$C$3";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum3";
defname.Text = "Sheet1!$C$4";
defnames.Append(defname);

defname = new DefinedName();
defname.Name = "PrimeNum4";
defname.Text = "Sheet1!$C$5";
defnames.Append(defname);

In case you’re unfamiliar with Excel, “Sheet1!$C$5” means the cell C5 of the sheet named “Sheet1”. The dollar sign acts as a separator, I think. In this case, the relevant cells are C2, C3, C4 and C5 (containing the 1st, 2nd, 3rd and 4th prime numbers). And then we have this part:

r = new Row();
r.RowIndex = 6;
c = new Cell();
c.CellReference = "B6";
c.DataType = CellValues.String;
c.CellValue = new CellValue("SUM");
r.Append(c);
c = new Cell();
c.CellReference = "C6";
c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");
c.CellValue = new CellValue("17");
r.Append(c);
sd.Append(r);

Note this particular line:

c.CellFormula = new CellFormula("SUM(PrimeNum1, PrimeNum2, PrimeNum3, PrimeNum4)");

That’s our defined names in effect. Otherwise, we would use this:

c.CellFormula = new CellFormula("SUM(C2, C3, C4, C5)");
// or even
// c.CellFormula = new CellFormula("SUM(C2:C5)");

I teach you more about the CellFormula class, as well as a whole bunch of Open XML concepts in my programming guide.

If you’re Chinese, don’t give your progeny full dialect names

I recently had a chat with a friend about baby names. She’s asking for opinions because, well, she’s carrying a baby girl (congratulations!) and she wanted to know what her/my friends and I thought of her choice.

The name’s not important to this story. The point is, her husband and her had decided to use only an English name and the surname (or last name if you come from those western countries). I’m using the term “English name” as opposed to “Christian name” because it’s more generic.

Now, the typical Chinese name has 2 “English” equivalents, phonetically speaking (other than a direct English name). One is the Hanyu Pinyin version. For example, my name in Hanyu Pinyin is Chen Weilie (or Wei Lie in 2 characters, but it’s sort of understood when they’re lumped together).

The other equivalent is the dialect name. My Chinese dialect group is Hokkien, which means my ancestors came from Fu Jian in China. And my dialect name is Tan Wai Lip.

Now, my preferred name is “Vincent”. Or “Vincent Tan”. “Vincent Chen” is fine too. I just think that last one sounds funny to me…

Which gives an interesting problem, because my full “English name” is Vincent Tan Wai Lip. “Tan” is my last name. It doesn’t look very “last” to me…

According to the western naming convention, I should be “Wai Lip Vincent Tan”, which means in my culture’s convention, I’m “Tan Wai Lip Vincent”. The “Wai Lip” part is not a middle name, it is my first name according to western culture, and my given name in Chinese culture.

I’m bringing this up because, many of the online transactions I’ve made requires a “first name” and “last name” field. What do I fill in for those? I usually use “Vincent” and “Tan” respectively.

This worked fine till I got to PayPal. For withdrawal purposes, they required my name, after fully resolving the “first name” and “last name” part, to be exactly the same as that in my Singapore bank account. I’ll leave you to imagine all the hassle this gave me…

I’m sure Chinese aren’t the only ones with this problem.

So, back to that pregnant friend of mine. Even though she and her husband are both Chinese, they’ve decided that their daughter shall only have the “English name” and the surname part. Their daughter will still have a full Chinese name, only that she won’t have a full dialect name.

Their main reason?

Because the nurse at the hospital is prone to giving terrible sounding dialect names…