The leap year 1900 “bug” in Excel

No it’s not really a bug in Microsoft Excel. What happens is that Excel will accept 29 Feb 1900 as a valid date.

“Wait, the year 1900 is not a leap year. 29 Feb 1900 is invalid!”

Yes, I agree. I wrote something about leap years before. From what I understand, it’s a historical issue, that

There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904.

[emphasis mine]

The “1-2-3” refers to Lotus 1-2-3, a spreadsheet program. To get the Lotus users to come over to Excel, Excel had to be able to import files in the Lotus format. Unfortunately, leap years weren’t well understood then, so Excel used the wrong leap year calculation, as did other spreadsheet software, which Microsoft acknowledges.

I’m talking about this because I’m creating Excel files in the Open XML format (for reporting purposes). In particular, I work with dates. For example, I have to create reports for call detail records that pass through satellites for my users. The date value is one of the most looked at piece of information.

Anyway, to study how dates are stored in Excel, I created an Excel file with date information such as “26/10/2009 12:34” and saved it. I used to save it in the (Excel 2003) XML format, but with Open XML and the Microsoft Office Compatibility Pack (for Office 2003 and earlier versions), I tried the .xlsx format.

I renamed the .xlsx to .zip (because Open XML files are just zip files), then unzipped the package. I looked at /xl/worksheets/sheet1.xml and looked for my date data. It disappeared! They’re just floating point numbers!

Actually, those floating point numbers represent the number of days since the epoch 1 Jan 1900.

How did I discover that? I can’t remember the details. I think I printed consecutive days such as 1 Jan 2009, 2 Jan 2009 and so on, and then checked the floating point number in the resulting XML file. I found that they differed by a difference of 1.

Then I made the brilliant mother of all light bulbs and postulated that perhaps the floating point numbers started counting from an epoch. The only epoch worth my while was 1 Jan 1900.

So I amended my test program to start generating date values
01/01/1900
02/01/1900
03/01/1900
04/01/1900
05/01/1900
and so on.

I looked at the resulting XML file and saw 1, 2, 3, 4, 5 and so on. My next brilliant bit of deduction was that the trailing decimal values must be fractional values of days. Pleased with myself, I proceeded to test that theory.

It worked fine. Till I hit 29 Feb 1900. My instincts warned me, “That doesn’t look right.” Generally, I look out for edge cases, and in the case of dates, 29 Feb. I did some calculations and woah, 29 Feb 1900 isn’t valid! Yet there it was in the Excel file, displayed and formatted correctly by Excel.

That’s when I dug around and found Joel’s article.

So how do you fix it?

DateTime dtEpoch = new DateTime(1900, 1, 1, 0, 0, 0, 0);
DateTime dt = DateTime.ParseExact("05 Mar 1900", "dd MMM yyyy", null);
TimeSpan ts = dt - dtEpoch;
double fExcelDateTime;
// Excel has "bug" of treating 29 Feb 1900 as valid
// 29 Feb 1900 is 59 days after 1 Jan 1900, so just skip to 1 Mar 1900
if (ts.Days >= 59)
{
	fExcelDateTime = ts.TotalDays + 2.0;
}
else
{
	fExcelDateTime = ts.TotalDays + 1.0;
}

If the date in question is on or after 1 Mar 1900, simply add one to the floating point value.

You might wonder why we added one more to the value in each section of the if-else portion. If the date is on or after 1 Mar 1900, shouldn’t it be +1.0 instead of +2.0?

Because it’s a counting problem. We excluded the epoch date itself when we calculated the TimeSpan ts, so we’re adding it back in.

You might wonder why you couldn’t have used 31 Dec 1899 as the base. You can. I just think 31 Dec 1899 doesn’t quite ring a bell. 1 Jan 1900 is more striking.

“But I don’t use the year 1900 at all! I mean, even the year 2000 is, like, so last century. The year 1900 is like, Babylonian!”

Hmm… ok. *shrugs*

There’s also updated material and source code, together with more information on how to work with Open XML. Click here to find out more.

Parsing date formats with dd/MM/yyyy or MM/dd/yyyy?

I was angry and frustrated. The developer tested his code, and it failed. When he sent it to me for help, I tested it, and it worked. Isn’t it supposed to be the other way around? Someone writes the code and it works for them, but when it’s reviewed by you, it fails?

That developer couldn’t get the date selection right. I wrote a web control library which includes a date selection widget. When I designed it, to simplify things, I hardcoded the date parsing to be dd/MM/yyyy format (I’ll explain my choice later on).

What happened was, the developer either got the yellow screen of death (if you’re working in ASP.NET, you know what I’m talking about), or the day and month were swapped.

Finally, I asked him if his computer settings had MM/dd/yyyy as the default short date format. And if it was, to change it to dd/MM/yyyy. It was, and he changed it, and the code worked.

The default date setting on computers running on Windows is MM/dd/yyyy format. I change it to dd/MM/yyyy format because Singapore uses that format. And my users’ main business involves a big company residing in United Kingdom, which also uses the dd/MM/yyyy format.

Here’s how you change the settings. Go to Control Panel, and click on “Regional and Language Options” or something named similarly. You’ll get something like:

Regional and Language Options

Click on the “Customize” button to get:

Customize Regional Options

In the “Short date format” dropdownlist, enter “dd/MM/yyyy”. You may not find it under the dropdownlist options. This is what some people don’t know: You can type “dd/MM/yyyy” in the dropdownlist. Click “Apply” button, then “OK” button, and you’re set.

Here’s some code to test your changes:

System.Globalization.CultureInfo c = new System.Globalization.CultureInfo("en-GB", true);
DateTime dt1 = DateTime.ParseExact("05/08/2009", "dd/MM/yyyy", null);
DateTime dt2 = DateTime.Parse("05/08/2009");
DateTime dt3 = DateTime.ParseExact("05/08/2009", "dd/MM/yyyy", c);
DateTime dt4 = DateTime.Parse("05/08/2009", c);
Console.WriteLine(dt1.ToString("D").PadLeft(30) + " | " + dt1.ToString(c));
Console.WriteLine(dt2.ToString("D").PadLeft(30) + " | " + dt2.ToString("dd/MM/yyyy"));
Console.WriteLine(dt3.ToString("D").PadLeft(30) + " | " + dt3.ToString("MM/dd/yyyy"));
Console.WriteLine(dt4.ToString("D").PadLeft(30) + " | " + dt4.ToString());

Did you get the values you expected?

Here’s the comparison output from the American MM/dd/yyyy and British dd/MM/yyyy format:
Regional settings test output

Note that the code remained the same. Only the regional settings were changed.

So what I found out was that, as long as you explicitly state the date format (or use a culture setting with your desired date format), it works whether you’re parsing in the date string, or printing out the date.

Despite changing the settings, I still had a date format issue which I asked on StackOverflow. It was some time ago, but I didn’t get a satisfactory answer. Then again, it was my colleague’s problem and he’s fine with the weird behaviour… The problem could be because he didn’t explicitly set the output format, which displayed the wrong date on the screen, even though the regional settings are correct on the web server.

A Tale of Two Timezones

I work with date and time data a lot. I write applications that handle tens of thousands of such records a day. Not sure if I can tell you the exact nature of my work, so I’ll just briefly touch on the peripheral. Besides, if I write it here, I’ll have to hunt down the 7 people who read my blog. Excluding you of course. You’re awesome.

Big Ben
[image by track5]

So a company in United Kingdom provides a … certain service. And this service produces most of the data I work with. United Kingdom and Singapore… not quite London and Paris though…

Anyway, the data is accessed by customers all around the world, so I’ve got to align the dates and times correctly. Luckily, the customers understand the data is in UTC, so I didn’t have to explain too much in the user interface. I just have to present the date/time information correct to the second.

Internally, there’s a fork in opinions. Should the times when the backend processes are run be in UTC, or Singapore time? Purists might say, “All times should be in UTC!” and they convert accordingly. Pragmatists might say, “I’m the only who reads those logs, so it should be in local time!” The customers won’t know what went on in the backend processes, but they might be interested in when their data was last refreshed (which happens to be the process date/time).

This creates an interesting problem. You have to standardise the handling of any date/time data logic, even if it’s just to get the current date/time. Sometimes, I forget whether a particular piece of data is in UTC or local time, and I had to dig out the code that handled it to find out. Because 8 hours is a big difference.

Adding to this confusion, is that the system times of the web server and database server might be off by a few hundredths of a second. So do you use DateTime.UtcNow in C# or getutcdate() in SQL? Does it even matter? I used to work for a manager who insisted all times to be based on the database server’s system time. I admire his stand. I also have to point out the many database calls wasted just to get the current time in the applications…

On the upside, I have to wait for the data to be consolidated for the previous day, before it’s sent over to Singapore for processing. UTC midnight is 8am in Singapore, which is about the time I start work. This gives me some time to react in case of emergencies. You take whatever blessings you have, however small…

Date and time format mistakes in .NET and SQL

I’ve written about how you can manipulate dates and times in .NET before. Here, I’m going to highlight a few avoidable mistakes when manipulating them in .NET and in SQL for database queries.

Case matters

I’m not going to list down all the format strings you can use. Refer to this list instead. You might find this article by Microsoft on best practices useful. I am going to highlight these two letters, h and m.

The small “h” gives you the hour, in 12-hour notation without a leading zero (for less than 10 values). “hh” gives you the 12-hour representation with a leading zero.

The capital “H” gives you the hour in 24-hour notation (or military time) without a leading zero, and “HH” gives you the 24-hour representation with a leading zero.

It’s good user-friendly practice to include the “t” or “tt” notation if you’re using the small “h” to represent the hour (for “A” / “P” or “AM” / “PM” respectively). This way, you know if it’s in the morning or night.

The letter H isn’t so bad. At least you’re still referring to the hour. When you get to M, oh you better watch out.

The small “m” and “mm” gives you the minute (of the time) without and with a leading zero respectively.

The capital “M” and “MM” gives you the month without and with a leading zero respectively. For example, September would be 9 and 09 respectively. See, totally different thing from its small lettered counterpart. There’s the “MMM” and the “MMMM” format string, and I’ll leave it to you to experiment with it.

Now to a common mistake I see: “dd/mm/yyyy“. See any problems?

I’m going to give you a starter custom format string: “dd/MM/yyyy HH:mm:ss“. Burn that into your brain. You can swap “dd” and “MM” if you want. Use “-” instead of “/” as the separator if you want.

The TO_CHAR() function in Oracle PL/SQL

For the equivalent SQL statement to format dates and times in Oracle, here it is:

TO_CHAR(sysdate, 'DD/MM/YYYY HH24:MI:SS')

Note that the function parameter is case insensitive.

Note the difference between “MM” for month and “MI” for minute. For a 12-hour representation, use “HH” or “HH12”.

Tricky formats in Sybase and SQL Server

I don’t know why Sybase and SQL Server don’t just give the ability to easily customise date and time formats. Instead of the string formats in .NET and PL/SQL, they use number codes. Number codes! *urgh*

I actually have a table of the codes and the resulting formats printed out and placed beside my desk, somewhere buried in the chaotic mess of papers. I really just remember 3 number codes: 103, 112 and 120.

select convert(char(10), getdate(), 103)
-- gives something like 17/09/2008 in dd/MM/yyyy format
select convert(char(8), getdate(), 112)
-- gives something like 20080917 in yyyyMMdd format
select convert(char(19), getdate(), 120)
-- gives something like 2008-09-17 05:04:03 in yyyy-MM-dd HH:mm:ss format

I’m using the .NET format string notation in the comments. Note that number code 120 is only available in SQL Server. The other 2 codes are available in both Sybase and SQL Server. There are other codes (which you can explore here). I just frequently use those 3, particularly 112.

If you’re used to the American date display, then this might be useful:

select convert(char(10), getdate(), 101)
-- gives something like 09/17/2008 in MM/dd/yyyy format

Note that you can do something like

select convert(char(6), getdate(), 112)
-- gives something like 200809 in yyyyMM format

Note the char(6) part (versus char(8) originally). I do this quite often too.

Now for the mistake in T-SQL. What do you think is wrong with this?

select convert(char(10), EFF_DATE, 103) EFF_DATE
from Customers
order by EFF_DATE desc

Yes, I’m being deliberately vague. Use your powers of deduction to fill in the blanks.