How to fix Open XML SDK created spreadsheet for iOS devices

Note that this is only for Open XML SDK 2.0. I haven’t tested for SDK 2.5 (or later, but there’s no “later” version at the point of this writing).

The fix

You have to manipulate the [Content_Types].xml file. This XML file is zipped together in the spreadsheet file, and is at the root of the file. If you don’t know what I’m talking about, you’re probably reading the wrong article.

Unfortunately, this can’t be done within Open XML SDK. As I understand it, the SDK runs internally on System.IO.Packaging namespace objects.

Now a package is a zip file with structure. But a zip file is not necessarily a package. (go maths logic!) See Microsoft reference (bottom of page).

As I understand it, you can’t manipulate a package’s [Content_Types].xml with any of the methods or classes from the System.IO.Packaging namespace. This XML file is supposed to be tamper-proof.

So if you can’t change it with the Packaging stuff, then you’ll have to manipulate it as a pure zip file. And as at .NET Framework 3.5 (because I’m using Open XML SDK 2.0 as reference), there’s no in-built zipping mechanism for the kind of zipping algorithm you need (the Gzip and Deflate isn’t exactly the tool).

This means you need an outside zip library. I leave it to you to find your favourite library. Assuming you found one, here’s why you need to change the [Content_Types].xml file.

The one generated by Excel looks something like this:

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Default ContentType="application/xml" Extension="xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml" PartName="/xl/workbook.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml" PartName="/xl/worksheets/sheet.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.theme+xml" PartName="/xl/theme/theme.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml" PartName="/xl/styles.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml" PartName="/xl/sharedStrings.xml"/>
<Override ContentType="application/vnd.openxmlformats-package.core-properties+xml" PartName="/docProps/core.xml"/>
</Types>

The one generated by Open XML SDK 2.0 looks something like:

<?xml version="1.0" encoding="UTF-8"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml" Extension="xml"/>
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml" PartName="/xl/worksheets/sheet.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.theme+xml" PartName="/xl/theme/theme.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml" PartName="/xl/styles.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml" PartName="/xl/sharedStrings.xml"/>
<Override ContentType="application/vnd.openxmlformats-package.core-properties+xml" PartName="/docProps/core.xml"/>
</Types>

Spot the difference time!

The workbook part is missing. Specifically, look for the XML tag with the ContentType attribute equal to “application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml”

Open XML SDK, for whatever reason, doesn’t explicitly set the PartName=”/xl/workbook.xml” part. This is why iOS devices can’t see the spreadsheet file.

I don’t know whether the fix will work on Android devices or Windows phones. But a developer emailed me that iOS 6.0 devices seem to still be able to read the spreadsheet files by Open XML SDK, but iOS 7.0 just gives up.

Microsoft, please, you created the Open XML specs. Other companies are following the specs and the one useful tool you made to help generate Open XML documents fail to follow the specs.

Considerations for storing Excel cell value in code

You may hate Excel, but you may find a discussion of how Excel stores cell values interesting.

So I have a spreadsheet library. The biggest concern at the initial stage was how to store all the spreadsheet data efficiently. I hear people talking about millions of cells, so I’m scared. If my program stores a spreadsheet cell using 10 bytes (for example), a million cells would take up 10 million bytes in memory.

Let’s start by looking at all the different types of information you can type into a spreadsheet cell. You have:

  • booleans: TRUE or FALSE
  • numbers
  • text
  • rich text (different styled text within the entire text itself)
  • dates and times

For us programmers, “numbers” can be separated into floating point or integer types. An Excel user won’t see a difference.

So how does Excel actually store those values? I’m going to focus only on Open XML because I’m not interested in BIFF files…

  • booleans: TRUE stored as text “1” and FALSE stored as text “0”
  • numbers: stored as text
  • text: duh
  • rich text: stored in a separate shared strings list, with the index to that list stored as text here.
  • dates and times: stored as number that’s in text form

You will see everything is basically stored as text. That’s because the underlying XML files are text files. There’s a property (XML attribute) that differentiates the data, such as boolean, number, string, inline string, shared string.

So why are dates stored as a number? It’s easier to do date calculations with 41449 than “24 June 2013”. So how is this number obtained? See here.

So if you’ve been looking closely enough, Excel’s optimisation tactic is to store everything as numeric text as far as possible. So I want to follow that.

Before doing so however, I went to read what other people are doing AKA open source spreadsheet libraries. In code, they use an object to store the cell value. As in System.Object, the mother of all data types in .NET.

So you have an integer? Dump it into the object variable. Floating point? Dump into object. String of characters? Dump.

How do you read it out? Boxing and unboxing. You remember it’s a floating point value and cast it back from an object to a double variable type.

So what did I do? I have a double variable and a string variable, and I store the cell value in one or the other based on the input.

The “all in object” way has variable (no pun intended) memory size, based on the contents. Sort of. I’m not an expert in this.

My way has a fixed memory size for double’s. Each double takes up 8 bytes (for sure?). A string variable takes up variable size, but because the optimisation tactic is to store data as a number, I can assign the data to the double variable and set the string variable to null. This means the string variable size is sort of fixed too.

So this is what I do. If it’s a number, I store it in the double variable and set the string variable to null. If it’s text, I convert it to a number by using shared strings (out of scope for discussion here) and store the index into the double variable and set the string variable to null. The only cases where the string variable is actually used is if I store the text there, or if I want to store the actual number there (because “1.23456789” may not be stored exactly as that in a double variable. Go read on how floating points are implemented for details), which are rare.

According to Jon Skeet, strings take up 20 + (n/2)*4 bytes (where n is the number of characters). But a null string takes up 8 bytes (it’s either 4 or 8 bytes. I’ll assume the worse scenario).

This means for the most part, each cell has a double variable that takes up 8 bytes and a null string that takes up 8 bytes. A cell value of 10 or 3.14 or 12345678.9 takes up 16 bytes regardless.

Since 16 bytes is less than 20 + (n/2)*4 bytes, I save more memory in most cases. I also have less boxing and unboxing operations, which make things go faster.

File upload size limit in IIS

Yay file uploads. As if letting the users to type in stuff into the web application giving me SQL injection nightmares weren’t enough, now I have to let users upload files.

Peachy.

So during my investigations into the limits of file uploading, I found that I couldn’t upload a file more than 30MB on my test server. It failed faster than Superman could jump a building in a single bound, and with just as much sound.

In short, here are my findings. The default file size limit set in IIS (6 and below? Read on for more details) is 4MB. In IIS7 (on Windows Server 2008), the file size limit is 30MB (technically it’s 28.61MB because it’s 30000000 bytes but who’s keeping track. Hey you read on!).

So how do you change the limits? In the web.config file. We’re doing ASP.NET applications.

<httpRuntime executionTimeout="3600" maxRequestLength="20480" />

That will give you a timeout period of 1 hour (3600 seconds) and a file size limit of 20MB (20480 KB. Yes, that attribute is in kilobytes).

For IIS7, we do this:

<security>
    <requestFiltering>
        <requestLimits maxAllowedContentLength="134217728" />
    </requestFiltering>
</security>

That gives you a 128MB limit (128 * 1024 * 1024). Yes it’s in number of bytes.

So why was I doing file uploads? Documents from university staff or students. The most important of which is the final doctoral thesis.

I asked how large can that thesis be, assuming it’s in PDF form. I got an answer where a 40MB limit seems too small. Really?

I had trouble auto-generating an Excel file of 40MB just to test the server limits. Do you know how large 40MB is?

If it’s a video or sound file, then yes I can believe it. I have video files of over 100MB, some over 200MB. But a PDF? With mostly text?

Go check out the Open XML specs from ECMA. The largest document is about 28MB. It’s over 5000 pages. I doubt any thesis can match that number of pages.

Specious spreadsheet security

If not for Chinese, it would’ve worked.

So in Excel, you can set a password for either protecting a particular worksheet, or protecting the entire workbook/spreadsheet. This password is then hashed, and the result is stored within the spreadsheet contents.

Now with the use of Open XML spreadsheets, this means the resulting hash is stored in “plain text” within XML files. Without going into too much detail, here’s the algorithm for the hash as documented in the Open XML SDK 2.0 help docs:

// Function Input:
//    szPassword: NULL-terminated C-style string
//    cchPassword: The number of characters in szPassword (not including the NULL terminator)
WORD GetPasswordHash(const CHAR *szPassword, int cchPassword) {
      WORD wPasswordHash;
      const CHAR *pch;
 
      wPasswordHash = 0;
 
      if (cchPassword > 0)
            {
            pch = &szPassword[cchPassword];
            while (pch-- != szPassword)
                  {
                  wPasswordHash = ((wPasswordHash >> 14) & 0x01) | ((wPasswordHash << 1) & 0x7fff);
                  wPasswordHash ^= *pch;
                  }
            wPasswordHash ^= (0x8000 | ('N' << 8) | 'K');
            }
      
      return(wPasswordHash);
}

This algorithm is wrong. Or at least it doesn't give the resulting hash that Excel produces.

Granted, I didn't really expect the algorithm to be correct. Because from the SDK help:

An example algorithm to hash the user input into the value stored is as follows:

It's an example algorithm, so it may or may not be the one that Excel actually use. However, for the purposes of usability, the password used by users have to be encrypted using Excel's algorithm, so we have to somehow get the resulting hash. Either we get the algorithm itself, or we simulate an algorithm such that the resulting hash matches that of Excel's.

Which is what Kohei Yoshida did. See his modified algorithm. This algorithm worked!

Given that I'm Chinese, I did what's natural: I used Chinese characters as the password.

And the modified algorithm failed. It only worked if the password consisted only of Latin alphabets. I tried Japanese characters. Failed too.

This is why I don't support password protection in SpreadsheetLight. I don't want to give the false impression that the worksheet/workbook encryption works. This is one feature where "partially worked" is unacceptable.

Granted Open XML spreadsheets also support other types of encryption, and you can store the name of the algorithm and salt value you used, and even the number of times the hash algorithm was run.

But.

This is in the context of a spreadsheet library.

What are spreadsheet libraries mostly used for? Automation.

Which means minimal (if at all) human interaction and intervention.

And so the question comes up.

Where do you store the password?

If a human is protecting the worksheet/workbook, she will provide the password herself, and then encrypts it (well Excel does it), and she just remembers the password.

If a spreadsheet library is generating the workbook, and encrypting it, the password has to be gotten from somewhere, right? So it's stored, maybe in a text file, maybe in a database, maybe hardcoded in code, or whatever.

The point being that the password is not held solely by a human being. And a computer hard drive is easier to hack into than a human brain.

And I will prefer not to be a party to facilitating insecure spreadsheet generation. Besides, it's Open XML. The data is supposed to be open and sharable. Password protection seems to be the opposite. Even Microsoft cautions the use of depending just on encrypted Excel files.

We already have social engineering used by devious people to deceive people into giving their passwords over. Storing passwords on a computer seems suicidal because computers have no common sense at all.

As a final word, I'd say using the Open XML SDK can be either verbose, or obscenely painstakingly verbose. Simple tasks need a couple of dozens of lines of code, and complex tasks take at least 2 magnitudes of work to do. For individualistic, compartmentalisable (that's not a word, right?) tasks, you can do it from scratch. Add even a smidgeon of complexity, and you'll find a library more useful. Try mine.

SpreadsheetLight version 3

Version 3 of my spreadsheet library is now available. There’s a whole bunch of updates, including Excel 2010 conditional formatting such as data bars with negative value fill colours and icon sets with no icons.

SpreadsheetLight is possibly the most developer-friendly spreadsheet library ever. Even if I do say so myself. 🙂