Python pass by reference

Some time ago, I was given the opportunity to take a free course in Python. I was like “Sure why not? And it’s free? Woot!” (I don’t actually say “woot” but it seems like the kind of thing people say, right?)

The course is a beginner module meant for technical people (but not full-on coders). I was the only professional programmer in the class. I didn’t care, I don’t know Python. So 2 points of interest…

White space is critical

The programming languages I know consider white space as something to make the code easier to read for the human. Python elevates that to a whole new level.

I’m not even talking about the debate about whether a tab or 4 spaces is better for indentation. Python will punish you if you used 4 spaces for indentation here and then use 3 spaces for indentation there. This is especially terrible if you use an editor where a tab looks like 4 spaces so it looks like the indentation is fine but it’s not.

Parameter values are by default passed by reference

This one took a bit of getting used to. I’m used to passing values to functions and knowing that those values (or classes or structs) will never be tampered with unless I say so. For example, with the “ref” keyword in C#:

public void Foobar(ref ComplicatedClass ImComplex)
{
// ImComplex will possibly be modified
}

I’ll just slide that into my “good to know, but I’ll probably never use” list. I don’t want any weird errors slithering up because that’s hard to swallow. (I swear that’s the end of my snake jokes)

P.S. Happy Independence Day to you if you celebrate it. Coincidentally, it’s also the official day that my company closed. So Independence Day to me too.

How to fix Open XML SDK created spreadsheet for iOS devices

Note that this is only for Open XML SDK 2.0. I haven’t tested for SDK 2.5 (or later, but there’s no “later” version at the point of this writing).

The fix

You have to manipulate the [Content_Types].xml file. This XML file is zipped together in the spreadsheet file, and is at the root of the file. If you don’t know what I’m talking about, you’re probably reading the wrong article.

Unfortunately, this can’t be done within Open XML SDK. As I understand it, the SDK runs internally on System.IO.Packaging namespace objects.

Now a package is a zip file with structure. But a zip file is not necessarily a package. (go maths logic!) See Microsoft reference (bottom of page).

As I understand it, you can’t manipulate a package’s [Content_Types].xml with any of the methods or classes from the System.IO.Packaging namespace. This XML file is supposed to be tamper-proof.

So if you can’t change it with the Packaging stuff, then you’ll have to manipulate it as a pure zip file. And as at .NET Framework 3.5 (because I’m using Open XML SDK 2.0 as reference), there’s no in-built zipping mechanism for the kind of zipping algorithm you need (the Gzip and Deflate isn’t exactly the tool).

This means you need an outside zip library. I leave it to you to find your favourite library. Assuming you found one, here’s why you need to change the [Content_Types].xml file.

The one generated by Excel looks something like this:

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Default ContentType="application/xml" Extension="xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml" PartName="/xl/workbook.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml" PartName="/xl/worksheets/sheet.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.theme+xml" PartName="/xl/theme/theme.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml" PartName="/xl/styles.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml" PartName="/xl/sharedStrings.xml"/>
<Override ContentType="application/vnd.openxmlformats-package.core-properties+xml" PartName="/docProps/core.xml"/>
</Types>

The one generated by Open XML SDK 2.0 looks something like:

<?xml version="1.0" encoding="UTF-8"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml" Extension="xml"/>
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml" PartName="/xl/worksheets/sheet.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.theme+xml" PartName="/xl/theme/theme.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml" PartName="/xl/styles.xml"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml" PartName="/xl/sharedStrings.xml"/>
<Override ContentType="application/vnd.openxmlformats-package.core-properties+xml" PartName="/docProps/core.xml"/>
</Types>

Spot the difference time!

The workbook part is missing. Specifically, look for the XML tag with the ContentType attribute equal to “application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml”

Open XML SDK, for whatever reason, doesn’t explicitly set the PartName=”/xl/workbook.xml” part. This is why iOS devices can’t see the spreadsheet file.

I don’t know whether the fix will work on Android devices or Windows phones. But a developer emailed me that iOS 6.0 devices seem to still be able to read the spreadsheet files by Open XML SDK, but iOS 7.0 just gives up.

Microsoft, please, you created the Open XML specs. Other companies are following the specs and the one useful tool you made to help generate Open XML documents fail to follow the specs.

Considerations for storing Excel cell value in code

You may hate Excel, but you may find a discussion of how Excel stores cell values interesting.

So I have a spreadsheet library. The biggest concern at the initial stage was how to store all the spreadsheet data efficiently. I hear people talking about millions of cells, so I’m scared. If my program stores a spreadsheet cell using 10 bytes (for example), a million cells would take up 10 million bytes in memory.

Let’s start by looking at all the different types of information you can type into a spreadsheet cell. You have:

  • booleans: TRUE or FALSE
  • numbers
  • text
  • rich text (different styled text within the entire text itself)
  • dates and times

For us programmers, “numbers” can be separated into floating point or integer types. An Excel user won’t see a difference.

So how does Excel actually store those values? I’m going to focus only on Open XML because I’m not interested in BIFF files…

  • booleans: TRUE stored as text “1” and FALSE stored as text “0”
  • numbers: stored as text
  • text: duh
  • rich text: stored in a separate shared strings list, with the index to that list stored as text here.
  • dates and times: stored as number that’s in text form

You will see everything is basically stored as text. That’s because the underlying XML files are text files. There’s a property (XML attribute) that differentiates the data, such as boolean, number, string, inline string, shared string.

So why are dates stored as a number? It’s easier to do date calculations with 41449 than “24 June 2013”. So how is this number obtained? See here.

So if you’ve been looking closely enough, Excel’s optimisation tactic is to store everything as numeric text as far as possible. So I want to follow that.

Before doing so however, I went to read what other people are doing AKA open source spreadsheet libraries. In code, they use an object to store the cell value. As in System.Object, the mother of all data types in .NET.

So you have an integer? Dump it into the object variable. Floating point? Dump into object. String of characters? Dump.

How do you read it out? Boxing and unboxing. You remember it’s a floating point value and cast it back from an object to a double variable type.

So what did I do? I have a double variable and a string variable, and I store the cell value in one or the other based on the input.

The “all in object” way has variable (no pun intended) memory size, based on the contents. Sort of. I’m not an expert in this.

My way has a fixed memory size for double’s. Each double takes up 8 bytes (for sure?). A string variable takes up variable size, but because the optimisation tactic is to store data as a number, I can assign the data to the double variable and set the string variable to null. This means the string variable size is sort of fixed too.

So this is what I do. If it’s a number, I store it in the double variable and set the string variable to null. If it’s text, I convert it to a number by using shared strings (out of scope for discussion here) and store the index into the double variable and set the string variable to null. The only cases where the string variable is actually used is if I store the text there, or if I want to store the actual number there (because “1.23456789” may not be stored exactly as that in a double variable. Go read on how floating points are implemented for details), which are rare.

According to Jon Skeet, strings take up 20 + (n/2)*4 bytes (where n is the number of characters). But a null string takes up 8 bytes (it’s either 4 or 8 bytes. I’ll assume the worse scenario).

This means for the most part, each cell has a double variable that takes up 8 bytes and a null string that takes up 8 bytes. A cell value of 10 or 3.14 or 12345678.9 takes up 16 bytes regardless.

Since 16 bytes is less than 20 + (n/2)*4 bytes, I save more memory in most cases. I also have less boxing and unboxing operations, which make things go faster.

File upload size limit in IIS

Yay file uploads. As if letting the users to type in stuff into the web application giving me SQL injection nightmares weren’t enough, now I have to let users upload files.

Peachy.

So during my investigations into the limits of file uploading, I found that I couldn’t upload a file more than 30MB on my test server. It failed faster than Superman could jump a building in a single bound, and with just as much sound.

In short, here are my findings. The default file size limit set in IIS (6 and below? Read on for more details) is 4MB. In IIS7 (on Windows Server 2008), the file size limit is 30MB (technically it’s 28.61MB because it’s 30000000 bytes but who’s keeping track. Hey you read on!).

So how do you change the limits? In the web.config file. We’re doing ASP.NET applications.

<httpRuntime executionTimeout="3600" maxRequestLength="20480" />

That will give you a timeout period of 1 hour (3600 seconds) and a file size limit of 20MB (20480 KB. Yes, that attribute is in kilobytes).

For IIS7, we do this:

<security>
    <requestFiltering>
        <requestLimits maxAllowedContentLength="134217728" />
    </requestFiltering>
</security>

That gives you a 128MB limit (128 * 1024 * 1024). Yes it’s in number of bytes.

So why was I doing file uploads? Documents from university staff or students. The most important of which is the final doctoral thesis.

I asked how large can that thesis be, assuming it’s in PDF form. I got an answer where a 40MB limit seems too small. Really?

I had trouble auto-generating an Excel file of 40MB just to test the server limits. Do you know how large 40MB is?

If it’s a video or sound file, then yes I can believe it. I have video files of over 100MB, some over 200MB. But a PDF? With mostly text?

Go check out the Open XML specs from ECMA. The largest document is about 28MB. It’s over 5000 pages. I doubt any thesis can match that number of pages.

Specious spreadsheet security

If not for Chinese, it would’ve worked.

So in Excel, you can set a password for either protecting a particular worksheet, or protecting the entire workbook/spreadsheet. This password is then hashed, and the result is stored within the spreadsheet contents.

Now with the use of Open XML spreadsheets, this means the resulting hash is stored in “plain text” within XML files. Without going into too much detail, here’s the algorithm for the hash as documented in the Open XML SDK 2.0 help docs:

// Function Input:
//    szPassword: NULL-terminated C-style string
//    cchPassword: The number of characters in szPassword (not including the NULL terminator)
WORD GetPasswordHash(const CHAR *szPassword, int cchPassword) {
      WORD wPasswordHash;
      const CHAR *pch;
 
      wPasswordHash = 0;
 
      if (cchPassword > 0)
            {
            pch = &szPassword[cchPassword];
            while (pch-- != szPassword)
                  {
                  wPasswordHash = ((wPasswordHash >> 14) & 0x01) | ((wPasswordHash << 1) & 0x7fff);
                  wPasswordHash ^= *pch;
                  }
            wPasswordHash ^= (0x8000 | ('N' << 8) | 'K');
            }
      
      return(wPasswordHash);
}

This algorithm is wrong. Or at least it doesn't give the resulting hash that Excel produces.

Granted, I didn't really expect the algorithm to be correct. Because from the SDK help:

An example algorithm to hash the user input into the value stored is as follows:

It's an example algorithm, so it may or may not be the one that Excel actually use. However, for the purposes of usability, the password used by users have to be encrypted using Excel's algorithm, so we have to somehow get the resulting hash. Either we get the algorithm itself, or we simulate an algorithm such that the resulting hash matches that of Excel's.

Which is what Kohei Yoshida did. See his modified algorithm. This algorithm worked!

Given that I'm Chinese, I did what's natural: I used Chinese characters as the password.

And the modified algorithm failed. It only worked if the password consisted only of Latin alphabets. I tried Japanese characters. Failed too.

This is why I don't support password protection in SpreadsheetLight. I don't want to give the false impression that the worksheet/workbook encryption works. This is one feature where "partially worked" is unacceptable.

Granted Open XML spreadsheets also support other types of encryption, and you can store the name of the algorithm and salt value you used, and even the number of times the hash algorithm was run.

But.

This is in the context of a spreadsheet library.

What are spreadsheet libraries mostly used for? Automation.

Which means minimal (if at all) human interaction and intervention.

And so the question comes up.

Where do you store the password?

If a human is protecting the worksheet/workbook, she will provide the password herself, and then encrypts it (well Excel does it), and she just remembers the password.

If a spreadsheet library is generating the workbook, and encrypting it, the password has to be gotten from somewhere, right? So it's stored, maybe in a text file, maybe in a database, maybe hardcoded in code, or whatever.

The point being that the password is not held solely by a human being. And a computer hard drive is easier to hack into than a human brain.

And I will prefer not to be a party to facilitating insecure spreadsheet generation. Besides, it's Open XML. The data is supposed to be open and sharable. Password protection seems to be the opposite. Even Microsoft cautions the use of depending just on encrypted Excel files.

We already have social engineering used by devious people to deceive people into giving their passwords over. Storing passwords on a computer seems suicidal because computers have no common sense at all.

As a final word, I'd say using the Open XML SDK can be either verbose, or obscenely painstakingly verbose. Simple tasks need a couple of dozens of lines of code, and complex tasks take at least 2 magnitudes of work to do. For individualistic, compartmentalisable (that's not a word, right?) tasks, you can do it from scratch. Add even a smidgeon of complexity, and you'll find a library more useful. Try mine.

SpreadsheetLight version 3

Version 3 of my spreadsheet library is now available. There’s a whole bunch of updates, including Excel 2010 conditional formatting such as data bars with negative value fill colours and icon sets with no icons.

SpreadsheetLight is possibly the most developer-friendly spreadsheet library ever. Even if I do say so myself. 🙂

Modulo 26 and column names

I was sitting in the lecture theatre valiantly trying to keep awake. The professor was speaking on the rigorous application and proving of the modulus function. It’s basically the remainder, but I’ve never been introduced to it in such, uh, rigor.

He brought up an example using modulo 26. And demonstrated the wrapping around of values. And the use of it in cryptology (a class I took later on, and I got tasked by the cryptography professor to write a program to do simple encryption. But another story perhaps…).

Modulo 26 is similar to finding the remainder. The difference is that the remainder is unique. This is important to our discussion.

“About what?” you may ask.

Excel column names.

There are 2 types of cell references used in spreadsheets, the R1C1 format and the A1 format. The R1C1 is simple. If the row index is 5 and the column index is 7, the result is R5C7.

The A1 format takes on a column name and the row index. Using our example again, the result is G5, because “G” is the 7th alphabet. Yes, that list of 26 alphabets.

The version of Excel currently (Excel 2007 and 2010) has up to 3 letters, with XFD as the last column name (that’s the 16384th column). What happens is you have column names A, B, and then up to Z. Then the next column name is AA, then AB and then up AZ. Then BA, BB and so on.

Basically, it’s base 26 arithmetic.

As far as I know, the typical method of getting the column name given the column index, is to run a loop. You add 1 until it hits 26, then you move to the next “position”, and then start from 1 again.

There’s nothing wrong with this method. It’s just that you have to iterate as many times as the given column index. If you’re given 16384, the loop runs 16384 times. This is regardless of the fact that the result is always the same. Given the range of values, the result can only be one of 16384 values.

So it was with this in mind, and my dabbling in game development (which said “Precalculate everything!”), that I precalculated an array of strings that are the column names. The array has 16384 items. The context is a spreadsheet software library.

Now to recoup that precalculation cost, I’d have to access my array at least 16384 times. This is where the context of the spreadsheet library comes in. Everybody (and their dog) wants to know if my library can handle millions of cells. This means if the column name calculation is in a method, that method is called millions of times. Given the iteration loop in it, that means the method with the iteration loop thing isn’t efficient (it’s O(n^2)).

However, due to technical issues, I can’t keep the array of strings. The column name array needs to be static to be available throughout the library. This causes issues if multi-threads or multi-processors or multi-whatevers comes in.

So I can’t use my static array anymore. Bummer. The O(n) of simple array access was working so well.

But I still want to have an efficient way of getting column names. So instead of iterating, I used simple division and modulus operations.

Consider 587. How do you know there’s 5 hundred? 587 / 100 equals 5 (truncating remainders). How do you know there’s 8 tens? 87 / 10 equals 8 (truncating remainders).

Yes, it’s elementary arithmetic, but it works great. So we can do the same for column names. There’s a problem though.

In the case above, we divided by 100. Why 100? Because it’s 10^2, and we’re concerned with the 2nd position after the ones position. The ones position is 10^0 by the way, which is 1.

So for our case, for the “hundreds” position, we divide by 676, which is 26^2. And for the “tens” position, we divide by 26.

Now 587 is 100*5 + 10*8 + 7. I’m going to use the notation (5,8,7) to denote this.

Now consider a column index of 3380. 3380 is equal to 676*5 + 26*0 + 0. This is (5,0,0).

However, in our case, our acceptable range of values is 1 through 26. It doesn’t contain zero. So (5,0,0) is not valid.

In the case of 587, we’re working in base 10, with the acceptable range of values being 0 to 9. This is “proper” remainder. Given any number in base 10, there’s a unique number within [0, 9] that’s the remainder.

However, for our purposes, there’s no unique number. Because we’re working in modulo 26, not just “remainder 26”.

The correct column name corresponding to column index 3380 is “DYZ”. This corresponds to (4,25,26). Or
3380 = 676*4 +26*25 + 26.

Note that 3380 is also 676*5 + 26*0 + 0.

My solution is to start from the “ones” position. If it’s greater than zero, fine. If it’s less than or equal to zero, borrow from the next larger position. Then we move to the next larger position, and check again. Continue to borrow until there are no zero values (or negatives) on the “right” side of the resulting notation (we can have “leading” zeroes).

So (5,0,0) becomes (5, -1, 0 + 26), or just (5,-1,26), borrowing 1 from the “tens” position. We cannot have -1, so that becomes (4, -1 + 26, 26), which becomes (4, 25, 26).

An interesting effect is that we typically assign 0 to A, 1 to B, and 25 to Z. In this case, 1 is assigned to A, 2 is to B, and most interestingly, both 0 and 26 map to Z. In fact, any multiple of 26 will map to Z.

Don’t think Z is special. Any multiple of 26 plus 1 also maps to A. So 1, 27, 53 and so on map to A. This is a property of the modulo thing.

Do you have a better way of converting (5,0,0) to (4,25,26)? Let me know in the comments.

Being a software god is tough

“Can this value be negative?” asked my colleague.

We were in a meeting with a product manager to get project requirements. The software application was to calculate settlement revenue between our company and our company’s partners, who were content providers. We charge the public customers for the content, then we share the revenue with the content providers.

The ideal situation was that all the numbers are nice and neat, people pay on time, everyone plays nice and so on and so forth. But reality isn’t this simple.

The value in question was that the product manager wanted to have a mechanism for him to introduce adjustments. You know, in case something happens and we should bill the content provider more, meaning we share less revenue with them.

But you know, you can’t just give the content provider a net number. They’d want to know why they’re receiving less money. So the adjustment had to be a line item on the settlement report.

Since I wasn’t the senior developer there, I decided to voice the concern to my colleague later on. But my colleague anticipated it. “Can this value be negative?” asked my colleague.

The product manager thought about it, and said yes, it could be negative. It means our company had to pay the content provider more money. Say we calculated last month’s settlement wrongly and we’re correcting that this month *cough*.

Let me tell you, finance people are getting blase with the number of zeroes in financial figures, but put one hyphen in front of a number and the whole financial sector goes into collective apoplexy. Sheesh…

My point is that unthinkable input values are always possible. Like negative values. Sometimes, I think developers forget the entire infinity of numbers on the other side of the real line…

I’m continuing sort of a discussion of the book “Geekonomics” by David Rice. Rice wrote that writing software was akin to creating an entire world in which the developer was, well, the supreme being.

Every single rule is determined by the developer. Every limit. Every calculation. Every display.

Well, everything that the developer is aware of anyway.

If the developer didn’t remember to put in the laws of gravity, that cannonball would’ve fired straight into space instead of landing nicely on the enemy tanks in that simulation war game.

Mother Nature takes care of anomalies in her stride. Entire species of dinosaurs dying out? No problem. Hey this mammalian species looks interesting. Let me give them a chance.

Software anomalies (read: bugs) aren’t so easily absorbed… With the world increasingly overrun by software, where medical devices, stock markets, airline ticket prices (supposedly tuned by software to remain competitive by comparing prices of other airlines), traffic lights, cars (Google cars are driven by software), online commerce and possibly even your toaster (if it isn’t already wired to the Internet), software anomalies can have a huge impact.

Software is written by developers. A developer is human. A human is flawed. Hence software is by design, flawed.

The Architect in The Matrix designed all the software in the world of Matrix (ok, maybe there’s delegation…). Right down to leaves falling and wind blowing in your face and steak tasting like steak (and not chicken) and pigeons flying like they’re supposed to.

But the Architect is also the creation of a human being. The Architect simulated the “real” world flawlessly, but only so far as human knowledge goes. What if Newtonian physics didn’t exist? If the imperial system or the metric system wasn’t invented, would the Architect invent some other measurement system?

In the end, the entire Matrix was bugged by an anomaly. Agent Smith.

And before that, the Matrix was bugged by another systemic anomaly. Which was sort of solved by creating the notion of The One. Hence Neo. (Ironically, the anomaly Agent Smith was the creation of the then current The One, the solution to solving the original systemic anomaly).

Get this. The Architect had to solve the systemic anomaly by flushing the entire Matrix, killing everyone except The One and the people chosen by The One to repopulate the Matrix.

Software is precise and elegant. Until it meets humans. Then everything hits the fan.

The Architect couldn’t solve his own software bugs (unless you call purging the entire Matrix as “solving”). Being a software god is tough.

Would you say the computer software entity known as The Architect is proficient in writing code? Because coming up soon, I’m going to write about another topic, that of software developer licensing, something that Rice also touched on. I’m talking “pass the bar or you don’t get to practice law” kind of accreditation.

Calculate Excel column width pixel interval

Brace yourself. You’re about to learn the secret behind how Excel mysteriously calculates the column width intervals.

In this article, I’m not going into the details of the column widths, but the column width intervals. There’s a difference. From the Open XML SDK specs:

width = Truncate([{Number of Characters} * {Maximum Digit Width} + {5 pixel padding}] / {Maximum Digit Width} * 256) / 256

To put it mildly, that’s a load of hogwash. In the documentation, it says that for Calibri 11 point at 96 DPI, the maximum digit width is 7 pixels. That is also another load of hogwash. It’s actually 8 pixels (well, 7 point something…).

When you move the line on the column width in Excel, just 1 pixel to the left, what is the column width? When you move it 1 pixel to the right, what’s the column width?

It turns out the each pixel interval isn’t a simple multiple of an internal column width interval.

Let’s take Calibri 11 pt 96 DPI again. With a maximum digit width of 8 pixels, each column width interval per pixel is supposedly 1/7 or 1/(max digit width -1).

But wait! It’s not actually 1/7. It’s the largest number of 1/256 multiples that is less than 1/7.

Now 1/7 is about 0.142857142857143. The actual interval is 0.140625, which is 36/256.

4/7 is about 0.571428571428571. The actual interval is 0.5703125, which is 146/256. And you will note that 146 is not equal to (4 * 36).

If you’re using Open XML SDK (or however you choose to access an Open XML Excel file), when you set the column width as 8.142857142857143, internally, Excel will save it as 8.140625.

Here’s some code:

int iPixelWidth = 8;
double fIntervalCheck;
double fInterval;
for (int step = 0; step < iPixelWidth; ++step)
{
    fIntervalCheck = (double)step / (double)(iPixelWidth - 1);
    fInterval = Math.Truncate(256.0 * fIntervalCheck) / 256.0;
    Console.WriteLine("{0:f15} {1:f15}", fIntervalCheck, fInterval);
}

So now you know how the intervals are calculated. But what about the actual column width? Hmm... perhaps another article...

P.S. I'm currently doing research for how to do autofitting for rows and columns for my spreadsheet library. I found this "secret" after fiddling with Excel files for a couple of hours... I know I'm talking about my library a lot, but it's taking up a lot of my brain space right now, so yeah...

Clopen source

A few years ago, when I was working in a job, I attended a social media course. There was a training budget to use up, and I couldn’t find any technical courses worth attending. So I attended the social media one.

At one point in the one-day course, the trainer was showing the attendees how easy it was to create a Wikipedia entry. He showed the update history. He showed how to edit other Wikipedia entries.

Now I’m going to tell you about a book I read recently. Don’t worry, it’s related. The book is “Geekonomics” by David Rice. There’s a chapter on open source (specifically related to software development).

Rice wrote that the openness of the open source movement might also be its downfall. Because anyone can contribute to an open source project, anyone does.

He also wrote that open source project contributors are typically not paid (in the form of money). They contribute for geek cred. And if you don’t contribute anything, you don’t get any geek cred at all.

And so developers typically contribute to features the developers want themselves or that the features are cool, and not because the features are user-requested or even helpful to the project. I mean they’re not paid, so they might as well do something cool.

Remember, no contribution, no geek cred.

Now Wikipedia is successful because if any “amateur” goes in to create “useless” entries or update existing entries with wrong information, there’s someone else who’s willing to go in and change it. The long tail of contributors work here because the entries aren’t typically arcane. And that contributors are motivated enough to make Wikipedia better.

Open source software projects don’t work quite as well. Amateur developers add useless or unnecessary features. No one wants to go in and edit code. Because developers do it for geek cred, if you go in and delete their stuff, even if their stuff is unnecessary or possibly even detrimental, the original developer/contributor is going to be upset with you. Because you’re removing their geek cred.

This had put me in a bit of a quandary. For the purposes of this article, I’ll define “open source” as:

  • Source code is available
  • Licensing is such that users are able to study, change, redistribute the software and the source code

I’m not sure about the licensing part. Any license that qualifies as open-source is good enough for me.

And “closed source” will be defined as “not everything that has to do with the software is made available”. I’ll explain the meaning in a bit.

I made a spreadsheet library and made it open source. It’s open source because the source code is available for anyone to read. I’ve licensed it with the MIT License, which basically allows you to do whatever you want with very few restrictions.

It’s also closed, because I didn’t include the .csproj file and the strong name key file in the downloadable package. There are various reasons, but the main ones are that I intend to keep the branding and that there’s only one “original” SpreadsheetLight library running out there (mine). Anyone who forks my source code can compile it, but the resulting DLL won’t be the “original”, so to speak.

So I consider my project as clopen source. “Clopen” is a mathematical term used in topology to mean both open and closed. I’m not going to bore you with the details. Go Google the definition yourself.

The main point I had was that, while my project was open source, I don’t quite encourage open collaboration. Open collaboration is not a requirement for a project to be considered open source. And it is open source, because you can view all the source code.

My main motivation was that I had a specific vision for the project, which was to make the spreadsheet library as easy to use as possible for developers who are on a tight schedule and don’t have time to learn how to make spreadsheets with a third party library. This meant I had to maintain strict control over things like method signatures and even method/property/class names.

Can you imagine having just anyone coming in to change the source code? Or just adding a feature because they need it, but the intended developer audience doesn’t need it?

Not every contributor is interested in being aligned with the project’s vision and purpose.