Commenter apit had given me a problem which I am unable to help with: Matrix multiplication using systolic arrays.

I searched and I came up with very little. About the only thing I understood was that the “array” was not the variable array in code. It’s an array of processors. **Systolic arrays are used for parallel processing**.

So how is it used for matrix multiplication?

Hmm… I’ve never heard of systolic arrays before this. So I asked my friend and he found a PowerPoint presentation that gave the best explanation that I could understand.

Systolic arrays & their applications, by Jonathan Break.

I believe apit didn’t understand that each “array cell” don’t have past memory of operations, only an intermediate result.

If you happen to understand it, and can explain it better than I do, please chime in with your comment. Thanks.

Ok. You’ve got me working on this problem, and I guess it has intrigued me to create a parallel software solution for this.

Essentially this is how it works:

Each “cell” is a processing unit. In software (or virtual terms), we call it a “Task” that can run a “Function” assigned to it. In the case of a Matrix Multiplcation, the “Function” has “2 inputs” and “2 outputs” with the function f(x, y) = x * y, and the processing unit “Task” has some memory (RAM in hardware terms) “variable” to sum the results. Then you reconstruct the result back into the output Matrix.

There are more intrinsic setup to weave each “cell” into “Continuations” so as to flow the data from one cell to the other automatically. Of course, rearranging the data is another step that must be taken before flowing the data through each “cell”.

What this means is that each “Task” or “Cell” can potentially be run in parallel at the same time, with a complexity of 2n + 1, provided you have sufficient CPU cores to run everything in parallel.

Utilizing the Cannon Technique, we can reduce the complexity to n where n is the size of the matrix (3 in this case). But using the Cannon Technique, the data needs to be rearranged slightly differently for data flow in.

I’m trying to work on a proof of concept by creating a virtual “cell” what IBM’s Cell Broadband Engine (CBE) calls “SPU” or “Synergistic Processing Unit” that accepts inputs and outputs together with some memory in place. This “cell” will simulate a single “Task” or “Thread” in the software world.

This is called “vector processing” and it’s essentially how programmers will have to start thinking in future.

Anyway, I don’t know if I make sense, but it’s almost 1am and I’m going to sleep on my 1st naive iteration and refactor it in the morning to look a little like what I described above.

Justin, you are awesome! Thanks for your input.