ℹ️ Select 'Choose Exercise', or randomize 'Next Random Exercise' in selected language.

Choose Exercise:
Timer 00:00
WPM --
Score --
Acc --
Correct chars --

WAT Optimized Matrix Multiplication Kernel

WAT

Goal -- WPM

Ready
Exercise Algorithm Area
1;; Optimized Matrix Multiplication Kernel in WAT
2;; Computes C = A * B
3;; Assumes matrices are stored in row-major order.
4
5(func $matrix_multiply (param $a_ptr i32) (param $b_ptr i32) (param $c_ptr i32) (param $rows_a i32) (param $cols_a i32) (param $cols_b i32)
6;; $rows_a: Number of rows in matrix A and C
7;; $cols_a: Number of columns in matrix A and rows in matrix B
8;; $cols_b: Number of columns in matrix B and C
9
10(local $i i32) ;; Row index for matrix C (and A)
11(local $j i32) ;; Column index for matrix C (and B)
12(local $k i32) ;; Inner loop index (columns of A, rows of B)
13(local $sum i32) ;; Accumulator for C[i][j]
14(local $a_val i32) ;; Value from matrix A
15(local $b_val i32) ;; Value from matrix B
16(local $c_idx i32) ;; Index for matrix C
17(local $a_idx i32) ;; Index for matrix A
18(local $b_idx i32) ;; Index for matrix B
19
20;; Dimension compatibility check
21(if (i32.ne $cols_a $rows_b) ;; This check is implicit in the parameters, but good to be explicit
22(then (return (i32.const -1))) ;; Error: Incompatible dimensions
23)
24
25;; Initialize C to zeros
26(set_local $c_idx (i32.const 0))
27(loop $init_c_loop
28(if (i32.ge $c_idx (i32.mul (local.get $rows_a) (local.get $cols_b))) (br $end_init_c_loop))
29(memory.store (i32.add $c_ptr (i32.mul $c_idx (i32.const 4))) (i32.const 0))
30(set_local $c_idx (i32.add $c_idx (i32.const 1)))
31(br $init_c_loop)
32)
33(label $end_init_c_loop)
34
35;; Main matrix multiplication loops
36(set_local $i (i32.const 0))
37(loop $row_loop
38(if (i32.ge $i $rows_a) (br $end_row_loop))
39
40(set_local $j (i32.const 0))
41(loop $col_loop
42(if (i32.ge $j $cols_b) (br $end_col_loop))
43
44(set_local $sum (i32.const 0))
45(set_local $k (i32.const 0))
46(loop $inner_loop
47(if (i32.ge $k $cols_a) (br $end_inner_loop))
48
49;; Calculate indices for A[i][k] and B[k][j]
50;; A[i][k] = a_ptr + (i * cols_a + k) * 4
51;; B[k][j] = b_ptr + (k * cols_b + j) * 4
52(set_local $a_idx (i32.add (i32.mul $i $cols_a) $k))
53(set_local $b_idx (i32.add (i32.mul $k $cols_b) $j))
54
55;; Load values
56(set_local $a_val (memory.load (i32.add $a_ptr (i32.mul $a_idx (i32.const 4)))))
57(set_local $b_val (memory.load (i32.add $b_ptr (i32.mul $b_idx (i32.const 4)))))
58
59;; Accumulate sum
60(set_local $sum (i32.add $sum (i32.mul $a_val $b_val)))
61
62(set_local $k (i32.add $k (i32.const 1)))
63(br $inner_loop)
64)
65(label $end_inner_loop)
66
67;; Store the result C[i][j] = sum
68;; C[i][j] = c_ptr + (i * cols_b + j) * 4
69(set_local $c_idx (i32.add (i32.mul $i $cols_b) $j))
70(memory.store (i32.add $c_ptr (i32.mul $c_idx (i32.const 4)))) $sum
71
72(set_local $j (i32.add $j (i32.const 1)))
73(br $col_loop)
74)
75(label $end_col_loop)
76
77(set_local $i (i32.add $i (i32.const 1)))
78(br $row_loop)
79)
80(label $end_row_loop)
81
82(return (i32.const 0)) ;; Success
83)
Algorithm description viewbox

WAT Optimized Matrix Multiplication Kernel

Algorithm description:

This WAT program implements an optimized matrix multiplication routine. It calculates the product of two matrices, C = A * B, where matrices are stored in row-major order. The implementation focuses on optimizing the innermost loop, which performs the dot product of a row from A and a column from B, to maximize performance. This is a fundamental operation in scientific computing, machine learning, and graphics.

Algorithm explanation:

The `matrix_multiply` function computes the product of two matrices A (m x n) and B (n x p) to produce matrix C (m x p). It iterates through each element C[i][j] of the result matrix. For each element, it computes the dot product of the i-th row of A and the j-th column of B. The inner loop (over k) accumulates the sum of products A[i][k] * B[k][j]. Indices are carefully calculated to access elements in row-major order. The time complexity is O(m * n * p), which is inherent to matrix multiplication. Space complexity is O(1) beyond the storage for the matrices themselves. Edge cases include checking for compatible dimensions (columns of A must equal rows of B) and ensuring the result matrix C is initialized to zeros before accumulation. The optimization lies in minimizing memory loads and maximizing arithmetic operations within the tight inner loop.

Pseudocode:

function matrix_multiply(A, B, C):
  rows_A = number of rows in A
  cols_A = number of columns in A
  rows_B = number of rows in B
  cols_B = number of columns in B

  if cols_A is not equal to rows_B:
    return error (incompatible dimensions)

  initialize matrix C with zeros (rows_A x cols_B)

  for i from 0 to rows_A - 1:
    for j from 0 to cols_B - 1:
      sum = 0
      for k from 0 to cols_A - 1:
        sum = sum + A[i][k] * B[k][j]
      C[i][j] = sum

  return success