pood.re/blog

It's the pood.re blog. Cool. You'll find me talking about things I did, or things I want to do, or things I like, or things I don't like. Let's say you'll find me talking about things. Topics include: computing & computers, gender/sexuality/diversity and politics. Sensitive topics will be warned about.

A cool LLVM optimization quirk when implementing CRC32

Last modified:

LLVM did a cool optimization and I'm writing about it.

Rosetta Code

I was browsing Rosetta Code's Rust tasks a while ago when I came across the CRC32 article. I enjoy comparing different implementations of the same task in the languages I know, like comparing C, C++, and Rust.

I tried running the Rust code and optimizing it: you see, the author mentions that This does not perform any caching of the lookup table for simplicity. I wanted to use this as an opportunity to try my hand at Rust's const expressions.

Indeed, the specific algorithm that needs to be implemented in this task is the one used in Gzip, PNG, and many others. A lookup table is used to speed up computations. This is how, in pseudocode, the table is generated:

table = (a table of 256 32-bit integers)
for i from 0 to 255 {
	rem = i
	for j from 0 to 7 {
		if (rem & 1) {
			rem = rem RIGHTSHIFT 1
			rem = rem XOR 0xEDB88320
		} else {
			rem = rem RIGHTSHIFT 1
		}
	}
	table[i] = rem
}

What caught my attention is this right shift being performed in both branches of this check for rem & 1. Surely there's a way to make that more concise! In the Rust submission, the same table generation code is more idiomatically written using an iterator and a match statement:

crc32_table[n as usize] = (0..8).fold(n as u32, |acc, _| {
		match acc & 1 {
				1 => 0xedb88320 ^ (acc >> 1),
				_ => acc >> 1,
		}
});

Interestingly, the match statement allows for a change that would be more awkward to write in C. Here how the match can be made less redundant:

match acc & 1 {
	1 => 0xedb88320,
	_ => 0,
} ^ (acc >> 1)

Here, the temporary variable that would have been required in C is made implicit. Pretty neat, I think.

As a side-note, the catch-all match case using an underscore is used instead of specifying 0 because Rust expects all possible values of an expression to be handled by the match statement, and the compiler is unaware that zero and one are the only possible 32-bit values that acc & 1 can yield.

But what about the compiled code?

Surely a compiler wouldn't generate this shift statement twice just to save one register, right? I wondered whether Rust's LLVM backend optimized this on its own. Using the amazing Compiler Explorer, I made a minimal reproduction. First, the setup code:

pub fn main() {
	let crc32 = compute(std::hint::black_box(0));

	std::hint::black_box(crc32);
}

The calls to black_box are to prevent the compiler from optimizing away the function entirely, or assuming that it is always called with value 0.

#[inline(never)]
pub fn compute(acc: u32) -> u32 {
	match acc & 1 {
		[...]
	}
}

The more redundant match statement with two instances of right-shifting generates the following assembly code when compiled with optimization level 3, the standard optimization level for production code:

; edi is the argument to compute()
example::compute:
	mov     ecx, edi        ; Copy the value of 'acc' into ecx
	shr     ecx             ; Shift ecx right by one bit, and store it back into ecx
	mov     eax, ecx        ; Move the value of ecx into eax
	xor     eax, -306674912 ; Bitwise XOR eax with 0xedb88320 and store it into eax
	test    dil, 1          ; Bitwise AND the lowest 8 bits of edi with 1, effectively storing NOT(acc & 1) into the zero flag (ZF)
	cmove   eax, ecx        ; Conditionally move the value of ecx if the zero flag is set
	ret                     ; eax is the return register

This code is the equivalent of computing both (acc >> 1) and (acc >> 1) ^ 0xedb88320 (only performing a shift once, as expected), then choosing which one to return based on the value of (acc & 1). It uses three registers and one conditional instruction, although I couldn't tell you if it has a significant performance impact since it is not a branch.

The more concise statement with only once instance of right-shifting generated this assembly code:

example::compute:
	mov     eax, edi        ; Copy the value of 'acc' into eax
	and     eax, 1          ; Bitwise AND the value of eax with 1 and store it back into eax
	neg     eax             ; Compute the two's complement of eax and store it back into eax (this is the cool bit!!)
	and     eax, -306674912 ; Bitwise AND the value of eax with 0xedb88320
	shr     edi             ; Shift edi right by one bit, and store it back into edi
	xor     eax, edi        ; Xor eax and edi together and store the result into eax
	ret                     ; eax is the return register again

This code is cool! It seems that LLVM figured out that negating a value of 0 or 1 generates a bitmask of either all zeroes (negative 0) or all ones (negative 1) to "conditionally" XOR a value without using a conditional instruction. The pseudocode equivalent of the assembly is:

return (acc >> 1) ^ (0xedb88320 & -(acc & 1))

This version, despite performing the exact same task, does it with only two registers and no conditional instructions.

Finally, the experimental GCC codegen backend for Rust generates similar assembly to the second LLVM output no matter which of the two match statements is given.

Why?

Two pieces of Rust code with the exact same meaning generated different optimized code, with one being arguably superior. I'm no expert of optimizing compiler internals, so I couldn't say why this happens, but I think it's fascinating.

Here in Rust land, we often assume that the code we write will just magically get optimized. That's because most of the time it does. But no matter how high-level the code you write is, if you want it to be performant, you will have to care about low-level, nitty-gritty details like this. I read from an LLVM engineer that Rust's iterators in particular are pretty hard to optimize correctly.

CRC32 lookup table generation code is not very likely to be performance critical, given that you can generate it at compile time or only once at runtime. However this kind of bit manipulation is everywhere in performance critical algorithms. Checking the assembly output of performance-critical code in a debugger is a great tool and it makes me want to know more about this low-level stuff.