Why use 'xor eax, eax'?

Q: Why use 'xor eax, eax'?

The instruction 'xor eax, eax' is used to zero out the EAX register. This is a common idiom in x86 assembly language. The XOR operation performs a bitwise exclusive OR on the operands, and when both operands are the same register, it effectively clears the register. This is because XORing any value with itself results in zero. This technique is preferred over 'mov eax, 0' because it is more efficient in terms of code size and execution speed.

daeken

about 1 month ago

1 reply

Back in 2005 or 2006, I was working at a little startup with "DVD Jon" Johansen and we'd have Quake 3 tournaments to break up the monotony of reverse-engineering and juggling storage infrastructure. His name was always "xor eax,eax" and I always just had to laugh at the idea of getting zeroed out by someone with that name. (Which happened a lot -- I was good, but he was much better!)

VectorLock

about 1 month ago

I was there but never got in on the Quake 3 fun; mp3t**

OgsyedIE

about 1 month ago

1 reply

The page crashes after 3 seconds, 100% of the time, on the latest version of Android Chrome and works fine on Brave, fyi.

robmccoll

about 1 month ago

This is not my experience on the latest version of Chrome Android (142.0.7444.171). It did not crash for me.

pansa2

about 1 month ago

3 replies

> Unlike other partial register writes, when writing to an e register like eax, the architecture zeros the top 32 bits for free.

I’m familiar with 32-bit x86 assembly from writing it 10-20 years ago. So I was aware of the benefit of xor in general, but the above quote was new to me.

I don’t have any experience with 64-bit assembly - is there a guide anywhere that teaches 64-bit specifics like the above? Something like “x64 for those who know x86”?

sparkie

about 1 month ago

2 replies

It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility. so existing programs would mostly continue working, unlike Intel's earlier attempt at 64-bits which was an entirely new design.

The reason `xor eax,eax` is preferred to `xor rax,rax` is due to how the instructions are encoded - it saves one byte which in turn reduces instruction cache usage.

When using 64-bit operations, a REX prefix is required on the instruction (byte 0x40..0x4F), which serves two purposes - the MSB of the low nybble (W) being set (ie, REX prefixes 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of low nybble allow using registers r8-r15 by providing an extra bit for the ModRM register field and the base and index fileds in the SIB byte, as only 3-bits (8-registers) are provided by x86.

A recent addition, APX, adds an additional 16 registers (r16-r31), which need 2 additional bits. There's a REX2 prefix for this (0xD5 ...), which is a two byte prefix to the instruction. REX2 replaces the REX prefix when accessing r16-r31, still contains the W bit, but it also includes an `M0` bit, which says which of the two main opcode maps to use, which replaces the 0x0F prefix, so it has no additional cost over the REX prefix when accessing the second opcode map.

cesarb

about 1 month ago

1 reply

> It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility.

It's not just that, zero-extending or sign-extending the result is also better for out-of-order implementations. If parts of the output register are preserved, the instruction needs an extra dependency on the original value.

ychen306

about 1 month ago

This. It's for renaming.

nickelpro

about 1 month ago

Except for `xchg eax, eax`, which was the canonical nop on x86. Because it was supposed to do nothing, having it zero out the top 32-bits of rax would be quite surprising. So it doesn't.

Instead you need to use the multi-byte, general purpose encoding of `xchg` for `xchg eax, eax` to get the expected behavior.

veltas

about 1 month ago

Chapter 3 of volume 1, ctrl+f for "64-bit mode", has a lot of the essentials including e.g. the stuff about zeroing out the top half of the register.

https://www.intel.com/content/www/us/en/developer/articles/t...

matt_d

about 1 month ago

See https://github.com/MattPD/cpplinks/blob/master/assembly.x86.... - mostly focused on x86-64 (and some of the talks/tutorials offer pretty good overview)

snvzz

about 1 month ago

4 replies

Because, unlike RISC-V, x86 has no x0 register.

jabl

about 1 month ago

5 replies

From your past posting history, I presume that you're implying this makes RISC-V better?

Do we have any data showing that having a dedicated zero register is better than a short and canonical instruction for zeroing an arbitrary register?

kevin_thibedeau

about 1 month ago

1 reply

It's a definite liability on a machine with only 8 general purpose registers. Losing 12% of the register space for a constant would be a waste of hardware.

menaerus

about 1 month ago

3 replies

8 registers? Ever heard of register renaming?

Polizeiposaune

about 1 month ago

1 reply

Ever heard of a loop that needed to keep more than 7 variables live? Register renaming helps with pipelining and out-of-order execution, but instructions in the program can only reference the architectural registers - go beyond that and you end up needing to spill some values to (architectural) memory.

There's a reason why AMD added r8-r15 to the architecture, and why intel is adding r16-r31..

menaerus

about 1 month ago

2 replies

I have but that was not the point? My first point was exactly that there are more ISA registers and not only 8, and therefore the question mark. My second point was about register renaming which, contrary what you say, does mitigate the artifacts of running out of registers by spilling the variables to the stack memory. It does it by eliminating the false dependencies between variables/registers and xor eax, eax is a great candidate for that.

saagarjha

about 1 month ago

1 reply

Register renaming does not let you avoid spills.

menaerus

about 1 month ago

1 reply

Ok, it obviously doesn't increase the number of ISA registers. What I am suggesting is something else - imagine a situation in which the compiler understands that the spill over will take place, and therefore rearranges the code such that it reduces the pressure on the registers. It can do that if it can break the data dependencies between the variables for instance. Or it can do that by unrolling the loops or by moving the initialization closer to where the variable is being used, no? I am pretty certain that compilers are already doing these kind of transformations, and in a sense this is taking advantage of register renaming but indirectly.

saagarjha

29 days ago

Yes, this is how compilers work today.

pwg

about 1 month ago

Register renaming allows the CPU to execute in parallel instructions it might otherwise need to serialize.

But it does nothing to help you, the programmer, when your algorithm really needs to have 9 registers worth of data in registers and your CPU only has 8 architectural registers available to you. At that point, you either spill manually, or you take the performance hit from keeping the ninth value in memory instead of a register.

account42

about 1 month ago

That's irrelevant, the zero register would be taking a slot in the limited register addressing bits in instructions, not replace a physical register on the chip.

kevin_thibedeau

about 1 month ago

8086 doesn't have that.

dooglius

about 1 month ago

1 reply

I think one could just pick a convention where a particular GP register is zeroed at program startup and just make your own zero register that way, getting all the benefits at very small cost. The microarchitecture AIUI has a dedicated zero register so any processor-level optimizations would still apply.

pklausler

about 1 month ago

That’s what was done on the CDC 6600 with two handy values, B0 (0) and B1 (1).

gruez

about 1 month ago

2 replies

It's basically the eternal debate of RISC vs CISC (x86). RISC proponents claim RISC is better because it's simpler to decode. CISC proponents retort that CISC means code can be more compact, which helps with cache hits.

bluGill

about 1 month ago

1 reply

In the real world there is no CISC or RISC anymore. RISC is always extended to some new feature and suddenly becomes more complex. Meanwhile CISC is just a decoder over a RISC processor. Either way you get the best of both worlds: simple hardware (the RISC internals and CSIC instructions that do what you need.

Don't get too carried away in the above, x86 is still a lot more complex than ARM or RISC-V. However the complexity is only a tiny part of a CPU and so it doesn't matter.

snvzz

about 1 month ago

You seem to be confusing ISA and microarchitecture.

Modern ISAs try really hard to be independent from microarchitecture.

snvzz

about 1 month ago

>CISC proponents retort that CISC means code can be more compact,

RISC-V has the most compact code on 64bit, with margin to boot.

On 32bit, it used to be behind Thumb2, but it's the best as of the bit manipulation and extra compressed extensions circa 2021.

phire

about 1 month ago

1 reply

The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free.

I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions).

AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions).

phkahler

about 1 month ago

3 replies

>> The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway.

IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-)

zozbot234

about 1 month ago

1 reply

There's not a strong case for redoing the RISC-V encoding with a new RISC-VI unless they run out of 32-bit encoding space outright, due to e.g. extensive new vector-like or AI-like instructions. And then they could free up a huge amount of encoding space trivially by moving to a 2-address format throughout with Rd=Rs1 and using a simple instruction fusion approach MOV Rd ← Rs1; OP Rd ← etc. for the former 3-address case.

(Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.)

phkahler

about 1 month ago

1 reply

>> Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.

I really like the idea of composition or standard prefixes. My favorite is the idea of replacing cmp/branch with "if". Where the condition is a predicate for the following instruction. For RISC-V it would eat a large part of the 16bit opcodes. Some form of load/store might be a good use for the remaining 16bit ops. Other things that might be a good prefix could be encoding data types (8,16,32,64 bit, sign extended, float, double) or a source/destination register. It might be interesting to see how a full ISA might be decomposed into smaller instruction fragments.

zozbot234

about 1 month ago

1 reply

> "if". Where the condition is a predicate for the following instruction

This is just a forward skip, which is optimized to a predicated insn already in some implementations.

phkahler

about 1 month ago

>> > "if". Where the condition is a predicate for the following instruction

>> This is just a forward skip, which is optimized to a predicated insn already in some implementations.

True, but make it a 16bit prefix and apply to all (or selected) instructions.

adgjlsfhk1

about 1 month ago

1 reply

IMO the riscv decoding is really elegant (arguably excepting the C extension). Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory). Most 64 bit constants in use can be sign extended from much smaller values, and for those that can't, supporting 72 bit (or bigger) instructions just to be able to load a 64 bit immediate will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism), and will only be 2 cycles faster than a L1 cache load (if the instruction is hot). 32 bit immediate would be kind of nice, but the benefit is pretty small. An x86 instruction with 32 bit immediate is 6 bytes, while the 2 RISC-V instructions are 8 bytes. There have been proposals to add 48 bit instructions, which would let Risc-v have 32 bit immediate support with the same 6 bytes as x86 (and 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in the very rare situations where doing so will be faster than a load).

ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

Tuna-Fish

about 1 month ago

2 replies

> Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory)

Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance.

> Most 64 bit constants in use can be sign extended from much smaller values

You should obviously also have smaller load instructions.

> will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism)

No, just have more fetch throughput.

> and will only be 2 cycles faster than a L1 cache load

Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch.

And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency.

> TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

camel-cdr

about 1 month ago

> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

AFAIK, the reason RISC-V supports alternative link registers is that it allows for efficient -msave-restore, keeps the encoding orthogonal to LUI/AUPIC and using the smaller immediate didn't impact codegen much.

adgjlsfhk1

about 1 month ago

> No, just have more fetch throughput.

Fetch throughput isn't unlimited. Modern x86 CPUs only have ~16-32B/cycle (from L2 once you're out of the uop cache). If you decode a single 10 byte instruction you're already using up a huge amount of the available decode bandwidth.

There absolutely are cases where a 64 bit load instruction would be an advantage, but ISA design is always a case of tradeoffs. Allowing 10 byte instructions has real cost in decode complexity, instruction bandwidth requirements, ensuring cacheline/pageline alignment etc. You have to weigh against that how frequent the instruction would be as well as what your alternative options are. Most imediates are small, and many others can be efficiently synthesized via 2 other instructions (e.g. shifts/xors/nots) and any synthesis that is 2 instructions or fewer will be cheaper than doing a load anyway. As a result you would end up massively complicating your architecture/decoders to benefit a fairly rare instruction which probably isn't worthwhile. It's notable that aarch64 makes the same tradeoff here and Apple's M series processors have an IPC advantage over the best x86.

> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

This mostly seems like a mistake to me. The rational probably is that you need the other instructions anyway (not all jumps are returns), so adding a jal that doesn't take a register would take a decent percentage of the opspace, but the extra 5 bits would be very nice.

kruador

about 1 month ago

ARM64 also has fixed length 32-bit instructions. Yes, immediates are normally small and it's not particularly orthogonal as to how many bits are available.

The largest MOV available is 16 bits, but those 16 bits can be shifted by 0, 16, 32 or 48 bits, so the worst case for a 64-bit immediate is 4 instructions. Or the compiler can decide to put the data in a PC-relative pool and use ADR or ADRP to calculate the address.

ADD immediate is 12 bits but can optionally apply a 12-bit left-shift to that immediate, so for immediates up to 24 bits it can be done in two instructions.

ARM64 decoding is also pretty complex, far less orthogonal than ARM32. Then again, ARM32 was designed to be decodable on a chip with 25,000 transistors, not where you can spend thousands of transistors to decode a single instruction.

wongarsu

about 1 month ago

MIPS for example also has one, along with a similar number of registers (~32). So it's not like RISC-V took a radical new position here, they were able to look back at what worked and what didn't, and decided that for their target a zero register was the right tradeoff. It's certainly the more "elegant" solution. A zero register is useful as input or output register for all kinds of operations, not just for zeroing

crote

about 1 month ago

4 replies

And the other way around: RISC-V doesn't have a move instruction so that's done as "dst = src + 0", and it doesn't have a nop instruction so that's done as "x0 = x0 + 0". There's like a dozen of them.

It's quite interesting what neat tricks roll out once you've got a guaranteed zero register - it greatly reduces the number of distinct instructions you need for what is basically the same operation.

Findecanor

about 1 month ago

There is a `c.mv` instruction in the compressed set, which most RISC-V processors implement.

That, and `add rd, rs, x0` could (like the zeroing idiom on x86), run entirely in the register-renaming stage of a processor.

RISC-V does actually have quite a few such idioms. Some idioms are multi-instruction sequences ("macro ops") that could get folded into single micro-ops ("macro-op fusion"/"instruction fusion"): for example `lui` followed by `addi` for loading a 32-bit constant, and left shift followed by right shift for extracting a bitfield.

kruador

about 1 month ago

ARM64 assembly has a MOV instruction, but for most of the ways it's used, it's an alias in the assembler to something else. For example, MOV between two registers actually generates ORR rd, rZR, rm, i.e. rd := (zero-register) OR rm. Or, a MOV with a small immediate is ORR rd, rZR, #imm.

If trying to set the stack pointer, or copy the stack pointer, instead the underlying instruction is ADD SP, Xn, #0 i.e. SP = Xn + 0. This is because the stack pointer and zero register are both encoded as register 31 (11111). Some instructions allow you to use the zero register, others the stack pointer. Presumably ORR uses the zero register and ADD the stack pointer.

NOP maps to HINT #0. There are 128 HINT values available; anything not implemented on this processor executes as a NOP.

There are other operations that are aliased like CMP Xm, Xn is really an alias for SUBS XZR, Xm, Xn: subtract Xn from Xm, store the result in the zero register [i.e. discard it], and set the flags. RISC-V doesn't have flags, of course. ARM Ltd clearly considered them still useful.

There are other oddities, things like 'rotate right' is encoded as 'extract register from pair of registers', but it specifies the same source register twice.

Disassemblers do their best to hide this from you. ARM list a 'preferred decoding' for any instruction that has aliases, to map back to a more meaningful alias wherever possible.

dist1ll

about 1 month ago

Another one is "jalr x0, imm(x0)", which turns an indirect branch into a direct jump to address "imm" in a single instruction w/o clobbering a register. Pretty neat.

pwg

about 1 month ago

The DEC Alpha chip was the same. It also had a hardwired zero register (although IIRC the zero register was r31 instead of r0) and about half the addressing modes and a whole bunch of "assembly instructions" were created by interesting uses of that zero register.

gpderetta

about 1 month ago

x86 doesn't need a zero register as it can encode constants in the instruction itself.

Findecanor

about 1 month ago

x86 has no architectural zero register, but a x86 CPU could have a microarchitectural zero register.

And when the instruction decoder in such a CPU with register renaming sees `xor eax, eax`, it just makes `eax` point to the zero register for instructions after it. It does not have to put any instruction into the pipeline, and it takes effectively 0 cycles. That is what makes the "zeroing idiom" so powerful.

omnicognate

about 1 month ago

2 replies

It happens to be the first instruction of the first snippet in the wonderful xchg rax, rax.

https://www.xorpd.net/pages/xchg_rax/snip_00.html

dooglius

about 1 month ago

1 reply

Not sure what I am looking at here is this just a bunch of different ways to zero registers?

omnicognate

about 1 month ago

It's a collection of interesting assembly snippets ("gems and riddles" in the author's words) presented without commentary. People have posted annotated "solutions" online, but figuring out what the snippets do and why they are interesting is the fun of it.

It's also available as an inscrutable printed book on Amazon.

mubou2

about 1 month ago

2 replies

That music when you click "int" is awesome. Reminds me of the good ol' days of keygens.

Audiophilip

about 1 month ago

It's a chiptune-style xm module, "Funky Stars" by Quazar: https://soundcloud.com/scene_music/funky-stars

therein

about 1 month ago

Keygen music will always have a special place in my heart. This is a good one.

I do wonder who was the first cracker that thought of including a keygen music that started the tradition.

I also miss how different groups competed with each other and boasted about theirs while dissing others in readmes.

Readme's would have .NFO suffix and that would try to load in some Windows tool but you had to open them in notepad. Good times.

eb0la

about 1 month ago

6 replies

I remember a lot of code zeroing registrers, dating at least back from the IBM PC XT days (before the 80286).

If you decode the instruction, it makes sense to use XOR:

- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

This extra byte in a machine with less than 1 Megabyte of memory did id matter.

In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)

Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.

vardump

about 1 month ago

1 reply

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.

eb0la

about 1 month ago

My fault: I just compiled the instruction with an assembler instead of looking up the actual instruction from documentation.

It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.

RHSeeger

about 1 month ago

1 reply

> the IBM PC XT days (before the 80286)

Fun fact - the IBM PC XT also came in a 286 model (the XT 286).

eb0la

about 1 month ago

1 reply

You're right. I forgot that!

RHSeeger

about 1 month ago

To be fair, I only remember because that was the 2nd computer I owned.

Anarch157a

about 1 month ago

1 reply

I don't know enough of the 8086 so I don't know if this works the same, but on the Z80 (which means it was probably true for the 8080 too), XOR A would also clear pretty much all bits on the flag register, meaning the flags would be in a known state before doing something that could affect them.

vanderZwan

about 1 month ago

Which I guess is the same reason why modern Intel CPU pipelines can rely on it for pipelining.

Someone

about 1 month ago

1 reply

> If you decode the instruction, it makes sense to use XOR:

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:

“But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”

qingcharles

about 1 month ago

1 reply

That's weird, I looked it up earlier and found the P6 (Pentium Pro) was the first to actually make the xor optimization into a zero clock operation.

https://fanael.github.io/archives/topic-microarchitecture-ar...

Someone

about 1 month ago

A few paragraphs down from that:

“I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.

Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.”

Sharlin

about 1 month ago

2 replies

As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

cogman10

about 1 month ago

2 replies

L1 instruction cache is backed by L2 and L3 caches.

For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

Sharlin

about 1 month ago

2 replies

> For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core)

Ryzen 9 CPUs have 1280kB of L1 in total. 80kB (48+32) per core, and the 9 series is the first in the entire history of Ryzens to have some other number than 64 (32+32) kilobytes of L1 per core. The 16MB L2 figure is also total. 1MB per core, same as the 7 series. AMD obviously touts the total, not per-core, amounts in their marketing materials because it looks more impressive.

kbolino

about 1 month ago

Also, rather importantly, the L1i cache is still only 32 kB. You can't execute instructions stored in the L1d cache, which is the one that got bigger.

monocasa

about 1 month ago

Yeah, the reason for that is that it's expensive in PPA for the size of an L1 cache to exceed number of ways times page size. The jump to 48kB was also a jump to 12 way set associative.

As an aside, zen 1 did actually have a 64kB (and only 4 way!) L1I cache, but changed to the page size times way count restriction with zen 2, reducing the L1 size by half.

You can also see this on the apple side, where their giant 192kB caches L1I are 12 ways with a 16kB page size.

gpderetta

about 1 month ago

Instruction caches also prefetch very well, as long as branch prediction is good. Of course on a misprediction you might also suffer a cache miss in addition to the normal penalty.

umanwizard

about 1 month ago

1 reply

> nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroes register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.

Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.

Sharlin

about 1 month ago

Thanks, that makes sense.

chasd00

about 1 month ago

1 reply

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

iirc doesn't word alignment matter? I have no idea if this is how the IBM PC XT was aligned but if you had 4 byte words then it doesn't matter if you save a byte with xor because you wouldn't be able to use it for anything else anyway. again, iirc.

Narishma

about 1 month ago

No, the 8088 used in the PC has a 2 byte word size. More importantly, it only has an 8-bit data bus, so alignment didn't really matter because it fetched instructions one byte at a time.

deadcore

about 1 month ago

2 replies

Matt Godbolt also uploads to his self titled Youtube channel: https://www.youtube.com/watch?v=eLjZ48gqbyg

vanderZwan

about 1 month ago

1 reply

Not sure why you got downvoted for pointing that out - it might be linked at the end of the article but people can still miss that.

deadcore

about 1 month ago

*shrugs* the internet being the internet I suppose.

There was "See the video that accompanies this post." but NGL was just posting encase anyone didn’t have time to read or missed it.

brucehoult

about 1 month ago

1 reply

He also runs a site with a bunch of different compilers and versions :p

mattgodbolt

about 1 month ago

1 reply

That's just some weird side hobby of his.

brucehoult

about 1 month ago

Dude. You've become a verb.

Dwedit

about 1 month ago

2 replies

Because "sub eax,eax" looks stupid. (and also clears the carry flag, unlike "xor eax, eax")

tom_

about 1 month ago

2 replies

xor clears the carry as well? In fact, looks like xor and sub affect the same set of flags!

xor:

> The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined.

sub:

> The OF, SF, ZF, AF, PF, and CF flags are set according to the result.

(I don't have an x64 system handy, but hopefully the reference manual can be trusted. I dimly remembered this, or something like it, tripping me up after coming from programming for the 6502.)

sfink

about 1 month ago

Strangely, the only difference on the flags is that AF (auxiliary carry) is undefined for `xor eax, eax` but guaranteed to be zeroed for `sub eax, eax`. I don't know what that means in practice, though I'm guessing that at the very least the hardware would not treat it as a dependency on the previous value.

trollbridge

about 1 month ago

This is a good thing since the pipeline now doesn’t have to track the state of the flags since they all got zero’d.

HackerThemAll

about 1 month ago

If I remember correctly, sub used to be slower than xor on some ancient architectures.

sylware

about 1 month ago

1 reply

Remnant of RISC attempt without a zero register.

sylware

about 1 month ago

Come on... that was a joke... this karma system...

fooker

about 1 month ago

2 replies

It's funny how machine code is a high level language nowadays, for this example the CPU recognizes the zeroing pattern and does something quite a bit different.

Reubensson

about 1 month ago

5 replies

What do you mean that cpu does something different? Isnt cpu doing what is being asked, that being xor with consequence of zeroing when given two same values.

IsTom

about 1 month ago

I think OP means that it has come a long way from the simple mental model of µops being a direct execution of operations and with all the register renamings and so on

fooker

about 1 month ago

Same consequence yes.

But it will not execute xor, nor will it actually zero out eax in most cases.

It'll do something similar to constant propagation with the information that whenever xor eax, eax occurs; all uses of eax go through a simpler execution path until eax is overwritten.

12_throw_away

about 1 month ago

> with consequence of zeroing when given two same values

Right, it has the same consequence, but it doesn't actually perform the stated operation. ASM is just a now just a high level language that tells the computer to "please give me the same state that a PDP-11-like computer would give me upon executing these instructions."

dooglius

about 1 month ago

FTA:

> And, having done that it removes the operation from the execution queue - that is the xor takes zero execution cycles!1 It’s essentially optimised out by the CPU

horsawlarway

about 1 month ago

No.

It's emulating the zero result when it recognizes this pattern, usually by playing clever tricks with virtual registers.

dheatov

about 1 month ago

1 reply

It's really impressive how powerful and efficient it has become. However, I find it so much more difficult to build mental model of it. I've been struggling with atomic and r/w barrier as there are sooo many ways the instructions could've been executed (or not executed!).

fooker

about 1 month ago

It's a consequence of keeping our general purpose single threaded programming model the same for five decades.

It has it's merits, but the underlying hardware has changed.

Intel tried to push this responsibility to the compiler with Itanium but that failed catastrophicically, so we're back to the CPU pretending it's 1985.

jgrahamc

about 1 month ago

7 replies

In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine.

Yeah, sadly the 6502 didn't allow you to do EOR A; while the Z80 did allow XOR A. If I remember correctly XOR A was AF and LD A, 0 was 3E 01. So saved a whole byte! And I think the XOR was 3 clock cycles fast than the LD. So less space taken up by the instruction and faster.

I have a very distinct memory in my first job (writing x86 assembly) of the CEO walking up behind my desk and pointing out that I'd done MOV AX, 0 when I could have done XOR AX, AX.

vanderZwan

about 1 month ago

2 replies

Hah, we commented on the exact same paragraph within a minute of each other! My memory agrees with your memory, although I think that should be 3E 00. Let me look that up:

https://jnz.dk/z80/ld_r_n.html

https://jnz.dk/z80/xor_r.html

Yep, if I'm reading this right that's 3E 00, since the second byte is the immediate value.

One difference between XOR and LD is that LD A, 0 does not affect flags, which sometimes mattered.

sfink

about 1 month ago

3 replies

What is this "LD A, 0" syntax? Is it a z80 thing?

One of the random things burned into my memory for 6502 assembly is that LDA is $A9. I never separated the instruction from the register; it's not like they were general purpose. But that might be because I learned programming from the 2 books that came with my C64, a BASIC manual and a machine code reference manual, and that's how they did it.

I learned assembly programming by reading through the list of supported instructions. That, and typing in games from Compute's Gazette and manually disassembling the DATA instructions to understand how they worked. Oh, and the zero-page reference.

Good times.

Narishma

about 1 month ago

1 reply

> One of the random things burned into my memory for 6502 assembly is that LDA is $A9. I never separated the instruction from the register; it's not like they were general purpose.

You had LDA and LDX and LDY as separate instructions while the Z80 assembler had a single LD instruction with different operands. It's the same thing really.

sfink

about 1 month ago

Right, though the LD? and ST? instructions were kind of exceptions. You could only do arithmetic and stack and bitwise ops (and, or, eor, shift, rotate) with A, never X nor Y. Increment and decrement were X/Y only. You couldn't even add two registers together without stashing one in memory.

vanderZwan

about 1 month ago

1 reply

> What is this "LD A, 0" syntax? Is it a z80 thing?

Well, I never wrote any 6502 so I can't compare, but yes, you could load immediate values into any register except the flag register on the Z80. Was that not a thing on the 6502?

jgrahamc

about 1 month ago

The 6502 instruction set was really limited but there were three registers: A, X, Y and there were immediate load instructions for each: LDA #0, LDX #0, LDY #0.

jgrahamc

about 1 month ago

What is this "LD A, 0" syntax? Is it a z80 thing?

On the 6502 you had three instructions LDA, LDX, LDY where the register name is essentially part of the instruction name. On the Z80 you had a lot of "load" instruction so you had LD and then many different operands: loading 8-bit registers, loading 16-bit, writing to memory, reading from memory, reading/writing from memory using a register as an index. So, made more sense on Z80 to have "LD" whereas LDA/LDX/LDY worked fine on 6502.

jgrahamc

about 1 month ago

You're right. Of course, it's 3E 00. Not sure how I remembered 3E 01. My only excuse is that it was 40 years ago!

wavemode

about 1 month ago

5 replies

> CEO walking up behind my desk and pointing out that I'd done MOV AX, 0 when I could have done XOR AX, AX

Now that's what I call micromanagement.

(sorry couldn't resist)

mkornaukhov

about 1 month ago

1 reply

Similarly, the CEO couldn't resist the outstanding optimization of memory and execution speed!

6510

about 1 month ago

2 replies

No one believes this story.

jgrahamc

about 1 month ago

3 replies

I am sad you don't believe this story. The CEO was very technical and this is exactly the sort of thing he would spot.

bombcar

about 1 month ago

2 replies

People don't realize that in the era of dinosaurs where MASM ruled and assembly walked the earth, there basically WEREN'T CEOs who didn't know the details, because all the companies doing this stuff were pretty small at the time (and the CEO may have been writing it himself a few years before).

Analemma_

about 1 month ago

1 reply

There was a time when Bill Gates wrote code for Microsoft, and he was actually quite good at it.

nomel

about 1 month ago

1 reply

Not sure why this was voted down. He was very technical, especially for the time: https://www.thecrimson.com/article/2025/6/7/bill-gates-reuni...

eru

about 1 month ago

He also wrote and published a paper on pancake sorting.

assimpleaspossi

about 1 month ago

1 reply

In the era of dinosaurs, neither MASM nor Windows existed but we still did assembly or micro-coding (machine coding) or flipped switches.

bombcar

about 1 month ago

Pre-MASM Dinos probably weren't doing xor ax, ax.

OrderlyTiamat

about 1 month ago

1 reply

My first part time dev job as a student featured me walking in on our CEO who showed me he was recompiling his kernel to enable some features. I'm quite sure he was just doing that to impress the students, but at least he knew how to!

eru

about 1 month ago

Perhaps his secretary showed him?

6510

about 1 month ago

Similarly, if you told people in the 80's that it would be the opposite in the future no one would believe it either.

Not even the developers are very technical in the future!

Woah, really? And they still manage to write good software?

Of course not, if good software would be standing next to their bed at 4 am they would scream who are you what are you doing here? help! help! Someone, make it go away!

eru

about 1 month ago

CEO doesn't need to mean some big boss. If you have a three person startup, the CEO might just be your co-founding buddy.

crest

about 1 month ago

I had to pad the code for alignment reasons. ;-)

ksherlock

about 1 month ago

I mean, he IS the Chief EORfficer

xigoi

about 1 month ago

The real joke is that a CEO had actual technical knowledge instead of just being there for decoration.

jgrahamc

about 1 month ago

How was right though. We were memory and cycle constrained and I'd wasted both!

anonzzzies

about 1 month ago

1 reply

3E 00 : I was on MSX and never had an assembler when you so I only remember the Hex, never actually knew the instructions; I wrote programs/games by data 3E,00,CD,etc without comments saying LD A as I never knew those at the time.

unnah

about 1 month ago

6 replies

Umm... how did you manage to learn those hex codes? You just read a lot of machine code and it started to make sense?

jgrahamc

about 1 month ago

1 reply

I started out writing machine code without an assembler and so had to hand assemble a lot of stuff. After a while you end up just knowing the common codes and can write your program directly. This was also useful because it was possible to write or modify programs directly through an interface sometimes called a "front panel" where you could change individual bytes in memory.

Back in 1985 I did some hand-coding like this because I didn't have access to an assembler: https://blog.jgc.org/2013/04/how-i-coded-in-1985.html and I typed the whole program in through the keypad.

stevekemp

about 1 month ago

Same here. On/For the ZX Spectrum, looking up the hex-codes in the back of the orange book. At least it was spiral-bound to make it easier.

Later still I'd be patching binaries to ensure their serial-checks passed, on Intel.

kragen

about 1 month ago

The instruction sets were a lot simpler at the time. The 8080 instruction set listing is only a few pages, and some of that is instructions you rarely use like RRC and DAA. The operand fields are always in the same place. My own summary of the instruction set is at https://dercuano.github.io/notes/8080-opcode-map.html#addtoc....

af78

about 1 month ago

I had a similar experience of writing machine code for Z80-based computers (Amstrad CPC) in the 90's, as a teenager. I didn't have an assembler so I manually converted mnemonics to hex. I still remember a few opcodes: CD for CALL, C9 for RET, 01 for LD BC, 21 for LD HL... Needless to say, the process was tedious and error-prone. Calculating relative jumps was a pain. So was keeping track of offsets and addresses of variables and jump targets. I tended to insert nops to avoid having to recalculate everything in case I needed to modify some code... I can't say I miss these times.

I'm quite sure none of my friends knew any CPU opcode; however, people usually remembered a few phone numbers.

senderista

about 1 month ago

It wasn't unusual in the 80s to type in machine code listings to a PC; I remember doing this as an 8-year-old from magazines, but I didn't understand any of the stuff I was typing in.

amirhirsch

about 1 month ago

I implemented a PDP-11 in 2007-10 and I can still read PDP-11 Octal

anonzzzies

about 1 month ago

Typing from mags, getting interested in how the magic works by learning to use a hex monitor and trying out things. I was a kid so time enough.

stevefan1999

about 1 month ago

4 replies

> In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine.

Correct. Most ciphers of that era should be Feistel cipher in the likes of DES/3DES, or even RC4 uses XOR too. Later AES/Rijndael, CRC and ECC (Elliptic Curve Cryptography) also make heavy use of XOR but in finite field terms which is based on modular arithmetic over GF(2), that effectively reduces to XOR (while in theory should be mod 2).

OhMeadhbh

about 1 month ago

1 reply

I was going to say "but RC4 and AES were published well after the 6502's heyday," but NESes were completely rocking it in '87 (and I'm told 65XX cores were used as the basis for several hard drive controllers of the era.) Alas, the closest I ever came to encryption on a (less than 32-bit system) was lucifer on an IBM channel controller in the forever-ago and debugging RC5 on an 8085.

kjs3

about 1 month ago

1 reply

I'm told 65XX cores were used as the basis for several hard drive controllers of the era

Western Design Center is still (apparently) making a profit at least in part licensing 6502 core IP for embedded stuff. There's probably a 6502 buried and unrecognized in all sorts of low-cost control applications laying around you.

RC5 on an 8085

Oof. Well played.

PaulHoule

about 1 month ago

3 replies

I dunno. The 6502 has been a $2 part for a long time but needs RAM and some glue logic, for a similar price you can get an AVR-8 [1] or ESP-32 [2] and get some RAM and GPIO.

[1] faster, more registers than the IBM 360, << 64k RAM

[2] much faster, 32bit, >> 64k RAM

brucehoult

about 1 month ago

1 reply

65C02s are $8 now. That didn't stop me buying one when I was stuck at home during COVID. And a 6809 too.

But forget AVR. Yeah, for a buck or so the ATTiny85 was my go-to small MCU five years ago, and the $5 328 for bigger tasks.

But for the last three years both can be replaced by a 48 MHz 32 bit RISC-V CH32V003 for $0.10 for the 8 pin package (like ATTiny85, and also no external components needed) and $0.20 for the 20 pin package with basically the same number of GPIOs as the 328. At 2k RAM and 16K flash it's the same RAM and a little less flash than the ATMega328 -- but not as much as you'd think as RISC-V handles 16 and 32 bit values and pointers sooo much better.

And now you have the CH32V002/4/5/6 with enhanced CPU and more RAM and/or flash -- up to 8K rAM and 62K flash on the 006 -- and still for around the $0.10-$0.20 price

https://www.lcsc.com/product-detail/C42431288.html

OhMeadhbh

29 days ago

1 reply

Hi Bruce! If you make it back to the states we'll have to drink a beer and wax poetic about the 6809. Do you know if anyone ever implement the embedded RISC-V profile in hardware? Not everything I do on small systems needs a 48MHz 32-bit. But if I could get away with a low I/O count, why not use the $0.10 part? Also pretty sure I saw 8051 based SoCs going for $2. I bet if you looked hard enough you could find something like a 6502 for about the same price.

There's probably no reason not to get some of the CH32VXXX's to play with. Every now and again I have an application that needs very low power and I'm happy to spring for an MSP430. But every time I buy an MSP430, TI EoLs the specific model I bought.

brucehoult

28 days ago

Heeey, how's the Cruz treating you? If it still is.

I don't know why you'd ever want to pay a cent more for a 6502 or 8051 or AVR than for a RISC-V or ARM (e.g. Puya PY32F002A). Especially when the CH32V002/4/6 run on anything from 2V to 5V (plus a margin) which is pretty rare, and they don't need any external components.

I don't know whether the M6809 designers were the first to ever analyse a body of real software to find instruction and addressing mode frequencies and the distribution of immediates in order to optimise the encoding of a new ISA -- in a way that the 8086 people clearly didn't [1], but I think they were the first to publish about it, and I was fascinated by their BYTE articles at the time.

MSP430 is also a fun ISA. I just wish they were cheaper, and the cheap ones has more than 512 bytes of RAM. FRAM is funky. I also loooove the instruction encoding e.g. `add.w r10,r11` is `0x5B0A` where `5` is `add`, `B` is src register, `0` means reg to reg word size, `A` is dst register. Just beautiful. Far nicer for emulating on a 6502 or z80 than Arm or RISC-V too. The R2/R3 const generation is a bit whack though.

[1] e.g. on one hand deciding it was worth squeezing a 5 bit offset from any of 4 registers into a 2-byte instruction, while also providing 8 and 16 bit offsets with 3 and 4 byte instructions. They were also confident enough to relegate the 6800's SEC/CLC/SEI/CLI/SEV/CLV to two-byte instructions (with a mask so you could do multiple at once). But not confident enough to do the same with DAA, or SEX. They kept the M6800 encoding for DAA (and for as much else as possible e.g. keeping the opcodes for indexed addressing, but expanding from one option to dozens), but SEX was new to them and they could have experimented with it.

rzzzt

about 1 month ago

There are uC versions like the W65C134S: https://www.westerndesigncenter.com/wdc/w65c134s-chip.php

kjs3

about 1 month ago

I dunno.

You don't know what, exactly? You can go to the web site and see what they are selling.

The 6502 has been a $2 part for a long time

I doubt that for an IP license at any volume such a thing would make sense.

but needs RAM and some glue logic

Sure? Embedded in whatever you're building.

for a similar price you can get...

Oh, sorry...my bad. You were doing it the HN way: "Don't actually read what was written for comprehension...just take your first knee jerk and tell them how you would obviously do it better.".

ASalazarMX

about 1 month ago

Reading cryptography was that advanced at that time, I'm even more surprised that the venerable Norton Utilities for MS-DOS required a password, that was simply XORed with some constant and embedded in the executables. If the reserved space was zeroes, it considered it a fresh install and demanded a new password.

If it had been properly encrypted my young cracker self would have had no opportunity.

Sesse__

about 1 month ago

Well, running in CTR mode is really common now, and that ends up XORing the generated keystream into the plaintext… (CTR mode is essentially converting block ciphers into stream ciphers, if you want to see it that way.)

stevefan1999

about 1 month ago

Self-correction: It is GF(2^8) and not GF(2), but GF(2^8) primitive operations (such as carryless multiplication) can be reduced into a bunch of table lookups and/or GF(2) operations, which is how to AES crypto accelerators are being done in hardware.

mmphosis

about 1 month ago

1 reply

Try to keep the value 0 in the Y register.

  echo tya|asm|mondump -r|6502
                                A=AA X=00 Y=00 S=00 P=22 PC=0300  0
  0300- 98        TYA           A=00 X=00 Y=00 S=00 P=22 PC=0301  2

brucehoult

about 1 month ago

1 reply

That's 1 byte smaller than `LDA #0`, but not faster. And you don't have enough registers to waste them -- being able to do `STZ` and the `(zp)` addressing mode without having to keep 0 in Z or Y were small but soooo convenient things in the 65C02.

snvzz

about 1 month ago

You might like the PC Engine, a game console based on the 65C02*.

*Actually a custom chip also containing some peripherals.

favorited

about 1 month ago

"Prefer `xor a` instead of `ld a, 0`" is basically the first optimization that you learn when doing SM83 assembly.

https://github.com/pret/pokecrystal/wiki/Optimizing-assembly...

user3939382

about 1 month ago

I’m building a new 6502 machine

dintech

about 1 month ago

1 reply

My brain read this is "Why not ear wax?"

kragen

about 1 month ago

    xor wax, wax    ; clear wax
    xor sax, sax    ; clear sax
    xor fax, fax    ; tru tru

jabedude

about 1 month ago

2 replies

similarly IIRC, on (some generations of) x86 chips, NOP is sugar around `XCHG EAX, EAX` which is effectively a do-nothing operation

bitwize

about 1 month ago

This is pretty much all x86 chips as far as I'm aware: opcode 0x90 which is equivalent to XCHG EAX,EAX.

The 8080 and Z80's NOP was at opcode 0. Which was neat because you could make a "NOP slide" simply by zeroing out memory.

kccqzy

about 1 month ago

There are multiple variants of nop mainly because you sometimes need the nop instruction to take up a certain number of bytes for alignment purposes. You have the 1-byte nop, but there is also the 9-byte nop.

pclmulqdq

about 1 month ago

In modern CPUs, a lot of these are recognized as zeroing idioms and they end up doing the same thing (often a register renaming trick). Using the shortest one makes sense. If you use a really weird zeroing pattern, you can also see it as a backend uop while many of these zeroing idioms are elided by the frontend on some cores.

bitwize

about 1 month ago

Because mov eax, 0 requires fetching a constant and prolongs instruction fetching/execution. XOR A was a trick I learned back in the Z80 days.

fortran77

about 1 month ago

Back when I did IBM 370 BAL Assembly Language, we did the same thing to clear a register to zero.

  XR   15,15         XOR REGISTER 15 WITH REGISTER 15

vs

  L    15,=F'0'      LOAD REGISTER 15 WITH 0

This was alleged to be faster on the 370 because because XR operated entirely within the CPU registers, and L (Load) fetched data from memory (i.e.., the constant came from program memory).

silverfrost

about 1 month ago

Back on the Z80 'xor a' is the shortest sequence to zero A

vanderZwan

about 1 month ago

> In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine.

Meanwhile, people like me who got started with a Z80 instead immediately knew why, since XOR A is the smallest and fastest way to clear the accumulator and flag register. Funny how that also shows how specific this is to a particular CPU lineage or its offshoots.

Why use 'xor eax, eax'?

Resources