In case you didn't know, most instructions take one or two cycles to execute. (For you anal-retentive people, this is actually just their apparent latency. The number of cycles is actually much larger and pipeline-dependent.)
Anyway, I'm going to take my instruction timings from an Intel Core2 L64. Let's take a look:
Clearly the speed of your program does not hinge on how few instructions you use. Do you have any idea how much time you waste by dividing? Great thing is, this problem isn't unavoidable. There's ways of getting around it to cut down the number of cycles, which is critical in loops.Code:add 1 cycle mov 1 cycle shl 1 cycle mul 8 cycles div 116 cycles
Multiplying
When multiplying by powers of two, we can use SHL:
This completes in one cycle instead of eight. Downside is that MUL stores the result in EDX:EAX, meaning your result could be a huge 64-bit number. With SHL, if you multiply by too much you could very well end up with zero as your answer.Code:SHL EAX, 1 ; EAX *= 2 SHL EAX, 2 ; EAX *= 4 ...
For numbers less than 10, we can also use LEA. If you ever find the need to multiply, add, and store, you can use this handy-dandy instruction to get it all done in one cycle, and probably with fewer registers too. For example:
This snippet multiplies EAX by five (clobbers ECX) and adds ten for a grand total of four cycles. Still better than the nine it would take with a MUL and an ADD. But what if we don't have any other free registers? We're going to have to use the stack, which'll add two more cycles onto our snippet for a total of six. Not saving too much there.Code:MOV ECX, EAX SHL EAX, 2 ADD EAX, ECX ADD EAX, 10
However, our friend LEA can do all that in one cycle, without touching ECX:
Note that we have a few restrictions:Code:LEA EAX, [EAX*4 + EAX + 10]
- We can only multiply one register by 1, 2, 4, or 8.
- We can only add one register. (No subtracting, sorry.)
- We can only add one signed constant, either 8 or 32 bits (whatever fits).
You can get creative and use it to add two registers and store the result in another register:
Note that we have a problem: You can't multiply a register by anything more than 9, and you can't do 6 or 7 either. Well, we can use two instructions for 6:Code:LEA EDX, [EAX + ECX] ; EDX = EAX+ECX LEA EBX, [EDX*4 + EAX - 1492] ; EBX = EDX*4 + EAX - 1492
Grand total of 2 clock cycles, which is 1/4 of what it takes using MUL.Code:; multiply EAX by 6 LEA EAX, [EAX*2 + EAX] ; EAX *= 3 SHL EAX, 1
Multiplying by 7 is a little more annoying. Here's one way to do it without clobbering any other registers:
Even though this code looks like a lot more work, it's actually twice as fast as a single MUL.Code:PUSH EAX SHL EAX, 3 SUB EAX, [ESP] ADD ESP, 4
Multiplying by larger numbers requires a series of shifts, adds and LEAs which isn't really worth it anymore. Way back when MUL used to take around fifteen cycles, it was possible. But eight? Eh...highly depends on how cunning you are. Plus you need to know your algebra.
Multiply EAX by 47 (same as EAX*32 + EAX*16 - EAX)
Four cycles! Not bad. If we want to save ECX we can use the stack:Code:MOV ECX, EAX ; save original EAX in ECX SHL EAX, 4 ; EAX *= 16 LEA EAX, [EAX*2 + EAX] ; EAX = EAX*16*2 + EAX*16 SUB EAX, ECX
Six cycles. Not that much of a benefit, but in a frequently called loop it'll be worth it.Code:PUSH EAX ; save original eax SHL EAX, 4 ; EAX *= 16 LEA EAX, [EAX*2 + EAX] ; EAX = EAX*16*2 + EAX*16 SUB EAX, [ESP] ADD ESP, 4
Now something harder: multiply EAX by 19. This is the same as EAX*16 + EAX*2 + EAX. This isn't going to work, though. Better redistribute the multiples to EAX*8 + EAX*4 + EAX.
Ha! Exactly the same as multiplying by 47, except for how much we shift by.Code:MOV ECX, EAX SHL EAX, 2 LEA EAX, [EAX*2 + EAX] ADD EAX, ECX
But enough of that. Let's move on to dividing.
Dividing
Not a whole ton you can do here, but as long as you're dividing by the right kind of numbers you can speed this up A LOT. Downside: you really can only do this quickly by shifting right, which means that you don't get a remainder. DIV gives you the quotient in EAX and the remainder in EDX. But if you don't need that remainder, you can shift using SAR and SHR. You need to know just one thing: is the number you're dividing signed or unsigned? (If you don't know the difference you kinda shouldn't be here.) Anyway, if you're dealing with signed numbers use SAR (shift arithmetic right), if you're using unsigned numbers use SHR (shift right).
These are useful for dividing by powers of two, just as SHL is useful for multiplying by powers of two. If you want to divide by something that's not a power of two...sorry, but as far as I know you're up the creek without a paddle. IF, however, you want to do fractions, hold on a sec. I'll get to that.
Let's try something easy first - dividing by eight.
Dividing by any power of 2 is this simple, and 1/116 of the time it'd take with a DIV. Small savings, right? (That's my humor for ya.)Code:; unsigned numbers SHR EAX, 3 ; signed numbers SAR EAX, 3
Fractions
Now about those fractions...we can only divide by powers of two, but we can at the same time multiply by another number. A little algebra:
The last one takes nine to thirteen cycles by my count. Neat, huh?Code:A/2 - A/8 = (8A - 2A)/16 = 6A/16 = 3A/8 A/16 + A/2 = (2A + 16A)/32 = 18A/32 = 9A/16 5A/16 + 47A = 757A/16
Caveat: You're going to get an integer answer unless you decide to use fixed-point arithmetic.
Sources: "Instruction latencies and throughput for AMD and Intel x86 processors" by Torbjör Granlund. (C) 2009.
sudo rm -rf /
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks