Puzzle: PowerPC Flag Simulation on x86

This week’s puzzle is to copy the carry flag to the high bit of ah.  You may destroy any other register, the flags, and the other 24 bits of eax.  Shortest sequence wins.

For an optional very long description of what this is all about, click “Read the rest”.  You don’t need to read it to participate in the contest =)

Let’s say that you are writing Rosetta for Apple to run PowerPC Mac programs on new x86-based hardware.  Naturally, speed is a huge goal, so you will use dynamic recompilation of PowerPC machine code to x86 to achieve this.  This is the basis of this week’s puzzle.

Like x86, PowerPC has a flag register, called “cr0”.  (There are other flag registers but we can ignore them here.)  Because cr0 is frequently used by PowerPC code, you assign an x86 register to always represent the simulated cr0.  You cannot use the x86 flags register itself because many PowerPC can do instructions like “or” without changing its flags, while the x86 cannot.  When you translate PowerPC conditional jumps, you will need to get the bits of the simulate cr0 into the x86 flags register so that you may use an x86 conditional jump to simulate the PowerPC conditional jump.

PowerPC “compare” instructions work much like x86, but there is a problem: On PowerPC, the compare instruction selects between signed and unsigned comparison, whereas on x86, the signed/unsigned selection is done by the choice of conditional jump.  Because a “compare” instruction and a conditional branch may be arbitrarily far apart with many intervening instructions between them, it is not possible in the general case to use data flow analysis to decide how to translate it.  It must be simulated more exactly.

Now comes the puzzle.  How do we recompile PowerPC compares and conditional branches into x86 code?  The “compare” needs to do one of two things.  One, it could set some flag such that when you do a conditional branch, you can know whether the previous comparison was signed or unsigned and decide whether to branch accordingly.  Two, it could set up the simulated cr0 register such that the branch did not need to know whether the branch was signed or unsigned.

This is a complicated description of the problem, so I’m going to show an example solution.  We go with the second approach: we set up the simulated cr0 at compare time so that the branch does not need to know the type of comparison used earlier.  We define eax and edx to be scratch registers (that is, not directly mapped with simulated PowerPC registers).  We also define our simulated PowerPC cr0 (flags) register as x86 register ecx.  (A real implementation would probably put the flags into a memory variable rather than a register because cr0 is used less frequently than the more common registers.)

We compare the value of esi, which we define as being a simulated PowerPC general register, to 5.  We then store bits in ecx to store the information necessary to implement the conditional jump later.  Because signed comparisons are more common than unsigned comparisons, we consider signed comparisons to be the “default”.  It is the responsibility of unsigned comparisons to set ecx to the same way that signed comparisons would set it.

We use the x86 instructions “lahf” (flags -> ah) and “sahf” (ah -> flags) to access the low 8 bits of the x86 flags register, which is where we can find the interesting comparison results.  We cannot use pushfd/popfd to do this, because they cause a kernel trap in some cases – they are extremely slow.

The flags we care about are the “sign” and “zero” flags.  With them, we can do all comparisons: <, >, ≤, ≥, and = can all be tested as some combination of these two flags.  Both these flags are in the low 8 bits of the flag register, and they can be tested against with a single conditional branch, so they make an optimal solution for the signed case.

For a signed comparison, we do this:

cmp esi, 5 ; actual simulated comparison
lahf
mov ecx, eax

Later, when we want to branch based on a previous comparison’s result, we do this:

mov eax, ecx
sahf
(branch) label

(branch) is the type of condition to check:

< becomes jl > becomes jg
≤ becomes jle
≥ becomes jge
= becomes je

These branches all use the signed and zero flags directly to determine whether to branch.

In the unsigned case, the x86 “carry” flag works for unsigned comparisons like the “sign” flag for signed comparisons.  By copying the “carry” flag to the “sign” flag, we allow the jl/jg/jle/jge/je branches to work for unsigned comparisons as well.  Carry is bit 0 of the x86 flags register, and sign is bit 7.  This is a simple way to implement unsigned comparisons:

cmp esi, 5
lahf
mov al, ah
shl al, 7
and ah, 0x7F
or ah, al
mov ecx, eax

 

Is there a faster way to implement this, especially for the unsigned case?

Rules

  • You may destroy any register.
  • You may used fixed memory locations as variables.
  • You may not use pushfd/popfd because they are privileged instructions.
  • You are not limited to improving the example; implementation of an entirely different solution to the same problem (such as to the first one suggested) are acceptable.
  • Small improvements count a lot, because presumably the code will execute many millions of times.

Michael Steil came up with the example solution shown here for his thesis work on SoftPear, and I only made a moderate enhancement to it.

-Myria

7 thoughts on “Puzzle: PowerPC Flag Simulation on x86”

  1. setc ah
    ror ah, 1

    (Not any shorter at 5 bytes, but this way seems a little more readable at first glance.)

    Reply
  2. “This week’s puzzle is to copy the carry flag to the high bit of ah. You may destroy any other register, the flags, and the other 24 bits of eax.”

    What about the other other 7 bits? Is that why some answers worked hard to preserve those bits?

    Reply
  3. All of the rcl/rcr-based solutions chop off a bit from AH when shifting CF in. This is the smallest solution that I can think of that doesn’t suffer this problem:

    salc ; D6
    and al, 0x80 ; 25 80
    or ah, al ; 0A E0

    That’s 5 bytes. The next solution is the fastest (that I can think of):

    sbb ecx, ecx ; 1B C9
    and ecx, 0x00008000 ; 81 E1 00 80 00 00
    or eax, ecx ; 0B C8

    Reply
  4. Also, there are some big problems with using jl/jge/jg/jle since you aren’t even setting OF before your signed branches, and those instructions calculate OF ^ SF as part of the condition.

    Honestly, I would do it the other way around by copying OF ^ SF to CF and using unsigned branches. That way you can use setl+shr. It’s fast and takes only 5 bytes. An example follows:

    ; Do comparison
    cmp esi, 5

    ; Save flags — 6 bytes and no partial register stalls
    setl al
    shr al, 1
    lahf

    ; Restore flags and branch — 3-7 bytes
    sahf
    jb foo ; esi < 5

    Reply

Leave a Comment