Asking the kernel how to make a syscall

Imagine you’re an i386 user mode application on a modern operating system, and you want to make a syscall, for example to request some memory or create a new thread. But syscalls can be made in various ways on the i386 family of CPUs (int, call gates, sysenter, syscall), and CPUs tend to support only a subset of them. But hardcoding “int” into the kernel is a waste of resources on modern CPUs, because sysenter is a lot faster.

The Windows XP kernel for example therefore detects the CPU type and tells user mode applications what mechanism to use. It maps at a constant location in every address space a read-only page that contains a small stub that can be called from user mode like a library function, and that does nothing more than transparently make the syscall.

do_syscall:
sysenter
ret

So far so good. But what if, for compatibility reasons, your cannot just map this page at a constant location? A microkernel like L4 is, among other things (to make a long story short), designed to support running unmodified applications written for many different operating systems at the same time, so we cannot guarantee for any location in the 4 GB address space that we can safely map a page of code there without destroying compatibility with some operating system.

So the question is, how can we ask the kernel how to make syscalls, if the kernel cannot put the info in our memory, and we obviously cannot make a syscall to ask the kernel…

The idea is to trap into kernel mode, by doing something illegal, so the kernel can put the information in a register and return to user mode. A division by zero is such a trap – but then the kernel would not be able to distinguish between this special syscall and a real division by zero exception. Using an illegal instruction doesn’t help either, because no i386 opcode is guaranteed to be illegal in the future.

The L4 guys came up with “lock nop”. “lock” is a prefix that makes sure that in the following instruction, the memory bus is not shared with any other CPU in an SMP configuration. But “lock” may only be used with one of 17 specific instructions – all other instructions following a “lock” will cause an “undefined opcode” exception and trap into the kernel, which can easily look up whether it was “lock nop” that caused the exception.

(Now here is a question: I found a hint on the net that “lock nop” didn’t do anything on some early Intel i386 CPUs – does anyone have more information?)

7 thoughts on “Asking the kernel how to make a syscall”

  1. ud2 (opcode 0f0b and 0fb9) is an opcode guaranteed to cause an “undefined opcode” exception, see e.g. http://www.x86.org/secrets/opcodes/ud2.htm

    I’m not 100% sure that there is no obscure (and ancient) x86 clone which has a different opcode at 0f0b or 0fb9, but I’d strongly vote against using “lock nop”. There is no guarantee that Intel or AMD will put new future instruction there (like they did with “rep nop”).

    Another smart way to cause an exception is “div edx”. This will always cause a “division by zero”-exception independent of value of edx (proof is left as excecise for the reader). Because this instruction has no real purpose other than dividing by zero it’s highly unlikely that it is used in real applications.

    You can go one step further by using “lock div edx” to be absolutely sure 🙂

    Reply
  2. So my question is, what’s wrong with a trap!?

    Sure, you can use some magic instruction sequence to trap into the kernel and do your operation, but I don’t see what it buys you.

    Every architecture I know enough details of has some method of performing a trap/syscall/whatever instruction. Why not use this to find out more information as to other supported syscall methods?

    Sure, you could argue that you can’t virtualise arbitrary existing binaries because there’s a chance you collide with syscall numbers and what not, but you get the same problem when everyone is using magic illegal instruction traps, too 🙂 (or you release version N…)

    Still confused about it is all

    Reply
    • Nu willen mijn andere vriendinnen het ook lezen !!!!Gewoon suuuuuuuuuuuuuper !!!!Nu kunnen we samen zwijmelen wanneer we praten en lezen over Will en Jacinda !!!!Will lijkt me echt zo’n lekker ding (zou hij echt moeten bestaan) !!!Komt er nu een film over het boek of ni?Want als da zo is, dan hoop ik dat het niet weer in het Duits is zoals bij Robijnrood !!!Nu willen sommige vriendinnen al niet meer samen met mij de film kijken (als hij er is).Ook al hebben ze dat beloofd !!!

      Reply
  3. You’re contradicting yourself: first you say that “no illegal opcode will remain illegal forever” (FALSE: UD2 was designed for this), then you suggest using “lock nop”, which is illegal, but only for now (only UD2 is guaranteed to be always illegal).

    So really, what you want to be doing is to use UD2. Also, the problem on “how to ask the kernel to make syscalls” you describe only exists on operating systems which need to check if it’s safe to use SYSENTER. Just don’t use it, and choose another interrupt.

    Best regards,
    Alex Ionescu

    Reply
  4. The real question is, why do you care?

    A syscall is a contract between two pieces of code; the system and the application. At the point the application starts, the contract has already been established, meaning in practical terms that the system knows how the application is going to invoke a system call.

    Thus there is no need for a “universal” exit; the context for the application is simply configured to behave appropriately according to the contract.

    Reply
  5. Sebastian: div edx doesn’t result in a ‘division by zero’, but the more general ‘divide error’. This is because the quotient is too large to fit in eax – what you’re calculating is essentially (edx*2^32+epsilon)/edx = 2^32+epsilon, which obviously can’t fit in 32 bits. Unless edx is zero, which gives a ‘divide by zero’. Same for 8, 16 and 64 bit operands, using div ah, div dx and div rdx, where the high half of the dividend is used as the divisor.

    The bizarre thing about the x86 div instruction is that it’s an n-bit dividend and only an n/2 bit quotient, so that it is even possible to overflow the output register. It’s not often that you need both the quotient and remainder in the same instruction, so the additional output register could have been used to increase the range of the result.

    Reply
  6. Yo suelo vrelos con yumex.Marco "instalados", voy hacia Kernel, marco los que quiero desinstalar, junto con el resto de kernel-devel obsoletos, además de los kmod o akmod que no me interesas mantener más.Un saludo,EmilianoBadajoz

    Reply

Leave a Comment