I understand that there might be a good reason for Intel to add virtualization extensions to their CPU architecture. Instead of fixing the x86 architecture to (optionally) make it Popek-Goldberg compliant and have all critial instructions trap if not run in Ring 0, they added non-root mode, a very big hammer that allows me to switch my CPU state completely to that of the guest and switches back to my original host state on a certain event in the guest. Well, it’s a great toy for people who want to play with CPU internals.
Therefore Intel had to add the VMCS, a 4 KB block in memory that holds the complete CPU state of both the host and the guest (segment registers, GDT and IDT pointer, certain MSRs etc.) as well as some control bits (for example, when to exit).
I also understand that Intel doesn’t allow me to just read and write memory in the VMCS, but abstracts accessing the virtualization state using a vmread/vmwrite interface. This way, the actual layout of this 4 KB page is an implementation detail and can be changed on later CPUs. It also allows for field indexes that are more spread out and encode what kind of field it is.
So I understand very well why Intel encodes into the VMCS field index whether it’s a control field (0), a read-only field (1), part of the guest state (2) or part of the host state (3), and whether it’s a 16 bit (0), 32 bit (2), 64 bit (1) or native-sized (3) field. This way, for example, all 16 bit guest state fields (like the guest’s CS) have indexes starting at 0x0800, and all 64 bit host state fields (like the hosts’s EFER MSR) start at 0x2C00.
Now what I don’t understand is what is so hard to be consistent with this convention (Intel Manual 3B, Appendix H).
- VMCS Link Pointer (0x2800): In the first revision of VT, it had already been already decided that there should be a mechanism for having a second 4 KB page in case later versions of VT need more than 4 KB of state. For this, there is there “VMCS Link Pointer”, which is a 64 bit physical address. Guess what category this belongs to? Guest state.
- “Guest Address Space Size” bit in the “VM Entry Controls” Field (0x4012): This is clearly guest state and not a control field.
- “Host Address Space Size” bit in the “VM Exit Controls” Field (0x400C): This is clearly host state and not a control field.
- VMX-preemption timer value (0x482E): This timer controls after how many ticks execution of the guest should end and control should be returned to the hypervisor. Intel put this into the “guest state” bucket: All other guest state fields are properties of the i386/x86_64 architecture that need to be switched, but not this one. This should really be a control field.
And here is another favorite of mine: the “Primary Execution Controls” field. The 32 bits specify which events in the guest will exit guest execution and trap into the hypervisor (Table 21-6). These events are, among others:
- exit on HLT
- exit on INVLPG
- exit on MOV CR3
- exit on PAUSE
Setting these bits to 1 enables the traps. So if you set all bits to 0, you basically have an unrestricted guest, and if you set all bits to 1, you have the most controlled guest, and you get a notification about every event in the guest. Or so you might think. Actually, there are two bits in the field that don’t work like this:
- Use MSR bitmaps
- Use I/O bitmaps
If these bits are set to 1, it checks a whitelist whether a certain MSR or I/O access is possible. If they are set to 0, all MSR and I/O accesses trap. Compared to all other bits, that’s backwards. Oh great.
Since Steve Jobs seems to be happy to explain his personal opinion on everything lately, I wrote him an email asking him about this, and he replied:
Return-path: <sjobs@apple.com> Received: from bulkin002-bge351000.mac.com ([unknown] [10.150.69.129]) by ms231.mac.com (Sun Java(tm) System Messaging Server 7u3-12.01 64bit (built Oct 15 2009)) with ESMTP id <0L2X00HTAZ3Q6GF1@ms231.mac.com> for XXX@mac.com; Mon, 24 May 2010 13:47:50 -0700 (PDT) Original-recipient: rfc822;XXX@mac.com Received: from relay13.apple.com ([17.128.113.29]) by bulkin002.mac.com (Sun Java(tm) System Messaging Server 6.3-7.02 (built Jun 27 2008; 32bit)) with ESMTP id <0L2X001EVZ3QKED0@bulkin002.mac.com> for XXX@mac.com (ORCPT XXX@mac.com); Mon, 24 May 2010 13:47:50 -0700 (PDT) X-AuditID: 1180721d-b7c17fe00000693e-19-4bfae5f6545a Received: from [17.201.27.84] (using TLS with cipher AES128-SHA (AES128-SHA/128 bits)) (Client did not present a certificate) by relay13.apple.com (Apple SCV relay) with SMTP id DB.14.26942.6F6EAFB4; Mon, 24 May 2010 13:47:50 -0700 (PDT) From: Steve Jobs <sjobs@apple.com> Content-type: text/plain Content-transfer-encoding: 7bit Subject: Re: Intel VT VMCS Layout Date: Mon, 24 May 2010 13:47:48 -0700 Message-id: <3E789F1B-7E13-FFD2-80F6-8E8D4CDDE7FB@apple.com> To: Michael Steil <XXX@mac.com> MIME-version: 1.0 (Apple Message framework v1077) X-Mailer: Apple Mail (2.1077) X-Brightmail-Tracker: AAAAAQAAAZE= The whole VMCS is a big mess, I hate it. > Hi Steve, what do you think about the ordering of the VMCS fields in > Intel's VT extenions? > > Michael
It’s fake! Steve Jobs uses Panther!
I wonder how much Steve actually knows about hardware 🙂
I have a few questions: do you think that the consensus of the people in your profession is that Intel have both the fastest and most awful to program desktop processors? Putting the topic of the post aside, what do you think of the x86 instruction set in general? What architecture do you think has the most elegant instruction set? And would you welcome a complete switch to that from the mainstream x86?
Oh Steve knows plenty about hardware. He used to solder the Apple I’s himself, and gave the Mac team a hard time about chip layout on the motherboard. Wait, what’s that sniggering, isn’t this enough for you? Pfft.
As for Intel’s dirtiest yet most persistent gift to the world – the x86 ISA – it’s proof that elegance, alas, doesn’t count for much when it is but an implementation detail. Intel’s designs have been beaten on elegance numerous times, but what they do best is make fabs. Those are what steamrollered PowerPC along with several other rivals.
Thanks Michael, this post made my day 🙂
@John Muir:
Intel didn’t design the ISA bus.
Technically it’s no problem to use pull-down resistors (instead of pull-up, either resistors or interlan in the interrupt controller chip) on active high IRQ lines, and then the whole IRQ setting/conflict issue would be solved.
(Actually the IDE/PATA interface (witch basically is a derivate of a part of the ISA bus) on Commodore Amiga 1200 and 4000 uses pull-down resistors on the IRQ line, and therefore doesn’t require any hardware to disable/enable hard disk IRQ).
All other ISA problems like non-existing autoconfig is a CPU architecture independent problem.
@MiaM:
You’re kidding right? John’s talking about the x86 Instruction Set Architecture (aka instruction set), not the ISA Bus.
I wonder how long this thing will survive. Will it just grow and grow over hundreds of years? You can’t just replace it with something else (Itanium anyone?) as the base of existing programs, business apps, programmers, tools, processes and procedures built around x86 is unfathomably huge.
Even if virtualization makes the underlying meat DNA irrelevant, we’ll still be virtualizing an x86 so the world can run.
Even if DC and Marvel hooked up I don’t think they could come up with a super hero to save us from it.
I think jumping architectures will be more feasible with time even if there will also be more applications in native code. It’s not the 80’s any more, and porting code is more and more turning into a matter of just selecting another target architecture and clicking the build button, and less of having to write assembly for some specific architecture and hardware.
I would love to see ARM compete with Intel in desktop performance and break into that market.
@Paul:
s?x86?System/360?g
“Oh Steve knows plenty about hardware. ”
Yes, but what they are talking about is software. I mean, how familiar is Steve Jobs with x86?
The elegance of the Intel ISA is hidden. Elegance flows behind the scenes in the micro-ops.