Documentation/x86_64/mm.txt

   1 The paging design used on the x86-64 linux kernel port in 2.4.x provides:
   2
   3 o       per process virtual address space limit of 512 Gigabytes
   4 o       top of userspace stack located at address 0x0000007fffffffff
   5 o       PAGE_OFFSET = 0xffff800000000000
   6 o       start of the kernel = 0xffffffff800000000
   7 o       global RAM per system 2^64-PAGE_OFFSET-sizeof(kernel) = 128 Terabytes - 2 Gigabytes
   8 o       no need of any common code change
   9 o       no need to use highmem to handle the 128 Terabytes of RAM
  10
  11 Description:
  12
  13         Userspace is able to modify and it sees only the 3rd/2nd/1st level
  14         pagetables (pgd_offset() implicitly walks the 1st slot of the 4th
  15         level pagetable and it returns an entry into the 3rd level pagetable).
  16         This is where the per-process 512 Gigabytes limit cames from.
  17
  18         The common code pgd is the PDPE, the pmd is the PDE, the
  19         pte is the PTE. The PML4E remains invisible to the common
  20         code.
  21
  22         The kernel uses all the first 47 bits of the negative half
  23         of the virtual address space to build the direct mapping using
  24         2 Mbytes page size. The kernel virtual  addresses have bit number
  25         47 always set to 1 (and in turn also bits 48-63 are set to 1 too,
  26         due the sign extension). This is where the 128 Terabytes - 2 Gigabytes global
  27         limit of RAM cames from.
  28
  29         Since the per-process limit is 512 Gigabytes (due to kernel common
  30         code 3 level pagetable limitation), the higher virtual address mapped
  31         into userspace is 0x7fffffffff and it makes sense to use it
  32         as the top of the userspace stack to allow the stack to grow as
  33         much as possible.
  34
  35         Setting the PAGE_OFFSET to 2^39 (after the last userspace
  36         virtual address) wouldn't make much difference compared to
  37         setting PAGE_OFFSET to 0xffff800000000000 because we have an
  38         hole into the virtual address space. The last byte mapped by the
  39         255th slot in the 4th level pagetable is at virtual address
  40         0x00007fffffffffff and the first byte mapped by the 256th slot in the
  41         4th level pagetable is at address 0xffff800000000000. Due to this
  42         hole we can't trivially build a direct mapping across all the
  43         512 slots of the 4th level pagetable, so we simply use only the
  44         second (negative) half of the 4th level pagetable for that purpose
  45         (that provides us 128 Terabytes of contigous virtual addresses).
  46         Strictly speaking we could build a direct mapping also across the hole
  47         using some DISCONTIGMEM trick, but we don't need such a large
  48         direct mapping right now.
  49
  50 Future:
  51
  52         During 2.5.x we can break the 512 Gigabytes per-process limit
  53         possibly by removing from the common code any knowledge about the
  54         architectural dependent physical layout of the virtual to physical
  55         mapping.
  56
  57         Once the 512 Gigabytes limit will be removed the kernel stack will
  58         be moved (most probably to virtual address 0x00007fffffffffff).
  59         Nothing will break in userspace due that move, as nothing breaks
  60         in IA32 compiling the kernel with CONFIG_2G.
  61
  62 Linus agreed on not breaking common code and to live with the 512 Gigabytes
  63 per-process limitation for the 2.4.x timeframe and he has given me and Andi
  64 some very useful hints... (thanks! :)
  65
  66 Thanks also to H. Peter Anvin for his interesting and useful suggestions on
  67 the x86-64-discuss lists!
  68
  69 Other memory management related issues follows:
  70
  71 PAGE_SIZE:
  72
  73         If somebody is wondering why these days we still have a so small
  74         4k pagesize (16 or 32 kbytes would be much better for performance
  75         of course), the PAGE_SIZE have to remain 4k for 32bit apps to
  76         provide 100% backwards compatible IA32 API (we can't allow silent
  77         fs corruption or as best a loss of coherency with the page cache
  78         by allocating MAP_SHARED areas in MAP_ANONYMOUS memory with a
  79         do_mmap_fake). I think it could be possible to have a dynamic page
  80         size between 32bit and 64bit apps but it would need extremely
  81         intrusive changes in the common code as first for page cache and
  82         we sure don't want to depend on them right now even if the
  83         hardware would support that.
  84
  85 PAGETABLE SIZE:
  86
  87         In turn we can't afford to have pagetables larger than 4k because
  88         we could not be able to allocate them due physical memory
  89         fragmentation, and failing to allocate the kernel stack is a minor
  90         issue compared to failing the allocation of a pagetable. If we
  91         fail the allocation of a pagetable the only thing we can do is to
  92         sched_yield polling the freelist (deadlock prone) or to segfault
  93         the task (not even the sighandler would be sure to run).
  94
  95 KERNEL STACK:
  96
  97         1st stage:
  98
  99         The kernel stack will be at first allocated with an order 2 allocation
 100         (16k) (the utilization of the stack for a 64bit platform really
 101         isn't exactly the double of a 32bit platform because the local
 102         variables may not be all 64bit wide, but not much less). This will
 103         make things even worse than they are right now on IA32 with
 104         respect of failing fork/clone due memory fragmentation.
 105
 106         2nd stage:
 107
 108         We'll benchmark if reserving one register as task_struct
 109         pointer will improve performance of the kernel (instead of
 110         recalculating the task_struct pointer starting from the stack
 111         pointer each time). My guess is that recalculating will be faster
 112         but it worth a try.
 113
 114                 If reserving one register for the task_struct pointer
 115                 will be faster we can as well split task_struct and kernel
 116                 stack. task_struct can be a slab allocation or a
 117                 PAGE_SIZEd allocation, and the kernel stack can then be
 118                 allocated in a order 1 allocation. Really this is risky,
 119                 since 8k on a 64bit platform is going to be less than 7k
 120                 on a 32bit platform but we could try it out. This would
 121                 reduce the fragmentation problem of an order of magnitude
 122                 making it equal to the current IA32.
 123
 124                 We must also consider the x86-64 seems to provide in hardware a
 125                 per-irq stack that could allow us to remove the irq handler
 126                 footprint from the regular per-process-stack, so it could allow
 127                 us to live with a smaller kernel stack compared to the other
 128                 linux architectures.
 129
 130         3rd stage:
 131
 132         Before going into production if we still have the order 2
 133         allocation we can add a sysctl that allows the kernel stack to be
 134         allocated with vmalloc during memory fragmentation. This have to
 135         remain turned off during benchmarks :) but it should be ok in real
 136         life.
 137
 138 Order of PAGE_CACHE_SIZE and other allocations:
 139
 140         On the long run we can increase the PAGE_CACHE_SIZE to be
 141         an order 2 allocations and also the slab/buffercache etc.ec..
 142         could be all done with order 2 allocations. To make the above
 143         to work we should change lots of common code thus it can be done
 144         only once the basic port will be in a production state. Having
 145         a working PAGE_CACHE_SIZE would be a benefit also for
 146         IA32 and other architectures of course.
 147
 148 Andrea <andrea@suse.de> SuSE