X-Git-Url: http://git.onelab.eu/?a=blobdiff_plain;f=Documentation%2FDocBook%2Fkernel-hacking.tmpl;h=582032eea87228352079b8b7002117151a79cc41;hb=43bc926fffd92024b46cafaf7350d669ba9ca884;hp=49a9ef82d575fe99c60fe1ad4b382cf221a2ce86;hpb=cee37fe97739d85991964371c1f3a745c00dd236;p=linux-2.6.git diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 49a9ef82d..582032eea 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -8,8 +8,7 @@ - Paul - Rusty + Rusty Russell
@@ -20,7 +19,7 @@ - 2001 + 2005 Rusty Russell @@ -64,7 +63,7 @@ Introduction - Welcome, gentle reader, to Rusty's Unreliable Guide to Linux + Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux Kernel Hacking. This document describes the common routines and general requirements for kernel code: its goal is to serve as a primer for Linux kernel development for experienced C @@ -96,13 +95,13 @@ - not associated with any process, serving a softirq, tasklet or bh; + not associated with any process, serving a softirq or tasklet; - running in kernel space, associated with a process; + running in kernel space, associated with a process (user context); @@ -114,11 +113,12 @@ - There is a strict ordering between these: other than the last - category (userspace) each can only be pre-empted by those above. - For example, while a softirq is running on a CPU, no other - softirq will pre-empt it, but a hardware interrupt can. However, - any other CPUs in the system execute independently. + There is an ordering between these. The bottom two can preempt + each other, but above that is a strict hierarchy: each can only be + preempted by the ones above it. For example, while a softirq is + running on a CPU, no other softirq will preempt it, but a hardware + interrupt can. However, any other CPUs in the system execute + independently. @@ -130,10 +130,10 @@ User Context - User context is when you are coming in from a system call or - other trap: you can sleep, and you own the CPU (except for - interrupts) until you call schedule(). - In other words, user context (unlike userspace) is not pre-emptable. + User context is when you are coming in from a system call or other + trap: like userspace, you can be preempted by more important tasks + and by interrupts. You can sleep, by calling + schedule(). @@ -153,7 +153,7 @@ - Beware that if you have interrupts or bottom halves disabled + Beware that if you have preemption or softirqs disabled (see below), in_interrupt() will return a false positive. @@ -168,10 +168,10 @@ keyboard are examples of real hardware which produce interrupts at any time. The kernel runs interrupt handlers, which services the hardware. The kernel - guarantees that this handler is never re-entered: if another + guarantees that this handler is never re-entered: if the same interrupt arrives, it is queued (or dropped). Because it disables interrupts, this handler has to be fast: frequently it - simply acknowledges the interrupt, marks a `software interrupt' + simply acknowledges the interrupt, marks a 'software interrupt' for execution and exits. @@ -188,60 +188,52 @@ - Software Interrupt Context: Bottom Halves, Tasklets, softirqs + Software Interrupt Context: Softirqs and Tasklets Whenever a system call is about to return to userspace, or a - hardware interrupt handler exits, any `software interrupts' + hardware interrupt handler exits, any 'software interrupts' which are marked pending (usually by hardware interrupts) are run (kernel/softirq.c). Much of the real interrupt handling work is done here. Early in - the transition to SMP, there were only `bottom + the transition to SMP, there were only 'bottom halves' (BHs), which didn't take advantage of multiple CPUs. Shortly after we switched from wind-up computers made of match-sticks and snot, - we abandoned this limitation. + we abandoned this limitation and switched to 'softirqs'. include/linux/interrupt.h lists the - different BH's. No matter how many CPUs you have, no two BHs will run at - the same time. This made the transition to SMP simpler, but sucks hard for - scalable performance. A very important bottom half is the timer - BH (include/linux/timer.h): you - can register to have it call functions for you in a given length of time. + different softirqs. A very important softirq is the + timer softirq (include/linux/timer.h): you can + register to have it call functions for you in a given length of + time. - 2.3.43 introduced softirqs, and re-implemented the (now - deprecated) BHs underneath them. Softirqs are fully-SMP - versions of BHs: they can run on as many CPUs at once as - required. This means they need to deal with any races in shared - data using their own locks. A bitmask is used to keep track of - which are enabled, so the 32 available softirqs should not be - used up lightly. (Yes, people will - notice). - - - - tasklets (include/linux/interrupt.h) - are like softirqs, except they are dynamically-registrable (meaning you - can have as many as you want), and they also guarantee that any tasklet - will only run on one CPU at any time, although different tasklets can - run simultaneously (unlike different BHs). + Softirqs are often a pain to deal with, since the same softirq + will run simultaneously on more than one CPU. For this reason, + tasklets (include/linux/interrupt.h) are more + often used: they are dynamically-registrable (meaning you can have + as many as you want), and they also guarantee that any tasklet + will only run on one CPU at any time, although different tasklets + can run simultaneously. - The name `tasklet' is misleading: they have nothing to do with `tasks', + The name 'tasklet' is misleading: they have nothing to do with 'tasks', and probably more to do with some bad vodka Alexey Kuznetsov had at the time. - You can tell you are in a softirq (or bottom half, or tasklet) + You can tell you are in a softirq (or tasklet) using the in_softirq() macro (include/linux/interrupt.h). @@ -288,11 +280,10 @@ A rigid stack limit - The kernel stack is about 6K in 2.2 (for most - architectures: it's about 14K on the Alpha), and shared - with interrupts so you can't use it all. Avoid deep - recursion and huge local arrays on the stack (allocate - them dynamically instead). + Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's + about 14K on most 64-bit archs, and often shared with interrupts + so you can't use it all. Avoid deep recursion and huge local + arrays on the stack (allocate them dynamically instead). @@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg) If all your routine does is read or write some parameter, consider - implementing a sysctl interface instead. + implementing a sysfs interface instead. @@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */ - You will eventually lock up your box if you break these rules. + You should always compile your kernel + CONFIG_DEBUG_SPINLOCK_SLEEP on, and it will warn + you if you break these rules. If you do break + the rules, you will eventually lock up your box. @@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); success). - [Yes, this moronic interface makes me cringe. Please submit a - patch and become my hero --RR.] + [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.] The functions may sleep implicitly. This should never be called @@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); - If you see a kmem_grow: Called nonatomically from int - warning message you called a memory allocation function - from interrupt context without GFP_ATOMIC. - You should really fix that. Run, don't walk. + If you see a sleeping function called from invalid + context warning message, then maybe you called a + sleeping allocation function from interrupt context without + GFP_ATOMIC. You should really fix that. + Run, don't walk. @@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); - <function>udelay()</function>/<function>mdelay()</function> + <title><function>mdelay()</function>/<function>udelay()</function> <filename class="headerfile">include/asm/delay.h</filename> <filename class="headerfile">include/linux/delay.h</filename> - The udelay() function can be used for small pauses. - Do not use large values with udelay() as you risk + The udelay() and ndelay() functions can be used for small pauses. + Do not use large values with them as you risk overflow - the helper function mdelay() is useful - here, or even consider schedule_timeout(). + here, or consider msleep(). @@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); These routines disable soft interrupts on the local CPU, and restore them. They are reentrant; if soft interrupts were disabled before, they will still be disabled after this pair - of functions has been called. They prevent softirqs, tasklets - and bottom halves from running on the current CPU. + of functions has been called. They prevent softirqs and tasklets + from running on the current CPU. @@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/asm/smp.h - smp_processor_id() returns the current - processor number, between 0 and NR_CPUS (the - maximum number of CPUs supported by Linux, currently 32). These - values are not necessarily continuous. + get_cpu() disables preemption (so you won't + suddenly get moved to another CPU) and returns the current + processor number, between 0 and NR_CPUS. Note + that the CPU numbers are not necessarily continuous. You return + it again with put_cpu() when you are done. + + + If you know you cannot be preempted by another task (ie. you are + in interrupt context, or have preemption disabled) you can use + smp_processor_id(). @@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); After boot, the kernel frees up a special section; functions marked with __init and data structures marked with - __initdata are dropped after boot is complete (within - modules this directive is currently ignored). __exit + __initdata are dropped after boot is complete: similarly + modules discard this memory after initialization. __exit is used to declare a function which is only required on exit: the function will be dropped if this file is not compiled as a module. See the header file for use. Note that it makes no sense for a function marked with __init to be exported to modules with EXPORT_SYMBOL() - this will break. - - Static data structures marked as __initdata must be initialised - (as opposed to ordinary static data which is zeroed BSS) and cannot be - const. - @@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); The function can return a negative error number to cause module loading to fail (unfortunately, this has no effect if - the module is compiled into the kernel). For modules, this is - called in user context, with interrupts enabled, and the - kernel lock held, so it can sleep. + the module is compiled into the kernel). This function is + called in user context with interrupts enabled, so it can sleep. @@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); reached zero. This function can also sleep, but cannot fail: everything must be cleaned up by the time it returns. + + + Note that this macro is optional: if it is not present, your + module will not be removable (except for 'rmmod -f'). + + + + + <function>try_module_get()</function>/<function>module_put()</function> + <filename class="headerfile">include/linux/module.h</filename> + + + These manipulate the module usage count, to protect against + removal (a module also can't be removed if another module uses one + of its exported symbols: see below). Before calling into module + code, you should call try_module_get() on + that module: if it fails, then the module is being removed and you + should act as if it wasn't there. Otherwise, you can safely enter + the module, and call module_put() when you're + finished. + + + + Most registerable structures have an + owner field, such as in the + file_operations structure. Set this field + to the macro THIS_MODULE. + @@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); There is a macro to do this: wait_event_interruptible() - include/linux/sched.h The + include/linux/wait.h The first argument is the wait queue head, and the second is an expression which is evaluated; the macro returns 0 when this expression is true, or @@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); Call wake_up() - include/linux/sched.h;, + include/linux/wait.h;, which will wake up every process in the queue. The exception is if one has TASK_EXCLUSIVE set, in which case - the remainder of the queue will not be woken. + the remainder of the queue will not be woken. There are other variants + of this basic function available in the same header. @@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); first class of operations work on atomic_t include/asm/atomic.h; this - contains a signed integer (at least 24 bits long), and you must use + contains a signed integer (at least 32 bits long), and you must use these functions to manipulate or read atomic_t variables. atomic_read() and atomic_set() get and set the counter, @@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); Note that these functions are slower than normal arithmetic, and - so should not be used unnecessarily. On some platforms they - are much slower, like 32-bit Sparc where they use a spinlock. + so should not be used unnecessarily. - The second class of atomic operations is atomic bit operations on a - long, defined in + The second class of atomic operations is atomic bit operations on an + unsigned long, defined in include/linux/bitops.h. These operations generally take a pointer to the bit pattern, and a bit @@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); test_and_clear_bit() and test_and_change_bit() do the same thing, except return true if the bit was previously set; these are - particularly useful for very simple locking. + particularly useful for atomically setting flags. @@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); than BITS_PER_LONG. The resulting behavior is strange on big-endian platforms though so it is a good idea not to do this. - - - Note that the order of bits depends on the architecture, and in - particular, the bitfield passed to these operations must be at - least as large as a long. - @@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/linux/module.h - This is the classic method of exporting a symbol, and it works - for both modules and non-modules. In the kernel all these - declarations are often bundled into a single file to help - genksyms (which searches source files for these declarations). - See the comment on genksyms and Makefiles below. + This is the classic method of exporting a symbol: dynamically + loaded modules will be able to use the symbol as normal. @@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); symbols exported by EXPORT_SYMBOL_GPL() can only be seen by modules with a MODULE_LICENSE() that specifies a GPL - compatible license. + compatible license. It implies that the function is considered + an internal implementation issue, and not really an interface. @@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/linux/list.h - There are three sets of linked-list routines in the kernel - headers, but this one seems to be winning out (and Linus has - used it). If you don't have some particular pressing need for - a single list, it's a good choice. In fact, I don't care - whether it's a good choice or not, just use it so we can get - rid of the others. + There used to be three sets of linked-list routines in the kernel + headers, but this one is the winner. If you don't have some + particular pressing need for a single list, it's a good choice. + + + + In particular, list_for_each_entry is useful. @@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); convention, and return 0 for success, and a negative error number (eg. -EFAULT) for failure. This can be - unintuitive at first, but it's fairly widespread in the networking - code, for example. + unintuitive at first, but it's fairly widespread in the kernel. - The filesystem code uses ERR_PTR() + Using ERR_PTR() - include/linux/fs.h; to + include/linux/err.h; to encode a negative error number into a pointer, and IS_ERR() and PTR_ERR() to get it back out again: avoids a separate pointer parameter for @@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = { supported, due to lack of general use, but the following are considered standard (see the GCC info page section "C Extensions" for more details - Yes, really the info page, the - man page is only a short summary of the stuff in info): + man page is only a short summary of the stuff in info). @@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = { - Function names as strings (__FUNCTION__) + Function names as strings (__FUNCTION__). @@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = { Usually you want a configuration option for your kernel hack. - Edit Config.in in the appropriate directory - (but under arch/ it's called - config.in). The Config Language used is not - bash, even though it looks like bash; the safe way is to use only - the constructs that you already see in - Config.in files (see - Documentation/kbuild/kconfig-language.txt). - It's good to run "make xconfig" at least once to test (because - it's the only one with a static parser). - - - - Variables which can be Y or N use bool followed by a - tagline and the config define name (which must start with - CONFIG_). The tristate function is the same, but - allows the answer M (which defines - CONFIG_foo_MODULE in your source, instead of - CONFIG_FOO) if CONFIG_MODULES - is enabled. + Edit Kconfig in the appropriate directory. + The Config language is simple to use by cut and paste, and there's + complete documentation in + Documentation/kbuild/kconfig-language.txt. You may well want to make your CONFIG option only visible if CONFIG_EXPERIMENTAL is enabled: this serves as a warning to users. There many other fancy things you can do: see - the various Config.in files for ideas. + the various Kconfig files for ideas. - - - Edit the Makefile: the CONFIG variables are - exported here so you can conditionalize compilation with `ifeq'. - If your file exports symbols then add the names to - export-objs so that genksyms will find them. - - - There is a restriction on the kernel build system that objects - which export symbols must have globally unique names. - If your object does not have a globally unique name then the - standard fix is to move the - EXPORT_SYMBOL() statements to their own - object with a unique name. - This is why several systems have separate exporting objects, - usually suffixed with ksyms. - - + In your description of the option, make sure you address both the + expert user and the user who knows nothing about your feature. Mention + incompatibilities and issues here. Definitely + end your description with if in doubt, say N + (or, occasionally, `Y'); this is for people who have no + idea what you are talking about. - Document your option in Documentation/Configure.help. Mention - incompatibilities and issues here. Definitely - end your description with if in doubt, say N - (or, occasionally, `Y'); this is for people who have no - idea what you are talking about. + Edit the Makefile: the CONFIG variables are + exported here so you can usually just add a "obj-$(CONFIG_xxx) += + xxx.o" line. The syntax is documented in + Documentation/kbuild/makefiles.txt. @@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = { - include/linux/brlock.h: + include/asm-i386/delay.h: -extern inline void br_read_lock (enum brlock_indices idx) -{ - /* - * This causes a link-time bug message if an - * invalid index is used: - */ - if (idx >= __BR_END) - __br_lock_usage_bug(); - - read_lock(&__brlock_array[smp_processor_id()][idx]); -} +#define ndelay(n) (__builtin_constant_p(n) ? \ + ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ + __ndelay(n))