DRAFT DRAFT DRAFT WORK IN PROGRESS DRAFT DRAFT DRAFT This is work in progress and likely to change. Roland McGrath --- User Debugging Data & Event Rendezvous ---- --------- ---- - ----- ---------- See linux/utrace.h for all the declarations used here. See also linux/tracehook.h for the utrace_regset declarations. The UTRACE is infrastructure code for tracing and controlling user threads. This is the foundation for writing tracing engines, which can be loadable kernel modules. The UTRACE interfaces provide three basic facilities: * Thread event reporting Tracing engines can request callbacks for events of interest in the thread: signals, system calls, exit, exec, clone, etc. * Core thread control Tracing engines can prevent a thread from running (keeping it in TASK_TRACED state), or make it single-step or block-step (when hardware supports it). Engines can cause a thread to abort system calls, they change the behaviors of signals, and they can inject signal-style actions at will. * Thread machine state access Tracing engines can read and write a thread's registers and similar per-thread CPU state. Tracing engines ------- ------- The basic actors in UTRACE are the thread and the tracing engine. A tracing engine is some body of code that calls into the utrace_* interfaces, represented by a struct utrace_engine_ops. (Usually it's a kernel module, though the legacy ptrace support is a tracing engine that is not in a kernel module.) The UTRACE interface operates on individual threads (struct task_struct). If an engine wants to treat several threads as a group, that is up to its higher-level code. Using the UTRACE starts out by attaching an engine to a thread. struct utrace_attached_engine * utrace_attach(struct task_struct *target, int flags, const struct utrace_engine_ops *ops, unsigned long data); Calling utrace_attach is what sets up a tracing engine to trace a thread. Use UTRACE_ATTACH_CREATE in flags, and pass your engine's ops. Check the return value with IS_ERR. If successful, it returns a struct pointer that is the handle used in all other utrace_* calls. The data argument is stored in the utrace_attached_engine structure, for your code to use however it wants. int utrace_detach(struct task_struct *target, struct utrace_attached_engine *engine); The utrace_detach call removes an engine from a thread. No more callbacks will be made after this returns success. An attached engine does nothing by default. An engine makes something happen by setting its flags. int utrace_set_flags(struct task_struct *target, struct utrace_attached_engine *engine, unsigned long flags); The synchronization issues related to these two calls are discussed further below in "Teardown Races". Action Flags ------ ----- There are two kinds of flags that an attached engine can set: event flags, and action flags. Event flags register interest in particular events; when an event happens and an engine has the right event flag set, it gets a callback. Action flags change the normal behavior of the thread. The action flags available are: UTRACE_ACTION_QUIESCE The thread will stay quiescent (see below). As long as any engine asserts the QUIESCE action flag, the thread will not resume running in user mode. (Usually it will be in TASK_TRACED state.) Nothing will wake the thread up except for SIGKILL (and implicit SIGKILLs such as a core dump in another thread sharing the same address space, or a group exit, fatal signal, or exec in another thread in the same thread group). UTRACE_ACTION_SINGLESTEP When the thread runs, it will run one instruction and then trap. (Exiting a system call or entering a signal handler is considered "an instruction" for this.) This is available on most machines. This can be used only if ARCH_HAS_SINGLE_STEP is #define'd by and evaluates to nonzero. UTRACE_ACTION_BLOCKSTEP When the thread runs, it will run until the next branch taken, and then trap. (Exiting a system call or entering a signal handler is considered taking a branch for this.) When the SINGLESTEP flag is set, BLOCKSTEP has no effect. This is only available on some machines. This can be used only if ARCH_HAS_BLOCK_STEP is #define'd by and evaluates to nonzero. UTRACE_ACTION_NOREAP When the thread exits or stops for job control, its parent process will not receive a SIGCHLD and the parent's wait calls will not wake up or report the child as dead. Even a self-reaping thread will remain a zombie. Note that this cannot prevent the reaping done when an exec is done by another thread in the same thread group; in that event, a REAP event (and callback if requested) will happen regardless of this flag. A well-behaved tracing engine does not want to interfere with the parent's normal notifications. This is provided mainly for the ptrace compatibility code to implement the traditional behavior. Event flags are specified using the macro UTRACE_EVENT(TYPE). Each event type is associated with a report_* callback in struct utrace_engine_ops. A tracing engine can leave unused callbacks NULL. The only callbacks required are those used by the event flags it sets. Many engines can be attached to each thread. When a thread has an event, each engine gets a report_* callback if it has set the event flag for that event type. Engines are called in the order they attached. Each callback takes arguments giving the details of the particular event. The first two arguments two every callback are the struct utrace_attached_engine and struct task_struct pointers for the engine and the thread producing the event. Usually this will be the current thread that is running the callback functions. The return value of report_* callbacks is a bitmask. Some bits are common to all callbacks, and some are particular to that callback and event type. The value zero (UTRACE_ACTION_RESUME) always means the simplest thing: do what would have happened with no tracing engine here. These are the flags that can be set in any report_* return value: UTRACE_ACTION_NEWSTATE Update the action state flags, described above. Those bits from the return value (UTRACE_ACTION_STATE_MASK) replace those bits in the engine's flags. This has the same effect as calling utrace_set_flags, but is a more efficient short-cut. To change the event flags, you must call utrace_set_flags. UTRACE_ACTION_DETACH Detach this engine. This has the effect of calling utrace_detach, but is a more efficient short-cut. UTRACE_ACTION_HIDE Hide this event from other tracing engines. This is only appropriate to do when the event was induced by some action of this engine, such as a breakpoint trap. Some events cannot be hidden, since every engine has to know about them: exit, death, reap. The return value bits in UTRACE_ACTION_OP_MASK indicate a change to the normal behavior of the event taking place. If zero, the thread does whatever that event normally means. For report_signal, other values control the disposition of the signal. Quiescence ---------- To control another thread and access its state, it must be "quiescent". This means that it is stopped and won't start running again while we access it. A quiescent thread is stopped in a place close to user mode, where the user state can be accessed safely; either it's about to return to user mode, or it's just entered the kernel from user mode, or it has already finished exiting (EXIT_ZOMBIE). Setting the UTRACE_ACTION_QUIESCE action flag will force the attached thread to become quiescent soon. After setting the flag, an engine must wait for an event callback when the thread becomes quiescent. The thread may be running on another CPU, or may be in an uninterruptible wait. When it is ready to be examined, it will make callbacks to engines that set the UTRACE_EVENT(QUIESCE) event flag. As long as some engine has UTRACE_ACTION_QUIESCE set, then the thread will remain stopped. SIGKILL will wake it up, but it will not run user code. When the flag is cleared via utrace_set_flags or a callback return value, the thread starts running again. (See also "Teardown Races", below.) During the event callbacks (report_*), the thread in question makes the callback from a safe place. It is not quiescent, but it can safely access its own state. Callbacks can access thread state directly without setting the QUIESCE action flag. If a callback does want to prevent the thread from resuming normal execution, it *must* use the QUIESCE action state rather than simply blocking; see "Core Events & Callbacks", below. Thread control ------ ------- These calls must be made on a quiescent thread (or the current thread): int utrace_inject_signal(struct task_struct *target, struct utrace_attached_engine *engine, u32 action, siginfo_t *info, const struct k_sigaction *ka); Cause a specified signal delivery in the target thread. This is not like kill, which generates a signal to be dequeued and delivered later. Injection directs the thread to deliver a signal now, before it next resumes in user mode or dequeues any other pending signal. It's as if the tracing engine intercepted a signal event and its report_signal callback returned the action argument as its value (see below). The info and ka arguments serve the same purposes as their counterparts in a report_signal callback. const struct utrace_regset * utrace_regset(struct task_struct *target, struct utrace_attached_engine *engine, const struct utrace_regset_view *view, int which); Get access to machine state for the thread. The struct utrace_regset_view indicates a view of machine state, corresponding to a user mode architecture personality (such as 32-bit or 64-bit versions of a machine). The which argument selects one of the register sets available in that view. The utrace_regset call must be made before accessing any machine state, each time the thread has been running and has then become quiescent. It ensures that the thread's state is ready to be accessed, and returns the struct utrace_regset giving its accessor functions. XXX needs front ends for argument checks, export utrace_native_view Core Events & Callbacks ---- ------ - --------- Event reporting callbacks have details particular to the event type, but are all called in similar environments and have the same constraints. Callbacks are made from safe spots, where no locks are held, no special resources are pinned, and the user-mode state of the thread is accessible. So, callback code has a pretty free hand. But to be a good citizen, callback code should never block for long periods. It is fine to block in kmalloc and the like, but never wait for i/o or for user mode to do something. If you need the thread to wait, set UTRACE_ACTION_QUIESCE and return from the callback quickly. When your i/o finishes or whatever, you can use utrace_set_flags to resume the thread. Well-behaved callbacks are important to maintain two essential properties of the interface. The first of these is that unrelated tracing engines not interfere with each other. If your engine's event callback does not return quickly, then another engine won't get the event notification in a timely manner. The second important property is that tracing be as noninvasive as possible to the normal operation of the system overall and of the traced thread in particular. That is, attached tracing engines should not perturb a thread's behavior, except to the extent that changing its user-visible state is explicitly what you want to do. (Obviously some perturbation is unavoidable, primarily timing changes, ranging from small delays due to the overhead of tracing, to arbitrary pauses in user code execution when a user stops a thread with a debugger for examination. When doing asynchronous utrace_attach to a thread doing a system call, more troublesome side effects are possible.) Even when you explicitly want the pertrubation of making the traced thread block, just blocking directly in your callback has more unwanted effects. For example, the CLONE event callbacks are called when the new child thread has been created but not yet started running; the child can never be scheduled until the CLONE tracing callbacks return. (This allows engines tracing the parent to attach to the child.) If a CLONE event callback blocks the parent thread, it also prevents the child thread from running (even to process a SIGKILL). If what you want is to make both the parent and child block, then use utrace_attach on the child and then set the QUIESCE action state flag on both threads. A more crucial problem with blocking in callbacks is that it can prevent SIGKILL from working. A thread that is blocking due to UTRACE_ACTION_QUIESCE will still wake up and die immediately when sent a SIGKILL, as all threads should. Relying on the utrace infrastructure rather than on private synchronization calls in event callbacks is an important way to help keep tracing robustly noninvasive. EVENT(REAP) Dead thread has been reaped Callback: void (*report_reap)(struct utrace_attached_engine *engine, struct task_struct *tsk); This means the parent called wait, or else this was a detached thread or a process whose parent ignores SIGCHLD. This cannot happen while the UTRACE_ACTION_NOREAP flag is set. This is the only callback you are guaranteed to get (if you set the flag; but see "Teardown Races", below). Unlike other callbacks, this can be called from the parent's context rather than from the traced thread itself--it must not delay the parent by blocking. This callback is different from all others, it returns void. Once you get this callback, your engine is automatically detached and you cannot access this thread or use this struct utrace_attached_engine handle any longer. This is the place to clean up your data structures and synchronize with your code that might try to make utrace_* calls using this engine data structure. The struct is still valid during this callback, but will be freed soon after it returns (via RCU). In all other callbacks, the return value is as described above. The common UTRACE_ACTION_* flags in the return value are always observed. Unless otherwise specified below, other bits in the return value are ignored. EVENT(QUIESCE) Thread is quiescent Callback: u32 (*report_quiesce)(struct utrace_attached_engine *engine, struct task_struct *tsk); This is the least interesting callback. It happens at any safe spot, including after any other event callback. This lets the tracing engine know that it is safe to access the thread's state, or to report to users that it has stopped running user code. EVENT(CLONE) Thread is creating a child Callback: u32 (*report_clone)(struct utrace_attached_engine *engine, struct task_struct *parent, unsigned long clone_flags, struct task_struct *child); A clone/clone2/fork/vfork system call has succeeded in creating a new thread or child process. The new process is fully formed, but not yet running. During this callback, other tracing engines are prevented from using utrace_attach asynchronously on the child, so that engines tracing the parent get the first opportunity to attach. After this callback returns, the child will start and the parent's system call will return. If CLONE_VFORK is set, the parent will block before returning. EVENT(VFORK_DONE) Finished waiting for CLONE_VFORK child Callback: u32 (*report_vfork_done)(struct utrace_attached_engine *engine, struct task_struct *parent, pid_t child_pid); Event reported for parent using CLONE_VFORK or vfork system call. The child has died or exec'd, so the vfork parent has unblocked and is about to return child_pid. UTRACE_EVENT(EXEC) Completed exec Callback: u32 (*report_exec)(struct utrace_attached_engine *engine, struct task_struct *tsk, const struct linux_binprm *bprm, struct pt_regs *regs); An execve system call has succeeded and the new program is about to start running. The initial user register state is handy to be tweaked directly, or utrace_regset can be used for full machine state access. UTRACE_EVENT(EXIT) Thread is exiting Callback: u32 (*report_exit)(struct utrace_attached_engine *engine, struct task_struct *tsk, long orig_code, long *code); The thread is exiting and cannot be prevented from doing so, but all its state is still live. The *code value will be the wait result seen by the parent, and can be changed by this engine or others. The orig_code value is the real status, not changed by any tracing engine. UTRACE_EVENT(DEATH) Thread has finished exiting Callback: u32 (*report_death)(struct utrace_attached_engine *engine, struct task_struct *tsk); The thread is really dead now. If the UTRACE_ACTION_NOREAP flag remains set after this callback, it remains an unreported zombie; If the flag was not set already, then it is too late to set it now--its parent has already been sent SIGCHLD. Otherwise, it might be reaped by its parent, or self-reap immediately. Though the actual reaping may happen in parallel, a report_reap callback will always be ordered after a report_death callback. UTRACE_EVENT(SYSCALL_ENTRY) Thread has entered kernel for a system call Callback: u32 (*report_syscall_entry)(struct utrace_attached_engine *engine, struct task_struct *tsk, struct pt_regs *regs); The system call number and arguments can be seen and modified in the registers. The return value register has -ENOSYS, which will be returned for an invalid system call. The macro tracehook_abort_syscall(regs) will abort the system call so that we go immediately to syscall exit, and return -ENOSYS (or whatever the register state is changed to). If tracing enginges keep the thread quiescent here, the system call will not be performed until it resumes. UTRACE_EVENT(SYSCALL_EXIT) Thread is leaving kernel after a system call Callback: u32 (*report_syscall_exit)(struct utrace_attached_engine *engine, struct task_struct *tsk, struct pt_regs *regs); The return value can be seen and modified in the registers. If the thread is allowed to resume, it will see any pending signals and then return to user mode. UTRACE_EVENT(SIGNAL) Signal caught by user handler UTRACE_EVENT(SIGNAL_IGN) Signal with no effect (SIG_IGN or default) UTRACE_EVENT(SIGNAL_STOP) Job control stop signal UTRACE_EVENT(SIGNAL_TERM) Fatal termination signal UTRACE_EVENT(SIGNAL_CORE) Fatal core-dump signal UTRACE_EVENT_SIGNAL_ALL All of the above (bitmask) Callback: u32 (*report_signal)(struct utrace_attached_engine *engine, struct task_struct *tsk, u32 action, siginfo_t *info, const struct k_sigaction *orig_ka, struct k_sigaction *return_ka); There are five types of signal events, but all use the same callback. These happen when a thread is dequeuing a signal to be delivered. (Not immediately when the signal is sent, and not when the signal is blocked.) No signal event is reported for SIGKILL; no tracing engine can prevent it from killing the thread immediately. The specific event types allow an engine to trace signals based on what they do. UTRACE_EVENT_SIGNAL_ALL is all of them OR'd together, to trace all signals (except SIGKILL). A subset of these event flags can be used e.g. to catch only fatal signals, not handled ones, or to catch only core-dump signals, not normal termination signals. The action argument says what the signal's default disposition is: UTRACE_SIGNAL_DELIVER Run the user handler from sigaction. UTRACE_SIGNAL_IGN Do nothing, ignore the signal. UTRACE_SIGNAL_TERM Terminate the process. UTRACE_SIGNAL_CORE Terminate the process a write a core dump. UTRACE_SIGNAL_STOP Absolutely stop the process, a la SIGSTOP. UTRACE_SIGNAL_TSTP Job control stop (no stop if orphaned). This selection is made from consulting the process's sigaction and the default action for the signal number, but may already have been changed by an earlier tracing engine (in which case you see its override). A return value of UTRACE_ACTION_RESUME means to carry out this action. If instead UTRACE_SIGNAL_* bits are in the return value, that overrides the normal behavior of the signal. The signal number and other details of the signal are in info, and this data can be changed to make the thread see a different signal. A return value of UTRACE_SIGNAL_DELIVER says to follow the sigaction in return_ka, which can specify a user handler or SIG_IGN to ignore the signal or SIG_DFL to follow the default action for info->si_signo. The orig_ka parameter shows the process's sigaction at the time the signal was dequeued, and return_ka initially contains this. Tracing engines can modify return_ka to change the effects of delivery. For other UTRACE_SIGNAL_* return values, return_ka is ignored. UTRACE_SIGNAL_HOLD is a flag bit that can be OR'd into the return value. It says to push the signal back on the thread's queue, with the signal number and details possibly changed in info. When the thread is allowed to resume, it will dequeue and report it again. Teardown Races -------- ----- Ordinarily synchronization issues for tracing engines are kept fairly straightforward by using quiescence (see above): you make a thread quiescent and then once it makes the report_quiesce callback it cannot do anything else that would result in another callback, until you let it. This simple arrangement avoids complex and error-prone code in each one of a tracing engine's event callbacks to keep them serialized with the engine's other operations done on that thread from another thread of control. However, giving tracing engines complete power to keep a traced thread stuck in place runs afoul of a more important kind of simplicity that the kernel overall guarantees: nothing can prevent or delay SIGKILL from making a thread die and release its resources. To preserve this important property of SIGKILL, it as a special case can break quiescence like nothing else normally can. This includes both explicit SIGKILL signals and the implicit SIGKILL sent to each other thread in the same thread group by a thread doing an exec, or processing a fatal signal, or making an exit_group system call. A tracing engine can prevent a thread from beginning the exit or exec or dying by signal (other than SIGKILL) if it is attached to that thread, but once the operation begins, no tracing engine can prevent or delay all other threads in the same thread group dying. As described above, the report_reap callback is always the final event in the life cycle of a traced thread. Tracing engines can use this as the trigger to clean up their own data structures. The report_death callback is always the penultimate event a tracing engine might see, except when the thread was already in the midst of dying when the engine attached. Many tracing engines will have no interest in when a parent reaps a dead process, and nothing they want to do with a zombie thread once it dies; for them, the report_death callback is the natural place to clean up data structures and detach. To facilitate writing such engines robustly, given the asynchrony of SIGKILL, and without error-prone manual implementation of synchronization schemes, the utrace infrastructure provides some special guarantees about the report_death and report_reap callbacks. It still takes some care to be sure your tracing engine is robust to teardown races, but these rules make it reasonably straightforward and concise to handle a lot of corner cases correctly. The first sort of guarantee concerns the core data structures themselves. struct utrace_attached_engine is allocated using RCU, as is task_struct. If you call utrace_attach under rcu_read_lock, then the pointer it returns will always be valid while in the RCU critical section. (Note that utrace_attach can block doing memory allocation, so you must consider the real critical section to start when utrace_attach returns. utrace_attach can never block when not given the UTRACE_ATTACH_CREATE flag bit). Conversely, you can call utrace_attach outside of rcu_read_lock and though the pointer can become stale asynchronously if the thread dies and is reaped, you can safely pass it to a subsequent utrace_set_flags or utrace_detach call and will just get an -ESRCH error return. However, you must be sure the task_struct remains valid, either via get_task_struct or via RCU. The utrace infrastructure never holds task_struct references of its own. Though neither rcu_read_lock nor any other lock is held while making a callback, it's always guaranteed that the task_struct and the struct utrace_attached_engine passed as arguments remain valid until the callback function returns. The second guarantee is the serialization of death and reap event callbacks for a given thread. The actual reaping by the parent (release_task call) can occur simultaneously while the thread is still doing the final steps of dying, including the report_death callback. If a tracing engine has requested both DEATH and REAP event reports, it's guaranteed that the report_reap callback will not be made until after the report_death callback has returned. If the report_death callback itself detaches from the thread (with utrace_detach or with UTRACE_ACTION_DETACH in its return value), then the report_reap callback will never be made. Thus it is safe for a report_death callback to clean up data structures and detach. The final sort of guarantee is that a tracing engine will know for sure whether or not the report_death and/or report_reap callbacks will be made for a certain thread. These teardown races are disambiguated by the error return values of utrace_set_flags and utrace_detach. Normally utrace_detach returns zero, and this means that no more callbacks will be made. If the thread is in the midst of dying, utrace_detach returns -EALREADY to indicate that the report_death callback may already be in progress; when you get this error, you know that any cleanup your report_death callback does is about to happen or has just happened--note that if the report_death callback does not detach, the engine remains attached until the thread gets reaped. If the thread is in the midst of being reaped, utrace_detach returns -ESRCH to indicate that the report_reap callback may already be in progress; this means the engine is implicitly detached when the callback completes. This makes it possible for a tracing engine that has decided asynchronously to detach from a thread to safely clean up its data structures, knowing that no report_death or report_reap callback will try to do the same. utrace_detach returns -ESRCH when the struct utrace_attached_engine has already been detached, but is still a valid pointer because of rcu_read_lock. If RCU is used properly, a tracing engine can use this to safely synchronize its own independent multiple threads of control with each other and with its event callbacks that detach. In the same vein, utrace_set_flags normally returns zero; if the target thread was quiescent before the call, then after a successful call, no event callbacks not requested in the new flags will be made, and a report_quiesce callback will always be made if requested. It fails with -EALREADY if you try to clear UTRACE_EVENT(DEATH) when the report_death callback may already have begun, if you try to clear UTRACE_EVENT(REAP) when the report_reap callback may already have begun, if you try to newly set UTRACE_ACTION_NOREAP when the target may already have sent its parent SIGCHLD, or if you try to newly set UTRACE_EVENT(DEATH), UTRACE_EVENT(QUIESCE), or UTRACE_ACTION_QUIESCE, when the target is already dead or dying. Like utrace_detach, it returns -ESRCH when the thread has already been detached (including forcible detach on reaping). This lets the tracing engine know for sure which event callbacks it will or won't see after utrace_set_flags has returned. By checking for errors, it can know whether to clean up its data structures immediately or to let its callbacks do the work.