DRAFT DRAFT DRAFT WORK IN PROGRESS DRAFT DRAFT DRAFT This is work in progress and likely to change. Roland McGrath --- User Debugging Data & Event Rendezvous ---- --------- ---- - ----- ---------- See linux/utrace.h for all the declarations used here. See also linux/tracehook.h for the utrace_regset declarations. The UTRACE is infrastructure code for tracing and controlling user threads. This is the foundation for writing tracing engines, which can be loadable kernel modules. The UTRACE interfaces provide three basic facilities: * Thread event reporting Tracing engines can request callbacks for events of interest in the thread: signals, system calls, exit, exec, clone, etc. * Core thread control Tracing engines can prevent a thread from running (keeping it in TASK_TRACED state), or make it single-step or block-step (when hardware supports it). Engines can cause a thread to abort system calls, they change the behaviors of signals, and they can inject signal-style actions at will. * Thread machine state access Tracing engines can read and write a thread's registers and similar per-thread CPU state. Tracing engines ------- ------- The basic actors in UTRACE are the thread and the tracing engine. A tracing engine is some body of code that calls into the utrace_* interfaces, represented by a struct utrace_engine_ops. (Usually it's a kernel module, though the legacy ptrace support is a tracing engine that is not in a kernel module.) The UTRACE interface operates on individual threads (struct task_struct). If an engine wants to treat several threads as a group, that is up to its higher-level code. Using the UTRACE starts out by attaching an engine to a thread. struct utrace_attached_engine * utrace_attach(struct task_struct *target, int flags, const struct utrace_engine_ops *ops, unsigned long data); Calling utrace_attach is what sets up a tracing engine to trace a thread. Use UTRACE_ATTACH_CREATE in flags, and pass your engine's ops. Check the return value with IS_ERR. If successful, it returns a struct pointer that is the handle used in all other utrace_* calls. The data argument is stored in the utrace_attached_engine structure, for your code to use however it wants. void utrace_detach(struct task_struct *target, struct utrace_attached_engine *engine); The utrace_detach call removes an engine from a thread. No more callbacks will be made after this returns. An attached engine does nothing by default. An engine makes something happen by setting its flags. void utrace_set_flags(struct task_struct *target, struct utrace_attached_engine *engine, unsigned long flags); Action Flags ------ ----- There are two kinds of flags that an attached engine can set: event flags, and action flags. Event flags register interest in particular events; when an event happens and an engine has the right event flag set, it gets a callback. Action flags change the normal behavior of the thread. The action flags available are: UTRACE_ACTION_QUIESCE The thread will stay quiescent (see below). As long as any engine asserts the QUIESCE action flag, the thread will not resume running in user mode. (Usually it will be in TASK_TRACED state.) Nothing will wake the thread up except for SIGKILL (and implicit SIGKILLs such as a core dump in another thread sharing the same address space, or a group exit or fatal signal in another thread in the same thread group). UTRACE_ACTION_SINGLESTEP When the thread runs, it will run one instruction and then trap. (Exiting a system call or entering a signal handler is considered "an instruction" for this.) This can be used only if ARCH_HAS_SINGLE_STEP #define'd by and evaluates to nonzero. UTRACE_ACTION_BLOCKSTEP When the thread runs, it will run until the next branch, and then trap. (Exiting a system call or entering a signal handler is considered a branch for this.) When the SINGLESTEP flag is set, BLOCKSTEP has no effect. This is only available on some machines (actually none yet). This can be used only if ARCH_HAS_BLOCK_STEP #define'd by and evaluates to nonzero. UTRACE_ACTION_NOREAP When the thread exits or stops for job control, its parent process will not receive a SIGCHLD and the parent's wait calls will not wake up or report the child as dead. A well-behaved tracing engine does not want to interfere with the parent's normal notifications. This is provided mainly for the ptrace compatibility code to implement the traditional behavior. Event flags are specified using the macro UTRACE_EVENT(TYPE). Each event type is associated with a report_* callback in struct utrace_engine_ops. A tracing engine can leave unused callbacks NULL. The only callbacks required are those used by the event flags it sets. Many engines can be attached to each thread. When a thread has an event, each engine gets a report_* callback if it has set the event flag for that event type. Engines are called in the order they attached. Each callback takes arguments giving the details of the particular event. The first two arguments two every callback are the struct utrace_attached_engine and struct task_struct pointers for the engine and the thread producing the event. Usually this will be the current thread that is running the callback functions. The return value of report_* callbacks is a bitmask. Some bits are common to all callbacks, and some are particular to that callback and event type. The value zero (UTRACE_ACTION_RESUME) always means the simplest thing: do what would have happened with no tracing engine here. These are the flags that can be set in any report_* return value: UTRACE_ACTION_NEWSTATE Update the action state flags, described above. Those bits from the return value (UTRACE_ACTION_STATE_MASK) replace those bits in the engine's flags. This has the same effect as calling utrace_set_flags, but is a more efficient short-cut. To change the event flags, you must call utrace_set_flags. UTRACE_ACTION_DETACH Detach this engine. This has the effect of calling utrace_detach, but is a more efficient short-cut. UTRACE_ACTION_HIDE Hide this event from other tracing engines. This is only appropriate to do when the event was induced by some action of this engine, such as a breakpoint trap. Some events cannot be hidden, since every engine has to know about them: exit, death, reap. The return value bits in UTRACE_ACTION_OP_MASK indicate a change to the normal behavior of the event taking place. If zero, the thread does whatever that event normally means. For report_signal, other values control the disposition of the signal. Quiescence ---------- To control another thread and access its state, it must be "quiescent". This means that it is stopped and won't start running again while we access it. A quiescent thread is stopped in a place close to user mode, where the user state can be accessed safely; either it's about to return to user mode, or it's just entered the kernel from user mode, or it has already finished exiting (TASK_ZOMBIE). Setting the UTRACE_ACTION_QUIESCE action flag will force the attached thread to become quiescent soon. After setting the flag, an engine must wait for an event callback when the thread becomes quiescent. The thread may be running on another CPU, or may be in an uninterruptible wait. When it is ready to be examined, it will make callbacks to engines that set the UTRACE_EVENT(QUIESCE) event flag. As long as some engine has UTRACE_ACTION_QUIESCE set, then the thread will remain stopped. SIGKILL will wake it up, but it will not run user code. When the flag is cleared via utrace_set_flags or a callback return value, the thread starts running again. During the event callbacks (report_*), the thread in question makes the callback from a safe place. It is not quiescent, but it can safely access its own state. Callbacks can access thread state directly without setting the QUIESCE action flag. If a callback does want to prevent the thread from resuming normal execution, it *must* use the QUIESCE action state rather than simply blocking; see "Core Events & Callbacks", below. Thread control ------ ------- These calls must be made on a quiescent thread (or the current thread): int utrace_inject_signal(struct task_struct *target, struct utrace_attached_engine *engine, u32 action, siginfo_t *info, const struct k_sigaction *ka); Cause a specified signal delivery in the target thread. This is not like kill, which generates a signal to be dequeued and delivered later. Injection directs the thread to deliver a signal now, before it next resumes in user mode or dequeues any other pending signal. It's as if the tracing engine intercepted a signal event and its report_signal callback returned the action argument as its value (see below). The info and ka arguments serve the same purposes as their counterparts in a report_signal callback. const struct utrace_regset * utrace_regset(struct task_struct *target, struct utrace_attached_engine *engine, const struct utrace_regset_view *view, int which); Get access to machine state for the thread. The struct utrace_regset_view indicates a view of machine state, corresponding to a user mode architecture personality (such as 32-bit or 64-bit versions of a machine). The which argument selects one of the register sets available in that view. The utrace_regset call must be made before accessing any machine state, each time the thread has been running and has then become quiescent. It ensures that the thread's state is ready to be accessed, and returns the struct utrace_regset giving its accessor functions. XXX needs front ends for argument checks, export utrace_native_view Core Events & Callbacks ---- ------ - --------- Event reporting callbacks have details particular to the event type, but are all called in similar environments and have the same constraints. Callbacks are made from safe spots, where no locks are held, no special resources are pinned, and the user-mode state of the thread is accessible. So, callback code has a pretty free hand. But to be a good citizen, callback code should never block for long periods. It is fine to block in kmalloc and the like, but never wait for i/o or for user mode to do something. If you need the thread to wait, set UTRACE_ACTION_QUIESCE and return from the callback quickly. When your i/o finishes or whatever, you can use utrace_set_flags to resume the thread. Well-behaved callbacks are important to maintain two essential properties of the interface. The first of these is that unrelated tracing engines not interfere with each other. If your engine's event callback does not return quickly, then another engine won't get the event notification in a timely manner. The second important property is that tracing be as noninvasive as possible to the normal operation of the system overall and of the traced thread in particular. That is, attached tracing engines should not perturb a thread's behavior, except to the extent that changing its user-visible state is explicitly what you want to do. (Obviously some perturbation is unavoidable, primarily timing changes, ranging from small delays due to the overhead of tracing, to arbitrary pauses in user code execution when a user stops a thread with a debugger for examination. When doing asynchronous utrace_attach to a thread doing a system call, more troublesome side effects are possible.) Even when you explicitly want the pertrubation of making the traced thread block, just blocking directly in your callback has more unwanted effects. For example, the CLONE event callbacks are called when the new child thread has been created but not yet started running; the child can never be scheduled until the CLONE tracing callbacks return. (This allows engines tracing the parent to attach to the child.) If a CLONE event callback blocks the parent thread, it also prevents the child thread from running (even to process a SIGKILL). If what you want is to make both the parent and child block, then use utrace_attach on the child and then set the QUIESCE action state flag on both threads. A more crucial problem with blocking in callbacks is that it can prevent SIGKILL from working. A thread that is blocking due to UTRACE_ACTION_QUIESCE will still wake up and die immediately when sent a SIGKILL, as all threads should. Relying on the utrace infrastructure rather than on private synchronization calls in event callbacks is an important way to help keep tracing robustly noninvasive. EVENT(REAP) Dead thread has been reaped Callback: void (*report_reap)(struct utrace_attached_engine *engine, struct task_struct *tsk); This means the parent called wait, or else this was a detached thread or a process whose parent ignores SIGCHLD. This cannot happen while the UTRACE_ACTION_NOREAP flag is set. This is the only callback you are guaranteed to get (if you set the flag). Unlike other callbacks, this can be called from the parent's context rather than from the traced thread itself--it must not delay the parent by blocking. This callback is different from all others, it returns void. Once you get this callback, your engine is automatically detached and you cannot access this thread or use this struct utrace_attached_engine handle any longer. This is the place to clean up your data structures and synchronize with your code that might try to make utrace_* calls using this engine data structure. The struct is still valid during this callback, but will be freed soon after it returns (via RCU). In all other callbacks, the return value is as described above. The common UTRACE_ACTION_* flags in the return value are always observed. Unless otherwise specified below, other bits in the return value are ignored. EVENT(QUIESCE) Thread is quiescent Callback: u32 (*report_quiesce)(struct utrace_attached_engine *engine, struct task_struct *tsk); This is the least interesting callback. It happens at any safe spot, including after any other event callback. This lets the tracing engine know that it is safe to access the thread's state, or to report to users that it has stopped running user code. EVENT(CLONE) Thread is creating a child Callback: u32 (*report_clone)(struct utrace_attached_engine *engine, struct task_struct *parent, unsigned long clone_flags, struct task_struct *child); A clone/clone2/fork/vfork system call has succeeded in creating a new thread or child process. The new process is fully formed, but not yet running. During this callback, other tracing engines are prevented from using utrace_attach asynchronously on the child, so that engines tracing the parent get the first opportunity to attach. After this callback returns, the child will start and the parent's system call will return. If CLONE_VFORK is set, the parent will block before returning. EVENT(VFORK_DONE) Finished waiting for CLONE_VFORK child Callback: u32 (*report_vfork_done)(struct utrace_attached_engine *engine, struct task_struct *parent, pid_t child_pid); Event reported for parent using CLONE_VFORK or vfork system call. The child has died or exec'd, so the vfork parent has unblocked and is about to return child_pid. UTRACE_EVENT(EXEC) Completed exec Callback: u32 (*report_exec)(struct utrace_attached_engine *engine, struct task_struct *tsk, const struct linux_binprm *bprm, struct pt_regs *regs); An execve system call has succeeded and the new program is about to start running. The initial user register state is handy to be tweaked directly, or utrace_regset can be used for full machine state access. UTRACE_EVENT(EXIT) Thread is exiting Callback: u32 (*report_exit)(struct utrace_attached_engine *engine, struct task_struct *tsk, long orig_code, long *code); The thread is exiting and cannot be prevented from doing so, but all its state is still live. The *code value will be the wait result seen by the parent, and can be changed by this engine or others. The orig_code value is the real status, not changed by any tracing engine. UTRACE_EVENT(DEATH) Thread has finished exiting Callback: u32 (*report_death)(struct utrace_attached_engine *engine, struct task_struct *tsk); The thread is really dead now. If the UTRACE_ACTION_NOREAP flag is set after this callback, it remains an unreported zombie. Otherwise, it might be reaped by its parent, or self-reap immediately. Though the actual reaping may happen in parallel, a report_reap callback will always be ordered after a report_death callback. UTRACE_EVENT(SYSCALL_ENTRY) Thread has entered kernel for a system call Callback: u32 (*report_syscall_entry)(struct utrace_attached_engine *engine, struct task_struct *tsk, struct pt_regs *regs); The system call number and arguments can be seen and modified in the registers. The return value register has -ENOSYS, which will be returned for an invalid system call. The macro tracehook_abort_syscall(regs) will abort the system call so that we go immediately to syscall exit, and return -ENOSYS (or whatever the register state is changed to). If tracing enginges keep the thread quiescent here, the system call will not be performed until it resumes. UTRACE_EVENT(SYSCALL_EXIT) Thread is leaving kernel after a system call Callback: u32 (*report_syscall_exit)(struct utrace_attached_engine *engine, struct task_struct *tsk, struct pt_regs *regs); The return value can be seen and modified in the registers. If the thread is allowed to resume, it will see any pending signals and then return to user mode. UTRACE_EVENT(SIGNAL) Signal caught by user handler UTRACE_EVENT(SIGNAL_IGN) Signal with no effect (SIG_IGN or default) UTRACE_EVENT(SIGNAL_STOP) Job control stop signal UTRACE_EVENT(SIGNAL_TERM) Fatal termination signal UTRACE_EVENT(SIGNAL_CORE) Fatal core-dump signal UTRACE_EVENT_SIGNAL_ALL All of the above (bitmask) Callback: u32 (*report_signal)(struct utrace_attached_engine *engine, struct task_struct *tsk, u32 action, siginfo_t *info, const struct k_sigaction *orig_ka, struct k_sigaction *return_ka); There are five types of signal events, but all use the same callback. These happen when a thread is dequeuing a signal to be delivered. (Not immediately when the signal is sent, and not when the signal is blocked.) No signal event is reported for SIGKILL; no tracing engine can prevent it from killing the thread immediately. The specific event types allow an engine to trace signals based on what they do. UTRACE_EVENT_SIGNAL_ALL is all of them OR'd together, to trace all signals (except SIGKILL). A subset of these event flags can be used e.g. to catch only fatal signals, not handled ones, or to catch only core-dump signals, not normal termination signals. The action argument says what the signal's default disposition is: UTRACE_SIGNAL_DELIVER Run the user handler from sigaction. UTRACE_SIGNAL_IGN Do nothing, ignore the signal. UTRACE_SIGNAL_TERM Terminate the process. UTRACE_SIGNAL_CORE Terminate the process a write a core dump. UTRACE_SIGNAL_STOP Absolutely stop the process, a la SIGSTOP. UTRACE_SIGNAL_TSTP Job control stop (no stop if orphaned). This selection is made from consulting the process's sigaction and the default action for the signal number, but may already have been changed by an earlier tracing engine (in which case you see its override). A return value of UTRACE_ACTION_RESUME means to carry out this action. If instead UTRACE_SIGNAL_* bits are in the return value, that overrides the normal behavior of the signal. The signal number and other details of the signal are in info, and this data can be changed to make the thread see a different signal. A return value of UTRACE_SIGNAL_DELIVER says to follow the sigaction in return_ka, which can specify a user handler or SIG_IGN to ignore the signal or SIG_DFL to follow the default action for info->si_signo. The orig_ka parameter shows the process's sigaction at the time the signal was dequeued, and return_ka initially contains this. Tracing engines can modify return_ka to change the effects of delivery. For other UTRACE_SIGNAL_* return values, return_ka is ignored. UTRACE_SIGNAL_HOLD is a flag bit that can be OR'd into the return value. It says to push the signal back on the thread's queue, with the signal number and details possibly changed in info. When the thread is allowed to resume, it will dequeue and report it again.