--- /dev/null
+DRAFT DRAFT DRAFT WORK IN PROGRESS DRAFT DRAFT DRAFT
+
+This is work in progress and likely to change.
+
+
+ Roland McGrath <roland@redhat.com>
+
+---
+
+ User Debugging Data & Event Rendezvous
+ ---- --------- ---- - ----- ----------
+
+See linux/utrace.h for all the declarations used here.
+See also linux/tracehook.h for the utrace_regset declarations.
+
+The UTRACE is infrastructure code for tracing and controlling user
+threads. This is the foundation for writing tracing engines, which
+can be loadable kernel modules. The UTRACE interfaces provide three
+basic facilities:
+
+* Thread event reporting
+
+ Tracing engines can request callbacks for events of interest in
+ the thread: signals, system calls, exit, exec, clone, etc.
+
+* Core thread control
+
+ Tracing engines can prevent a thread from running (keeping it in
+ TASK_TRACED state), or make it single-step or block-step (when
+ hardware supports it). Engines can cause a thread to abort system
+ calls, they change the behaviors of signals, and they can inject
+ signal-style actions at will.
+
+* Thread machine state access
+
+ Tracing engines can read and write a thread's registers and
+ similar per-thread CPU state.
+
+
+ Tracing engines
+ ------- -------
+
+The basic actors in UTRACE are the thread and the tracing engine.
+A tracing engine is some body of code that calls into the utrace_*
+interfaces, represented by a struct utrace_engine_ops. (Usually it's a
+kernel module, though the legacy ptrace support is a tracing engine
+that is not in a kernel module.) The UTRACE interface operates on
+individual threads (struct task_struct). If an engine wants to
+treat several threads as a group, that is up to its higher-level
+code. Using the UTRACE starts out by attaching an engine to a thread.
+
+ struct utrace_attached_engine *
+ utrace_attach(struct task_struct *target, int flags,
+ const struct utrace_engine_ops *ops, unsigned long data);
+
+Calling utrace_attach is what sets up a tracing engine to trace a
+thread. Use UTRACE_ATTACH_CREATE in flags, and pass your engine's ops.
+Check the return value with IS_ERR. If successful, it returns a
+struct pointer that is the handle used in all other utrace_* calls.
+The data argument is stored in the utrace_attached_engine structure,
+for your code to use however it wants.
+
+ void utrace_detach(struct task_struct *target,
+ struct utrace_attached_engine *engine);
+
+The utrace_detach call removes an engine from a thread.
+No more callbacks will be made after this returns.
+
+
+An attached engine does nothing by default.
+An engine makes something happen by setting its flags.
+
+ void utrace_set_flags(struct task_struct *target,
+ struct utrace_attached_engine *engine,
+ unsigned long flags);
+
+
+ Action Flags
+ ------ -----
+
+There are two kinds of flags that an attached engine can set: event
+flags, and action flags. Event flags register interest in particular
+events; when an event happens and an engine has the right event flag
+set, it gets a callback. Action flags change the normal behavior of
+the thread. The action flags available are:
+
+ UTRACE_ACTION_QUIESCE
+
+ The thread will stay quiescent (see below).
+ As long as any engine asserts the QUIESCE action flag,
+ the thread will not resume running in user mode.
+ (Usually it will be in TASK_TRACED state.)
+ Nothing will wake the thread up except for SIGKILL
+ (and implicit SIGKILLs such as a core dump in
+ another thread sharing the same address space, or a
+ group exit or fatal signal in another thread in the
+ same thread group).
+
+ UTRACE_ACTION_SINGLESTEP
+
+ When the thread runs, it will run one instruction
+ and then trap. (Exiting a system call or entering a
+ signal handler is considered "an instruction" for this.)
+ This can be used only if ARCH_HAS_SINGLE_STEP #define'd
+ by <asm/tracehook.h> and evaluates to nonzero.
+
+ UTRACE_ACTION_BLOCKSTEP
+
+ When the thread runs, it will run until the next branch,
+ and then trap. (Exiting a system call or entering a
+ signal handler is considered a branch for this.)
+ When the SINGLESTEP flag is set, BLOCKSTEP has no effect.
+ This is only available on some machines (actually none yet).
+ This can be used only if ARCH_HAS_BLOCK_STEP #define'd
+ by <asm/tracehook.h> and evaluates to nonzero.
+
+ UTRACE_ACTION_NOREAP
+
+ When the thread exits or stops for job control, its
+ parent process will not receive a SIGCHLD and the
+ parent's wait calls will not wake up or report the
+ child as dead. A well-behaved tracing engine does not
+ want to interfere with the parent's normal notifications.
+ This is provided mainly for the ptrace compatibility
+ code to implement the traditional behavior.
+
+Event flags are specified using the macro UTRACE_EVENT(TYPE).
+Each event type is associated with a report_* callback in struct
+utrace_engine_ops. A tracing engine can leave unused callbacks NULL.
+The only callbacks required are those used by the event flags it sets.
+
+Many engines can be attached to each thread. When a thread has an
+event, each engine gets a report_* callback if it has set the event flag
+for that event type. Engines are called in the order they attached.
+
+Each callback takes arguments giving the details of the particular
+event. The first two arguments two every callback are the struct
+utrace_attached_engine and struct task_struct pointers for the engine
+and the thread producing the event. Usually this will be the current
+thread that is running the callback functions.
+
+The return value of report_* callbacks is a bitmask. Some bits are
+common to all callbacks, and some are particular to that callback and
+event type. The value zero (UTRACE_ACTION_RESUME) always means the
+simplest thing: do what would have happened with no tracing engine here.
+These are the flags that can be set in any report_* return value:
+
+ UTRACE_ACTION_NEWSTATE
+
+ Update the action state flags, described above. Those
+ bits from the return value (UTRACE_ACTION_STATE_MASK)
+ replace those bits in the engine's flags. This has the
+ same effect as calling utrace_set_flags, but is a more
+ efficient short-cut. To change the event flags, you must
+ call utrace_set_flags.
+
+ UTRACE_ACTION_DETACH
+
+ Detach this engine. This has the effect of calling
+ utrace_detach, but is a more efficient short-cut.
+
+ UTRACE_ACTION_HIDE
+
+ Hide this event from other tracing engines. This is
+ only appropriate to do when the event was induced by
+ some action of this engine, such as a breakpoint trap.
+ Some events cannot be hidden, since every engine has to
+ know about them: exit, death, reap.
+
+The return value bits in UTRACE_ACTION_OP_MASK indicate a change to the
+normal behavior of the event taking place. If zero, the thread does
+whatever that event normally means. For report_signal, other values
+control the disposition of the signal.
+
+
+ Quiescence
+ ----------
+
+To control another thread and access its state, it must be "quiescent".
+This means that it is stopped and won't start running again while we access
+it. A quiescent thread is stopped in a place close to user mode, where the
+user state can be accessed safely; either it's about to return to user
+mode, or it's just entered the kernel from user mode, or it has already
+finished exiting (TASK_ZOMBIE). Setting the UTRACE_ACTION_QUIESCE action
+flag will force the attached thread to become quiescent soon. After
+setting the flag, an engine must wait for an event callback when the thread
+becomes quiescent. The thread may be running on another CPU, or may be in
+an uninterruptible wait. When it is ready to be examined, it will make
+callbacks to engines that set the UTRACE_EVENT(QUIESCE) event flag.
+
+As long as some engine has UTRACE_ACTION_QUIESCE set, then the thread will
+remain stopped. SIGKILL will wake it up, but it will not run user code.
+When the flag is cleared via utrace_set_flags or a callback return value,
+the thread starts running again.
+
+During the event callbacks (report_*), the thread in question makes the
+callback from a safe place. It is not quiescent, but it can safely access
+its own state. Callbacks can access thread state directly without setting
+the QUIESCE action flag. If a callback does want to prevent the thread
+from resuming normal execution, it *must* use the QUIESCE action state
+rather than simply blocking; see "Core Events & Callbacks", below.
+
+
+ Thread control
+ ------ -------
+
+These calls must be made on a quiescent thread (or the current thread):
+
+ int utrace_inject_signal(struct task_struct *target,
+ struct utrace_attached_engine *engine,
+ u32 action, siginfo_t *info,
+ const struct k_sigaction *ka);
+
+Cause a specified signal delivery in the target thread. This is not
+like kill, which generates a signal to be dequeued and delivered later.
+Injection directs the thread to deliver a signal now, before it next
+resumes in user mode or dequeues any other pending signal. It's as if
+the tracing engine intercepted a signal event and its report_signal
+callback returned the action argument as its value (see below). The
+info and ka arguments serve the same purposes as their counterparts in
+a report_signal callback.
+
+ const struct utrace_regset *
+ utrace_regset(struct task_struct *target,
+ struct utrace_attached_engine *engine,
+ const struct utrace_regset_view *view,
+ int which);
+
+Get access to machine state for the thread. The struct utrace_regset_view
+indicates a view of machine state, corresponding to a user mode
+architecture personality (such as 32-bit or 64-bit versions of a machine).
+The which argument selects one of the register sets available in that view.
+The utrace_regset call must be made before accessing any machine state,
+each time the thread has been running and has then become quiescent.
+It ensures that the thread's state is ready to be accessed, and returns
+the struct utrace_regset giving its accessor functions.
+
+XXX needs front ends for argument checks, export utrace_native_view
+
+
+ Core Events & Callbacks
+ ---- ------ - ---------
+
+Event reporting callbacks have details particular to the event type, but
+are all called in similar environments and have the same constraints.
+Callbacks are made from safe spots, where no locks are held, no special
+resources are pinned, and the user-mode state of the thread is accessible.
+So, callback code has a pretty free hand. But to be a good citizen,
+callback code should never block for long periods. It is fine to block in
+kmalloc and the like, but never wait for i/o or for user mode to do
+something. If you need the thread to wait, set UTRACE_ACTION_QUIESCE and
+return from the callback quickly. When your i/o finishes or whatever, you
+can use utrace_set_flags to resume the thread.
+
+Well-behaved callbacks are important to maintain two essential properties
+of the interface. The first of these is that unrelated tracing engines not
+interfere with each other. If your engine's event callback does not return
+quickly, then another engine won't get the event notification in a timely
+manner. The second important property is that tracing be as noninvasive as
+possible to the normal operation of the system overall and of the traced
+thread in particular. That is, attached tracing engines should not perturb
+a thread's behavior, except to the extent that changing its user-visible
+state is explicitly what you want to do. (Obviously some perturbation is
+unavoidable, primarily timing changes, ranging from small delays due to the
+overhead of tracing, to arbitrary pauses in user code execution when a user
+stops a thread with a debugger for examination. When doing asynchronous
+utrace_attach to a thread doing a system call, more troublesome side
+effects are possible.) Even when you explicitly want the pertrubation of
+making the traced thread block, just blocking directly in your callback has
+more unwanted effects. For example, the CLONE event callbacks are called
+when the new child thread has been created but not yet started running; the
+child can never be scheduled until the CLONE tracing callbacks return.
+(This allows engines tracing the parent to attach to the child.) If a
+CLONE event callback blocks the parent thread, it also prevents the child
+thread from running (even to process a SIGKILL). If what you want is to
+make both the parent and child block, then use utrace_attach on the child
+and then set the QUIESCE action state flag on both threads. A more crucial
+problem with blocking in callbacks is that it can prevent SIGKILL from
+working. A thread that is blocking due to UTRACE_ACTION_QUIESCE will still
+wake up and die immediately when sent a SIGKILL, as all threads should.
+Relying on the utrace infrastructure rather than on private synchronization
+calls in event callbacks is an important way to help keep tracing robustly
+noninvasive.
+
+
+EVENT(REAP) Dead thread has been reaped
+Callback:
+ void (*report_reap)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk);
+
+This means the parent called wait, or else this was a detached thread or
+a process whose parent ignores SIGCHLD. This cannot happen while the
+UTRACE_ACTION_NOREAP flag is set. This is the only callback you are
+guaranteed to get (if you set the flag).
+
+Unlike other callbacks, this can be called from the parent's context
+rather than from the traced thread itself--it must not delay the parent by
+blocking. This callback is different from all others, it returns void.
+Once you get this callback, your engine is automatically detached and you
+cannot access this thread or use this struct utrace_attached_engine handle
+any longer. This is the place to clean up your data structures and
+synchronize with your code that might try to make utrace_* calls using this
+engine data structure. The struct is still valid during this callback,
+but will be freed soon after it returns (via RCU).
+
+In all other callbacks, the return value is as described above.
+The common UTRACE_ACTION_* flags in the return value are always observed.
+Unless otherwise specified below, other bits in the return value are ignored.
+
+
+EVENT(QUIESCE) Thread is quiescent
+Callback:
+ u32 (*report_quiesce)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk);
+
+This is the least interesting callback. It happens at any safe spot,
+including after any other event callback. This lets the tracing engine
+know that it is safe to access the thread's state, or to report to users
+that it has stopped running user code.
+
+EVENT(CLONE) Thread is creating a child
+Callback:
+ u32 (*report_clone)(struct utrace_attached_engine *engine,
+ struct task_struct *parent,
+ unsigned long clone_flags,
+ struct task_struct *child);
+
+A clone/clone2/fork/vfork system call has succeeded in creating a new
+thread or child process. The new process is fully formed, but not yet
+running. During this callback, other tracing engines are prevented from
+using utrace_attach asynchronously on the child, so that engines tracing
+the parent get the first opportunity to attach. After this callback
+returns, the child will start and the parent's system call will return.
+If CLONE_VFORK is set, the parent will block before returning.
+
+EVENT(VFORK_DONE) Finished waiting for CLONE_VFORK child
+Callback:
+ u32 (*report_vfork_done)(struct utrace_attached_engine *engine,
+ struct task_struct *parent, pid_t child_pid);
+
+Event reported for parent using CLONE_VFORK or vfork system call.
+The child has died or exec'd, so the vfork parent has unblocked
+and is about to return child_pid.
+
+UTRACE_EVENT(EXEC) Completed exec
+Callback:
+ u32 (*report_exec)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk,
+ const struct linux_binprm *bprm,
+ struct pt_regs *regs);
+
+An execve system call has succeeded and the new program is about to
+start running. The initial user register state is handy to be tweaked
+directly, or utrace_regset can be used for full machine state access.
+
+UTRACE_EVENT(EXIT) Thread is exiting
+Callback:
+ u32 (*report_exit)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk,
+ long orig_code, long *code);
+
+The thread is exiting and cannot be prevented from doing so, but all its
+state is still live. The *code value will be the wait result seen by
+the parent, and can be changed by this engine or others. The orig_code
+value is the real status, not changed by any tracing engine.
+
+UTRACE_EVENT(DEATH) Thread has finished exiting
+Callback:
+ u32 (*report_death)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk);
+
+The thread is really dead now. If the UTRACE_ACTION_NOREAP flag is set
+after this callback, it remains an unreported zombie. Otherwise, it might
+be reaped by its parent, or self-reap immediately. Though the actual
+reaping may happen in parallel, a report_reap callback will always be
+ordered after a report_death callback.
+
+UTRACE_EVENT(SYSCALL_ENTRY) Thread has entered kernel for a system call
+Callback:
+ u32 (*report_syscall_entry)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk,
+ struct pt_regs *regs);
+
+The system call number and arguments can be seen and modified in the
+registers. The return value register has -ENOSYS, which will be
+returned for an invalid system call. The macro tracehook_abort_syscall(regs)
+will abort the system call so that we go immediately to syscall exit,
+and return -ENOSYS (or whatever the register state is changed to). If
+tracing enginges keep the thread quiescent here, the system call will
+not be performed until it resumes.
+
+UTRACE_EVENT(SYSCALL_EXIT) Thread is leaving kernel after a system call
+Callback:
+ u32 (*report_syscall_exit)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk,
+ struct pt_regs *regs);
+
+The return value can be seen and modified in the registers. If the
+thread is allowed to resume, it will see any pending signals and then
+return to user mode.
+
+UTRACE_EVENT(SIGNAL) Signal caught by user handler
+UTRACE_EVENT(SIGNAL_IGN) Signal with no effect (SIG_IGN or default)
+UTRACE_EVENT(SIGNAL_STOP) Job control stop signal
+UTRACE_EVENT(SIGNAL_TERM) Fatal termination signal
+UTRACE_EVENT(SIGNAL_CORE) Fatal core-dump signal
+UTRACE_EVENT_SIGNAL_ALL All of the above (bitmask)
+Callback:
+ u32 (*report_signal)(struct utrace_attached_engine *engine,
+ struct task_struct *tsk,
+ u32 action, siginfo_t *info,
+ const struct k_sigaction *orig_ka,
+ struct k_sigaction *return_ka);
+
+There are five types of signal events, but all use the same callback.
+These happen when a thread is dequeuing a signal to be delivered.
+(Not immediately when the signal is sent, and not when the signal is
+blocked.) No signal event is reported for SIGKILL; no tracing engine
+can prevent it from killing the thread immediately. The specific
+event types allow an engine to trace signals based on what they do.
+UTRACE_EVENT_SIGNAL_ALL is all of them OR'd together, to trace all
+signals (except SIGKILL). A subset of these event flags can be used
+e.g. to catch only fatal signals, not handled ones, or to catch only
+core-dump signals, not normal termination signals.
+
+The action argument says what the signal's default disposition is:
+
+ UTRACE_SIGNAL_DELIVER Run the user handler from sigaction.
+ UTRACE_SIGNAL_IGN Do nothing, ignore the signal.
+ UTRACE_SIGNAL_TERM Terminate the process.
+ UTRACE_SIGNAL_CORE Terminate the process a write a core dump.
+ UTRACE_SIGNAL_STOP Absolutely stop the process, a la SIGSTOP.
+ UTRACE_SIGNAL_TSTP Job control stop (no stop if orphaned).
+
+This selection is made from consulting the process's sigaction and the
+default action for the signal number, but may already have been
+changed by an earlier tracing engine (in which case you see its override).
+A return value of UTRACE_ACTION_RESUME means to carry out this action.
+If instead UTRACE_SIGNAL_* bits are in the return value, that overrides
+the normal behavior of the signal.
+
+The signal number and other details of the signal are in info, and
+this data can be changed to make the thread see a different signal.
+A return value of UTRACE_SIGNAL_DELIVER says to follow the sigaction in
+return_ka, which can specify a user handler or SIG_IGN to ignore the
+signal or SIG_DFL to follow the default action for info->si_signo.
+The orig_ka parameter shows the process's sigaction at the time the
+signal was dequeued, and return_ka initially contains this. Tracing
+engines can modify return_ka to change the effects of delivery.
+For other UTRACE_SIGNAL_* return values, return_ka is ignored.
+
+UTRACE_SIGNAL_HOLD is a flag bit that can be OR'd into the return
+value. It says to push the signal back on the thread's queue, with
+the signal number and details possibly changed in info. When the
+thread is allowed to resume, it will dequeue and report it again.