DRAFT DRAFT DRAFT	WORK IN PROGRESS	DRAFT DRAFT DRAFT

This is work in progress and likely to change.


	Roland McGrath <roland@redhat.com>

---

		User Debugging Data & Event Rendezvous
		---- --------- ---- - ----- ----------

See linux/utrace.h for all the declarations used here.
See also linux/tracehook.h for the utrace_regset declarations.

The UTRACE is infrastructure code for tracing and controlling user
threads.  This is the foundation for writing tracing engines, which
can be loadable kernel modules.  The UTRACE interfaces provide three
basic facilities:

* Thread event reporting

  Tracing engines can request callbacks for events of interest in
  the thread: signals, system calls, exit, exec, clone, etc.

* Core thread control

  Tracing engines can prevent a thread from running (keeping it in
  TASK_TRACED state), or make it single-step or block-step (when
  hardware supports it).  Engines can cause a thread to abort system
  calls, they change the behaviors of signals, and they can inject
  signal-style actions at will.

* Thread machine state access

  Tracing engines can read and write a thread's registers and
  similar per-thread CPU state.


	Tracing engines
	------- -------

The basic actors in UTRACE are the thread and the tracing engine.
A tracing engine is some body of code that calls into the utrace_*
interfaces, represented by a struct utrace_engine_ops.  (Usually it's a
kernel module, though the legacy ptrace support is a tracing engine
that is not in a kernel module.)  The UTRACE interface operates on
individual threads (struct task_struct).  If an engine wants to
treat several threads as a group, that is up to its higher-level
code.  Using the UTRACE starts out by attaching an engine to a thread.

	struct utrace_attached_engine *
	utrace_attach(struct task_struct *target, int flags,
		      const struct utrace_engine_ops *ops, unsigned long data);

Calling utrace_attach is what sets up a tracing engine to trace a
thread.  Use UTRACE_ATTACH_CREATE in flags, and pass your engine's ops.
Check the return value with IS_ERR.  If successful, it returns a
struct pointer that is the handle used in all other utrace_* calls.
The data argument is stored in the utrace_attached_engine structure,
for your code to use however it wants.

	void utrace_detach(struct task_struct *target,
			   struct utrace_attached_engine *engine);

The utrace_detach call removes an engine from a thread.
No more callbacks will be made after this returns.


An attached engine does nothing by default.
An engine makes something happen by setting its flags.

	void utrace_set_flags(struct task_struct *target,
			      struct utrace_attached_engine *engine,
			      unsigned long flags);


	Action Flags
	------ -----

There are two kinds of flags that an attached engine can set: event
flags, and action flags.  Event flags register interest in particular
events; when an event happens and an engine has the right event flag
set, it gets a callback.  Action flags change the normal behavior of
the thread.  The action flags available are:

	UTRACE_ACTION_QUIESCE

		The thread will stay quiescent (see below).
		As long as any engine asserts the QUIESCE action flag,
		the thread will not resume running in user mode.
		(Usually it will be in TASK_TRACED state.)
		Nothing will wake the thread up except for SIGKILL
		(and implicit SIGKILLs such as a core dump in
		another thread sharing the same address space, or a
		group exit or fatal signal in another thread in the
		same thread group).

	UTRACE_ACTION_SINGLESTEP

		When the thread runs, it will run one instruction
		and then trap.  (Exiting a system call or entering a
		signal handler is considered "an instruction" for this.)
		This can be used only if ARCH_HAS_SINGLE_STEP #define'd
		by <asm/tracehook.h> and evaluates to nonzero.

	UTRACE_ACTION_BLOCKSTEP

		When the thread runs, it will run until the next branch,
		and then trap.  (Exiting a system call or entering a
		signal handler is considered a branch for this.)
		When the SINGLESTEP flag is set, BLOCKSTEP has no effect.
		This is only available on some machines (actually none yet).
		This can be used only if ARCH_HAS_BLOCK_STEP #define'd
		by <asm/tracehook.h> and evaluates to nonzero.

	UTRACE_ACTION_NOREAP

		When the thread exits or stops for job control, its
		parent process will not receive a SIGCHLD and the
		parent's wait calls will not wake up or report the
		child as dead.  A well-behaved tracing engine does not
		want to interfere with the parent's normal notifications.
		This is provided mainly for the ptrace compatibility
		code to implement the traditional behavior.

Event flags are specified using the macro UTRACE_EVENT(TYPE).
Each event type is associated with a report_* callback in struct
utrace_engine_ops.  A tracing engine can leave unused callbacks NULL.
The only callbacks required are those used by the event flags it sets.

Many engines can be attached to each thread.  When a thread has an
event, each engine gets a report_* callback if it has set the event flag
for that event type.  Engines are called in the order they attached.

Each callback takes arguments giving the details of the particular
event.  The first two arguments two every callback are the struct
utrace_attached_engine and struct task_struct pointers for the engine
and the thread producing the event.  Usually this will be the current
thread that is running the callback functions.

The return value of report_* callbacks is a bitmask.  Some bits are
common to all callbacks, and some are particular to that callback and
event type.  The value zero (UTRACE_ACTION_RESUME) always means the
simplest thing: do what would have happened with no tracing engine here.
These are the flags that can be set in any report_* return value:

	UTRACE_ACTION_NEWSTATE

		Update the action state flags, described above.  Those
		bits from the return value (UTRACE_ACTION_STATE_MASK)
		replace those bits in the engine's flags.  This has the
		same effect as calling utrace_set_flags, but is a more
		efficient short-cut.  To change the event flags, you must
		call utrace_set_flags.

	UTRACE_ACTION_DETACH

		Detach this engine.  This has the effect of calling
		utrace_detach, but is a more efficient short-cut.

	UTRACE_ACTION_HIDE

		Hide this event from other tracing engines.  This is
		only appropriate to do when the event was induced by
		some action of this engine, such as a breakpoint trap.
		Some events cannot be hidden, since every engine has to
		know about them: exit, death, reap.

The return value bits in UTRACE_ACTION_OP_MASK indicate a change to the
normal behavior of the event taking place.  If zero, the thread does
whatever that event normally means.  For report_signal, other values
control the disposition of the signal.


	Quiescence
	----------

To control another thread and access its state, it must be "quiescent".
This means that it is stopped and won't start running again while we access
it.  A quiescent thread is stopped in a place close to user mode, where the
user state can be accessed safely; either it's about to return to user
mode, or it's just entered the kernel from user mode, or it has already
finished exiting (TASK_ZOMBIE).  Setting the UTRACE_ACTION_QUIESCE action
flag will force the attached thread to become quiescent soon.  After
setting the flag, an engine must wait for an event callback when the thread
becomes quiescent.  The thread may be running on another CPU, or may be in
an uninterruptible wait.  When it is ready to be examined, it will make
callbacks to engines that set the UTRACE_EVENT(QUIESCE) event flag.

As long as some engine has UTRACE_ACTION_QUIESCE set, then the thread will
remain stopped.  SIGKILL will wake it up, but it will not run user code.
When the flag is cleared via utrace_set_flags or a callback return value,
the thread starts running again.

During the event callbacks (report_*), the thread in question makes the
callback from a safe place.  It is not quiescent, but it can safely access
its own state.  Callbacks can access thread state directly without setting
the QUIESCE action flag.  If a callback does want to prevent the thread
from resuming normal execution, it *must* use the QUIESCE action state
rather than simply blocking; see "Core Events & Callbacks", below.


	Thread control
	------ -------

These calls must be made on a quiescent thread (or the current thread):

	int utrace_inject_signal(struct task_struct *target,
				 struct utrace_attached_engine *engine,
				 u32 action, siginfo_t *info,
				 const struct k_sigaction *ka);

Cause a specified signal delivery in the target thread.  This is not
like kill, which generates a signal to be dequeued and delivered later.
Injection directs the thread to deliver a signal now, before it next
resumes in user mode or dequeues any other pending signal.  It's as if
the tracing engine intercepted a signal event and its report_signal
callback returned the action argument as its value (see below).  The
info and ka arguments serve the same purposes as their counterparts in
a report_signal callback.

	const struct utrace_regset *
	utrace_regset(struct task_struct *target,
		      struct utrace_attached_engine *engine,
		      const struct utrace_regset_view *view,
		      int which);

Get access to machine state for the thread.  The struct utrace_regset_view
indicates a view of machine state, corresponding to a user mode
architecture personality (such as 32-bit or 64-bit versions of a machine).
The which argument selects one of the register sets available in that view.
The utrace_regset call must be made before accessing any machine state,
each time the thread has been running and has then become quiescent.
It ensures that the thread's state is ready to be accessed, and returns
the struct utrace_regset giving its accessor functions.

XXX needs front ends for argument checks, export utrace_native_view


	Core Events & Callbacks
	---- ------ - ---------

Event reporting callbacks have details particular to the event type, but
are all called in similar environments and have the same constraints.
Callbacks are made from safe spots, where no locks are held, no special
resources are pinned, and the user-mode state of the thread is accessible.
So, callback code has a pretty free hand.  But to be a good citizen,
callback code should never block for long periods.  It is fine to block in
kmalloc and the like, but never wait for i/o or for user mode to do
something.  If you need the thread to wait, set UTRACE_ACTION_QUIESCE and
return from the callback quickly.  When your i/o finishes or whatever, you
can use utrace_set_flags to resume the thread.

Well-behaved callbacks are important to maintain two essential properties
of the interface.  The first of these is that unrelated tracing engines not
interfere with each other.  If your engine's event callback does not return
quickly, then another engine won't get the event notification in a timely
manner.  The second important property is that tracing be as noninvasive as
possible to the normal operation of the system overall and of the traced
thread in particular.  That is, attached tracing engines should not perturb
a thread's behavior, except to the extent that changing its user-visible
state is explicitly what you want to do.  (Obviously some perturbation is
unavoidable, primarily timing changes, ranging from small delays due to the
overhead of tracing, to arbitrary pauses in user code execution when a user
stops a thread with a debugger for examination.  When doing asynchronous
utrace_attach to a thread doing a system call, more troublesome side
effects are possible.)  Even when you explicitly want the pertrubation of
making the traced thread block, just blocking directly in your callback has
more unwanted effects.  For example, the CLONE event callbacks are called
when the new child thread has been created but not yet started running; the
child can never be scheduled until the CLONE tracing callbacks return.
(This allows engines tracing the parent to attach to the child.)  If a
CLONE event callback blocks the parent thread, it also prevents the child
thread from running (even to process a SIGKILL).  If what you want is to
make both the parent and child block, then use utrace_attach on the child
and then set the QUIESCE action state flag on both threads.  A more crucial
problem with blocking in callbacks is that it can prevent SIGKILL from
working.  A thread that is blocking due to UTRACE_ACTION_QUIESCE will still
wake up and die immediately when sent a SIGKILL, as all threads should.
Relying on the utrace infrastructure rather than on private synchronization
calls in event callbacks is an important way to help keep tracing robustly
noninvasive.


EVENT(REAP)		Dead thread has been reaped
Callback:
	void (*report_reap)(struct utrace_attached_engine *engine,
			    struct task_struct *tsk);

This means the parent called wait, or else this was a detached thread or
a process whose parent ignores SIGCHLD.  This cannot happen while the
UTRACE_ACTION_NOREAP flag is set.  This is the only callback you are
guaranteed to get (if you set the flag).

Unlike other callbacks, this can be called from the parent's context
rather than from the traced thread itself--it must not delay the parent by
blocking.  This callback is different from all others, it returns void.
Once you get this callback, your engine is automatically detached and you
cannot access this thread or use this struct utrace_attached_engine handle
any longer.  This is the place to clean up your data structures and
synchronize with your code that might try to make utrace_* calls using this
engine data structure.  The struct is still valid during this callback,
but will be freed soon after it returns (via RCU).

In all other callbacks, the return value is as described above.
The common UTRACE_ACTION_* flags in the return value are always observed.
Unless otherwise specified below, other bits in the return value are ignored.


EVENT(QUIESCE)		Thread is quiescent
Callback:
	u32 (*report_quiesce)(struct utrace_attached_engine *engine,
			      struct task_struct *tsk);

This is the least interesting callback.  It happens at any safe spot,
including after any other event callback.  This lets the tracing engine
know that it is safe to access the thread's state, or to report to users
that it has stopped running user code.

EVENT(CLONE)		Thread is creating a child
Callback:
	u32 (*report_clone)(struct utrace_attached_engine *engine,
			    struct task_struct *parent,
			    unsigned long clone_flags,
			    struct task_struct *child);

A clone/clone2/fork/vfork system call has succeeded in creating a new
thread or child process.  The new process is fully formed, but not yet
running.  During this callback, other tracing engines are prevented from
using utrace_attach asynchronously on the child, so that engines tracing
the parent get the first opportunity to attach.  After this callback
returns, the child will start and the parent's system call will return.
If CLONE_VFORK is set, the parent will block before returning.

EVENT(VFORK_DONE)	Finished waiting for CLONE_VFORK child
Callback:
	u32 (*report_vfork_done)(struct utrace_attached_engine *engine,
				 struct task_struct *parent, pid_t child_pid);

Event reported for parent using CLONE_VFORK or vfork system call.
The child has died or exec'd, so the vfork parent has unblocked
and is about to return child_pid.

UTRACE_EVENT(EXEC)		Completed exec
Callback:
	u32 (*report_exec)(struct utrace_attached_engine *engine,
			   struct task_struct *tsk,
			   const struct linux_binprm *bprm,
			   struct pt_regs *regs);

An execve system call has succeeded and the new program is about to
start running.  The initial user register state is handy to be tweaked
directly, or utrace_regset can be used for full machine state access.

UTRACE_EVENT(EXIT)		Thread is exiting
Callback:
	u32 (*report_exit)(struct utrace_attached_engine *engine,
			   struct task_struct *tsk,
			   long orig_code, long *code);

The thread is exiting and cannot be prevented from doing so, but all its
state is still live.  The *code value will be the wait result seen by
the parent, and can be changed by this engine or others.  The orig_code
value is the real status, not changed by any tracing engine.

UTRACE_EVENT(DEATH)		Thread has finished exiting
Callback:
	u32 (*report_death)(struct utrace_attached_engine *engine,
			    struct task_struct *tsk);

The thread is really dead now.  If the UTRACE_ACTION_NOREAP flag is set
after this callback, it remains an unreported zombie.  Otherwise, it might
be reaped by its parent, or self-reap immediately.  Though the actual
reaping may happen in parallel, a report_reap callback will always be
ordered after a report_death callback.

UTRACE_EVENT(SYSCALL_ENTRY)	Thread has entered kernel for a system call
Callback:
	u32 (*report_syscall_entry)(struct utrace_attached_engine *engine,
				    struct task_struct *tsk,
				    struct pt_regs *regs);

The system call number and arguments can be seen and modified in the
registers.  The return value register has -ENOSYS, which will be
returned for an invalid system call.  The macro tracehook_abort_syscall(regs)
will abort the system call so that we go immediately to syscall exit,
and return -ENOSYS (or whatever the register state is changed to).  If
tracing enginges keep the thread quiescent here, the system call will
not be performed until it resumes.

UTRACE_EVENT(SYSCALL_EXIT)	Thread is leaving kernel after a system call
Callback:
	u32 (*report_syscall_exit)(struct utrace_attached_engine *engine,
				   struct task_struct *tsk,
				   struct pt_regs *regs);

The return value can be seen and modified in the registers.  If the
thread is allowed to resume, it will see any pending signals and then
return to user mode.

UTRACE_EVENT(SIGNAL)		Signal caught by user handler
UTRACE_EVENT(SIGNAL_IGN)		Signal with no effect (SIG_IGN or default)
UTRACE_EVENT(SIGNAL_STOP)	Job control stop signal
UTRACE_EVENT(SIGNAL_TERM)	Fatal termination signal
UTRACE_EVENT(SIGNAL_CORE)	Fatal core-dump signal
UTRACE_EVENT_SIGNAL_ALL		All of the above (bitmask)
Callback:
	u32 (*report_signal)(struct utrace_attached_engine *engine,
			     struct task_struct *tsk,
			     u32 action, siginfo_t *info,
			     const struct k_sigaction *orig_ka,
			     struct k_sigaction *return_ka);

There are five types of signal events, but all use the same callback.
These happen when a thread is dequeuing a signal to be delivered.
(Not immediately when the signal is sent, and not when the signal is
blocked.)  No signal event is reported for SIGKILL; no tracing engine
can prevent it from killing the thread immediately.  The specific
event types allow an engine to trace signals based on what they do.
UTRACE_EVENT_SIGNAL_ALL is all of them OR'd together, to trace all
signals (except SIGKILL).  A subset of these event flags can be used
e.g. to catch only fatal signals, not handled ones, or to catch only
core-dump signals, not normal termination signals.

The action argument says what the signal's default disposition is:

	UTRACE_SIGNAL_DELIVER	Run the user handler from sigaction.
	UTRACE_SIGNAL_IGN	Do nothing, ignore the signal.
	UTRACE_SIGNAL_TERM	Terminate the process.
	UTRACE_SIGNAL_CORE	Terminate the process a write a core dump.
	UTRACE_SIGNAL_STOP	Absolutely stop the process, a la SIGSTOP.
	UTRACE_SIGNAL_TSTP	Job control stop (no stop if orphaned).

This selection is made from consulting the process's sigaction and the
default action for the signal number, but may already have been
changed by an earlier tracing engine (in which case you see its override).
A return value of UTRACE_ACTION_RESUME means to carry out this action.
If instead UTRACE_SIGNAL_* bits are in the return value, that overrides
the normal behavior of the signal.

The signal number and other details of the signal are in info, and
this data can be changed to make the thread see a different signal.
A return value of UTRACE_SIGNAL_DELIVER says to follow the sigaction in
return_ka, which can specify a user handler or SIG_IGN to ignore the
signal or SIG_DFL to follow the default action for info->si_signo.
The orig_ka parameter shows the process's sigaction at the time the
signal was dequeued, and return_ka initially contains this.  Tracing
engines can modify return_ka to change the effects of delivery.
For other UTRACE_SIGNAL_* return values, return_ka is ignored.

UTRACE_SIGNAL_HOLD is a flag bit that can be OR'd into the return
value.  It says to push the signal back on the thread's queue, with
the signal number and details possibly changed in info.  When the
thread is allowed to resume, it will dequeue and report it again.