relayfs - a high-speed data relay filesystem
============================================

relayfs is a filesystem designed to provide an efficient mechanism for
tools and facilities to relay large amounts of data from kernel space
to user space.

The main idea behind relayfs is that every data flow is put into a
separate "channel" and each channel is a file.  In practice, each
channel is a separate memory buffer allocated from within kernel space
upon channel instantiation. Software needing to relay data to user
space would open a channel or a number of channels, depending on its
needs, and would log data to that channel. All the buffering and
locking mechanics are taken care of by relayfs.  The actual format and
protocol used for each channel is up to relayfs' clients.

relayfs makes no provisions for copying the same data to more than a
single channel. This is for the clients of the relay to take care of,
and so is any form of data filtering. The purpose is to keep relayfs
as simple as possible.


Usage
=====

In addition to the relayfs kernel API described below, relayfs
implements basic file operations.  Here are the file operations that
are available and some comments regarding their behavior:

open()	 enables user to open an _existing_ channel.  A channel can be
	 opened in blocking or non-blocking mode, and can be opened
	 for reading as well as for writing.  Readers will by default
	 be auto-consuming.

mmap()	 results in channel's memory buffer being mmapped into the
	 caller's memory space.

read()	 since we are dealing with circular buffers, the user is only
	 allowed to read forward.  Some apps may want to loop around
	 read() waiting for incoming data - if there is no data
	 available, read will put the reader on a wait queue until
	 data is available (blocking mode).  Non-blocking reads return
	 -EAGAIN if data is not available.


write()	 writing from user space operates exactly as relay_write() does
	 (described below).

poll()	POLLIN/POLLRDNORM/POLLOUT/POLLWRNORM/POLLERR supported.

close()  decrements the channel's refcount.  When the refcount reaches
	 0 i.e. when no process or kernel client has the file open
	 (see relay_close() below), the channel buffer is freed.


In order for a user application to make use of relayfs files, the
relayfs filesystem must be mounted.  For example,

	mount -t relayfs relayfs /mountpoint


The relayfs kernel API
======================

relayfs channels are implemented as circular buffers subdivided into
'sub-buffers'.  kernel clients write data into the channel using
relay_write(), and are notified via a set of callbacks when
significant events occur within the channel.  'Significant events'
include:

- a sub-buffer has been filled i.e. the current write won't fit into the
  current sub-buffer, and a 'buffer-switch' is triggered, after which
  the data is written into the next buffer (if the next buffer is
  empty).  The client is notified of this condition via two callbacks,
  one providing an opportunity to perform start-of-buffer tasks, the
  other end-of-buffer tasks.

- data is ready for the client to process.  The client can choose to
  be notified either on a per-sub-buffer basis (bulk delivery) or
  per-write basis (packet delivery).

- data has been written to the channel from user space.  The client can
  use this notification to accept and process 'commands' sent to the
  channel via write(2).

- the channel has been opened/closed/mapped/unmapped from user space.
  The client can use this notification to trigger actions within the
  kernel application, such as enabling/disabling logging to the
  channel.  It can also return result codes from the callback,
  indicating that the operation should fail e.g. in order to restrict
  more than one user space open or mmap.

- the channel needs resizing, or needs to update its
  state based on the results of the resize.  Resizing the channel is
  up to the kernel client to actually perform.  If the channel is
  configured for resizing, the client is notified when the unread data
  in the channel passes a preset threshold, giving it the opportunity
  to allocate a new channel buffer and replace the old one.

Reader objects
--------------

Channel readers use an opaque rchan_reader object to read from
channels.  For VFS readers (those using read(2) to read from a
channel), these objects are automatically created and used internally;
only kernel clients that need to directly read from channels, or whose
userspace applications use mmap to access channel data, need to know
anything about rchan_readers - others may skip this section.

A relay channel can have any number of readers, each represented by an
rchan_reader instance, which is used to encapsulate reader settings
and state.  rchan_reader objects should be treated as opaque by kernel
clients.  To create a reader object for directly accessing a channel
from kernel space, call the add_rchan_reader() kernel API function:

rchan_reader *add_rchan_reader(rchan_id, auto_consume)

This function returns an rchan_reader instance if successful, which
should then be passed to relay_read() when the kernel client is
interested in reading from the channel.

The auto_consume parameter indicates whether a read done by this
reader will automatically 'consume' that portion of the unread channel
buffer when relay_read() is called (see below for more details).

To close the reader, call

remove_rchan_reader(reader)

which will remove the reader from the list of current readers.


To create a reader object representing a userspace mmap reader in the
kernel application, call the add_map_reader() kernel API function:

rchan_reader *add_map_reader(rchan_id)

This function returns an rchan_reader instance if successful, whose
main purpose is as an argument to be passed into
relay_buffers_consumed() when the kernel client becomes aware that
data has been read by a user application using mmap to read from the
channel buffer.  There is no auto_consume option in this case, since
only the kernel client/user application knows when data has been read.

To close the map reader, call

remove_map_reader(reader)

which will remove the reader from the list of current readers.

Consumed count
--------------

A relayfs channel is a circular buffer, which means that if there is
no reader reading from it or a reader reading too slowly, at some
point the channel writer will 'lap' the reader and data will be lost.
In normal use, readers will always be able to keep up with writers and
the buffer is thus never in danger of becoming full.  In many
applications, it's sufficient to ensure that this is practically
speaking always the case, by making the buffers large enough.  These
types of applications can basically open the channel as
RELAY_MODE_CONTINOUS (the default anyway) and not worry about the
meaning of 'consume' and skip the rest of this section.

If it's important for the application that a kernel client never allow
writers to overwrite unread data, the channel should be opened using
RELAY_MODE_NO_OVERWRITE and must be kept apprised of the count of
bytes actually read by the (typically) user-space channel readers.
This count is referred to as the 'consumed count'.  read(2) channel
readers automatically update the channel's 'consumed count' as they
read.  If the usage mode is to have only read(2) readers, which is
typically the case, the kernel client doesn't need to worry about any
of the relayfs functions having to do with 'bytes consumed' and can
skip the rest of this section.  (Note that it is possible to have
multiple read(2) or auto-consuming readers, but like having multiple
readers on a pipe, these readers will race with each other i.e. it's
supported, but doesn't make much sense).

If the kernel client cannot rely on an auto-consuming reader to keep
the 'consumed count' up-to-date, then it must do so manually, by
making the appropriate calls to relay_buffers_consumed() or
relay_bytes_consumed().  In most cases, this should only be necessary
for bulk mmap clients - almost all packet clients should be covered by
having auto-consuming read(2) readers.  For mmapped bulk clients, for
instance, there are no auto-consuming VFS readers, so the kernel
client needs to make the call to relay_buffers_consumed() after
sub-buffers are read.

Kernel API
----------

Here's a summary of the API relayfs provides to in-kernel clients:

int    relay_open(channel_path, bufsize, nbufs, channel_flags,
		  channel_callbacks, start_reserve, end_reserve,
		  rchan_start_reserve, resize_min, resize_max, mode,
		  init_buf, init_buf_size)
int    relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote)
rchan_reader *add_rchan_reader(channel_id, auto_consume)
int    remove_rchan_reader(rchan_reader *reader)
rchan_reader *add_map_reader(channel_id)
int    remove_map_reader(rchan_reader *reader)
int    relay_read(reader, buf, count, wait, *actual_read_offset)
void   relay_buffers_consumed(reader, buffers_consumed)
void   relay_bytes_consumed(reader, bytes_consumed, read_offset)
int    relay_bytes_avail(reader)
int    rchan_full(reader)
int    rchan_empty(reader)
int    relay_info(channel_id, *channel_info)
int    relay_close(channel_id)
int    relay_realloc_buffer(channel_id, nbufs, async)
int    relay_replace_buffer(channel_id)
int    relay_reset(int rchan_id)

----------
int relay_open(channel_path, bufsize, nbufs, 
	 channel_flags, channel_callbacks, start_reserve,
	 end_reserve, rchan_start_reserve, resize_min, resize_max, mode)

relay_open() is used to create a new entry in relayfs.  This new entry
is created according to channel_path.  channel_path contains the
absolute path to the channel file on relayfs.  If, for example, the
caller sets channel_path to "/xlog/9", a "xlog/9" entry will appear
within relayfs automatically and the "xlog" directory will be created
in the filesystem's root.  relayfs does not implement any policy on
its content, except to disallow the opening of two channels using the
same file. There are, nevertheless a set of guidelines for using
relayfs. Basically, each facility using relayfs should use a top-level
directory identifying it. The entry created above, for example,
presumably belongs to the "xlog" software.

The remaining parameters for relay_open() are as follows:

- channel_flags - an ORed combination of attribute values controlling
  common channel characteristics:

	- logging scheme - relayfs use 2 mutually exclusive schemes
	  for logging data to a channel.  The 'lockless scheme'
	  reserves and writes data to a channel without the need of
	  any type of locking on the channel.  This is the preferred
	  scheme, but may not be available on a given architecture (it
	  relies on the presence of a cmpxchg instruction).  It's
	  specified by the RELAY_SCHEME_LOCKLESS flag.  The 'locking
	  scheme' either obtains a lock on the channel for writing or
	  disables interrupts, depending on whether the channel was
	  opened for SMP or global usage (see below).  It's specified
	  by the RELAY_SCHEME_LOCKING flag.  While a client may want
	  to explicitly specify a particular scheme to use, it's more
	  convenient to specify RELAY_SCHEME_ANY for this flag, which
	  will allow relayfs to choose the best available scheme i.e.
	  lockless if supported.

       - overwrite mode (default is RELAY_MODE_CONTINUOUS) -
	 If RELAY_MODE_CONTINUOUS is specified, writes to the channel
	 will succeed regardless of whether there are up-to-date
	 consumers or not.  If RELAY_MODE_NO_OVERWRITE is specified,
	 the channel becomes 'full' when the total amount of buffer
	 space unconsumed by readers equals or exceeds the total
	 buffer size.  With the buffer in this state, writes to the
	 buffer will fail - clients need to check the return code from
	 relay_write() to determine if this is the case and act
	 accordingly - 0 or a negative value indicate the write failed.

       - SMP usage - this applies only when the locking scheme is in
	 use.  If RELAY_USAGE_SMP is specified, it's assumed that the
	 channel will be used in a per-CPU fashion and consequently,
	 the only locking that will be done for writes is to disable
	 local irqs.  If RELAY_USAGE_GLOBAL is specified, it's assumed
	 that writes to the buffer can occur within any CPU context,
	 and spinlock_irq_save will be used to lock the buffer.

       - delivery mode - if RELAY_DELIVERY_BULK is specified, the
	 client will be notified via its deliver() callback whenever a
	 sub-buffer has been filled.  Alternatively,
	 RELAY_DELIVERY_PACKET will cause delivery to occur after the
	 completion of each write.  See the description of the channel
	 callbacks below for more details.

       - timestamping - if RELAY_TIMESTAMP_TSC is specified and the
	 architecture supports it, efficient TSC 'timestamps' can be
	 associated with each write, otherwise more expensive
	 gettimeofday() timestamping is used.  At the beginning of
	 each sub-buffer, a gettimeofday() timestamp and the current
	 TSC, if supported, are read, and are passed on to the client
	 via the buffer_start() callback.  This allows correlation of
	 the current time with the current TSC for subsequent writes.
	 Each subsequent write is associated with a 'time delta',
	 which is either the current TSC, if the channel is using
	 TSCs, or the difference between the buffer_start gettimeofday
	 timestamp and the gettimeofday time read for the current
	 write.  Note that relayfs never writes either a timestamp or
	 time delta into the buffer unless explicitly asked to (see
	 the description of relay_write() for details).
 
- bufsize - the size of the 'sub-buffers' making up the circular channel
  buffer.  For the lockless scheme, this must be a power of 2.

- nbufs - the number of 'sub-buffers' making up the circular
  channel buffer.  This must be a power of 2.

  The total size of the channel buffer is bufsize * nbufs rounded up 
  to the next kernel page size.  If the lockless scheme is used, both
  bufsize and nbufs must be a power of 2.  If the locking scheme is
  used, the bufsize can be anything and nbufs must be a power of 2.  If
  RELAY_SCHEME_ANY is used, the bufsize and nbufs should be a power of 2.

  NOTE: if nbufs is 1, relayfs will bypass the normal size
  checks and will allocate an rvmalloced buffer of size bufsize.
  This buffer will be freed when relay_close() is called, if the channel
  isn't still being referenced.

- callbacks - a table of callback functions called when events occur
  within the data relay that clients need to know about:
          
	  - int buffer_start(channel_id, current_write_pos, buffer_id,
	    start_time, start_tsc, using_tsc) -

	    called at the beginning of a new sub-buffer, the
	    buffer_start() callback gives the client an opportunity to
	    write data into space reserved at the beginning of a
	    sub-buffer.  The client should only write into the buffer
	    if it specified a value for start_reserve and/or
	    channel_start_reserve (see below) when the channel was
	    opened.  In the latter case, the client can determine
	    whether to write its one-time rchan_start_reserve data by
	    examining the value of buffer_id, which will be 0 for the
	    first sub-buffer.  The address that the client can write
	    to is contained in current_write_pos (the client by
	    definition knows how much it can write i.e. the value it
	    passed to relay_open() for start_reserve/
	    channel_start_reserve).  start_time contains the
	    gettimeofday() value for the start of the buffer and start
	    TSC contains the TSC read at the same time.  The using_tsc
	    param indicates whether or not start_tsc is valid (it
	    wouldn't be if TSC timestamping isn't being used).

	    The client should return the number of bytes it wrote to
	    the channel, 0 if none.

	  - int buffer_end(channel_id, current_write_pos, end_of_buffer,
	    end_time, end_tsc, using_tsc)

	    called at the end of a sub-buffer, the buffer_end()
	    callback gives the client an opportunity to perform
	    end-of-buffer processing.  Note that the current_write_pos
	    is the position where the next write would occur, but
	    since the current write wouldn't fit (which is the trigger
	    for the buffer_end event), the buffer is considered full
	    even though there may be unused space at the end.  The
	    end_of_buffer param pointer value can be used to determine
	    exactly the size of the unused space.  The client should
	    only write into the buffer if it specified a value for
	    end_reserve when the channel was opened.  If the client
	    doesn't write anything i.e. returns 0, the unused space at
	    the end of the sub-buffer is available via relay_info() -
	    this data may be needed by the client later if it needs to
	    process raw sub-buffers (an alternative would be to save
	    the unused bytes count value in end_reserve space at the
	    end of each sub-buffer during buffer_end processing and
	    read it when needed at a later time.  The other
	    alternative would be to use read(2), which makes the
	    unused count invisible to the caller).  end_time contains
	    the gettimeofday() value for the end of the buffer and end
	    TSC contains the TSC read at the same time.  The using_tsc
	    param indicates whether or not end_tsc is valid (it
	    wouldn't be if TSC timestamping isn't being used).

	    The client should return the number of bytes it wrote to
	    the channel, 0 if none.

	  - void deliver(channel_id, from, len)

	    called when data is ready for the client.  This callback
	    is used to notify a client when a sub-buffer is complete
	    (in the case of bulk delivery) or a single write is
	    complete (packet delivery).  A bulk delivery client might
	    wish to then signal a daemon that a sub-buffer is ready.
	    A packet delivery client might wish to process the packet
	    or send it elsewhere.  The from param is a pointer to the
	    delivered data and len specifies how many bytes are ready.

	  - void user_deliver(channel_id, from, len)

	    called when data has been written to the channel from user
	    space.  This callback is used to notify a client when a
	    successful write from userspace has occurred, independent
	    of whether bulk or packet delivery is in use.  This can be
	    used to allow userspace programs to communicate with the
	    kernel client through the channel via out-of-band write(2)
	    'commands' instead of via ioctls, for instance.  The from
	    param is a pointer to the delivered data and len specifies
	    how many bytes are ready.  Note that this callback occurs
	    after the bytes have been successfully written into the
	    channel, which means that channel readers must be able to
	    deal with the 'command' data which will appear in the
	    channel data stream just as any other userspace or
	    non-userspace write would.

	  - int needs_resize(channel_id, resize_type,
	                     suggested_buf_size, suggested_n_bufs)

	    called when a channel's buffers are in danger of becoming
	    full i.e. the number of unread bytes in the channel passes
	    a preset threshold, or when the current capacity of a
	    channel's buffer is no longer needed.  Also called to
	    notify the client when a channel's buffer has been
	    replaced.  If resize_type is RELAY_RESIZE_EXPAND or
	    RELAY_RESIZE_SHRINK, the kernel client should arrange to
	    call relay_realloc_buffer() with the suggested buffer size
	    and buffer count, which will allocate (but will not
	    replace the old one) a new buffer of the recommended size
	    for the channel.  When the allocation has completed,
	    needs_resize() is again called, this time with a
	    resize_type of RELAY_RESIZE_REPLACE.  The kernel client
	    should then arrange to call relay_replace_buffer() to
	    actually replace the old channel buffer with the newly
	    allocated buffer.  Finally, once the buffer replacement
	    has completed, needs_resize() is again called, this time
	    with a resize_type of RELAY_RESIZE_REPLACED, to inform the
	    client that the replacement is complete and additionally
	    confirming the current sub-buffer size and number of
	    sub-buffers.  Note that a resize can be canceled if
	    relay_realloc_buffer() is called with the async param
	    non-zero and the resize conditions no longer hold.  In
	    this case, the RELAY_RESIZE_REPLACED suggested number of
	    sub-buffers will be the same as the number of sub-buffers
	    that existed before the RELAY_RESIZE_SHRINK or EXPAND i.e.
	    values indicating that the resize didn't actually occur.

	  - int fileop_notify(channel_id, struct file *filp, enum relay_fileop)

	    called when a userspace file operation has occurred or
	    will occur on a relayfs channel file.  These notifications
	    can be used by the kernel client to trigger actions within
	    the kernel client when the corresponding event occurs,
	    such as enabling logging only when a userspace application
	    opens or mmaps a relayfs file and disabling it again when
	    the file is closed or unmapped.  The kernel client can
	    also return its own return value, which can affect the
	    outcome of file operation - returning 0 indicates that the
	    operation should succeed, and returning a negative value
	    indicates that the operation should be failed, and that
	    the returned value should be returned to the ultimate
	    caller e.g. returning -EPERM from the open fileop will
	    cause the open to fail with -EPERM.  Among other things,
	    the return value can be used to restrict a relayfs file
	    from being opened or mmap'ed more than once.  The currently
	    implemented fileops are:

	    RELAY_FILE_OPEN - a relayfs file is being opened.  Return
			      0 to allow it to succeed, negative to
			      have it fail.  A negative return value will
			      be passed on unmodified to the open fileop.
	    RELAY_FILE_CLOSE- a relayfs file is being closed.  The return
			      value is ignored.
	    RELAY_FILE_MAP - a relayfs file is being mmap'ed.  Return 0
			     to allow it to succeed, negative to have
			     it fail.  A negative return value will be
			     passed on unmodified to the mmap fileop.
	    RELAY_FILE_UNMAP- a relayfs file is being unmapped.  The return
			      value is ignored.

	  - void ioctl(rchan_id, cmd, arg)

  	    called when an ioctl call is made using a relayfs file
	    descriptor.  The cmd and arg are passed along to this
	    callback unmodified for it to do as it wishes with.  The
	    return value from this callback is used as the return value
	    of the ioctl call.

  If the callbacks param passed to relay_open() is NULL, a set of
  default do-nothing callbacks will be defined for the channel.
  Likewise, any NULL rchan_callback function contained in a non-NULL
  callbacks struct will be filled in with a default callback function
  that does nothing.

- start_reserve - the number of bytes to be reserved at the start of
  each sub-buffer.  The client can do what it wants with this number
  of bytes when the buffer_start() callback is invoked.  Typically
  clients would use this to write per-sub-buffer header data.

- end_reserve - the number of bytes to be reserved at the end of each
  sub-buffer.  The client can do what it wants with this number of
  bytes when the buffer_end() callback is invoked.  Typically clients
  would use this to write per-sub-buffer footer data.

- channel_start_reserve - the number of bytes to be reserved, in
  addition to start_reserve, at the beginning of the first sub-buffer
  in the channel.  The client can do what it wants with this number of
  bytes when the buffer_start() callback is invoked.  Typically
  clients would use this to write per-channel header data.

- resize_min - if set, this signifies that the channel is
  auto-resizeable.  The value specifies the size that the channel will
  try to maintain as a normal working size, and that it won't go
  below.  The client makes use of the resizing callbacks and
  relay_realloc_buffer() and relay_replace_buffer() to actually effect
  the resize.

- resize_max - if set, this signifies that the channel is
  auto-resizeable.  The value specifies the maximum size the channel
  can have as a result of resizing.

- mode - if non-zero, specifies the file permissions that will be given
  to the channel file.  If 0, the default rw user perms will be used.

- init_buf - if non-NULL, rather than allocating the channel buffer,
  this buffer will be used as the initial channel buffer.  The kernel
  API function relay_discard_init_buf() can later be used to have
  relayfs allocate a normal mmappable channel buffer and switch over
  to using it after copying the init_buf contents into it.  Currently,
  the size of init_buf must be exactly buf_size * n_bufs.  The caller
  is responsible for managing the init_buf memory.  This feature is
  typically used for init-time channel use and should normally be
  specified as NULL.

- init_buf_size - the total size of init_buf, if init_buf is specified
  as non-NULL.  Currently, the size of init_buf must be exactly
  buf_size * n_bufs.

Upon successful completion, relay_open() returns a channel id
to be used for all other operations with the relay. All buffers
managed by the relay are allocated using rvmalloc/rvfree to allow
for easy mmapping to user-space.

----------
int relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote_pos)

relay_write() reserves space in the channel and writes count bytes of
data pointed to by data_ptr to it.  Automatically performs any
necessary locking, depending on the scheme and SMP usage in effect (no
locking is done for the lockless scheme regardless of usage).  It
returns the number of bytes written, or 0/negative on failure.  If
time_delta_offset is >= 0, the internal time delta, the internal time
delta calculated when the slot was reserved will be written at that
offset.  This is the TSC or gettimeofday() delta between the current
write and the beginning of the buffer, whichever method is being used
by the channel.  Trying to write a count larger than the bufsize
specified to relay_open() (taking into account the reserved
start-of-buffer and end-of-buffer space as well) will fail.  If
wrote_pos is non-NULL, it will receive the location the data was
written to, which may be needed for some applications but is not
normally interesting.  Most applications should pass in NULL for this
param.

----------
struct rchan_reader *add_rchan_reader(int rchan_id, int auto_consume)

add_rchan_reader creates and initializes a reader object for a
channel.  An opaque rchan_reader object is returned on success, and is
passed to relay_read() when reading the channel.  If the boolean
auto_consume parameter is 1, the reader is defined to be
auto-consuming.  auto-consuming reader objects are automatically
created and used for VFS read(2) readers.

----------
void remove_rchan_reader(struct rchan_reader *reader)

remove_rchan_reader finds and removes the given reader from the
channel.  This function is used only by non-VFS read(2) readers.  VFS
read(2) readers are automatically removed when the corresponding file
object is closed.

----------
reader add_map_reader(int rchan_id)

Creates and initializes an rchan_reader object for channel map
readers, and is needed for updating relay_bytes/buffers_consumed()
when kernel clients become aware of the need to do so by their mmap
user clients.

----------
int remove_map_reader(reader)

Finds and removes the given map reader from the channel.  This function
is useful only for map readers.

----------
int relay_read(reader, buf, count, wait, *actual_read_offset)

Reads count bytes from the channel, or as much as is available within
the sub-buffer currently being read.  The read offset that will be
read from is the position contained within the reader object.  If the
wait flag is set, buf is non-NULL, and there is nothing available, it
will wait until there is.  If the wait flag is 0 and there is nothing
available, -EAGAIN is returned.  If buf is NULL, the value returned is
the number of bytes that would have been read.  actual_read_offset is
the value that should be passed as the read offset to
relay_bytes_consumed, needed only if the reader is not auto-consuming
and the channel is MODE_NO_OVERWRITE, but in any case, it must not be
NULL.

---------- 

int relay_bytes_avail(reader)

Returns the number of bytes available relative to the reader's current
read position within the corresponding sub-buffer, 0 if there is
nothing available.  Note that this doesn't return the total bytes
available in the channel buffer - this is enough though to know if
anything is available, however, or how many bytes might be returned
from the next read.

----------
void relay_buffers_consumed(reader, buffers_consumed)

Adds to the channel's consumed buffer count.  buffers_consumed should
be the number of buffers newly consumed, not the total number
consumed.  NOTE: kernel clients don't need to call this function if
the reader is auto-consuming or the channel is MODE_CONTINUOUS.

In order for the relay to detect the 'buffers full' condition for a
channel, it must be kept up-to-date with respect to the number of
buffers consumed by the client.  If the addition of the value of the
bufs_consumed param to the current bufs_consumed count for the channel
would exceed the bufs_produced count for the channel, the channel's
bufs_consumed count will be set to the bufs_produced count for the
channel.  This allows clients to 'catch up' if necessary.

----------
void relay_bytes_consumed(reader, bytes_consumed, read_offset)

Adds to the channel's consumed count.  bytes_consumed should be the
number of bytes actually read e.g. return value of relay_read() and
the read_offset should be the actual offset the bytes were read from
e.g. the actual_read_offset set by relay_read().  NOTE: kernel clients
don't need to call this function if the reader is auto-consuming or
the channel is MODE_CONTINUOUS.

In order for the relay to detect the 'buffers full' condition for a
channel, it must be kept up-to-date with respect to the number of
bytes consumed by the client.  For packet clients, it makes more sense
to update after each read rather than after each complete sub-buffer
read.  The bytes_consumed count updates bufs_consumed when a buffer
has been consumed so this count remains consistent.

----------
int relay_info(channel_id, *channel_info)

relay_info() fills in an rchan_info struct with channel status and
attribute information such as usage modes, sub-buffer size and count,
the allocated size of the entire buffer, buffers produced and
consumed, current buffer id, count of writes lost due to buffers full
condition.

The virtual address of the channel buffer is also available here, for
those clients that need it.

Clients may need to know how many 'unused' bytes there are at the end
of a given sub-buffer.  This would only be the case if the client 1)
didn't either write this count to the end of the sub-buffer or
otherwise note it (it's available as the difference between the buffer
end and current write pos params in the buffer_end callback) (if the
client returned 0 from the buffer_end callback, it's assumed that this
is indeed the case) 2) isn't using the read() system call to read the
buffer.  In other words, if the client isn't annotating the stream and
is reading the buffer by mmaping it, this information would be needed
in order for the client to 'skip over' the unused bytes at the ends of
sub-buffers.

Additionally, for the lockless scheme, clients may need to know
whether a particular sub-buffer is actually complete.  An array of
boolean values, one per sub-buffer, contains non-zero if the buffer is
complete, non-zero otherwise.

----------
int relay_close(channel_id)

relay_close() is used to close the channel.  It finalizes the last
sub-buffer (the one currently being written to) and marks the channel
as finalized.  The channel buffer and channel data structure are then
freed automatically when the last reference to the channel is given
up.

----------
int relay_realloc_buffer(channel_id, nbufs, async)

Allocates a new channel buffer using the specified sub-buffer count
(note that resizing can't change sub-buffer sizes).  If async is
non-zero, the allocation is done in the background using a work queue.
When the allocation has completed, the needs_resize() callback is
called with a resize_type of RELAY_RESIZE_REPLACE.  This function
doesn't replace the old buffer with the new - see
relay_replace_buffer().

This function is called by kernel clients in response to a
needs_resize() callback call with a resize type of RELAY_RESIZE_EXPAND
or RELAY_RESIZE_SHRINK.  That callback also includes a suggested
new_bufsize and new_nbufs which should be used when calling this
function.

Returns 0 on success, or errcode if the channel is busy or if
the allocation couldn't happen for some reason.

NOTE: if async is not set, this function should not be called with a
lock held, as it may sleep.

----------
int relay_replace_buffer(channel_id)

Replaces the current channel buffer with the new buffer allocated by
relay_realloc_buffer and contained in the channel struct.  When the
replacement is complete, the needs_resize() callback is called with
RELAY_RESIZE_REPLACED.  This function is called by kernel clients in
response to a needs_resize() callback having a resize type of
RELAY_RESIZE_REPLACE.

Returns 0 on success, or errcode if the channel is busy or if the
replacement or previous allocation didn't happen for some reason.

NOTE: This function will not sleep, so can called in any context and
with locks held.  The client should, however, ensure that the channel
isn't actively being read from or written to.

----------
int relay_reset(rchan_id)

relay_reset() has the effect of erasing all data from the buffer and
restarting the channel in its initial state.  The buffer itself is not
freed, so any mappings are still in effect.  NOTE: Care should be
taken that the channnel isn't actually being used by anything when
this call is made.

----------
int rchan_full(reader)

returns 1 if the channel is full with respect to the reader, 0 if not.

----------
int rchan_empty(reader)

returns 1 if the channel is empty with respect to the reader, 0 if not.

----------
int relay_discard_init_buf(rchan_id)

allocates an mmappable channel buffer, copies the contents of init_buf
into it, and sets the current channel buffer to the newly allocated
buffer.  This function is used only in conjunction with the init_buf
and init_buf_size params to relay_open(), and is typically used when
the ability to write into the channel at init-time is needed.  The
basic usage is to specify an init_buf and init_buf_size to relay_open,
then call this function when it's safe to switch over to a normally
allocated channel buffer.  'Safe' means that the caller is in a
context that can sleep and that nothing is actively writing to the
channel.  Returns 0 if successful, negative otherwise.


Writing directly into the channel
=================================

Using the relay_write() API function as described above is the
preferred means of writing into a channel.  In some cases, however,
in-kernel clients might want to write directly into a relay channel
rather than have relay_write() copy it into the buffer on the client's
behalf.  Clients wishing to do this should follow the model used to
implement relay_write itself.  The general sequence is:

- get a pointer to the channel via rchan_get().  This increments the
  channel's reference count.
- call relay_lock_channel().  This will perform the proper locking for
  the channel given the scheme in use and the SMP usage.
- reserve a slot in the channel via relay_reserve()
- write directly to the reserved address
- call relay_commit() to commit the write
- call relay_unlock_channel()
- call rchan_put() to release the channel reference

In particular, clients should make sure they call rchan_get() and
rchan_put() and not hold on to references to the channel pointer.
Also, forgetting to use relay_lock_channel()/relay_unlock_channel()
has no effect if the lockless scheme is being used, but could result
in corrupted buffer contents if the locking scheme is used.


Limitations
===========

Writes made via the write() system call are currently limited to 2
pages worth of data.  There is no such limit on the in-kernel API
function relay_write().

User applications can currently only mmap the complete buffer (it
doesn't really make sense to mmap only part of it, given its purpose).


Latest version
==============

The latest version can be found at:

http://www.opersys.com/relayfs

Example relayfs clients, such as dynamic printk and the Linux Trace
Toolkit, can also be found there.


Credits
=======

The ideas and specs for relayfs came about as a result of discussions
on tracing involving the following:

Michel Dagenais		<michel.dagenais@polymtl.ca>
Richard Moore		<richardj_moore@uk.ibm.com>
Bob Wisniewski		<bob@watson.ibm.com>
Karim Yaghmour		<karim@opersys.com>
Tom Zanussi		<zanussi@us.ibm.com>

Also thanks to Hubertus Franke for a lot of useful suggestions and bug
reports, and for contributing the klog code.