Documentation/filesystems/relayfs.txt

   1
   2 relayfs - a high-speed data relay filesystem
   3 ============================================
   4
   5 relayfs is a filesystem designed to provide an efficient mechanism for
   6 tools and facilities to relay large amounts of data from kernel space
   7 to user space.
   8
   9 The main idea behind relayfs is that every data flow is put into a
  10 separate "channel" and each channel is a file.  In practice, each
  11 channel is a separate memory buffer allocated from within kernel space
  12 upon channel instantiation. Software needing to relay data to user
  13 space would open a channel or a number of channels, depending on its
  14 needs, and would log data to that channel. All the buffering and
  15 locking mechanics are taken care of by relayfs.  The actual format and
  16 protocol used for each channel is up to relayfs' clients.
  17
  18 relayfs makes no provisions for copying the same data to more than a
  19 single channel. This is for the clients of the relay to take care of,
  20 and so is any form of data filtering. The purpose is to keep relayfs
  21 as simple as possible.
  22
  23
  24 Usage
  25 =====
  26
  27 In addition to the relayfs kernel API described below, relayfs
  28 implements basic file operations.  Here are the file operations that
  29 are available and some comments regarding their behavior:
  30
  31 open()   enables user to open an _existing_ channel.  A channel can be
  32          opened in blocking or non-blocking mode, and can be opened
  33          for reading as well as for writing.  Readers will by default
  34          be auto-consuming.
  35
  36 mmap()   results in channel's memory buffer being mmapped into the
  37          caller's memory space.
  38
  39 read()   since we are dealing with circular buffers, the user is only
  40          allowed to read forward.  Some apps may want to loop around
  41          read() waiting for incoming data - if there is no data
  42          available, read will put the reader on a wait queue until
  43          data is available (blocking mode).  Non-blocking reads return
  44          -EAGAIN if data is not available.
  45
  46
  47 write()  writing from user space operates exactly as relay_write() does
  48          (described below).
  49
  50 poll()  POLLIN/POLLRDNORM/POLLOUT/POLLWRNORM/POLLERR supported.
  51
  52 close()  decrements the channel's refcount.  When the refcount reaches
  53          0 i.e. when no process or kernel client has the file open
  54          (see relay_close() below), the channel buffer is freed.
  55
  56
  57 In order for a user application to make use of relayfs files, the
  58 relayfs filesystem must be mounted.  For example,
  59
  60         mount -t relayfs relayfs /mountpoint
  61
  62
  63 The relayfs kernel API
  64 ======================
  65
  66 relayfs channels are implemented as circular buffers subdivided into
  67 'sub-buffers'.  kernel clients write data into the channel using
  68 relay_write(), and are notified via a set of callbacks when
  69 significant events occur within the channel.  'Significant events'
  70 include:
  71
  72 - a sub-buffer has been filled i.e. the current write won't fit into the
  73   current sub-buffer, and a 'buffer-switch' is triggered, after which
  74   the data is written into the next buffer (if the next buffer is
  75   empty).  The client is notified of this condition via two callbacks,
  76   one providing an opportunity to perform start-of-buffer tasks, the
  77   other end-of-buffer tasks.
  78
  79 - data is ready for the client to process.  The client can choose to
  80   be notified either on a per-sub-buffer basis (bulk delivery) or
  81   per-write basis (packet delivery).
  82
  83 - data has been written to the channel from user space.  The client can
  84   use this notification to accept and process 'commands' sent to the
  85   channel via write(2).
  86
  87 - the channel has been opened/closed/mapped/unmapped from user space.
  88   The client can use this notification to trigger actions within the
  89   kernel application, such as enabling/disabling logging to the
  90   channel.  It can also return result codes from the callback,
  91   indicating that the operation should fail e.g. in order to restrict
  92   more than one user space open or mmap.
  93
  94 - the channel needs resizing, or needs to update its
  95   state based on the results of the resize.  Resizing the channel is
  96   up to the kernel client to actually perform.  If the channel is
  97   configured for resizing, the client is notified when the unread data
  98   in the channel passes a preset threshold, giving it the opportunity
  99   to allocate a new channel buffer and replace the old one.
 100
 101 Reader objects
 102 --------------
 103
 104 Channel readers use an opaque rchan_reader object to read from
 105 channels.  For VFS readers (those using read(2) to read from a
 106 channel), these objects are automatically created and used internally;
 107 only kernel clients that need to directly read from channels, or whose
 108 userspace applications use mmap to access channel data, need to know
 109 anything about rchan_readers - others may skip this section.
 110
 111 A relay channel can have any number of readers, each represented by an
 112 rchan_reader instance, which is used to encapsulate reader settings
 113 and state.  rchan_reader objects should be treated as opaque by kernel
 114 clients.  To create a reader object for directly accessing a channel
 115 from kernel space, call the add_rchan_reader() kernel API function:
 116
 117 rchan_reader *add_rchan_reader(rchan_id, auto_consume)
 118
 119 This function returns an rchan_reader instance if successful, which
 120 should then be passed to relay_read() when the kernel client is
 121 interested in reading from the channel.
 122
 123 The auto_consume parameter indicates whether a read done by this
 124 reader will automatically 'consume' that portion of the unread channel
 125 buffer when relay_read() is called (see below for more details).
 126
 127 To close the reader, call
 128
 129 remove_rchan_reader(reader)
 130
 131 which will remove the reader from the list of current readers.
 132
 133
 134 To create a reader object representing a userspace mmap reader in the
 135 kernel application, call the add_map_reader() kernel API function:
 136
 137 rchan_reader *add_map_reader(rchan_id)
 138
 139 This function returns an rchan_reader instance if successful, whose
 140 main purpose is as an argument to be passed into
 141 relay_buffers_consumed() when the kernel client becomes aware that
 142 data has been read by a user application using mmap to read from the
 143 channel buffer.  There is no auto_consume option in this case, since
 144 only the kernel client/user application knows when data has been read.
 145
 146 To close the map reader, call
 147
 148 remove_map_reader(reader)
 149
 150 which will remove the reader from the list of current readers.
 151
 152 Consumed count
 153 --------------
 154
 155 A relayfs channel is a circular buffer, which means that if there is
 156 no reader reading from it or a reader reading too slowly, at some
 157 point the channel writer will 'lap' the reader and data will be lost.
 158 In normal use, readers will always be able to keep up with writers and
 159 the buffer is thus never in danger of becoming full.  In many
 160 applications, it's sufficient to ensure that this is practically
 161 speaking always the case, by making the buffers large enough.  These
 162 types of applications can basically open the channel as
 163 RELAY_MODE_CONTINOUS (the default anyway) and not worry about the
 164 meaning of 'consume' and skip the rest of this section.
 165
 166 If it's important for the application that a kernel client never allow
 167 writers to overwrite unread data, the channel should be opened using
 168 RELAY_MODE_NO_OVERWRITE and must be kept apprised of the count of
 169 bytes actually read by the (typically) user-space channel readers.
 170 This count is referred to as the 'consumed count'.  read(2) channel
 171 readers automatically update the channel's 'consumed count' as they
 172 read.  If the usage mode is to have only read(2) readers, which is
 173 typically the case, the kernel client doesn't need to worry about any
 174 of the relayfs functions having to do with 'bytes consumed' and can
 175 skip the rest of this section.  (Note that it is possible to have
 176 multiple read(2) or auto-consuming readers, but like having multiple
 177 readers on a pipe, these readers will race with each other i.e. it's
 178 supported, but doesn't make much sense).
 179
 180 If the kernel client cannot rely on an auto-consuming reader to keep
 181 the 'consumed count' up-to-date, then it must do so manually, by
 182 making the appropriate calls to relay_buffers_consumed() or
 183 relay_bytes_consumed().  In most cases, this should only be necessary
 184 for bulk mmap clients - almost all packet clients should be covered by
 185 having auto-consuming read(2) readers.  For mmapped bulk clients, for
 186 instance, there are no auto-consuming VFS readers, so the kernel
 187 client needs to make the call to relay_buffers_consumed() after
 188 sub-buffers are read.
 189
 190 Kernel API
 191 ----------
 192
 193 Here's a summary of the API relayfs provides to in-kernel clients:
 194
 195 int    relay_open(channel_path, bufsize, nbufs, channel_flags,
 196                   channel_callbacks, start_reserve, end_reserve,
 197                   rchan_start_reserve, resize_min, resize_max, mode,
 198                   init_buf, init_buf_size)
 199 int    relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote)
 200 rchan_reader *add_rchan_reader(channel_id, auto_consume)
 201 int    remove_rchan_reader(rchan_reader *reader)
 202 rchan_reader *add_map_reader(channel_id)
 203 int    remove_map_reader(rchan_reader *reader)
 204 int    relay_read(reader, buf, count, wait, *actual_read_offset)
 205 void   relay_buffers_consumed(reader, buffers_consumed)
 206 void   relay_bytes_consumed(reader, bytes_consumed, read_offset)
 207 int    relay_bytes_avail(reader)
 208 int    rchan_full(reader)
 209 int    rchan_empty(reader)
 210 int    relay_info(channel_id, *channel_info)
 211 int    relay_close(channel_id)
 212 int    relay_realloc_buffer(channel_id, nbufs, async)
 213 int    relay_replace_buffer(channel_id)
 214 int    relay_reset(int rchan_id)
 215
 216 ----------
 217 int relay_open(channel_path, bufsize, nbufs,
 218          channel_flags, channel_callbacks, start_reserve,
 219          end_reserve, rchan_start_reserve, resize_min, resize_max, mode)
 220
 221 relay_open() is used to create a new entry in relayfs.  This new entry
 222 is created according to channel_path.  channel_path contains the
 223 absolute path to the channel file on relayfs.  If, for example, the
 224 caller sets channel_path to "/xlog/9", a "xlog/9" entry will appear
 225 within relayfs automatically and the "xlog" directory will be created
 226 in the filesystem's root.  relayfs does not implement any policy on
 227 its content, except to disallow the opening of two channels using the
 228 same file. There are, nevertheless a set of guidelines for using
 229 relayfs. Basically, each facility using relayfs should use a top-level
 230 directory identifying it. The entry created above, for example,
 231 presumably belongs to the "xlog" software.
 232
 233 The remaining parameters for relay_open() are as follows:
 234
 235 - channel_flags - an ORed combination of attribute values controlling
 236   common channel characteristics:
 237
 238         - logging scheme - relayfs use 2 mutually exclusive schemes
 239           for logging data to a channel.  The 'lockless scheme'
 240           reserves and writes data to a channel without the need of
 241           any type of locking on the channel.  This is the preferred
 242           scheme, but may not be available on a given architecture (it
 243           relies on the presence of a cmpxchg instruction).  It's
 244           specified by the RELAY_SCHEME_LOCKLESS flag.  The 'locking
 245           scheme' either obtains a lock on the channel for writing or
 246           disables interrupts, depending on whether the channel was
 247           opened for SMP or global usage (see below).  It's specified
 248           by the RELAY_SCHEME_LOCKING flag.  While a client may want
 249           to explicitly specify a particular scheme to use, it's more
 250           convenient to specify RELAY_SCHEME_ANY for this flag, which
 251           will allow relayfs to choose the best available scheme i.e.
 252           lockless if supported.
 253
 254        - overwrite mode (default is RELAY_MODE_CONTINUOUS) -
 255          If RELAY_MODE_CONTINUOUS is specified, writes to the channel
 256          will succeed regardless of whether there are up-to-date
 257          consumers or not.  If RELAY_MODE_NO_OVERWRITE is specified,
 258          the channel becomes 'full' when the total amount of buffer
 259          space unconsumed by readers equals or exceeds the total
 260          buffer size.  With the buffer in this state, writes to the
 261          buffer will fail - clients need to check the return code from
 262          relay_write() to determine if this is the case and act
 263          accordingly - 0 or a negative value indicate the write failed.
 264
 265        - SMP usage - this applies only when the locking scheme is in
 266          use.  If RELAY_USAGE_SMP is specified, it's assumed that the
 267          channel will be used in a per-CPU fashion and consequently,
 268          the only locking that will be done for writes is to disable
 269          local irqs.  If RELAY_USAGE_GLOBAL is specified, it's assumed
 270          that writes to the buffer can occur within any CPU context,
 271          and spinlock_irq_save will be used to lock the buffer.
 272
 273        - delivery mode - if RELAY_DELIVERY_BULK is specified, the
 274          client will be notified via its deliver() callback whenever a
 275          sub-buffer has been filled.  Alternatively,
 276          RELAY_DELIVERY_PACKET will cause delivery to occur after the
 277          completion of each write.  See the description of the channel
 278          callbacks below for more details.
 279
 280        - timestamping - if RELAY_TIMESTAMP_TSC is specified and the
 281          architecture supports it, efficient TSC 'timestamps' can be
 282          associated with each write, otherwise more expensive
 283          gettimeofday() timestamping is used.  At the beginning of
 284          each sub-buffer, a gettimeofday() timestamp and the current
 285          TSC, if supported, are read, and are passed on to the client
 286          via the buffer_start() callback.  This allows correlation of
 287          the current time with the current TSC for subsequent writes.
 288          Each subsequent write is associated with a 'time delta',
 289          which is either the current TSC, if the channel is using
 290          TSCs, or the difference between the buffer_start gettimeofday
 291          timestamp and the gettimeofday time read for the current
 292          write.  Note that relayfs never writes either a timestamp or
 293          time delta into the buffer unless explicitly asked to (see
 294          the description of relay_write() for details).
 295
 296 - bufsize - the size of the 'sub-buffers' making up the circular channel
 297   buffer.  For the lockless scheme, this must be a power of 2.
 298
 299 - nbufs - the number of 'sub-buffers' making up the circular
 300   channel buffer.  This must be a power of 2.
 301
 302   The total size of the channel buffer is bufsize * nbufs rounded up
 303   to the next kernel page size.  If the lockless scheme is used, both
 304   bufsize and nbufs must be a power of 2.  If the locking scheme is
 305   used, the bufsize can be anything and nbufs must be a power of 2.  If
 306   RELAY_SCHEME_ANY is used, the bufsize and nbufs should be a power of 2.
 307
 308   NOTE: if nbufs is 1, relayfs will bypass the normal size
 309   checks and will allocate an rvmalloced buffer of size bufsize.
 310   This buffer will be freed when relay_close() is called, if the channel
 311   isn't still being referenced.
 312
 313 - callbacks - a table of callback functions called when events occur
 314   within the data relay that clients need to know about:
 315
 316           - int buffer_start(channel_id, current_write_pos, buffer_id,
 317             start_time, start_tsc, using_tsc) -
 318
 319             called at the beginning of a new sub-buffer, the
 320             buffer_start() callback gives the client an opportunity to
 321             write data into space reserved at the beginning of a
 322             sub-buffer.  The client should only write into the buffer
 323             if it specified a value for start_reserve and/or
 324             channel_start_reserve (see below) when the channel was
 325             opened.  In the latter case, the client can determine
 326             whether to write its one-time rchan_start_reserve data by
 327             examining the value of buffer_id, which will be 0 for the
 328             first sub-buffer.  The address that the client can write
 329             to is contained in current_write_pos (the client by
 330             definition knows how much it can write i.e. the value it
 331             passed to relay_open() for start_reserve/
 332             channel_start_reserve).  start_time contains the
 333             gettimeofday() value for the start of the buffer and start
 334             TSC contains the TSC read at the same time.  The using_tsc
 335             param indicates whether or not start_tsc is valid (it
 336             wouldn't be if TSC timestamping isn't being used).
 337
 338             The client should return the number of bytes it wrote to
 339             the channel, 0 if none.
 340
 341           - int buffer_end(channel_id, current_write_pos, end_of_buffer,
 342             end_time, end_tsc, using_tsc)
 343
 344             called at the end of a sub-buffer, the buffer_end()
 345             callback gives the client an opportunity to perform
 346             end-of-buffer processing.  Note that the current_write_pos
 347             is the position where the next write would occur, but
 348             since the current write wouldn't fit (which is the trigger
 349             for the buffer_end event), the buffer is considered full
 350             even though there may be unused space at the end.  The
 351             end_of_buffer param pointer value can be used to determine
 352             exactly the size of the unused space.  The client should
 353             only write into the buffer if it specified a value for
 354             end_reserve when the channel was opened.  If the client
 355             doesn't write anything i.e. returns 0, the unused space at
 356             the end of the sub-buffer is available via relay_info() -
 357             this data may be needed by the client later if it needs to
 358             process raw sub-buffers (an alternative would be to save
 359             the unused bytes count value in end_reserve space at the
 360             end of each sub-buffer during buffer_end processing and
 361             read it when needed at a later time.  The other
 362             alternative would be to use read(2), which makes the
 363             unused count invisible to the caller).  end_time contains
 364             the gettimeofday() value for the end of the buffer and end
 365             TSC contains the TSC read at the same time.  The using_tsc
 366             param indicates whether or not end_tsc is valid (it
 367             wouldn't be if TSC timestamping isn't being used).
 368
 369             The client should return the number of bytes it wrote to
 370             the channel, 0 if none.
 371
 372           - void deliver(channel_id, from, len)
 373
 374             called when data is ready for the client.  This callback
 375             is used to notify a client when a sub-buffer is complete
 376             (in the case of bulk delivery) or a single write is
 377             complete (packet delivery).  A bulk delivery client might
 378             wish to then signal a daemon that a sub-buffer is ready.
 379             A packet delivery client might wish to process the packet
 380             or send it elsewhere.  The from param is a pointer to the
 381             delivered data and len specifies how many bytes are ready.
 382
 383           - void user_deliver(channel_id, from, len)
 384
 385             called when data has been written to the channel from user
 386             space.  This callback is used to notify a client when a
 387             successful write from userspace has occurred, independent
 388             of whether bulk or packet delivery is in use.  This can be
 389             used to allow userspace programs to communicate with the
 390             kernel client through the channel via out-of-band write(2)
 391             'commands' instead of via ioctls, for instance.  The from
 392             param is a pointer to the delivered data and len specifies
 393             how many bytes are ready.  Note that this callback occurs
 394             after the bytes have been successfully written into the
 395             channel, which means that channel readers must be able to
 396             deal with the 'command' data which will appear in the
 397             channel data stream just as any other userspace or
 398             non-userspace write would.
 399
 400           - int needs_resize(channel_id, resize_type,
 401                              suggested_buf_size, suggested_n_bufs)
 402
 403             called when a channel's buffers are in danger of becoming
 404             full i.e. the number of unread bytes in the channel passes
 405             a preset threshold, or when the current capacity of a
 406             channel's buffer is no longer needed.  Also called to
 407             notify the client when a channel's buffer has been
 408             replaced.  If resize_type is RELAY_RESIZE_EXPAND or
 409             RELAY_RESIZE_SHRINK, the kernel client should arrange to
 410             call relay_realloc_buffer() with the suggested buffer size
 411             and buffer count, which will allocate (but will not
 412             replace the old one) a new buffer of the recommended size
 413             for the channel.  When the allocation has completed,
 414             needs_resize() is again called, this time with a
 415             resize_type of RELAY_RESIZE_REPLACE.  The kernel client
 416             should then arrange to call relay_replace_buffer() to
 417             actually replace the old channel buffer with the newly
 418             allocated buffer.  Finally, once the buffer replacement
 419             has completed, needs_resize() is again called, this time
 420             with a resize_type of RELAY_RESIZE_REPLACED, to inform the
 421             client that the replacement is complete and additionally
 422             confirming the current sub-buffer size and number of
 423             sub-buffers.  Note that a resize can be canceled if
 424             relay_realloc_buffer() is called with the async param
 425             non-zero and the resize conditions no longer hold.  In
 426             this case, the RELAY_RESIZE_REPLACED suggested number of
 427             sub-buffers will be the same as the number of sub-buffers
 428             that existed before the RELAY_RESIZE_SHRINK or EXPAND i.e.
 429             values indicating that the resize didn't actually occur.
 430
 431           - int fileop_notify(channel_id, struct file *filp, enum relay_fileop)
 432
 433             called when a userspace file operation has occurred or
 434             will occur on a relayfs channel file.  These notifications
 435             can be used by the kernel client to trigger actions within
 436             the kernel client when the corresponding event occurs,
 437             such as enabling logging only when a userspace application
 438             opens or mmaps a relayfs file and disabling it again when
 439             the file is closed or unmapped.  The kernel client can
 440             also return its own return value, which can affect the
 441             outcome of file operation - returning 0 indicates that the
 442             operation should succeed, and returning a negative value
 443             indicates that the operation should be failed, and that
 444             the returned value should be returned to the ultimate
 445             caller e.g. returning -EPERM from the open fileop will
 446             cause the open to fail with -EPERM.  Among other things,
 447             the return value can be used to restrict a relayfs file
 448             from being opened or mmap'ed more than once.  The currently
 449             implemented fileops are:
 450
 451             RELAY_FILE_OPEN - a relayfs file is being opened.  Return
 452                               0 to allow it to succeed, negative to
 453                               have it fail.  A negative return value will
 454                               be passed on unmodified to the open fileop.
 455             RELAY_FILE_CLOSE- a relayfs file is being closed.  The return
 456                               value is ignored.
 457             RELAY_FILE_MAP - a relayfs file is being mmap'ed.  Return 0
 458                              to allow it to succeed, negative to have
 459                              it fail.  A negative return value will be
 460                              passed on unmodified to the mmap fileop.
 461             RELAY_FILE_UNMAP- a relayfs file is being unmapped.  The return
 462                               value is ignored.
 463
 464           - void ioctl(rchan_id, cmd, arg)
 465
 466             called when an ioctl call is made using a relayfs file
 467             descriptor.  The cmd and arg are passed along to this
 468             callback unmodified for it to do as it wishes with.  The
 469             return value from this callback is used as the return value
 470             of the ioctl call.
 471
 472   If the callbacks param passed to relay_open() is NULL, a set of
 473   default do-nothing callbacks will be defined for the channel.
 474   Likewise, any NULL rchan_callback function contained in a non-NULL
 475   callbacks struct will be filled in with a default callback function
 476   that does nothing.
 477
 478 - start_reserve - the number of bytes to be reserved at the start of
 479   each sub-buffer.  The client can do what it wants with this number
 480   of bytes when the buffer_start() callback is invoked.  Typically
 481   clients would use this to write per-sub-buffer header data.
 482
 483 - end_reserve - the number of bytes to be reserved at the end of each
 484   sub-buffer.  The client can do what it wants with this number of
 485   bytes when the buffer_end() callback is invoked.  Typically clients
 486   would use this to write per-sub-buffer footer data.
 487
 488 - channel_start_reserve - the number of bytes to be reserved, in
 489   addition to start_reserve, at the beginning of the first sub-buffer
 490   in the channel.  The client can do what it wants with this number of
 491   bytes when the buffer_start() callback is invoked.  Typically
 492   clients would use this to write per-channel header data.
 493
 494 - resize_min - if set, this signifies that the channel is
 495   auto-resizeable.  The value specifies the size that the channel will
 496   try to maintain as a normal working size, and that it won't go
 497   below.  The client makes use of the resizing callbacks and
 498   relay_realloc_buffer() and relay_replace_buffer() to actually effect
 499   the resize.
 500
 501 - resize_max - if set, this signifies that the channel is
 502   auto-resizeable.  The value specifies the maximum size the channel
 503   can have as a result of resizing.
 504
 505 - mode - if non-zero, specifies the file permissions that will be given
 506   to the channel file.  If 0, the default rw user perms will be used.
 507
 508 - init_buf - if non-NULL, rather than allocating the channel buffer,
 509   this buffer will be used as the initial channel buffer.  The kernel
 510   API function relay_discard_init_buf() can later be used to have
 511   relayfs allocate a normal mmappable channel buffer and switch over
 512   to using it after copying the init_buf contents into it.  Currently,
 513   the size of init_buf must be exactly buf_size * n_bufs.  The caller
 514   is responsible for managing the init_buf memory.  This feature is
 515   typically used for init-time channel use and should normally be
 516   specified as NULL.
 517
 518 - init_buf_size - the total size of init_buf, if init_buf is specified
 519   as non-NULL.  Currently, the size of init_buf must be exactly
 520   buf_size * n_bufs.
 521
 522 Upon successful completion, relay_open() returns a channel id
 523 to be used for all other operations with the relay. All buffers
 524 managed by the relay are allocated using rvmalloc/rvfree to allow
 525 for easy mmapping to user-space.
 526
 527 ----------
 528 int relay_write(channel_id, *data_ptr, count, time_delta_offset, **wrote_pos)
 529
 530 relay_write() reserves space in the channel and writes count bytes of
 531 data pointed to by data_ptr to it.  Automatically performs any
 532 necessary locking, depending on the scheme and SMP usage in effect (no
 533 locking is done for the lockless scheme regardless of usage).  It
 534 returns the number of bytes written, or 0/negative on failure.  If
 535 time_delta_offset is >= 0, the internal time delta, the internal time
 536 delta calculated when the slot was reserved will be written at that
 537 offset.  This is the TSC or gettimeofday() delta between the current
 538 write and the beginning of the buffer, whichever method is being used
 539 by the channel.  Trying to write a count larger than the bufsize
 540 specified to relay_open() (taking into account the reserved
 541 start-of-buffer and end-of-buffer space as well) will fail.  If
 542 wrote_pos is non-NULL, it will receive the location the data was
 543 written to, which may be needed for some applications but is not
 544 normally interesting.  Most applications should pass in NULL for this
 545 param.
 546
 547 ----------
 548 struct rchan_reader *add_rchan_reader(int rchan_id, int auto_consume)
 549
 550 add_rchan_reader creates and initializes a reader object for a
 551 channel.  An opaque rchan_reader object is returned on success, and is
 552 passed to relay_read() when reading the channel.  If the boolean
 553 auto_consume parameter is 1, the reader is defined to be
 554 auto-consuming.  auto-consuming reader objects are automatically
 555 created and used for VFS read(2) readers.
 556
 557 ----------
 558 void remove_rchan_reader(struct rchan_reader *reader)
 559
 560 remove_rchan_reader finds and removes the given reader from the
 561 channel.  This function is used only by non-VFS read(2) readers.  VFS
 562 read(2) readers are automatically removed when the corresponding file
 563 object is closed.
 564
 565 ----------
 566 reader add_map_reader(int rchan_id)
 567
 568 Creates and initializes an rchan_reader object for channel map
 569 readers, and is needed for updating relay_bytes/buffers_consumed()
 570 when kernel clients become aware of the need to do so by their mmap
 571 user clients.
 572
 573 ----------
 574 int remove_map_reader(reader)
 575
 576 Finds and removes the given map reader from the channel.  This function
 577 is useful only for map readers.
 578
 579 ----------
 580 int relay_read(reader, buf, count, wait, *actual_read_offset)
 581
 582 Reads count bytes from the channel, or as much as is available within
 583 the sub-buffer currently being read.  The read offset that will be
 584 read from is the position contained within the reader object.  If the
 585 wait flag is set, buf is non-NULL, and there is nothing available, it
 586 will wait until there is.  If the wait flag is 0 and there is nothing
 587 available, -EAGAIN is returned.  If buf is NULL, the value returned is
 588 the number of bytes that would have been read.  actual_read_offset is
 589 the value that should be passed as the read offset to
 590 relay_bytes_consumed, needed only if the reader is not auto-consuming
 591 and the channel is MODE_NO_OVERWRITE, but in any case, it must not be
 592 NULL.
 593
 594 ----------
 595
 596 int relay_bytes_avail(reader)
 597
 598 Returns the number of bytes available relative to the reader's current
 599 read position within the corresponding sub-buffer, 0 if there is
 600 nothing available.  Note that this doesn't return the total bytes
 601 available in the channel buffer - this is enough though to know if
 602 anything is available, however, or how many bytes might be returned
 603 from the next read.
 604
 605 ----------
 606 void relay_buffers_consumed(reader, buffers_consumed)
 607
 608 Adds to the channel's consumed buffer count.  buffers_consumed should
 609 be the number of buffers newly consumed, not the total number
 610 consumed.  NOTE: kernel clients don't need to call this function if
 611 the reader is auto-consuming or the channel is MODE_CONTINUOUS.
 612
 613 In order for the relay to detect the 'buffers full' condition for a
 614 channel, it must be kept up-to-date with respect to the number of
 615 buffers consumed by the client.  If the addition of the value of the
 616 bufs_consumed param to the current bufs_consumed count for the channel
 617 would exceed the bufs_produced count for the channel, the channel's
 618 bufs_consumed count will be set to the bufs_produced count for the
 619 channel.  This allows clients to 'catch up' if necessary.
 620
 621 ----------
 622 void relay_bytes_consumed(reader, bytes_consumed, read_offset)
 623
 624 Adds to the channel's consumed count.  bytes_consumed should be the
 625 number of bytes actually read e.g. return value of relay_read() and
 626 the read_offset should be the actual offset the bytes were read from
 627 e.g. the actual_read_offset set by relay_read().  NOTE: kernel clients
 628 don't need to call this function if the reader is auto-consuming or
 629 the channel is MODE_CONTINUOUS.
 630
 631 In order for the relay to detect the 'buffers full' condition for a
 632 channel, it must be kept up-to-date with respect to the number of
 633 bytes consumed by the client.  For packet clients, it makes more sense
 634 to update after each read rather than after each complete sub-buffer
 635 read.  The bytes_consumed count updates bufs_consumed when a buffer
 636 has been consumed so this count remains consistent.
 637
 638 ----------
 639 int relay_info(channel_id, *channel_info)
 640
 641 relay_info() fills in an rchan_info struct with channel status and
 642 attribute information such as usage modes, sub-buffer size and count,
 643 the allocated size of the entire buffer, buffers produced and
 644 consumed, current buffer id, count of writes lost due to buffers full
 645 condition.
 646
 647 The virtual address of the channel buffer is also available here, for
 648 those clients that need it.
 649
 650 Clients may need to know how many 'unused' bytes there are at the end
 651 of a given sub-buffer.  This would only be the case if the client 1)
 652 didn't either write this count to the end of the sub-buffer or
 653 otherwise note it (it's available as the difference between the buffer
 654 end and current write pos params in the buffer_end callback) (if the
 655 client returned 0 from the buffer_end callback, it's assumed that this
 656 is indeed the case) 2) isn't using the read() system call to read the
 657 buffer.  In other words, if the client isn't annotating the stream and
 658 is reading the buffer by mmaping it, this information would be needed
 659 in order for the client to 'skip over' the unused bytes at the ends of
 660 sub-buffers.
 661
 662 Additionally, for the lockless scheme, clients may need to know
 663 whether a particular sub-buffer is actually complete.  An array of
 664 boolean values, one per sub-buffer, contains non-zero if the buffer is
 665 complete, non-zero otherwise.
 666
 667 ----------
 668 int relay_close(channel_id)
 669
 670 relay_close() is used to close the channel.  It finalizes the last
 671 sub-buffer (the one currently being written to) and marks the channel
 672 as finalized.  The channel buffer and channel data structure are then
 673 freed automatically when the last reference to the channel is given
 674 up.
 675
 676 ----------
 677 int relay_realloc_buffer(channel_id, nbufs, async)
 678
 679 Allocates a new channel buffer using the specified sub-buffer count
 680 (note that resizing can't change sub-buffer sizes).  If async is
 681 non-zero, the allocation is done in the background using a work queue.
 682 When the allocation has completed, the needs_resize() callback is
 683 called with a resize_type of RELAY_RESIZE_REPLACE.  This function
 684 doesn't replace the old buffer with the new - see
 685 relay_replace_buffer().
 686
 687 This function is called by kernel clients in response to a
 688 needs_resize() callback call with a resize type of RELAY_RESIZE_EXPAND
 689 or RELAY_RESIZE_SHRINK.  That callback also includes a suggested
 690 new_bufsize and new_nbufs which should be used when calling this
 691 function.
 692
 693 Returns 0 on success, or errcode if the channel is busy or if
 694 the allocation couldn't happen for some reason.
 695
 696 NOTE: if async is not set, this function should not be called with a
 697 lock held, as it may sleep.
 698
 699 ----------
 700 int relay_replace_buffer(channel_id)
 701
 702 Replaces the current channel buffer with the new buffer allocated by
 703 relay_realloc_buffer and contained in the channel struct.  When the
 704 replacement is complete, the needs_resize() callback is called with
 705 RELAY_RESIZE_REPLACED.  This function is called by kernel clients in
 706 response to a needs_resize() callback having a resize type of
 707 RELAY_RESIZE_REPLACE.
 708
 709 Returns 0 on success, or errcode if the channel is busy or if the
 710 replacement or previous allocation didn't happen for some reason.
 711
 712 NOTE: This function will not sleep, so can called in any context and
 713 with locks held.  The client should, however, ensure that the channel
 714 isn't actively being read from or written to.
 715
 716 ----------
 717 int relay_reset(rchan_id)
 718
 719 relay_reset() has the effect of erasing all data from the buffer and
 720 restarting the channel in its initial state.  The buffer itself is not
 721 freed, so any mappings are still in effect.  NOTE: Care should be
 722 taken that the channnel isn't actually being used by anything when
 723 this call is made.
 724
 725 ----------
 726 int rchan_full(reader)
 727
 728 returns 1 if the channel is full with respect to the reader, 0 if not.
 729
 730 ----------
 731 int rchan_empty(reader)
 732
 733 returns 1 if the channel is empty with respect to the reader, 0 if not.
 734
 735 ----------
 736 int relay_discard_init_buf(rchan_id)
 737
 738 allocates an mmappable channel buffer, copies the contents of init_buf
 739 into it, and sets the current channel buffer to the newly allocated
 740 buffer.  This function is used only in conjunction with the init_buf
 741 and init_buf_size params to relay_open(), and is typically used when
 742 the ability to write into the channel at init-time is needed.  The
 743 basic usage is to specify an init_buf and init_buf_size to relay_open,
 744 then call this function when it's safe to switch over to a normally
 745 allocated channel buffer.  'Safe' means that the caller is in a
 746 context that can sleep and that nothing is actively writing to the
 747 channel.  Returns 0 if successful, negative otherwise.
 748
 749
 750 Writing directly into the channel
 751 =================================
 752
 753 Using the relay_write() API function as described above is the
 754 preferred means of writing into a channel.  In some cases, however,
 755 in-kernel clients might want to write directly into a relay channel
 756 rather than have relay_write() copy it into the buffer on the client's
 757 behalf.  Clients wishing to do this should follow the model used to
 758 implement relay_write itself.  The general sequence is:
 759
 760 - get a pointer to the channel via rchan_get().  This increments the
 761   channel's reference count.
 762 - call relay_lock_channel().  This will perform the proper locking for
 763   the channel given the scheme in use and the SMP usage.
 764 - reserve a slot in the channel via relay_reserve()
 765 - write directly to the reserved address
 766 - call relay_commit() to commit the write
 767 - call relay_unlock_channel()
 768 - call rchan_put() to release the channel reference
 769
 770 In particular, clients should make sure they call rchan_get() and
 771 rchan_put() and not hold on to references to the channel pointer.
 772 Also, forgetting to use relay_lock_channel()/relay_unlock_channel()
 773 has no effect if the lockless scheme is being used, but could result
 774 in corrupted buffer contents if the locking scheme is used.
 775
 776
 777 Limitations
 778 ===========
 779
 780 Writes made via the write() system call are currently limited to 2
 781 pages worth of data.  There is no such limit on the in-kernel API
 782 function relay_write().
 783
 784 User applications can currently only mmap the complete buffer (it
 785 doesn't really make sense to mmap only part of it, given its purpose).
 786
 787
 788 Latest version
 789 ==============
 790
 791 The latest version can be found at:
 792
 793 http://www.opersys.com/relayfs
 794
 795 Example relayfs clients, such as dynamic printk and the Linux Trace
 796 Toolkit, can also be found there.
 797
 798
 799 Credits
 800 =======
 801
 802 The ideas and specs for relayfs came about as a result of discussions
 803 on tracing involving the following:
 804
 805 Michel Dagenais         <michel.dagenais@polymtl.ca>
 806 Richard Moore           <richardj_moore@uk.ibm.com>
 807 Bob Wisniewski          <bob@watson.ibm.com>
 808 Karim Yaghmour          <karim@opersys.com>
 809 Tom Zanussi             <zanussi@us.ibm.com>
 810
 811 Also thanks to Hubertus Franke for a lot of useful suggestions and bug
 812 reports, and for contributing the klog code.