-
Notifications
You must be signed in to change notification settings - Fork 3
A threaded, continuations-based I/O event library for manycore NUMA machines
License
dankamongmen/libtorque
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
================================================== libtorque, a multithreaded I/O event library Copyright © 2009--2021 Nick Black <dank@qemfd.net> Render this document with fixed-width fonts! ================================================== ___________________________________________________________________ 888 ,e, 888 d8 "...tear the roof off the sucka..." 888 " 888 88e d88 e88 88e 888,8, e88 888 8888 8888 ,e e, 888 888 888 888b d88888 d888 888b 888 " d888 888 8888 8888 d88 88b 888 888 888 888P 888 Y888 888P 888 Y888 888 Y888 888P 888 , 888 888 888 88" 888 "88 88" 888 "88 888 "88 88" "YeeP" _____________________________________________ 888 _________________ continuation-based unix i/o for manycore numa\888/© nick black 20xx Wiki - https://nick-black.com/dankwiki/index.php/Libtorque Mailing list - http://groups.google.com/group/libtorque-devel GitHub project page - http://github.com/dankamongmen/libtorque Primary git repository - git://github.com/dankamongmen/libtorque.git Bugzilla - https://nick-black.com/bugzilla/buglist.cgi?product=libtorque I. History and licensing II. Minimum requirements 1. architecture 2. operating system 3. compiler 4. libc/pthreads 5. cpuset 6. numa 7. cuda 8. ssl 9. dns A. doc B. misc III. Building libtorque IV. Design issues 1. design docs V. Writing libtorque applications 1. overview 2. common mistakes VI. FAQ 1. building 2. general use 3. file descriptors 4. signals ----------------------------------------------------------------libtorque----- -=+ I. History and licensing +=- ----------------------------------------------------------------libtorque----- libtorque was conceived as a project for Professor Richard Vuduc's Fall 2009 "CSE 6230: Tools and Applications for High Performance Computing" at the Georgia Institute of Technology. The original proposal for libtorque is available at: https://nick-black.com/tabpower/cse6230proposal.pdf libtorque is licensed under version 2 of the Apache License: http://www.apache.org/licenses/LICENSE-2.0.html A copy can be found in the toplevel file COPYING. Development of libtorque would have been impossible without the extraordinary grace, patience and benevolence of management at McAfee Research, particularly Dmitri Alperovitch and Dr. Sven Krasser. ----------------------------------------------------------------libtorque----- -=+ II. Minimum requirements +=- ----------------------------------------------------------------libtorque----- --architecture requirements--------------------------------------------------- Only x86 processors with the CPUID instruction are currently supported (most everything from the Pentium Pro onwards). Further hardware support is intended. --operating system requirements----------------------------------------------- libtorque has been tested on Linux (versions 2.6.19 through 3.2.6), and FreeBSD (version 7.1). It might work on earlier versions of Linux. Support for other operating systems, and earlier versions, is intended. --compiler requirements------------------------------------------------------- libtorque is reliant upon GNU Make and the GNU Compiler Collection. gcc is tracked quite closely, and only recent versions might be supported at any time; 4.3 is the minimum gcc version explicitly supported or recommended gcc 4.2, and also llvm using the 4.2 frontend, appear to work if some -W options are removed from WFLAGS. The results have not been extensively tested. --libc/pthreads requirements-------------------------------------------------- On Linux, the GNU C Library is required, using the NPTL threading implementation (NPTL is the default on 2.6 kernels since GNU libc 2.3.2). Versions 2.5 through 2.10 have been tested. On FreeBSD, only the libthr threading implementation is explicitly supported or recommended (this is the default in FreeBSD 7, and the only supported mode in FreeBSD 8)). If rebuilding world, ensure NO_LIBTHR is not active in make.conf. If using another pthread library as the default, bind libpthread references to libthr via the following entries in /etc/libmap.conf: libpthread.so.2 libthr.so.2 libpthread.so libthr.so If using a 32-bit version of the library on a 64-bit system, place these same lines in /etc/libmap32.conf. The mapping may be restricted to libtorque if necessary (this author recommends general use of the libthr implementation). --cpuset requirements--------------------------------------------------------- On FreeBSD, the native code added during 7.1 development is used. On Linux, administrative support for cpusets requires CONFIG_CPUSET to be enabled in the kernel (if cpuset partitioning is in effect, a "cpuset" or "cgroups" filesystem will be mounted on /dev/cpuset). Affinities can and will still be used by libtorque without this support, but it will be difficult to partition processing and memory elements up among processes. Affinities have been part of Linux since 2.5.8. See the Linux kernel's Documentation/cpusets.txt and libtorque bug #14 (https://nick-black.com/bugzilla/show_bug.cgi?id=14) for more info. If cgroups are used, you likely also want CONFIG_GROUP_SCHED. The SGI libcpuset library (http://oss.sgi.com/projects/cpusets/) was evaluated, but I decided against it due to stability, portability and maintenance issues. Version 1.0 was tested. --numa requirements--------------------------------------------------------- On Linux, the libNUMA library (http://oss.sgi.com/projects/libnuma/) is used. Version 2.0.3 has been tested. CONFIG_NUMA must be enabled in the kernel; if NUMA is properly supported, devices/system/node* directories will be present in mounted sysfs filesystems. FreeBSD does not, to my knowledge, expose NUMA details as of 7.2. --cuda requirements--------------------------------------------------------- On Linux, the "Driver API" libcuda library (http://www.nvidia.com/object/cuda_get.html) is used. Version 2.3 has been tested. --ssl requirements---------------------------------------------------------- OpenSSL is supported. Version 0.9.8 has been tested. GnuTLS support is being considered. --dns requirements---------------------------------------------------------- GNU adns is supported. Version 1.4 has been tested. C-ares support is being considered. We might roll our own, one designed for highly concurrent operation. --doc requirements---------------------------------------------------------- Building the man pages (distributed in Docbook XML) requires xsltproc (part of the GNOME project's libxslt) and DocBook. A network connection is required if the Docbook DTD's and XSL stylesheets are not installed; building the documentation will be much faster with local copies. Install: - docbook-xml, docbook-xsl, xsltproc (Debian) - textproc/docbook-xml, textproc/docbook-xsl, textproc/xsltproc (FreeBSD) Building the other documentation (papers, figures, etc) requires GraphViz's dot(1) utility. Version 2.20.2--2.26.0 have been tested. Install: - graphviz (Debian) - graphics/graphviz (FreeBSD) --misc requirements--------------------------------------------------------- Exuberant Ctags are required to build the tagfile. Install: - devel/ctags (FreeBSD) - exuberant-ctags (Debian) ----------------------------------------------------------------libtorque----- -=+ III. Building libtorque +=- ----------------------------------------------------------------libtorque----- If you have downloaded a release tarball, "configure" will already be present. If you're building from a source checkout, you'll need the GNU Autotools. Run "autoreconf -fi" to (re)generate "configure". Run "./configure" and "make" to build the library, and "make install" to install it. Environment variables can affect the build by overriding defaults: DESTDIR (Installation prefix. Default: /usr/local) DOCPREFIX (Doc installation prefix. Default: /usr/local/share (Linux), /usr/local (FreeBSD)) CC (C compiler executable. Default: gcc-4.4 (Linux), gcc44 (FreeBSD)) TAGBIN (Source tag generator. Default: exctags if on path, otherwise ctags) XSLTPROC (XSL processor. Default: xsltproc) MARCH/MTUNE (Code generation settings. See below) Build policy can be modified by defining certain variables: LIBTORQUE_WITHOUT_ADNS (do not build in GNU adns support) LIBTORQUE_WITHOUT_CUDA (do not build in NVIDIA CUDA support) LIBTORQUE_WITHOUT_OPENSSL (do not build in OpenSSL support) LIBTORQUE_WITHOUT_NUMA (do not build in libNUMA support) LIBTORQUE_WITHOUT_EV (do not build libev-based testing binaries) LIBTORQUE_WITHOUT_WERROR (do not compile with -Werror -- use is discouraged) Changing environment variables ought be followed by the 'clean' target; this is one of the very few times the 'clean' target must be used. By default, libtorque is built optimizing for the buildhost's µ-architecture and ISA, using gcc 4.3's "native" option to -march and -mtune. If you don't have gcc 4.3 or greater, you'll need to define appropriate march and mtune values for your system (see gcc's "Submodel Options" info page). Libraries intended to be run on arbitrary x86 hardware must be built with MARCH explicitly defined as "generic", and MTUNE unset. The resulting libraries will be wretchedly suboptimal on the vast majority of x86 processors. From the toplevel, invoke GNU make. On Linux, 'make' is almost always GNU make. On FreeBSD, the devel/gmake Port supplies GNU make as 'gmake'. This will build the libtorque library, and run the supplied unit tests. Unit test failures are promoted to full build failures. The install target can then be run to install the library. Note: The 'install' target depends on unit testing targets, and thus will not install a known-unsafe library. This might be undesirable when hacking on the library, and testing with another application. The 'unsafe-install' target is provided to facilitate such operation. Its use is not typically recommended. The 'deinstall' target will remove the files installed by that version of libtorque (it cannot remove files installed only by previous versions). Since libtorque does not install any active configuration files, use of 'deinstall' is thus recommended prior to updating and rebuilding libtorque. Non-existence of files is not considered an error by the 'deinstall' target. libtorque can be brought up to date via 'git pull'. The 'clean' target ought never be necessary to run, save when hacking on the build process itself (or changing build parameters, as noted above), or (re)moving source files. ----------------------------------------------------------------libtorque----- -=+ IV. Design Issues +=- ----------------------------------------------------------------libtorque----- - Execution unit detection, differentiation, and effective use. This might have to deal with symmetric multiprocessing, one or many multicore packages, simultaneous multithreading (ie HyperThreading), heterogenous cores, limited cpusets, and processors which are removed from or added to the workset at runtime. Power management capabilities, functional units, memory and I/O paths and interconnection properties all play roles in data placement and event scheduling. Instruction set details ought not matter so much. libtorque will initially operate as the sole user of any processing units it is allocated; consideration of other processes, if it exists, is incidental. Later, this might change. We might support prioritizing within a cpuset, so that for instance two libtorque programs can share the entirety of a cpuset, but stomp on each other minimally. It would of course generally be best to combine these various components into a single libtorque program. - Memory detection, differentiation and effective use. This might have to deal with unified vs split caches, n-way associativities, line sizes, total store sizes, page sizes and types, prefetching, eviction policies, DMA into DRAM or even cache SRAM, multiprocessor coherence and sharing, inclusive and exclusive levels, bank count, and TLB sizes. It is unexpected that libtorque will take into consideration memory pipelining, writethrough vs writeback, memory bandwidth, or absolute latency. libtorque will, for instance, want to generally schedule two functionally pipelined gyres on a shared die, whereas functionally parallel codes might be usually scheduled irrespective of die-sharing. Stacks can freely alias one another across exclusive, independent caches, but ought not relative to a shared cache. Meanwhile, multiple states scheduled on a given thread ought not be aliasing. These issues combine in complex, interesting ways as the eventspace becomes irregular, and states must be moved among processors (for instance, select a processor serving no aliasing states if one's available). - Not only event-handling, but also event receipt must be scheduled. Any given set of threads can invoke event discovery, on shared or distinct sets of events, where shared events could employ shared or distinct kernel-side event sets. Multiple listeners on an event means more flexibility, but also more communication and wasted work; it is likely better to move the event. If no more than one thread can wait for an event, and either one-shot handling or edge-triggering is used, a majority of locking and possible contention can be excised from the core. --design docs--------------------------------------------------------------- Various design documents can be found in the doc/ subdirectory. Included among them are: doc/mteventqueues - "Event Queues and Threads" doc/termination - "Termination" ----------------------------------------------------------------libtorque----- -=+ V. Writing libtorque applications +=- ----------------------------------------------------------------libtorque----- --overview------------------------------------------------------------------ The only interfaces available to users of libtorque are those in libtorque.h, which attempts to be authoritative and current regarding technical details. Numerous example applications live in tools/testing/ and various src/ directories. That having been said: - A torque_ctx is required to use any libtorque functionality. A program may use more than one torque_ctx, although this constrains event handling and is thus not generally optimal. This support exists because: - multiple libraries used by an app might each use libtorque - multiple-architecture processes might one day need it - it seems unlikely that refusing to support multiple contexts would lead to any bugs being discovered more quickly This is not primarily a security- or billing-related issue; to effect QoS and accounting, multiple libtorque applications ought be run in distinct operating system containers. Alternatively, use libtorque's priority system in conjunction with handrolled stats. - A torque_ctx can be created only via torque_init(). It cannot be used after passing it to torque_stop(). Side-effects of torque_init() include: - (re-)detection of system topology and processor details - populating allocated processors with an event thread each (note that N libtorque contexts in a process lead to N threads per processor, assuming the process's cpuset doesn't change between initializations) - SIGPIPE is ignored if it was previously handled via the default action - SIGTERM will be intercepted by some instance of libtorque subject to operating system-specific rules. See kqueue(7) or signalfd(2) (this also applies to any signals registered via torque_addsignal()) - allocation of moderate amounts of memory and a handful of file descriptors - Add event sources to the libtorque context via torque_add*(). Fundamental event sources include: - file descriptors (rx / tx) - signals (rx) - timers (absolute or relative; see timerfd(2) or kqueue(7)) - filesystem events (see inotify(7) or kqueue(7)) - pthread condition variables Synthesized atop these are numerous derived sources (event systems)... - SSLv3/TLSv1 servers and clients - DNS queries - Network events (via netlink/PF_ROUTE sockets) ...and also stream transforms: - SSLv3/TLSv1 - gzip/bzip2 - architecture-adaptive buffering Event sources may be registered with more than one libtorque context; the events will be repeated to (and thus handled by) all associated contexts. - Once registered, libtorque will immediately begin facilitating callbacks for the specified event. No more than one libtorque thread will dispatch a given event's handler at once (though subsequent events may be handled by any thread). Locking is thus only necessary for mutable data referenced by multiple events' handlers (including, for instance, thread-unsafe libraries called by potentially-concurrent handlers). This is efficiently implemented via exclusive use of edge-triggered I/O notification (we'd otherwise need locks in the event dispatching). Edge-triggered I/O is covered in epoll_wait(2) and kqueue(2); most important to note is that all available data must be read in each callback, or events will cease to be generated. This means every dequeuing operation (read(2), accept(2), etc) must be repeated until either: - An attempt to dequeue returns with EAGAIN or EWOULDBLOCK. Further read-type events will be processed and dispatched as they occur. - Further handling would block on some other resource, a mutex for instance or perhaps buffer space. Ensure appropriate related continuations are registered, and compose a read-type callback across them (this is the most general definition of an event queue as mentioned in epoll_wait(2)). - EOF is reached (read() returns 0). Either close(2) the descriptor or, if still writing, ensure appropriate related continuations are registered, as no more read-type events will be dispatched. - The connection is invalidated, in which case it must be close(2)d lest it possibly be leaked (there is no assurance of further read-type events). - It is (currently) critical that handlers not block. Only non-blocking or asynchronous I/O operations ought be used, and preferably only file descriptors explicitly marked non-blocking. Rather than sleeping on a contended mutex, update the continuation and yield the processing context. Remember that non-blocking operation is typically meaningless in the context of a disk-backed read(2); asynchronous I/O is thus preferable for disk reads (especially since a "non-blocking" read retried or failed at the block layer can block arbitrarily). Major computations upon which handling is dependent ought be implemented via libtorque's opportunistic or dedicated compute-thread infrastructure and more fine-grained continuations. - Handlers can themselves call libtorque functions (even torque_stop()), even on their own contexts (this is of course necessary for any kind of accept(2)ing socket). --common mistakes----------------------------------------------------------- - Using torque_addfd() for a listen(2)ing socket. The proper activation is torque_addfd_unbuffered(). The accept(2)ing callback will never be invoked from a default (buffered) fd. - Failing to account for EINTR or short returns. Whether a system call interrupted by signal delivery is automatically restarted depends on the operating system and libc, the capability operand, the system call, and the current signal-handling state. Since libtorque intercepts signals prior to uninterested threads' receipt (see signalfd(2) and kqueue(2)), applications needn't worry about yielding on EINTR, nor unbounded looping thereon. Also, EINTR is indicated only when no data had been moved upon signal delivery; a short result is returned otherwise. - Failure to mask signals registered with libtorque in all other threads. ----------------------------------------------------------------libtorque----- -=+ VI. FAQ +=- ----------------------------------------------------------------libtorque----- --building------------------------------------------------------------------ Q: I get errors about NUMA-related functionality. A: See the NUMA requirements in section II.6. If you can't provide the required minimum support, build with LIBTORQUE_WITHOUT_NUMA. --general use--------------------------------------------------------------- Q: Can an (unthreaded) program use libtorque even if it doesn't use -pthread during compilation? A: Yes. Most of the binaries built as part of libtorque don't use -pthread; see CFLAGS vs MT_CFLAGS in the GNUmakefile. --file descriptors---------------------------------------------------------- Q: Why is torque_addfd() failing on very high fds? A: Did your file descriptor rlimit change after the relevant torque_ctx was created? torque_init() detects and uses the file descriptor rlimit to shape some internal arrays, and will reject file descriptors outside this range. --signals------------------------------------------------------------------- Q: Why can't I listen to SIGTERM, SIGKILL or SIGSTOP? A: SIGKILL and SIGSTOP can't be caught or used through signalfd/kqueue, so any attempt to use them will be rejected. Libtorque uses SIGTERM internally, so attempts to use it will also be rejected (technically, it uses EVTHREAD_TERM, which is #defined on current platforms to SIGTERM). Q: Will externally-generated signals be delivered to libtorque threads, or other threads, or both? A: libtorque uses POSIX threads, and reflects those semantics. An IPC signal will be delivered to an arbitrary thread which is not masking that signal. By default, libtorque threads mask all possible signals (all save KILL, STOP, and TERM -- see above), and thus signals will prefer other threads. When a signal is registered with libtorque, that signal will be unmasked in at least one libtorque thread. In summary: to ensure delivery to non-libtorque threads, don't register the relevant signals with libtorque, and mask them prior to calling torque_init(). To ensure handling within libtorque, mask the relevant signals (in all threads) prior to calling torque_init(), and register them for handling. Always be sure to keep SIGTERM blocked. Q: Will signal handlers be called if libtorque is listening for that signal? A: Signal delivered to libtorque threads will be consumed. It doesn't matter anyway, since you ought have the relevant signals blocked in other threads. Furthermore, libtorque might modify the (process-wide) signal handler.
About
A threaded, continuations-based I/O event library for manycore NUMA machines
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published