Add LDAP intro document

inejge · Dec 20, 2022 · 95d2704 · 95d2704
1 parent 00a513e
commit 95d2704
Showing 1 changed file with 341 additions and 0 deletions.
diff --git a/LDAP-primer.md b/LDAP-primer.md
@@ -0,0 +1,341 @@
+# An LDAP Primer
+
+Copyright (c) 2022 Ivan Nejgebauer. Distributed under the Creative Commons
+Attribution-NonCommercial-ShareAlike 4.0 International (CC-BY-NC-SA) license.
+
+# 0. Foreword
+
+"LDAP" stands for Lightweight Directory Access Protocol. If you don't recognize
+the acronym, or don't find anything in the full name meaningful, this document
+is probably not for you. It's a high-level introduction to a niche technical
+topic.
+
+With that out of the way, why another LDAP introduction? As the author of an
+[LDAP library](https://crates.io/crates/ldap3), one thing that has long
+bothered me was the inability to refer the non-knowledgeable reader to basic
+documentation that I felt to be both approachable and complete enough. Existing
+LDAP literature is famously hermetic, making the initial steps toward understanding
+much more difficult than necessary; even the introductory texts can't resist
+the temptation of using advanced terminology before having it explained.
+
+The aspiration of this primer is to introduce the topic of LDAP well enough
+to provide a foothold for further exploration, without requiring the reader's
+previous familiarity with anything beyond JSON, TCP, and TLS.
+
+# 1. Introduction
+
+The term "LDAP" is used for two related but distinct concepts:
+
+1. A database with a well-defined data model. ("We keep our user identities
+   in LDAP.") The database and the service are often called a _directory_.
+   It can be implemented as a separate specialized server, like OpenLDAP,
+   or as part of a set of system components, like Active Directory.
+
+2. The standardized protocol to access the database. ("Use LDAP to query
+   Active Directory.") Since the protocol is standardized and the data model
+   well-known, even a service with a radically different data store can
+   use it to offer access to its data.
+
+Developers need good working knowledge of both aspects to use LDAP
+effectively. The principal difficulty with gaining that knowledge lies
+in learning a sizeable set of interrelated concepts using rather alien
+terminology, stemming from LDAP's roots in the X.500 standardization
+effort. The IETF/ISO standard wars of the mid- to late 1980s are
+now forgotten, but one can glimpse some of the reasons for internet's
+decisive victory by perusing any available X.500 document and noting
+the impenetrable text combined with a jungle of references to other
+such documents.
+
+This primer will try to lessen the pain by always being clear whether it's
+talking about LDAP, the database or LDAP, the protocol, and introducing
+the minimum of necessary terminology to explain any issue. For a deeper
+understanding, familiarity with the original specifications is still
+immensely useful, and the interested reader shouldn't shy away even from
+the ISO/ITU-T standards. X.501 (The Directory: Models), 1993 edition, is
+[readily available online](https://www.itu.int/rec/T-REC-X.501-199311-S/en).
+If one makes peace with the style, it's evident that many LDAP concepts are
+a simplified extrapolation of those in X.501.
+
+# 2. The LDAP data model
+
+## 2.1. Entries and attributes
+
+Let's imagine a toy system for storing user identities. For each user there
+will be a data set containing the username, a password, the user's full name
+and their last name. (This is not very realistic; especially, the password
+should _never_ be stored in clear text.) If there is a "John Doe" with the
+username "john.doe" and password "secret", their data set represented as a
+JSON object might look like this:
+
+    {
+        "username": "john.doe",
+        "full_name": "John Doe",
+        "last_name": "Doe",
+        "password": "secret"
+    }
+
+To make the data closer to the LDAP model, one would first rename the keys.
+Still keeping it in JSON:
+
+    {
+        "uid": "john.doe",
+        "cn": "John Doe",
+        "sn": "Doe",
+        "userPassword": "secret"
+    }
+
+LDAP represents data in a different format, called LDIF (LDAP Data Interchange
+Format.) Directly translating the above JSON to LDIF gives:
+
+    uid: john.doe
+    cn: John Doe
+    sn: Doe
+    userPassword: secret
+
+In LDAP, a data set like this one is called an __entry__. An LDAP database is
+a collection of entries. What were the keys in the JSON object are called
+__attributes__ in LDAP. 
+
+The above entry is incomplete. If filled out to a minimal complete form, it might
+look like this:
+
+    dn: uid=john.doe,ou=People,dc=example,dc=com
+    objectClass: top
+    objectClass: inetOrgPerson
+    uid: john.doe
+    cn: John Doe
+    sn: Doe
+    userPassword: secret
+
+The `dn` attribute will be explained in the next section. Here, just note that
+there are multiple lines starting with "objectClass". That's because certain
+LDAP attributes can be multi-valued, which is represented in LDIF as separate
+lines with the same attribute name. Attribute values in an entry form a set:
+there must not be any duplicate values for a given attribute.
+
+Attribute names can be more complicated than described here, and more will be
+said in a later section.
+
+## 2.2. Entry naming and organization
+
+Entries in an LDAP database are logically organized as a tree, called the __directory
+information tree__ (DIT), and every entry has an attribute describing its position
+in the tree. The attribute is named `dn` for "distinguished name", and is unique
+for an entry. Let's repeat the full entry from the previous section and highlight
+the naming attributes:
+
+<pre><code><b>dn</b>: uid=john.doe,ou=People,<span style="color: blue;">dc=example,dc=com</span>
+objectClass: top
+objectClass: inetOrgPerson
+<b>uid</b>: john.doe
+cn: John Doe
+sn: Doe
+userPassword: secret
+</code></pre>
+
+A __distinguished name__ (DN) is a sequence of components describing the path
+through the logical tree from its root to the entry itself. In the LDIF value
+representation, the components are comma-separated and written left-to-right
+from the leaf to the root. (In the above DN, the component closest to the
+root is "dc=com".) Each component is called a __relative distinguished name__
+(RDN), and (in the simplest case, covered here) consists of a single attribute
+name and a value. The attribute and the value of the leftmost RDN, which names
+the entry, must exist in the entry itself. Above, the entry having the RDN
+"uid=john.doe" means that it must have the attribute `uid` with the value
+"john.doe".
+
+Originally, the X.500 directory standard envisioned a distributed directory,
+not unlike the domain name system, but with a much richer data model. For a
+variety of reasons, this never happened, and the directories remained disparate.
+The X.500 naming system has countries and registered organizations as DN root
+elements, which wasn't well suited to domain names used for naming on the internet,
+so a method of using domain names in DNs was standardized in RFC 2247 and is
+predominantly used in LDAP today.
+
+The DN above has the rightmost part in a different color. That part directly
+corresponds to the domain __example.com__. In a DIT organized with such
+a mapping, all entries are at or below the domain, which is then called the
+__suffix__ or __naming context__. A single DIT can contain multiple naming
+contexts.
+
+There is a special entry at the very root of the DIT. It has an empty DN
+and is called the __root DSE__ (DSA Specific Entry; "DSA" stands for "directory
+service agent", another name for the directory server.) It's used to store basic
+configuration and capability data about the directory server.
+
+## 2.3. Entry structure and schemas
+
+The set of attributes in an entry is not arbitrary, but dictated by the entry's
+__object classes__, each of which defines a set of mandatory and/or optional
+attributes. One, and only one, of the entry's object classes is __structural__,
+and ordinarily fixed for the lifetime of the entry. All other object classes are
+__auxiliary__ and may be added to or removed from an entry. Object classes form
+a hierarchy, in which all structural classes are ultimately derived from the
+_top_ class. That's why there will always be an `objectClass` attribute with
+that value in an entry.
+
+In the John Doe entry used as an example, the class _inetOrgPerson_ is derived
+from _organizationalPerson_, in turn a descendant of  _person_, which mandates
+the `cn` and `sn` attributes. Furthermore, `userPassword` is an optional attribute
+of _person_, while `uid` is an optional attribute of _inetOrgPerson_, although
+it's mandatory in this entry since it's a naming attribute.
+
+Despite having descriptive names, object classes and attributes are actually
+identified by __object identifiers__ (OIDs), which are arrays of integers, written
+as decimal numbers separated by dots. For example, the identifier `2.5.4.3`
+is mapped to two names: `cn` and `commonName`. The example entry used `cn`,
+but could have used `commonName` for exactly the same purpose. Attribute names
+are not case-sensitive: `commonName`, `commonname` and `COMMONNAME` refer to
+the same attribute.
+
+Object identifiers are one of the main visible legacies of LDAP's X.500 origins.
+The main branches of the OID namespace are controlled by the international
+standards organizations, ISO and ITU-T (formerly CCITT.) Obtaining one's own OID
+subtree for experimentation or customization is a somewhat involved process, but
+is also possible without bureaucracy by [using a UUID](https://oidref.com/2.25)
+in the `2.25` OID arc.
+
+Aside from the identifier and the names, an attribute definition also describes
+the syntax of attribute values and the ways those values can be compared for
+(in)equality, ordering and substring matching, by specifying the OIDs of the
+corresponding syntax and matching rules. Knowing the rules for a particular
+attribute is necessary for constructing searches, since not all search operations
+are compatible with all attribute types.
+
+A collection of attribute and object class definitions is called a __schema__. A
+directory will support a number of standardized schemas and provide the ways to
+extend the schema collection. It's important to note that standard attribute
+definitions should not be changed, because external tools and integrations might
+depend on them.
+
+# 3. LDAP, the protocol
+
+## 3.1. Connection lifecycle and basic authentication
+
+LDAP standards describe the transport options, operations and message formats of
+the protocol. The latter two are tightly coupled to the data model, so understanding
+the model makes learning the protocol much easier. Contrary to the many early internet
+protocols, LDAP has a binary encoding, necessitating the use of language-specific
+APIs or precompiled access tools to communicate with the server.
+
+LDAP uses TCP and port 389 to listen to client connection requests. Unless upgraded
+later, those connections are completely in the clear, and vulnerable to eavesdropping.
+It is therefore strongly recommended to secure the connection by using TLS or some
+other method, like Kerberos. Here one encounters some long-standing unaddressed
+specification deficiencies.
+
+The only standardized way of activating TLS is by first connecting in the clear to
+port 389, then issuing a special operation which starts the TLS handshake. This method,
+called StartTLS, is considered less secure than unconditional TLS establishment, and
+is discouraged in current usage. There is a de-facto standard for unconditional TLS
+connections, where the server listens on port 636. It's supported by everyone, but for
+some reason was never standardized, and it's doubtful that it will ever be.
+
+LDAP has a family of URL schemes which is pretty widely used for specifying connection
+and search parameters. In the most basic usage, the scheme part is __ldap__ for clear
+connections, __ldaps__ for unconditional TLS, and __ldapi__ for UNIX domain sockets
+(definitely nonstandard but widely available on Unix-like systems.) The host part of
+the URL specifies the hostname or IP address of the server; the port, if present,
+likewise, with the default value depending on the scheme (389 for __ldap__, 636 for
+__ldaps__.) The __ldapi__ scheme is different, because the host part represents the
+path of the server socket, and no port can be given.
+
+Upon establishing the connection, the client is anonymous. Servers usually limit access
+to some attributes or whole subtrees of the DIT to anonymous users. To overcome this,
+the client may need to authenticate, or __bind__, to the directory. Binding is one of
+a handful of standardized protocol operations. The usual binding method is called a
+_simple bind_, and uses a distinguished name (_bind DN_) and a password as parameters.
+The bind DN plays the role of username. The precise form of that DN depends on server
+configuration. It should be obvious that sending the password should only be done over
+a TLS-protected connection.
+
+When the client is finished, it may use the __unbind__ operation or simply drop
+the connection. The name "unbind" is slightly misleading, suggesting the opposite
+of "bind" while keeping the connection open; this is not the case, since unbinding shuts
+down the connection unconditionally.
+
+## 3.2. The Search operation
+
+LDAP is mostly used in read-heavy applications, where data retrieval dominates the usage.
+The only way to retrieve one or more entries via LDAP is to use the __search__ operation.
+It has a plethora of parameters and options, of which the following four are usually
+all it's necessary to provide:
+
+1. The _search base_, which is the starting point in the DIT for the operation. It must
+   lie within a naming context; one exception is the search for the root DSE.
+
+2. The _scope_, which bounds the number of entries which the operation will consider. It
+   can have the values _base_, meaning only the entry named as the search base; _one_ (for
+   "one level"), which means all entries on the single level immediately below the
+   search base, but excluding the base itself; and _sub_ (for "subtree"), containing
+   everything at and below the search base.
+
+3. The _filter_, an expression computed for all candidate entries, selecting those for which
+   it evaluates to true. An empty filter is syntactically invalid, although some LDAP tools
+   automatically substitute it with `(objectClass=*)`, meaning "an entry containing
+   a non-empty `objectClass` attribute," which is all of them.
+
+4. The list of _attributes_ to retrieve from the matching entries. If none are specified, only
+   the matching entry DNs are returned. The special name `*` means "all attributes," or more
+   precisely, all _user_ attributes, since there is an additional set of __operational attributes__
+   in each entry, maintained by the directory itself and not modifiable by the user. The set of
+   all operational attributes is requested by the special name `+`.
+
+The search operation, like all but two other LDAP operations, returns a result structure, one of
+whose elements is a numeric result code. Zero signifes success, while most non-zero values
+are errors.
+
+Filters are arguably the most important components of search requests, since they enable finding
+and retrieving the relevant entry or a small subset of entries among, potentially, thousands.
+Their representation in the request is binary, but a string format is standardized in RFC 4515
+and supported by all APIs, command line tools and applications. It's a bit quirky and heavy on
+the parentheses, especially if multiple terms and negations are involved, but not too difficult
+to understand and write for most day-to-day uses.
+
+Let's suppose that the John Doe entry also has an email address, given by the following LDIF
+fragment:
+
+    mail: john.doe@example.com
+
+(The email could be added to the entry as previously specified, since the object class
+`inetOrgPerson` has `mail` among its optional attributes.) If one wanted to search for
+a user's entry using the value entred in the login form, the resulting filter might be:
+
+    (uid=john.doe)
+
+However, if the user can enter both the username and the email address, the filter could be
+changed to account for both:
+
+    (|(uid=john.doe)(mail=john.doe))
+
+The filter would match an entry which has either `uid` or `mail` with the value "john.doe".
+In this case, the `uid` term would match. If the user entered "john.doe@example.com", the
+resulting filter would be:
+
+    (|(uid=john.doe@example.com)(mail=john.doe@example.com))
+
+This filter would match the same entry, but in this case the `mail` term would evaluate
+to true.
+
+## 3.3. Other LDAP operations
+
+LDAP provides the full set of operations for maintaining the DIT: adding, modifying and deleting
+entries, the analogues of INSERT, UPDATE and DELETE operations on a relational database. A notable
+difference is that LDAP changes don't operate in bulk, but on a single entry. The protocol guarantees
+that all modifications of an entry performed in a single operation will be atomic: either all of them
+are visible, or none. The guarantee doesn't extend to a series of modifications or other operations
+that change the DIT.
+
+The __modify__ operation consists of a sequence of modification requests for a single entry. Each
+request can add, delete or replace the values of a single attribute. Logically, the requests must
+be performed in the order listed within the operation. Individual requests may temporarily violate
+the schema (e.g., if all values of a mandatory attribute are deleted), but the final result must
+conform to the schema. Changing or deleting the structural object class or the naming attribute
+are not allowed.
+
+For changing the RDN or moving an entry elsewhere in the DIT there is the __modify DN__ operation.
+Schema conformance dictates that the naming attribute with the value used in the leaf RDN must
+exist in the entry, so renaming an entry to a RDN that is not already present would require two
+operations: one to add a new value and one to perform a renaming. To account for this situation,
+__modify DN__ will add the new naming attribute and value to the renamed entry automatically,
+and will also optionally delete the previous naming attribute-value pair.