-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc for making Ceph pool creation/deletion optional #1900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this follows the spirit of #1868. Before merge, it looks like the build is failing, and you'll need to run make codegen
since the types.go was updated.
SkipPoolCreation bool `json:"skipPoolCreation"` | ||
|
||
// Do not erase Ceph filesystem and pools when this filesystem is removed in Rook. | ||
PreservePoolsOnRemove bool `json:"preservePoolsOnRemove"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time the pool management was implemented in Rook, there was some discussion about the lifetime of the pools and whether they should be deleted when the CRD was deleted. Without a confirmation flag such as --yes-i-really-really-mean-it
, risk of deleting data unintentionally is higher. So even if rook creates the pools, it's good to have the separate flag that says to keep my pools.
I'm uncomfortable with making Rook partially in control of a given resource, for example, Pool. I think this complicates the reconciliation loops in the operator unnecessarily. What if instead we went with a model where each resource (Pool, Cluster, etc.) is either completely controlled by Rook, or not at all. For example, a Ceph Filesystem could have reference a Pool CRD (managed by Rook) or an externally managed Pool (identified by name). You could do the same with a Ceph Cluster. Using this approach a given resource is either declaratively managed or not, but its never partially declaratively managed 😀 This might pushes down a path of refactoring some CRDs types. |
@bassam separating out "Filesystem" (which currently means filesystem, pools, and MDS daemons) out has some logic to it, but it's going to make the YAML quite a bit more verbose. It would be a lot more flexible than we really need (for our Ceph product), as what we really want is a boolean: either manage the logical Ceph resources, or don't. The intermediate cases (e.g. create their pools by hand, but their filesystem in Rook) are probably not useful in practice. I'm not sure how the complexity ends up looking on the operator side, I guess we have to have mechanisms to prevent people deleting the filesystem in use by an MDS cluster, and ensure that controllers are setting up their respective bits in the right order. @travisn @jbw976 you guys might have a better gut feel than I do about how much work it would be to make that larger change to how filesystems and objectstores work |
agree. I think this is a generic problem with Kuberentes and YAML -- it tends to be verbose. There are a number of solutions on-top of YAML (like jsonnet, kustomize, etc.) that can help.
IMO, I think we have the wrong factoring currently. FileSystem and the two associated pools need to be more independent resources. I suspect we coupled them more tightly to improve the YAML experience. |
Having the partial management behavior seems independent of whether the pool CRD is factored out. I don't see how factoring out the pool into a separate CRD would help here. You would still need another property in the filesystem CRD that indicates that the pool is externally managed. The design choice behind embedding the pools in the filesystem and objectstore CRD were for the following reasons.
The advantages of having them inline seem much better than the complications of separating them out into the pool CRD. |
Rook can still automatically create and manage the pools needed for a Fileysystem if that's desired. In that case they are not externally managed pools. I'm just making the case that a given resource should not be partially managed. For example, you could go with something like:
I'm trying to see if we can make resource boundaries also become automated/manual management boundaries. |
I don't think that syntax works well. The shortcomings I see are:
If you reduce this down to the point where we don't specify separate policies for metadata vs. data, and we don't specify the pool names at all, then you end up back where we started with a boolean flag for whether to manage the pools (or two flags to separately control creation and deletion). Another alternative would be to explicitly separate the MDS cluster definition from the Filesystem definition, such that we (external management interfaces) wouldn't create a "Filesystem" in Rook at all, we'd just create an MDSCluster or similar object (and the "Filesystem" wouldn't need its pool creation to be optional). Then we'd have the question of whether a Filesystem should also wrap an MDSCluster object, or whether users should have to separately define a Filesystem and an MDSCluster to go with it -- back to the "making the yaml complicated" issue, that I find hard to justify unless there are key use cases that aren't covered by just adding the proposed bool fields. |
At a higher design level... The Kubernetes patterns around desired state include policies around how to handle a resource. When there is a resource with complex handling requirements, there is a policy defined surrounding it and its controller is expected to apply the policy. Some examples of policies include: In the case of pool lifetime, it seems very reasonable to have a policy that says the pool should not be deleted when the owner resource is deleted. This is in the same spirit as the PV reclaim policy. Sometimes you want to make sure you don't lose your data and the user should be able to set that policy. In the case of the other setting that tells Rook to skip pool creation or expect to be given the name of a pool managed by some external entity, neither option is in the spirit of desired state. Desired state should not have a policy that says "don't manage me". Such a desired state leads quickly to a state where Rook cannot control related resources and complicates the test matrix. With the desired state approach, Rook should be expected to do everything in its power to make sure its resources are fully functional. We should be teaching Rook how to handle all the desired state settings for pools, file systems, and object stores and let it manage the settings. If an external entity wants to interface with Rook, it would do so through the CRDs to express the desired state. Anything the external entity does to work around Rook's desired state management is going to lead down a slippery slope and cause many more headaches in the long run. @jcsp i'm still unclear about what is needed here when the user is going through the dashboard. Can we not teach the dashboard (or mgr) to fully use the rook CRDs to create the filesystems, pools, apply replica settings, etc? Rook wasn't designed to partially manage. It was designed to fully manage. |
We can, but we definitely shouldn't. If flows like pool creation had to go through Rook, ceph-ansible, and DeepSea, it would balloon the size of the code and the test matrix. The UI functionality would be limited to the lowest common denominator of what the orchestrators knew how to do (they all have different subsets of functionality). Remember also that Ceph is a moving target: Filesystems and pools are both going to change in the near future (e.g sharing data pools, pg autoscaling). It would be seriously painful if the Ceph dashboard had to deal with the inevitable differing levels of orchestrator support for changes, rather than simply exposing the new features when they were ready in Ceph. It's not that we want to toggle arbitrary subsets of Rook functionality: it's a single split, between logical Ceph entities (pools, filesystems, auth users, RGW zones), and orchestrated services (mons, OSDs, MDSs, RGW daemons). The patch is tiny, because it's such a natural separation: the things we're toggling were already separate functions, because they're conceptually separate. This isn't a slippery slope of complexity in Rook. On the Ceph side, we reflect that separation in the nascent orchestrator interface (ceph/ceph#21046). The scope of the orchestrator interface is "everything Ceph needs external help to do". Everything else is accessible to the dashboard using existing calls. The cost of sending things through an external tools is worthwhile for things we can't do ourselves, but it's not worthwhile for things we can easily do with a call straight to the mon. |
The desired state model ultimately doesn't really fit here. After sleeping on it, and with those details of the problem, it does seem necessary to have a policy that says "don't manage me". With the setting Or perhaps now our two settings can merge into a single policy:
|
Ah, no -- to me the flag means "don't touch it at all", I was borrowing the name from #1868, you're right that it gives the wrong impression about handling changes. I'd be happy with having a simpler single attribute to enable/disable management of the logical Ceph bits. The naming is awkward because while we talk about pools, we mostly really mean "pool and filesystem" (i.e. when the flag is false, the Rook FilesystemSpec is just driving an MDS cluster). Maybe we should bite the bullet of calling it something a little abstract like "manageCephEntities", as it is mostly expected to be used by code rather than humans. I'm neutral on whether the "retainPool" mode is needed -- I don't think we'd be using that from ceph-mgr, but I can imagine how people writing their own YAML might quite like it. |
ok, if it applies to the filesystem management, i come back to a bool, although this name is still awkward: If that flag is set, the name of the filesystem to be managed by the MDS can be assumed to be the name of the filesystem CRD? And rook wouldn't need to know anything about the pool names at that point. |
@jcsp do you have updates to the policy/settings names based on the conversation in this thread? |
afef068
to
581cc4f
Compare
I've changed this to a simplified form that uses an |
@jcsp What would be the behavior if the pools and filesystem aren't created before creating the CRDs with the new daemons? I imagine the daemons would crash/exit if the pools aren't available. These changes make sense to me, even if a different type of desired state to skip settings. We would also want to document the settings in the object store and file system CRDs (ceph-object-store-crd.md and ceph-filesystem-crd.md). We might consider wording around support for this setting. If you set it to @bassam any more concerns about the desired state settings? |
In the CephFS case, the daemon will just wait (remain in standby) until a filesystem that matches its standby_for_fscid setting appears. I'm a bit fuzzy on how Rook actually populates that setting, for a non-existent filesystem it won't be able to resolve a name to a fscid, so I'm not sure it would even get as far as starting the daemon. Not sure what the RGW behaviour is, I know that it creates its own pools in some cases. This is not a new question for Rook though -- it's just the same as if someone started daemons using e.g. systemd, and the daemons were configured use FS/pools that didn't exist. |
|
||
// The data pool settings | ||
DataPools []PoolSpec `json:"dataPools"` | ||
DataPools []PoolSpec `json:"dataPools,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a non-breaking change from v0.8, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the same yaml file should always work before or after the change
|
||
// The mds pod info | ||
MetadataServer MetadataServerSpec `json:"metadataServer"` | ||
|
||
// Don't touch pools/filesystem, just run MDS daemons | ||
OnlyManageDaemons bool `json:"onlyManageDaemons"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if this flag is true
the values for DataPools
and MetadataPools
must be empty, correct? I wonder if it would be cleaner to define another FileSystemSpec
like so:
type FilesystemPoolSpec struct {
// The metadata pool settings
MetadataPool PoolSpec `json:"metadataPool"`
MetadataPool PoolSpec `json:"metadataPool,omitempty"`
// The data pool settings
DataPools []PoolSpec `json:"dataPools"`
DataPools []PoolSpec `json:"dataPools,omitempty"`
}
// FilesystemSpec represents the spec of a file system
type FilesystemSpec struct {
// The data and metadata pool settings. If empty Rook will not create or manage pools
FilesystemPoolSpec FilesystemPoolSpec `json:"pools,omitempty"`
// The mds pod info
MetadataServer MetadataServerSpec `json:"metadataServer"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we require the pool specs to be empty? If that setting is true, it just means the operator would ignore them if they are set. We have several other places in the CRDs where settings are ignored depending on other settings. For example, if useAllNodes: true
, the entire nodes
element is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pools do not have to be empty. In practice, they would probably be non-empty if a user initially created their pools using Rook, and then set onlyManageDaemons=true right before deleting their Rook filesystem, to preserve their data. They could also clear the pools element, but I don't see any reason to force them to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean for OnlyManageDaemons
to be set true and the pool specs to be specified? I'm attempting to find a clean way to delineate between a declaratively managed an unmanaged resource.
if
useAllNodes: true
, the entire nodes element is ignored.
that should be a validation error, right? if useAllNodes
is true, nodes
should be omitted. you can specify these kinds of validation rules now in CRD specs.
if a user initially created their pools using Rook, and then set onlyManageDaemons=true right before deleting their Rook filesystem, to preserve their data.
this is the kind of partial management that concerns me. I.e. create the pools with rook and then rook stops managing them due to a flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can specify these kinds of validation rules now in CRD specs.
Although it feels a bit overkill, I don't have any fundamental problem making this stricter -- could you help me out with a pointer to how to do this stuff with k8s CRDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you see as the advantage of moving the metadataPools
and dataPools
under a pools
element? Is that to make it more obvious that both pools should be empty or else both pools should be set? A complication with that change is that we would need to implement a migration since that would be a breaking change to the crd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that to make it more obvious that both pools should be empty or else both pools should be set?
yes. said differently, rook is either managing the pools or doesn't know about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same goal not accomplished with the omitempty
flag on the metadataPool
and dataPools
? I do see that your suggestion could be clearer, but CRD migration has a non-trivial cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can achieve the same goal with omitempty
and eventually adding CRD validation to make sure both are specified or both are omitted.
We would not need OnlyManageDaemons
in this approach and this takes us back to the original suggestion earlier in the PR :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we could certainly use the resource boundary in this case. I was preferring the explicit setting because it has implications not just for skipping pool management, but also for skipping creating the filesystem resource. Currently the filesystem resource has no settings so is not visible in the crd to the user, so it seemed awkward to skip creating it based on not specifying pools. But since it's hidden from the user anyway, it seems reasonable to infer.
Before picking this up again, I want to make sure I'm clear on the approach from the last few comments:
The main downside (apart from complexity) to this is that it breaks the existing user experience for creating a filesystem without specifying any pool names. Currently, if someone leaves pool names out, the names get auto-generated based on the filesystem name, and the pools get created. After this change, if someone leaves pool names out then no filesystem or pools will be created. |
@jcsp Sounds good, except one question to clarify... Why does the pool spec need the name of the pool now? I was thinking that if the pool specs are specified, Rook would still create the default pool names. If the pool specs are not included, then the pools would be whatever name the external orchestrator decides. Are you saying there is a need for creating the pools with a name determined by the user even when the pools are specified? If the user is able to specify the name of the pools, it would complicate the pool management since the user could update the CRD with new pool names. Is renaming pools supported? |
Oops, I was getting confused by Rook's two I was worrying that PoolSpecs would be nil when users were creating filesystems with all-defaults, but it looks like the PoolSpec validation already requires that the user explicitly populate at least one replicated/EC option, so that's not an issue. I'm still a bit concerned about the UX for people using |
In this case where the pools don't get created, will the mds daemon pod crash loop? Hopefully that would be a sign that something is going wrong and they would hopefully look in the operator log for clues. The operator log could indicate that pools are expected to be created externally. Agreed that this isn't a great way to find errors, but such is the life of a k8s cluster admin. |
great! can we split out the code to a different PR and just make this one a design PR? |
The MDS itself would just sit there in standby mode until it sees the filesystem ID that it's been configured to serve. However, Rook won't get that far. It'll error out of CreateFilesystem when it can't resolve the filesystem name to an ID. |
Should Rook check if the pools exist even when it doesn't create them? Then the problem could be caught earlier. And is Rook creating the filesystem when the pools are skipped? I was thinking the filesystem would be skipped as well. |
In the case of filesystems, Rook is already de-facto checking their existence at the point it tries to resolve an ID to a name. We can make that message a bit friendlier for the benefit of anyone misusing the feature and inspecting their operator log to find out why a pool-less FilesystemSpec isn't working.
No, the Rook filesystem won't create the Ceph filesystem if the Rook filesystem was specified without pool settings. But it'll still go through CreateFilesystem (the one in mds.go) -- I didn't realise when I wrote that comment that there were three CreateFilesystem functions in Rook, so didn't think to say which one I meant ;-) |
I've updated the document, one last review please? |
@@ -0,0 +1,120 @@ | |||
# Enable external Ceph management tools | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a line that says the target version? The was some discussion that this would help when glancing at design docs down the road so we know if they are stale. See the top of this doc for an example.
design/external-management.md
Outdated
## Proposal | ||
|
||
In FilesystemSpec and ObjectStoreSpec, group the metadata and | ||
data pool attributes into a new struct called FilesystemPoolsSpec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about moving the pools under the new pools spec since that would require a migration of CRDs. Do we really need to do that, or can we leave the pools where they are? It seems like we can get the same behavior simply by allowing the existing metadataPool and dataPools elements to be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to do that, or can we leave the pools where they are?
I would be happy to leave them where they are, but I got the impression that @bassam was objecting to that + the desire was to have a harder separation between the case where pools are being specified or they aren't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn no. I walked away from the discussion above thinking we would just add omitEmpty
to pool names. No need for OnlyManageDaemons
or anything else.
(of Ceph entities like pools and filesystems) Partially fixes: rook#1868 Signed-off-by: John Spray <john.spray@redhat.com>
Updated. This should now reflect the most recent discussion and be ready to merge. I've filled out the target version to 0.9 |
@bassam ping, is this ready to go from your point of view? |
Signed-off-by: John Spray <john.spray@redhat.com>
This partially addresses #1868 -- it's the part about making pool creation optional, rather than any of the broader ideas around adoption/migration paths.
This is a design doc with some bonus code that implements it. Please review the design doc and treat the code as illustrative -- the reason I've skipped ahead to the implementation is that my ceph-mgr work already uses it, so I have the patches handy.
Resolves #1868
Additions to docs and tests pending consensus on design doc.
make codegen
) has been run to update object specifications, if necessary.make vendor
does not cause changes.[skip ci]