-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FTP support to jobstores #5142
Conversation
""" | ||
FTP access with upload. | ||
|
||
Taken and modified from https://github.com/ohsu-comp-bio/cwl-tes/blob/03f0096f9fae8acd527687d3460a726e09190c3a/cwl_tes/ftp.py#L8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license at cwl-tes is also Apache 2.0. Should I add a section to the Toil license that specifically states all code under this class is under cwl-tes's Apache 2.0 license?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to repeat the license.
If there are code quotes, then you could mention the copyright of the original authors / copyright holders:
Copyright 2018 EMBL-EBI <contracts@ebi.ac.uk> (Michael R. Crusoe's original implementation, written under contract)
Copyright 2019 Adam Struck <adamjstruck@gmail.com>
Copyright 2020 Joris Vankerschaver <jvankerschaver@enthought.com>
Copyright 2020 Dab Brill <dbrill@enthought.com>
Copyright 2020 Manabu ISHII <manabu.ishii.rb@gmail.com>
Otherwise if this implementation is different enough, you don't need to mention the authors.
I suggest this metric: if we wanted to change the license of Toil, would you have to get permission from these copyright holders because the code is essentially the same? Or is the code so adapted or different for other reasons that it is a new work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach is just add their Apache 2.0 license's copyright line to Toil's LICENSE file along with the others.
You could also drop it in a different file and import it to here, and in that file have Copyright 2017 Oregon Health and Science University
and the copyright boilerplate we use for our files. I think.
Also the #8
at the end of this link is pointing to the wrong line number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to do the latter by putting the code in a separate file, so I'll do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some more documentation about how FTP-over-SSL (which I think is rather unusual) is being used here, and how we're handling non-SSL FTP servers vs. those without certificates (which themselves might actually be downgrade attacks).
We need to write to a user who doesn't necessarily know that FTP can happen over SSL.
""" | ||
FTP access with upload. | ||
|
||
Taken and modified from https://github.com/ohsu-comp-bio/cwl-tes/blob/03f0096f9fae8acd527687d3460a726e09190c3a/cwl_tes/ftp.py#L8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach is just add their Apache 2.0 license's copyright line to Toil's LICENSE file along with the others.
You could also drop it in a different file and import it to here, and in that file have Copyright 2017 Oregon Health and Science University
and the copyright boilerplate we use for our files. I think.
Also the #8
at the end of this link is pointing to the wrong line number.
def __init__( | ||
self, cache: Optional[dict[Any, ftplib.FTP]] = None, insecure: bool = False | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have docstrings for this and all the other functions in the class, ideally. It's especially important with a parameter called insecure
; what is not being secured, and what is it not being secured from? Does it mean we're vulnerable to password sniffing and someone tampering with the FTP data in transit, or are we now executing arbitrary code from every file we download, or what?
user = env_user | ||
if env_passwd: | ||
passwd = env_passwd | ||
ftp.login(user or "", passwd or "", secure=not self.insecure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this using an undocumented secure
parameter? What does that do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftp in this case is an FTP_TLS object, which supports connecting over TLS. It's not on the official FTP_TLS python documentation, nor is there an associated docstring, but the code itself simply sets up an SSL connection to the FTP server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should cut the feature where we require user approval for an insecure FTP connection. (We might also want to then cut/flag off .netrc support?) If we do implement that, we also need to implement it for the actual requests, not just pass off to urllib
.
I also think we should address the cached FTP connections that seem to live forever. Maybe we want them to time out somehow? Or maybe we can just bypass the cache or not store a persistent FtpFsAccess
in a class-level variable.
cls._setup_ftp() | ||
# mypy is unable to understand that ftp must exist by this point | ||
assert cls.ftp is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you structured this as a method that returns an FtpFsAccess and not one that has a side effect of setting a class-level field, you wouldn't need all these asserts and you also wouldn't need to remember two member names.
ftp = None | ||
|
||
@classmethod | ||
def _setup_ftp(cls) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that the ftp
member is public, but the function that must be called before it is actually used has an underscore prefix.
src/toil/jobStores/ftp_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better placed in src/toil/lib
.
@@ -1872,15 +1874,28 @@ class JobStoreSupport(AbstractJobStore, metaclass=ABCMeta): | |||
stores. | |||
""" | |||
|
|||
ftp = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep a class-level FtpFsAccess
instance that is never destroyed, and the FtpFsAccess
instance caches open FTP connections to servers, when do we close our FTP connections? It might be never, and that's probably not what we want.
src/toil/jobStores/ftp_utils.py
Outdated
:param cache: cache of generated FTP objects | ||
:param insecure: Whether to connect over FTP with TLS | ||
""" | ||
self.cache = cache or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to be able to turn off using a connection cache at all, if we're going to make an FtpFsAccess that is never destroyed.
src/toil/jobStores/ftp_utils.py
Outdated
user = env_user | ||
if env_passwd: | ||
passwd = env_passwd | ||
ftp.login(user or "", passwd or "", secure=not self.insecure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so according to the obviously reliable https://tedboy.github.io/python_stdlib/_modules/ftplib.html#FTP_TLS.login what secure
does here is call auth()
if you haven't already set up SSL/TLS on the connection, before doing the login.
So it looks like unless the user sets TOIL_FTP_USE_SSL=False
themselves, they're going to get an error here whenever they use an ftp://
URL, unless the server actually supports SSL/TLS.
One server people might want to use is ftp://ftp-trace.ncbi.nlm.nih.gov
. I've just checked and it doesn't support FTP over SSL/TLS.
Another thing we don't do is call the prot_p()
method on the connection to actually secure the data channel. And later we fall back to urlopen()
and give it the full FTP URL with username and password, and I don't think that urlopen()
can use or enforce SSL. We'd have to tinker with urlopen
's handler as recommended in https://stackoverflow.com/a/73171311.
So I think that by default we should allow unencrypted FTP connections without user intervention to accept them. That's the current behavior.
If we want to change Toil to enforce secure FTP by default, then we also need to enforce security on the actual reads. And we probably would need a way to individually allow particular insecure servers?
…ssues/5134-ftp-get-size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The promised SSL enforcement doesn't actually work yet because of the fallback to urlopen()
.
@@ -1872,18 +1874,29 @@ class JobStoreSupport(AbstractJobStore, metaclass=ABCMeta): | |||
stores. | |||
""" | |||
|
|||
@classmethod | |||
def _setup_ftp(cls) -> FtpFsAccess: | |||
return FtpFsAccess(insecure=strtobool(os.environ.get('TOIL_FTP_USE_SSL', 'False')) is False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it looks like we don't reuse FTP connections at all. But I guess that's fine? There's nothing like a time-based @memoize
; we'd need a thread to watch it and that would be a pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can leave a comment that this doesn't reuse FTP connections, and that implementing it later may be a good idea
""" | ||
if "r" in mode: | ||
host, port, user, passwd, path = self._parse_url(fn) | ||
handle = urlopen("ftp://{}:{}@{}:{}/{}".format(user, passwd, host, port, path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't respect TOIL_FTP_USE_SSL
.
If we implemented open()
on top of https://docs.python.org/3/library/ftplib.html#ftplib.FTP.retrbinary and supporting files larger than memory, we'd need some kind of pipe file object and a thread to feed the data into it.
We have some pipe implementations over in
toil/src/toil/jobStores/utils.py
Line 153 in a5657cf
class ReadablePipe(ABC): |
open()
needs to produce a file object that Just Works and have all the cleanup happen automatically when it goes out of scope and is closed. We might need to use closing()
to help with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed to put into another issue as it appears to be too large of an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the SSL enforcement feature I think this looks OK.
Closes #5134
Also turns all url_exists requests and size requests to HEAD requests instead, as mentioned here: #5114 (comment)
In my local testing, this significantly improves FTP performance. (Previously an existence check on a small text file on an insecure FTP server took around 5 seconds, though I'm unsure why it would take so long originally)
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist