-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builtin: implement, document and test url-parse #1715
base: master
Are you sure you want to change the base?
Conversation
It will be used in more places so it should be placed in url.h. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Welcome to GitGitGadgetHi @matheusmoreira, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that either:
You can CC potential reviewers by adding a footer to the PR description with the following syntax:
Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
/allow |
User matheusmoreira is now allowed to use GitGitGadget. |
146cfe3
to
33e1284
Compare
Define general parsing function that supports all Git URLs including scp style URLs such as hostname:~user/repo. Has the same interface as the URL normalization function and uses the same data structures, facilitating its use. It's adapted from the algorithm used to process URLs in connect.c, so it should support the same inputs. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Git commands can accept a rather wide variety of URLs syntaxes. The range of accepted inputs might expand even more in the future. This makes the parsing of URL components difficult since standard URL parsers cannot be used. Extracting the components of a git URL would require implementing all the schemes that git itself supports, not to mention tracking its development continuously in case new URL schemes are added. The url-parse builtin command is designed to solve this problem by exposing git's native URL parsing facilities as a plumbing command. Other programs can then call upon git itself to parse the git URLs and extract their components. This should be quite useful for scripts. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
This function either successfully parses an URL or dies with an error message. Since this is a plumbing command, the error message is not translated. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Create an enumeration containing all possible git URL components which may be selected by the user. The URL_NONE component is used when the user did not request the parsing of any component. In this case, the command will return successfully if the URL parses. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
The extract function returns a newly allocated string whose contents are the specified git URL component. The string must be freed later. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Converts a git URL component name to its corresponding enumeration value so that it can be conveniently used internally by the url-parse command. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Create the data structures expected by the git option parser. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Prepare to handle input by parsing the command line options and removing them from the arguments vector. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Parse all the git URLs given as input on the command line. Die if an URL cannot be parsed. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Parse the specified git URL component from each of the given git URLs and print them to standard output, one per line. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
The new url-parse builtin validates git URLs and optionally extracts their components. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Test git URL parsing, validation and component extraction on all documented git URL schemes and syntaxes. Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
/submit |
Submitted as pull.1715.git.git.1714343461.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Torsten Bögershausen wrote (reply to this): On Sun, Apr 28, 2024 at 10:30:48PM +0000, Matheus Moreira via GitGitGadget wrote:
> Git commands accept a wide variety of URLs syntaxes, not just standard URLs.
> This can make parsing git URLs difficult since standard URL parsers cannot
> be used. Even if an external parser were implemented, it would have to track
> git's development closely in case support for any new URL schemes are added.
>
> These patches introduce a new url-parse builtin command that exposes git's
> native URL parsing algorithms as a plumbing command, allowing other programs
> to then call upon git itself to parse the git URLs and their components.
>
> This should be quite useful for scripts. For example, a script might want to
> add remotes to repositories, naming them according to the domain name where
> the repository is hosted. This new builtin allows it to parse the git URL
> and extract its host name which can then be used as input for other
> operations. This would be difficult to implement otherwise due to git's
> support for scp style URLs.
>
All in all, having a URL parser as such is a good thing, thanks for working
on that.
There are, however, some notes and questions, up for discussion:
- are there any plans to integrate the parser into connect.c and fetch ?
Speaking as a person, who manage to break the parsing of URLs once,
with the good intention to improve things, I need to learn that
test cases are important.
Some work can be seen in t5601-clone.sh
Especially, when dealing with literal IPv6 addresses, the ones with []
and the simplified ssh syntax 'myhost:src' are interesting to test.
Git itself strives to be RFC compliant when parsing URLs, but
we do not fully guarantee to be "fully certified".
And some features using the [] syntax to embedd a port number
inside the simplified ssh syntax had not been documented,
but used in practise, and are now part of the test suite.
See "[myhost:123]:src" in t5601
- Or is this new tool just a helper, to verify "good" URL's,
and not accepting our legacy parser quirks ?
Then we still should see some IPv6 tests ?
Or may be not, as we prefer hostnames these days ?
- One minor comment:
in 02/13 we read:
+enum protocol {
+ PROTO_UNKNOWN = 0,
+ PROTO_LOCAL,
+ PROTO_FILE,
+ PROTO_SSH,
+ PROTO_GIT,
The RFC 1738 uses the term "scheme" here, and using the very generic
term "protocol" may lead to name clashes later.
Would something like "git_scheme" or so be better ?
- One minor comment:
In 13/13 we read:
+ git url-parse "file:///" &&
+ git url-parse "file://"
I think that the "///" version is superflous, it should already
be covered by the "//" version
|
User |
On the Git mailing list, Matheus Afonso Martins Moreira wrote (reply to this): Thank you for your feedback.
> are there any plans to integrate the parser into connect.c and fetch ?
Yes.
That was my intention but I was not confident enough to touch connect.c
before getting feedback from the community, since it's critical code
and it is my first contribution.
I do want to merge all URL parsing in git into this one function though,
thereby creating a "single point of truth". This is so that if the algorithm
is modified the changes are visible to the URL parser builtin as well.
> Speaking as a person, who manage to break the parsing of URLs once,
> with the good intention to improve things, I need to learn that
> test cases are important.
Absolutely agree.
When adding test cases, I looked at the possibilities enumerated in urls.txt
and generated test cases based on those. I also looked at the urlmatch.h
test cases. However...
> Some work can be seen in t5601-clone.sh
... I did not think to check those.
> Especially, when dealing with literal IPv6 addresses,
> the ones with [] and the simplified ssh syntax 'myhost:src'
> are interesting to test.
You're right about that. I shall prepare an updated v2 patchset
with more test cases, and also any other changes/improvements
requested by maintainers.
> And some features using the [] syntax to embedd a port number
> inside the simplified ssh syntax had not been documented,
> but used in practise, and are now part of the test suite.
> See "[myhost:123]:src" in t5601
Indeed, I did not read anything of the sort when I checked it.
Would you like me to commit a note to this effect to urls.txt ?
> Or is this new tool just a helper, to verify "good" URL's,
> and not accepting our legacy parser quirks ?
It is my intention that this builtin be able to accept, parse
and decompose all types of URLs that git itself can accept.
> Then we still should see some IPv6 tests ?
I will add them!
> Or may be not, as we prefer hostnames these days ?
I would have to defer that choice to someone more experienced
with the codebase. Please advise on how to proceed.
> The RFC 1738 uses the term "scheme" here, and using the very generic
> term "protocol" may lead to name clashes later.
> Would something like "git_scheme" or so be better ?
Scheme does seem like a better word if it's the terminology used by RFCs.
I can change that in a new version if necessary.
That code is based on the existing connect.c parsing code though.
> I think that the "///" version is superflous, it should already
> be covered by the "//" version
I thought it was a good idea because of existing precedent:
my first approach to creating the test cases was to copy the
ones from t0110-urlmatch-normalization.sh which did have many
cases such as those. Then as I developed the code I came to
believe that it was not necessary: I call url_normalize
in the url_parse function and url_normalize is already being
tested. I think I just forgot to delete those lines.
Reading that file over once again, it does have IPv6 address
test cases. So I should probably go over it again.
Thanks again for the feedback,
Matheus |
On the Git mailing list, Torsten Bögershausen wrote (reply to this): On Mon, Apr 29, 2024 at 07:04:40PM -0300, Matheus Afonso Martins Moreira wrote:
> Thank you for your feedback.
>
> > are there any plans to integrate the parser into connect.c and fetch ?
>
> Yes.
>
> That was my intention but I was not confident enough to touch connect.c
> before getting feedback from the community, since it's critical code
> and it is my first contribution.
Welcome to the Git community.
I wasn't aware of t0110 as a test case...
>
> I do want to merge all URL parsing in git into this one function though,
> thereby creating a "single point of truth". This is so that if the algorithm
> is modified the changes are visible to the URL parser builtin as well.
>
That is a good thing to do. Be prepared for a longer journey, since we have
this legacy stuff to deal with. But I am happy to help with reviews, even
if that may take some days,
[]
> When adding test cases, I looked at the possibilities enumerated in urls.txt
> and generated test cases based on those. I also looked at the urlmatch.h
> test cases. However...
>
> > Some work can be seen in t5601-clone.sh
>
> ... I did not think to check those.
>
> > Especially, when dealing with literal IPv6 addresses,
> > the ones with [] and the simplified ssh syntax 'myhost:src'
> > are interesting to test.
>
> You're right about that. I shall prepare an updated v2 patchset
> with more test cases, and also any other changes/improvements
> requested by maintainers.
>
> > And some features using the [] syntax to embedd a port number
> > inside the simplified ssh syntax had not been documented,
> > but used in practise, and are now part of the test suite.
> > See "[myhost:123]:src" in t5601
>
> Indeed, I did not read anything of the sort when I checked it.
> Would you like me to commit a note to this effect to urls.txt ?
On short: please not.
This kind of syntax was never ment to be used.
The official "ssh://myhost:123/src" is recommended.
When IPv6 parsing was added, people discovered that it could be
used to "protect" the ':' from being a seperator between the hostname
and the path, and can be used to seperate the hostname from the port.
Once that was used in real live, it was too late to change it.
If we now get a better debug tool, it could mention that this is
a legacy feature, and recommend the longer "ssh://" syntax.
>
> > Or is this new tool just a helper, to verify "good" URL's,
> > and not accepting our legacy parser quirks ?
>
> It is my intention that this builtin be able to accept, parse
> and decompose all types of URLs that git itself can accept.
>
> > Then we still should see some IPv6 tests ?
>
> I will add them!
>
> > Or may be not, as we prefer hostnames these days ?
>
> I would have to defer that choice to someone more experienced
> with the codebase. Please advise on how to proceed.
Re-reading this email conversation,
I think that we should support (in the future),
what we support today.
Having a new parser tool means, that there is a chance to reject
those URLs with the note/hint, that they are depracted, and should
be replaced by a proper one.
From my point of view this means that all existing test case should pass
even with the new parser, as a general approach.
Deprecating things is hard, may take years, and may be done in a seperate
task/patch series. Or may be part of this one, in seperate commits.
>
> > The RFC 1738 uses the term "scheme" here, and using the very generic
> > term "protocol" may lead to name clashes later.
> > Would something like "git_scheme" or so be better ?
>
> Scheme does seem like a better word if it's the terminology used by RFCs.
> I can change that in a new version if necessary.
> That code is based on the existing connect.c parsing code though.
>
> > I think that the "///" version is superflous, it should already
> > be covered by the "//" version
>
> I thought it was a good idea because of existing precedent:
> my first approach to creating the test cases was to copy the
> ones from t0110-urlmatch-normalization.sh which did have many
> cases such as those. Then as I developed the code I came to
> believe that it was not necessary: I call url_normalize
> in the url_parse function and url_normalize is already being
> tested. I think I just forgot to delete those lines.
>
> Reading that file over once again, it does have IPv6 address
> test cases. So I should probably go over it again.
>
> Thanks again for the feedback,
>
> Matheus
> |
@@ -0,0 +1,59 @@ | |||
git-url-parse(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ghanshyam Thakkar wrote (reply to this):
On Sun, 28 Apr 2024, Matheus Afonso Martins Moreira via GitGitGadget <gitgitgadget@gmail.com> wrote:
> +* Print the path:
> ++
> +------------
> +$ git url-parse --component path https://example.com/user/repo
> +/usr/repo
s/usr/user/
Thanks.
User |
@@ -3,6 +3,7 @@ | |||
#include "hex-ll.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ghanshyam Thakkar wrote (reply to this):
On Sun, 28 Apr 2024, Matheus Afonso Martins Moreira via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
>
> Define general parsing function that supports all Git URLs
> including scp style URLs such as hostname:~user/repo.
> Has the same interface as the URL normalization function
> and uses the same data structures, facilitating its use.
> It's adapted from the algorithm used to process URLs in connect.c,
> so it should support the same inputs.
>
> Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
> ---
> urlmatch.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> urlmatch.h | 1 +
> 2 files changed, 91 insertions(+)
>
> diff --git a/urlmatch.c b/urlmatch.c
> index 1d0254abacb..5a442e31fa2 100644
> --- a/urlmatch.c
> +++ b/urlmatch.c
> @@ -3,6 +3,7 @@
> #include "hex-ll.h"
> #include "strbuf.h"
> #include "urlmatch.h"
> +#include "url.h"
>
> #define URL_ALPHA "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
> #define URL_DIGIT "0123456789"
> @@ -438,6 +439,95 @@ char *url_normalize(const char *url, struct url_info *out_info)
> return url_normalize_1(url, out_info, 0);
> }
>
> +enum protocol {
> + PROTO_UNKNOWN = 0,
> + PROTO_LOCAL,
> + PROTO_FILE,
> + PROTO_SSH,
> + PROTO_GIT,
> +};
> +
> +static enum protocol url_get_protocol(const char *name, size_t n)
> +{
> + if (!strncmp(name, "ssh", n))
> + return PROTO_SSH;
> + if (!strncmp(name, "git", n))
> + return PROTO_GIT;
> + if (!strncmp(name, "git+ssh", n)) /* deprecated - do not use */
> + return PROTO_SSH;
> + if (!strncmp(name, "ssh+git", n)) /* deprecated - do not use */
> + return PROTO_SSH;
> + if (!strncmp(name, "file", n))
> + return PROTO_FILE;
> + return PROTO_UNKNOWN;
> +}
> +
> +char *url_parse(const char *url_orig, struct url_info *out_info)
> +{
> + struct strbuf url;
> + char *host, *separator;
> + char *detached, *normalized;
> + enum protocol protocol = PROTO_LOCAL;
> + struct url_info local_info;
> + struct url_info *info = out_info? out_info : &local_info;
> + bool scp_syntax = false;
> +
> + if (is_url(url_orig)) {
> + url_orig = url_decode(url_orig);
> + } else {
> + url_orig = xstrdup(url_orig);
> + }
> +
> + strbuf_init(&url, strlen(url_orig) + sizeof("ssh://"));
> + strbuf_addstr(&url, url_orig);
> +
> + host = strstr(url.buf, "://");
> + if (host) {
> + protocol = url_get_protocol(url.buf, host - url.buf);
> + host += 3;
> + } else {
> + if (!url_is_local_not_ssh(url.buf)) {
> + scp_syntax = true;
> + protocol = PROTO_SSH;
> + strbuf_insertstr(&url, 0, "ssh://");
> + host = url.buf + 6;
> + }
> + }
Interesting.
`
$ ./git url-parse -c protocol file:/test/test
ssh
`
seems like only having a single slash after the 'protocol:' prints
'ssh' always (I think this may not even be a valid url). After this 'else'
block, the url turns into 'ssh://file/test/test'. Will examine the details
later. Not that it's your code's doing, and rather the result of
url_is_local_not_ssh(). But just wanted to point this out and ask if this
should error out or is this an intended behavior that I can't figure out.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Torsten Bögershausen wrote (reply to this):
[]
> Interesting.
>
> `
> $ ./git url-parse -c protocol file:/test/test
> ssh
> `
>
> seems like only having a single slash after the 'protocol:' prints
> 'ssh' always (I think this may not even be a valid url). After this 'else'
> block, the url turns into 'ssh://file/test/test'. Will examine the details
> later. Not that it's your code's doing, and rather the result of
> url_is_local_not_ssh(). But just wanted to point this out and ask if this
> should error out or is this an intended behavior that I can't figure out.
ssh is the correct answer, try something like
`git clone localhost:/home/myself/project/git.git`
It is the scp syntax, supported by Git as well.
From `man scp`
scp copies files between hosts on a network.
[]
The source and target may be specified as a local pathname,
a remote host with optional path in the form
[user@]host:[path],
or a URI in the form scp://[user@]host[:port][/path].
Local file names can be made explicit using absolute or relative pathnames
to avoid scp treating file names containing ‘:’ as host specifiers.
So yes, they share similar problems
with the ':' that could mean different things when using the short form.
Git commands accept a wide variety of URLs syntaxes,
not just standard URLs. This can make parsing git URLs
difficult since standard URL parsers cannot be used.
Even if an external parser were implemented, it would
have to track git's development closely in case support
for any new URL schemes are added.
These patches introduce a new url-parse builtin command
that exposes git's native URL parsing algorithms as a
plumbing command, allowing other programs to then call
upon git itself to parse the git URLs and their components.
This should be quite useful for scripts. For example, a script
might want to add remotes to repositories, naming them according
to the domain name where the repository is hosted. This new builtin
allows it to parse the git URL and extract its host name which can
then be used as input for other operations. This would be difficult
to implement otherwise due to git's support for scp style URLs.
Signed-off-by: Matheus Afonso Martins Moreira matheus@matheusmoreira.com
cc: Torsten Bögershausen tboegi@web.de
cc: Ghanshyam Thakkar shyamthakkar001@gmail.com