-
-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts with shebangs but no file extension are skipped #115
Comments
Intentional design choice as you suggest. The first being because the only way scc can determine the file is via the extension and the second is performance. This solves the issue of loading binary files such as scc itself because it has no extension. They usually get identified as a binary file in processing, but this cuts that down. That said this is something I have been looking at recently. I was toying around with scc and found the following issue #114 Part of fixing the above would mean that scc is able to determine which files are full matches or not. At this point files with no extension could be labelled as something like "PotentialSheBang" and inside the worker have the first 100 bytes or so inspected to try and determine the language. Tokei does this BTW https://github.com/XAMPPRocky/tokei/blob/2c41fdb182b7ba1bcc7c9d1cfc0de569538e7f0d/src/language/language_type.hbs.rs#L458 however I am not sure if that is the correct approach as it appears it would fail with your example If you can help provide a list of test cases for this I will ensure it makes it into the next release. |
Hmm, that Tokei logic is very limited. I think the correct logic is (limited to the first X bytes or so):
cloc has a good list of interpreters it maps to languages - https://github.com/AlDanial/cloc/blob/master/cloc#L7636 I think the minimum set of special ones are perl, perl5, python, python2, python3, php, php5 (I don't believe they did a separate binary for php7). I'm not sure it's worth going down the rabbit hole of mapping python3.3, 3.4, etc... I can definitely help with test cases. What did you have in mind for test case format - a list of shebangs and what language they should map to, or a collection of sample code files, or something else? |
Raises the question does #! work with BOM. I would imagine so but something to check. I cannot read the cloc code sorry. I need to ensure I don't accidentally copy anything due to its use of GPL. Not opposed to GPL but I make this MIT/Unlicense for a specific reason. Just want to ensure that I don't accidentally introduce any issues. Usually it is safe to look at the tokei code though because there is no way to implement the same thing in Go due to the lack of optionals and functional programming. For test cases all I really need is a list of mappings for test cases. EG
The more important ones being the positive cases. I have been looking at the wiki page for #! and trying to work out the edge cases. I believe something like the below should work.
The lookup may need to look for the above using |
A bit of checking around suggests that #! does not work with BOM since it predates unicode. Id have to check the implementation itself to be sure though. |
Confirmed. That makes the implementation a little easier. |
Ah sorry, I hadn't noticed the licensing difference. 🙂
It seems like there's two main ways of doing it - maintaining a list of interpreters to language and iterating through each, seeing if they match the shebang line (as you described). Or going the other way around and parsing out the correct field of the shebang and looking it up first in the languageDatabase map (since most interpreter names match language name), and then in a 'wonky cases' map (with a handful of things like python3, node, etc.). I think it depends on if you want to default to trying to find a match for the potentially long-tail of languages supported, or only keep a whitelist. Here are some csv test cases. In the Perl one I enumerated the edge cases (different locations, env vs no env, command-line flags, and whitespace). In the others, I only have a single test case:
And depending on if you do or do not want to fallback to the languageDatabase, these are languages whose interpreter name matches the language name:
|
The licencing thing is just me being paranoid, but its one of those things you don't want to accidentally make a mistake with. I think I would rather whitelist for the moment. Its faster to process for a start and it cuts down on the edge cases which always seem to bite me otherwise. Thanks for supplying the above. I will resolve the issue with #114 first which sets this up to be easier to implement. |
Not quite ready for merge yet, but @elindsey if you want please checkout the following branch https://github.com/boyter/scc/tree/114 and build. This should recognise the #! for languages based on the list you provided. I didn't include java,scala,swift,groovy,dart because I was unable to find any examples of them in the wild. I am going to refactor it to move the definitions into the language.json file I think because there is no reason to have it separated out. Assuming that's all good then I will look to merge it in. |
Local testing looks good to me! I hit an edge case that looks new to this feature though: Two folders, test and .test, both contain two files with the same contents, file and file.pl.
Folder level counts are correct:
Individual file with extension counts are correct:
But a no extension file in a hidden folder produces a zero count:
|
Odd... I am unable to replicate the above.
What shell are you using? Its possible that it may be trying to be intelligent with its processing which is causing your result. |
I've repro'd on macos 10.14.6 with bash 5.0.11/zsh 5.3/tcsh 6.18.01 and ubuntu 18.04 with bash 4.4.20; it doesn't look environment related. Does it work correctly in your environment when the hidden folder isn't the first entry in the path?
|
Looks like it's this logic https://github.com/boyter/scc/blob/114/processor/file.go#L192 Should this be calling filepath.Base on the second argument? https://github.com/boyter/scc/blob/114/processor/file.go#L81 |
Can confirm I am able to replicate now. Seems the deeper folder is the trick. As for the cause... I did change that code because with the new #! lookup it needs to assume that any file without an extension or with . at the start is potentially a #! This meant it was picking up additional files such as those in the .git folder which should be excluded by the denylist. Adding Base on the end of the second might resolve it. Will add some test cases to catch this first then try fixing it. |
Ah I see the issue. Its due to passing in the full name of the file, which means the file job is dealing with the full path to the file. You are close, it needs to pass in the filename on the first argument and the path on the second. Ill add a test case for this to resolve it. |
Wait I have that backwards. Yes you are 100% correct. |
Fix should be sitting in the branch again for you to try out. Appears to be working correctly for me now. |
Looks good to me! I'm really excited for this, thanks for adding it. |
Merged into master. |
Describe the bug
I'm not sure if this is a bug, a feature request, or an intentional decision for performance. 🙂
Scripts with a shebang but no file extension are not counted.
To Reproduce
Expected behavior
It would be neat if scc could use shebang lines for language categorization.
The text was updated successfully, but these errors were encountered: