Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts with shebangs but no file extension are skipped #115

Closed
elindsey opened this issue Sep 29, 2019 · 18 comments
Closed

Scripts with shebangs but no file extension are skipped #115

elindsey opened this issue Sep 29, 2019 · 18 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@elindsey
Copy link
Contributor

Describe the bug
I'm not sure if this is a bug, a feature request, or an intentional decision for performance. 🙂

Scripts with a shebang but no file extension are not counted.

To Reproduce

[elindsey@worktop:~/.bin]
$ cat test.pl 
#!/usr/bin/env perl

print "Hello, World!\n";
[elindsey@worktop:~/.bin]
$ scc test.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/.bin]
$ mv test.pl test
[elindsey@worktop:~/.bin]
$ scc test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/.bin]
$ cloc test
       1 text file.
       1 unique file.                              
       0 files ignored.

github.com/AlDanial/cloc v 1.84  T=0.01 s (191.1 files/s, 573.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Perl                             1              1              0              2
-------------------------------------------------------------------------------

Expected behavior

It would be neat if scc could use shebang lines for language categorization.

@elindsey elindsey changed the title Scripts with shebangs no file extension are skipped Scripts with shebangs but no file extension are skipped Sep 29, 2019
@boyter
Copy link
Owner

boyter commented Sep 29, 2019

Intentional design choice as you suggest. The first being because the only way scc can determine the file is via the extension and the second is performance. This solves the issue of loading binary files such as scc itself because it has no extension. They usually get identified as a binary file in processing, but this cuts that down.

That said this is something I have been looking at recently. I was toying around with scc and found the following issue #114

Part of fixing the above would mean that scc is able to determine which files are full matches or not.

At this point files with no extension could be labelled as something like "PotentialSheBang" and inside the worker have the first 100 bytes or so inspected to try and determine the language.

Tokei does this BTW https://github.com/XAMPPRocky/tokei/blob/2c41fdb182b7ba1bcc7c9d1cfc0de569538e7f0d/src/language/language_type.hbs.rs#L458 however I am not sure if that is the correct approach as it appears it would fail with your example #!/usr/bin/env perl

If you can help provide a list of test cases for this I will ensure it makes it into the next release.

@boyter boyter added enhancement New feature or request help wanted Extra attention is needed labels Sep 29, 2019
@elindsey
Copy link
Contributor Author

Hmm, that Tokei logic is very limited. I think the correct logic is (limited to the first X bytes or so):

  1. Detect if the first two characters are a shebang (no whitespace is allowed before the shebang)
  2. Skip arbitrary whitespace to the first token (whitespace is allowed between the shebang and interpreter)
  3. 'basename' the token
  4. If basenamed token is 'env', skip arbitrary whitespace to the next token and use that for language lookup. Else, use basenamed token for the language lookup

cloc has a good list of interpreters it maps to languages - https://github.com/AlDanial/cloc/blob/master/cloc#L7636

I think the minimum set of special ones are perl, perl5, python, python2, python3, php, php5 (I don't believe they did a separate binary for php7). I'm not sure it's worth going down the rabbit hole of mapping python3.3, 3.4, etc...

I can definitely help with test cases. What did you have in mind for test case format - a list of shebangs and what language they should map to, or a collection of sample code files, or something else?

@boyter
Copy link
Owner

boyter commented Sep 30, 2019

Raises the question does #! work with BOM. I would imagine so but something to check.

I cannot read the cloc code sorry. I need to ensure I don't accidentally copy anything due to its use of GPL. Not opposed to GPL but I make this MIT/Unlicense for a specific reason. Just want to ensure that I don't accidentally introduce any issues. Usually it is safe to look at the tokei code though because there is no way to implement the same thing in Go due to the lack of optionals and functional programming.

For test cases all I really need is a list of mappings for test cases. EG

#!/usr/bin/env perl -> perl
#!/usr/bin/perl > perl
#!/usr/bin/php > php
#!/usr/bin/php5 > php
#!/usr/bin/nothing -> Unknown

The more important ones being the positive cases.

I have been looking at the wiki page for #! and trying to work out the edge cases. I believe something like the below should work.

For files with extension
Check the first two bytes (ignoreing BOM) for #!
If present check the next 50 bytes or to newline if present
Check for presense of perl, php, python etc... and if so map it

The lookup may need to look for the above using /perl or perl with space to ensure it matchs correctly though,

@boyter
Copy link
Owner

boyter commented Sep 30, 2019

A bit of checking around suggests that #! does not work with BOM since it predates unicode. Id have to check the implementation itself to be sure though.

@boyter
Copy link
Owner

boyter commented Sep 30, 2019

# bboyter @ SurfaceBook2 in ~/Projects [8:38:49]
$ ./withbom
./withbom: 1: ./withbom: #!/usr/bin/python: not found
Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/%{ <-- HERE (.*?)}/ at /usr/bin/print line 528.                                                                       Error: no such file "Hello"
# bboyter @ SurfaceBook2 in ~/Projects [8:38:59] C:2
$ ./withoutbom
Hello                                           

Confirmed. That makes the implementation a little easier.

@elindsey
Copy link
Contributor Author

elindsey commented Oct 1, 2019

Ah sorry, I hadn't noticed the licensing difference. 🙂

I believe something like the below should work.
Check for presense of perl, php, python etc... and if so map it

It seems like there's two main ways of doing it - maintaining a list of interpreters to language and iterating through each, seeing if they match the shebang line (as you described). Or going the other way around and parsing out the correct field of the shebang and looking it up first in the languageDatabase map (since most interpreter names match language name), and then in a 'wonky cases' map (with a handful of things like python3, node, etc.). I think it depends on if you want to default to trying to find a match for the potentially long-tail of languages supported, or only keep a whitelist.
Since 99% of things should be covered by shell/perl/python/ruby, I'm not sure it matters very much either way.

Here are some csv test cases. In the Perl one I enumerated the edge cases (different locations, env vs no env, command-line flags, and whitespace). In the others, I only have a single test case:

perl,#!/usr/bin/perl                                                                                                                                                           
perl,#!  /usr/bin/perl
perl,#!/usr/bin/perl -w
perl,#!/usr/bin/env perl
perl,#!  /usr/bin/env   perl
perl,#!/usr/bin/env perl -w
perl,#!  /usr/bin/env   perl   -w  
perl,#!/opt/local/bin/perl
perl,#!/usr/bin/perl5

php,#!/usr/bin/php
php,#!/usr/bin/php5

python,#!/usr/bin/python
python,#!/usr/bin/python2
python,#!/usr/bin/python3

awk,#!/usr/bin/awk
awk,#!/usr/bin/gawk
awk,#!/usr/bin/mawk

csh,#!/bin/csh
csh,#!/bin/tcsh

d,#!/usr/bin/env rdmd

erlang,#!/usr/bin/env escript
javascript,#!/usr/bin/env node
lisp,#!/usr/local/bin/sbcl
lisp,#!/usr/bin/env sbcl
scheme,#!/usr/bin/env racket

And depending on if you do or do not want to fallback to the languageDatabase, these are languages whose interpreter name matches the language name:

java,#!/opt/java/jdk-11/bin/java --source 11
bash,/bin/bash
dart,/usr/bin/env dart
fish,/bin/fish
groovy,/usr/bin/groovy
korn,/bin/ksh
lua,/usr/bin/env lua
ruby,/usr/bin/ruby
scala,/usr/bin/env scala
sed,usr/bin/sed                                                                                                                                                                
shell,/bin/sh
swift,/usr/bin/env swift
tcl,/usr/bin/env tcl
zsh,/bin/zsh

@boyter
Copy link
Owner

boyter commented Oct 1, 2019

The licencing thing is just me being paranoid, but its one of those things you don't want to accidentally make a mistake with.

I think I would rather whitelist for the moment. Its faster to process for a start and it cuts down on the edge cases which always seem to bite me otherwise.

Thanks for supplying the above. I will resolve the issue with #114 first which sets this up to be easier to implement.

@boyter
Copy link
Owner

boyter commented Oct 3, 2019

Not quite ready for merge yet, but @elindsey if you want please checkout the following branch https://github.com/boyter/scc/tree/114 and build. This should recognise the #! for languages based on the list you provided. I didn't include java,scala,swift,groovy,dart because I was unable to find any examples of them in the wild.

I am going to refactor it to move the definitions into the language.json file I think because there is no reason to have it separated out. Assuming that's all good then I will look to merge it in.

@elindsey
Copy link
Contributor Author

elindsey commented Oct 3, 2019

Local testing looks good to me!

I hit an edge case that looks new to this feature though:

Two folders, test and .test, both contain two files with the same contents, file and file.pl.

[elindsey@worktop:~/Downloads]
$ ls test/
file  file.pl
[elindsey@worktop:~/Downloads]
$ ls .test/
file  file.pl
[elindsey@worktop:~/Downloads]
$ cat test/file
#!/usr/bin/env perl

print "Hello, World\n";
[elindsey@worktop:~/Downloads]
$ cat test/file.pl 
#!/usr/bin/env perl

print "Hello, World\n";

Folder level counts are correct:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

Individual file with extension counts are correct:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/file.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/file.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

But a no extension file in a hidden folder produces a zero count:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

@boyter
Copy link
Owner

boyter commented Oct 3, 2019

Odd... I am unable to replicate the above.

# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:37]
$ find .
./.test
./.test/test
./test
./test/test

# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:38]
$ scc --no-cocomo
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         8        4         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         8        4         2        2          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:42]
$ scc --no-cocomo test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:46]
$ scc --no-cocomo .test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:51]
$ scc --no-cocomo test/test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:54]
$ scc --no-cocomo .test/test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────

What shell are you using? Its possible that it may be trying to be intelligent with its processing which is causing your result.

@elindsey
Copy link
Contributor Author

elindsey commented Oct 4, 2019

I've repro'd on macos 10.14.6 with bash 5.0.11/zsh 5.3/tcsh 6.18.01 and ubuntu 18.04 with bash 4.4.20; it doesn't look environment related.

Does it work correctly in your environment when the hidden folder isn't the first entry in the path?

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc .test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ mkdir tmp
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ mv .test/ tmp/
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc tmp/.test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

@elindsey
Copy link
Contributor Author

elindsey commented Oct 4, 2019

Looks like it's this logic https://github.com/boyter/scc/blob/114/processor/file.go#L192

Should this be calling filepath.Base on the second argument? https://github.com/boyter/scc/blob/114/processor/file.go#L81

@boyter
Copy link
Owner

boyter commented Oct 4, 2019

Can confirm I am able to replicate now. Seems the deeper folder is the trick. As for the cause... I did change that code because with the new #! lookup it needs to assume that any file without an extension or with . at the start is potentially a #! This meant it was picking up additional files such as those in the .git folder which should be excluded by the denylist.

Adding Base on the end of the second might resolve it. Will add some test cases to catch this first then try fixing it.

@boyter
Copy link
Owner

boyter commented Oct 4, 2019

Ah I see the issue. Its due to passing in the full name of the file, which means the file job is dealing with the full path to the file.

You are close, it needs to pass in the filename on the first argument and the path on the second. Ill add a test case for this to resolve it.

@boyter
Copy link
Owner

boyter commented Oct 4, 2019

Wait I have that backwards. Yes you are 100% correct.

@boyter
Copy link
Owner

boyter commented Oct 4, 2019

Fix should be sitting in the branch again for you to try out. Appears to be working correctly for me now.

@elindsey
Copy link
Contributor Author

elindsey commented Oct 4, 2019

Looks good to me! I'm really excited for this, thanks for adding it.

@boyter
Copy link
Owner

boyter commented Oct 13, 2019

Merged into master.

@boyter boyter closed this as completed Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants