Scripts with shebangs but no file extension are skipped #115

elindsey · 2019-09-29T15:03:46Z

Describe the bug
I'm not sure if this is a bug, a feature request, or an intentional decision for performance. 🙂

Scripts with a shebang but no file extension are not counted.

To Reproduce

[elindsey@worktop:~/.bin]
$ cat test.pl 
#!/usr/bin/env perl

print "Hello, World!\n";
[elindsey@worktop:~/.bin]
$ scc test.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/.bin]
$ mv test.pl test
[elindsey@worktop:~/.bin]
$ scc test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/.bin]
$ cloc test
       1 text file.
       1 unique file.                              
       0 files ignored.

github.com/AlDanial/cloc v 1.84  T=0.01 s (191.1 files/s, 573.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Perl                             1              1              0              2
-------------------------------------------------------------------------------

Expected behavior

It would be neat if scc could use shebang lines for language categorization.

The text was updated successfully, but these errors were encountered:

boyter · 2019-09-29T22:11:56Z

Intentional design choice as you suggest. The first being because the only way scc can determine the file is via the extension and the second is performance. This solves the issue of loading binary files such as scc itself because it has no extension. They usually get identified as a binary file in processing, but this cuts that down.

That said this is something I have been looking at recently. I was toying around with scc and found the following issue #114

Part of fixing the above would mean that scc is able to determine which files are full matches or not.

At this point files with no extension could be labelled as something like "PotentialSheBang" and inside the worker have the first 100 bytes or so inspected to try and determine the language.

Tokei does this BTW https://github.com/XAMPPRocky/tokei/blob/2c41fdb182b7ba1bcc7c9d1cfc0de569538e7f0d/src/language/language_type.hbs.rs#L458 however I am not sure if that is the correct approach as it appears it would fail with your example #!/usr/bin/env perl

If you can help provide a list of test cases for this I will ensure it makes it into the next release.

elindsey · 2019-09-30T16:57:10Z

Hmm, that Tokei logic is very limited. I think the correct logic is (limited to the first X bytes or so):

Detect if the first two characters are a shebang (no whitespace is allowed before the shebang)
Skip arbitrary whitespace to the first token (whitespace is allowed between the shebang and interpreter)
'basename' the token
If basenamed token is 'env', skip arbitrary whitespace to the next token and use that for language lookup. Else, use basenamed token for the language lookup

cloc has a good list of interpreters it maps to languages - https://github.com/AlDanial/cloc/blob/master/cloc#L7636

I think the minimum set of special ones are perl, perl5, python, python2, python3, php, php5 (I don't believe they did a separate binary for php7). I'm not sure it's worth going down the rabbit hole of mapping python3.3, 3.4, etc...

I can definitely help with test cases. What did you have in mind for test case format - a list of shebangs and what language they should map to, or a collection of sample code files, or something else?

boyter · 2019-09-30T22:29:06Z

Raises the question does #! work with BOM. I would imagine so but something to check.

I cannot read the cloc code sorry. I need to ensure I don't accidentally copy anything due to its use of GPL. Not opposed to GPL but I make this MIT/Unlicense for a specific reason. Just want to ensure that I don't accidentally introduce any issues. Usually it is safe to look at the tokei code though because there is no way to implement the same thing in Go due to the lack of optionals and functional programming.

For test cases all I really need is a list of mappings for test cases. EG

#!/usr/bin/env perl -> perl
#!/usr/bin/perl > perl
#!/usr/bin/php > php
#!/usr/bin/php5 > php
#!/usr/bin/nothing -> Unknown

The more important ones being the positive cases.

I have been looking at the wiki page for #! and trying to work out the edge cases. I believe something like the below should work.

For files with extension
Check the first two bytes (ignoreing BOM) for #!
If present check the next 50 bytes or to newline if present
Check for presense of perl, php, python etc... and if so map it

The lookup may need to look for the above using /perl or perl with space to ensure it matchs correctly though,

boyter · 2019-09-30T22:32:55Z

A bit of checking around suggests that #! does not work with BOM since it predates unicode. Id have to check the implementation itself to be sure though.

boyter · 2019-09-30T22:40:06Z

# bboyter @ SurfaceBook2 in ~/Projects [8:38:49]
$ ./withbom
./withbom: 1: ./withbom: #!/usr/bin/python: not found
Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/%{ <-- HERE (.*?)}/ at /usr/bin/print line 528.                                                                       Error: no such file "Hello"
# bboyter @ SurfaceBook2 in ~/Projects [8:38:59] C:2
$ ./withoutbom
Hello

Confirmed. That makes the implementation a little easier.

elindsey · 2019-10-01T02:40:15Z

Ah sorry, I hadn't noticed the licensing difference. 🙂

I believe something like the below should work.
Check for presense of perl, php, python etc... and if so map it

It seems like there's two main ways of doing it - maintaining a list of interpreters to language and iterating through each, seeing if they match the shebang line (as you described). Or going the other way around and parsing out the correct field of the shebang and looking it up first in the languageDatabase map (since most interpreter names match language name), and then in a 'wonky cases' map (with a handful of things like python3, node, etc.). I think it depends on if you want to default to trying to find a match for the potentially long-tail of languages supported, or only keep a whitelist.
Since 99% of things should be covered by shell/perl/python/ruby, I'm not sure it matters very much either way.

Here are some csv test cases. In the Perl one I enumerated the edge cases (different locations, env vs no env, command-line flags, and whitespace). In the others, I only have a single test case:

perl,#!/usr/bin/perl                                                                                                                                                           
perl,#!  /usr/bin/perl
perl,#!/usr/bin/perl -w
perl,#!/usr/bin/env perl
perl,#!  /usr/bin/env   perl
perl,#!/usr/bin/env perl -w
perl,#!  /usr/bin/env   perl   -w  
perl,#!/opt/local/bin/perl
perl,#!/usr/bin/perl5

php,#!/usr/bin/php
php,#!/usr/bin/php5

python,#!/usr/bin/python
python,#!/usr/bin/python2
python,#!/usr/bin/python3

awk,#!/usr/bin/awk
awk,#!/usr/bin/gawk
awk,#!/usr/bin/mawk

csh,#!/bin/csh
csh,#!/bin/tcsh

d,#!/usr/bin/env rdmd

erlang,#!/usr/bin/env escript
javascript,#!/usr/bin/env node
lisp,#!/usr/local/bin/sbcl
lisp,#!/usr/bin/env sbcl
scheme,#!/usr/bin/env racket

And depending on if you do or do not want to fallback to the languageDatabase, these are languages whose interpreter name matches the language name:

java,#!/opt/java/jdk-11/bin/java --source 11
bash,/bin/bash
dart,/usr/bin/env dart
fish,/bin/fish
groovy,/usr/bin/groovy
korn,/bin/ksh
lua,/usr/bin/env lua
ruby,/usr/bin/ruby
scala,/usr/bin/env scala
sed,usr/bin/sed                                                                                                                                                                
shell,/bin/sh
swift,/usr/bin/env swift
tcl,/usr/bin/env tcl
zsh,/bin/zsh

boyter · 2019-10-01T22:22:03Z

The licencing thing is just me being paranoid, but its one of those things you don't want to accidentally make a mistake with.

I think I would rather whitelist for the moment. Its faster to process for a start and it cuts down on the edge cases which always seem to bite me otherwise.

Thanks for supplying the above. I will resolve the issue with #114 first which sets this up to be easier to implement.

boyter · 2019-10-03T09:39:53Z

Not quite ready for merge yet, but @elindsey if you want please checkout the following branch https://github.com/boyter/scc/tree/114 and build. This should recognise the #! for languages based on the list you provided. I didn't include java,scala,swift,groovy,dart because I was unable to find any examples of them in the wild.

I am going to refactor it to move the definitions into the language.json file I think because there is no reason to have it separated out. Assuming that's all good then I will look to merge it in.

elindsey · 2019-10-03T17:58:46Z

Local testing looks good to me!

I hit an edge case that looks new to this feature though:

Two folders, test and .test, both contain two files with the same contents, file and file.pl.

[elindsey@worktop:~/Downloads]
$ ls test/
file  file.pl
[elindsey@worktop:~/Downloads]
$ ls .test/
file  file.pl
[elindsey@worktop:~/Downloads]
$ cat test/file
#!/usr/bin/env perl

print "Hello, World\n";
[elindsey@worktop:~/Downloads]
$ cat test/file.pl 
#!/usr/bin/env perl

print "Hello, World\n";

Folder level counts are correct:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         6        2         2        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

Individual file with extension counts are correct:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/file.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/file.pl 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

But a no extension file in a hidden folder produces a zero count:

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc ~/Downloads/.test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

boyter · 2019-10-03T22:10:43Z

Odd... I am unable to replicate the above.

# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:37]
$ find .
./.test
./.test/test
./test
./test/test

# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:38]
$ scc --no-cocomo
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         2         8        4         2        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        2         8        4         2        2          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:42]
$ scc --no-cocomo test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:46]
$ scc --no-cocomo .test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:51]
$ scc --no-cocomo test/test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────


# bboyter @ SurfaceBook2 in ~/Go/src/github.com/boyter/scc/examples/temp on git:114 x [8:07:54]
$ scc --no-cocomo .test/test
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         4        2         1        1          0
───────────────────────────────────────────────────────────────────────────────

What shell are you using? Its possible that it may be trying to be intelligent with its processing which is causing your result.

elindsey · 2019-10-04T01:56:58Z

I've repro'd on macos 10.14.6 with bash 5.0.11/zsh 5.3/tcsh 6.18.01 and ubuntu 18.04 with bash 4.4.20; it doesn't look environment related.

Does it work correctly in your environment when the hidden folder isn't the first entry in the path?

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc .test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Perl                         1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         3        1         1        1          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $19
Estimated Schedule Effort 0.247114 months
Estimated People Required 0.009168
───────────────────────────────────────────────────────────────────────────────

[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ mkdir tmp
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ mv .test/ tmp/
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
[elindsey@worktop:~/Downloads/src/github.com/boyter/scc]
$ ./scc tmp/.test/file
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
Total                        0         0        0         0        0          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $0
Estimated Schedule Effort 0.000000 months
Estimated People Required NaN
───────────────────────────────────────────────────────────────────────────────

elindsey · 2019-10-04T02:10:38Z

Looks like it's this logic https://github.com/boyter/scc/blob/114/processor/file.go#L192

Should this be calling filepath.Base on the second argument? https://github.com/boyter/scc/blob/114/processor/file.go#L81

boyter · 2019-10-04T07:07:03Z

Can confirm I am able to replicate now. Seems the deeper folder is the trick. As for the cause... I did change that code because with the new #! lookup it needs to assume that any file without an extension or with . at the start is potentially a #! This meant it was picking up additional files such as those in the .git folder which should be excluded by the denylist.

Adding Base on the end of the second might resolve it. Will add some test cases to catch this first then try fixing it.

boyter · 2019-10-04T07:33:19Z

Ah I see the issue. Its due to passing in the full name of the file, which means the file job is dealing with the full path to the file.

You are close, it needs to pass in the filename on the first argument and the path on the second. Ill add a test case for this to resolve it.

boyter · 2019-10-04T07:37:28Z

Wait I have that backwards. Yes you are 100% correct.

boyter · 2019-10-04T07:47:45Z

Fix should be sitting in the branch again for you to try out. Appears to be working correctly for me now.

elindsey · 2019-10-04T14:34:46Z

Looks good to me! I'm really excited for this, thanks for adding it.

boyter · 2019-10-13T21:14:26Z

Merged into master.

elindsey changed the title ~~Scripts with shebangs no file extension are skipped~~ Scripts with shebangs but no file extension are skipped Sep 29, 2019

boyter added enhancement New feature or request help wanted Extra attention is needed labels Sep 29, 2019

boyter mentioned this issue Oct 1, 2019

Issue with file's without extensions #114

Closed

boyter closed this as completed Oct 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts with shebangs but no file extension are skipped #115

Scripts with shebangs but no file extension are skipped #115

elindsey commented Sep 29, 2019

boyter commented Sep 29, 2019

elindsey commented Sep 30, 2019

boyter commented Sep 30, 2019

boyter commented Sep 30, 2019

boyter commented Sep 30, 2019 •

edited

Loading

elindsey commented Oct 1, 2019

boyter commented Oct 1, 2019

boyter commented Oct 3, 2019

elindsey commented Oct 3, 2019

boyter commented Oct 3, 2019 •

edited

Loading

elindsey commented Oct 4, 2019

elindsey commented Oct 4, 2019 •

edited

Loading

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

elindsey commented Oct 4, 2019

boyter commented Oct 13, 2019

Scripts with shebangs but no file extension are skipped #115

Scripts with shebangs but no file extension are skipped #115

Comments

elindsey commented Sep 29, 2019

boyter commented Sep 29, 2019

elindsey commented Sep 30, 2019

boyter commented Sep 30, 2019

boyter commented Sep 30, 2019

boyter commented Sep 30, 2019 • edited Loading

elindsey commented Oct 1, 2019

boyter commented Oct 1, 2019

boyter commented Oct 3, 2019

elindsey commented Oct 3, 2019

boyter commented Oct 3, 2019 • edited Loading

elindsey commented Oct 4, 2019

elindsey commented Oct 4, 2019 • edited Loading

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

boyter commented Oct 4, 2019

elindsey commented Oct 4, 2019

boyter commented Oct 13, 2019

boyter commented Sep 30, 2019 •

edited

Loading

boyter commented Oct 3, 2019 •

edited

Loading

elindsey commented Oct 4, 2019 •

edited

Loading