Find similar code in Git repositories
Gemini is a tool for searching for similar 'items' in source code repositories. Supported granularity level or items are:
- repositories (TBD)
- files
- functions
./hash <path-to-repos-or-siva-files>
./query <path-to-file>
./report
You would need to prefix commands with docker-compose exec gemini
if you run it in docker. Read below how to start gemini in docker or standalone mode.
To pre-process number of repositories for a quick finding of the duplicates run
./hash ./src/test/resources/siva
Input format of the repositories is the same as in src-d/Engine.
To pre-process repositories for search of similar functions run:
./hash -m func ./src/test/resources/siva
To find all duplicate of the single file run
./query <path-to-single-file>
To find all similar function defined in a file run:
./query -m func <path-to-single-file>
If you are interested in similarities of only 1 function defined in the file you can run:
./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>
To find all duplicate files and similar functions in all repositories run
./report
All repositories must be hashed before and a community detection library installed.
Start containers:
docker-compose up -d
Local directories repositories
and query
are available as /repositories
and /query
inside the container.
Examples:
docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report
You would need:
- JVM 1.8
- Apache Cassandra or ScyllaDB
- Apache Spark
- Python 3
- Bblfshd v2.5.0+
By default, all commands are going to use
- Apache Cassandra or ScyllaDB instance available at
localhost:9042
- Apache Spark, available though
$SPARK_HOME
# save some repos in .siva files using Borges
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git" > repo-list.txt
# get Borges from https://github.com/src-d/borges/releases
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt
# start Apache Cassandra
docker run -p 9042:9042 \
--name cassandra -d rinscy/cassandra:3.11
# or ScyllaDB \w workaround https://github.com/gocql/gocql/issues/987
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
--name some-scylla -d scylladb/scylla:2.0.0 \
--broadcast-address 127.0.0.1 --listen-address 0.0.0.0 --broadcast-rpc-address 127.0.0.1 \
--memory 2G --smp 1
# to get access to DB for development
docker exec -it some-scylla cqlsh
Just set url to the Spark Master though env var
MASTER="spark://<spark-master-url>" ./hash <path>
All three commands accept parameters for database connection and logging:
-h/--host
- cassandra/scylla db hostname, default127.0.0.1
-p/--port
- cassandra/scylla db port, default9042
-k/--keyspace
- cassandra/scylla db keyspace, defaulthashes
-v/--verbose
- producing more verbose output, defaultfalse
For query
and hash
commands parameters for bblfsh/features extractor configuration are available:
-m/--mode
- similarity modes:file
orfunction
, defaultfile
--bblfsh-host
- babelfish server host, default127.0.0.1
--bblfsh-port
- babelfish server port, default9432
--features-extractor-host
- features-extractor host, default127.0.0.1
--features-extractor-port
- features-extractor port, default9001
Hash command specific arguments:
-l/--limit
- limit the number of repositories to be processed. All repositories will be processed by default-f/--format
- format of the stored repositories. Supported input data formats that repositories could be stored in aresiva
,bare
orstandard
, defaultsiva
If env var DEV
is set, ./sbt
is used to compile and run all non-Spark commands: ./hash
and ./report
.
This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow
that is similar to experience with interpreted languages.
To build final .jars for all commands
./sbt assemblyPackageDependency
./sbt assembly
Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.
To run tests, that rely
./sbt test
Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext
.
In case you update any of the src/main/proto/*.proto
, you would need to generate gRPC code for Feature Extractors:
./src/main/resources/generate_from_proto.sh
To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:
bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>
Copyright (C) 2018 source{d}. This project is licensed under the GNU General Public License v3.0.