Proof of concept for generating synthetic phishing emails using a fine-tuned GPT-2 model.
Warning
This project is intended for educational purposes only.
Misuse of this code is to generate phishing emails or any other malicious activities is strictly prohibited.
iArcanic not liable for any misuse or damages resulting from the use of this code. By using this code, you agree to use it responsibly and ethically.
Via Docker Compose:
docker-compose up --build
Via Docker CLI:
docker build -t ai-phishing-email-gen .
docker run ai-phishing-email-gen
Ensure the Docker engine is installed on your system with version 18.06.0 or higher.
You can download and install the Docker engine from the official Docker website.
Note
- Especially on Linux, make sure your user has the required permissions to interact with the Docker daemon.
- If you are unable to do this, either append
sudo
in front of eachdocker
command or switch to root usingsudo -s
.
Ensure that Docker Compose is installed on your system with version 1.28.0 or higher.
You can download and install Docker Compose from the official Docker website.
- CPU: Multi-core processor (4 cores or more)
- RAM: At least 16GB of RAM
- Disk space: Minimum of 50GB of free disk space
- Graphics: Integrated graphics (for non-GPU training)
- CPU: High-performance multi-core processor (8 cores or more)
- RAM: 32GB or more of RAM
- Disk space: Minimum if 100GB of free disk space
- GPU: NVIDIA GPU with CUDA support (e.g., NVIDIA RTX 2080 or better) for faster training
- Graphics: Dedicated GPU for significant performance improvement
Note
Training on a CPU will be significantly slower than on a GPU. It is highly recommended to use a machine with a compatible NVIDIA GPU for training large models.
Ensure sufficient disk space to store the model checkpoints and datasets.
If running in a virtualized environment or container, allocate resources according to the above recommendations to ensure optimal performance.
- Clone the repository to your local machine.
git clone https://github.com/iArcanic/ai-phishing-email-gen-poc
- Navigate to the project's root directory.
cd ai-phishing-email-gen-poc
- Build and run the Docker container.
docker-compose up --build
Note
With Docker Compose, you can also optionally use the following:
- If you want to build the images each time (or changed a Dockerfile), use:
docker-compose up --build
- If you want to run all the services in the background, use:
docker-compose up -d
After, you can optionally view Docker images, status of containers, and interact with running containers using the Docker Desktop application.
- View the Docker Container's logs
The following logs should be displayed upon running the container:
Loading the data...
Preprocessing the data...
Splitting datasets...
Initialising GPT2 Model...
GPT2 model initialised.
Tokenizing the train dataset...
Map: 100%|██████████| 4323/4323 [00:02<00:00, 2010.93 examples/s]
Train dataset tokenized.
Tokenizing the evaluation dataset...
Map: 100%|██████████| 1081/1081 [00:00<00:00, 2470.80 examples/s]
Evaluation dataset tokenized.
Setting up training arguments...
Training arguments set.
Initialising the trainer...
Trainer initialised.
Training the GPT2 model...
0%| | 10/3243 [00:47<4:06:58, 4.58s/it]
{'loss': 8.376, 'grad_norm': 196.69766235351562, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.0}
Note
The training of the GPT-2 model can take a very long time without a GPU.
Testing this on a Ubuntu Server 16-core machine with 32 GB memory (no GPU) resulted in a time of 2 days.
Upon successful model training, the following logs should be displayed:
100%|██████████| 3243/3243 [32:26:09<00:00, 36.01s/it] ning_rate': 5.468465184104995e-08, 'epoch': 3.0}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
{'eval_loss': 0.7192820906639099, 'eval_runtime': 313.6607, 'eval_samples_per_second': 3.446, 'eval_steps_per_second': 0.864, 'epoch': 3.0}
{'train_runtime': 116769.8894, 'train_samples_per_second': 0.111, 'train_steps_per_second': 0.028, 'train_loss': 0.7957250362849698, 'epoch': 3.0}
Training completed successfully.
Saving the final model and tokenizer...
Model and tokenizer saved successfully.
Loading the trained model and tokenizer...
Model and tokenizer loaded successfully.
Generating text with prompt: 'Dear customer, we have noticed unusual activity in your bank account.'...
Text generation completed successfully.
Generated text 1:
Dear customer, we have noticed unusual activity in your bank account. we are trying to contact you.
please verify your account and regain full access to your å£350 weekly sae. call 0871871850 now!
Note
The prompt can be customised via the prompt
variable in src/main.py.
- Destroy the Docker Container after use
docker-compose down
- Datasets in the
data
folder are taken from TanusreeSharma/phishingdata-Analysis. - GPT-2 model from Hugging Face.