A deep learning project for musical interval recognition using PyTorch. This system generates synthetic musical intervals, processes them into mel spectrograms, and trains a CNN model to recognize different musical intervals.
intervals-ai/
│
├── intervals_ai/
│ ├── dataset/
│ │ ├── dataset.py # Dataset handling
│ │ └── preprocessor.py # Audio preprocessing
│ ├── generation/
│ │ ├── config.py # Configuration classes
│ │ ├── intervals.py # Interval generation
│ │ └── notes.py # Note generation utilities
│ ├── models/
│ │ ├── cnn.py # CNN model architecture
│ │ ├── evaluation.py # Model evaluation
│ │ └── trainer.py # Training logic
│ ├── utils/
│ │ ├── audio.py # Audio utilities
│ │ └── visualization.py # Visualization utilities
│ └── main.py # Main training script
│
├── generated_data/ # Generated audio samples
│ └── interval_*.wav # Generated interval audio files
│
├── experiments/ # Training runs
│ └── run_YYYYMMDD_HHMMSS/ # Timestamped experiment directory
│ ├── checkpoints/ # Model checkpoints
│ ├── plots/ # Generated visualizations
│ └── stats/ # Training statistics and metrics
│
├── README.md
├── pyproject.toml # Poetry dependencies
└── poetry.lock # Poetry lock file
- Install Python 3.10 or newer
- Install Poetry (dependency management):
curl -sSL https://install.python-poetry.org | python3 -
- Clone the repository:
git clone https://github.com/mj300405/intervals_ai.git cd intervals-ai
- Install dependencies:
poetry install
-
Activate the virtual environment:
poetry shell
-
Run the training script:
python intervals_ai/main.py
The script will:
- Generate a dataset of 5000 interval samples (if not already present)
- Train a CNN model for 100 epochs
- Save results in a timestamped experiment directory
Each training run creates a timestamped directory under experiments/
:
experiments/run_YYYYMMDD_HHMMSS/
├── checkpoints/
│ ├── best_model.pt # Best performing model
│ └── checkpoint_epoch_*.pt # Regular checkpoints
│
├── plots/
│ ├── confusion_matrix.png # Model performance matrix
│ ├── training_history.png # Loss and accuracy curves
│ └── audio_features_*.png # Audio visualizations
│
└── stats/
├── run_config.json # Run configuration
├── training_history.json # Detailed training metrics
├── evaluation_results.json # Test set results
├── metrics_summary.txt # Performance summary
└── training_summary.txt # Training overview
- Input: Mel spectrograms (128 mel bands)
- Architecture: CNN with 3 convolutional layers
- Training:
- 5000 total samples
- 70/15/15 train/val/test split
- 150 epochs maximum
- Early stopping with 15 epochs patience
- Learning rate: 0.001
- Batch size: 32
The system generates intervals using these parameters:
- Sample rate: 44100 Hz
- Duration: 2.0 seconds
- Note gap: 0.2 seconds
- Frequency range: 55 Hz (A1) to 1760 Hz (A6)
- Supported intervals:
- Minor/Major Second
- Minor/Major Third
- Perfect Fourth
- Tritone
- Perfect Fifth
- Minor/Major Sixth
- Minor/Major Seventh
- Octave
-
Training Plots:
- Loss curves (training and validation)
- Accuracy curves (training and validation)
- Confusion matrix
-
Audio Visualizations:
- Waveform
- Spectrogram
- Mel spectrogram
Main dependencies include:
- PyTorch (deep learning)
- librosa (audio processing)
- numpy (numerical computations)
- matplotlib (visualization)
- soundfile (audio I/O)
- scikit-learn (evaluation metrics)
For a complete list, see pyproject.toml
.
MIT License
- First run will generate the dataset, which may take some time
- Subsequent runs will reuse the existing dataset
- Each run creates a new experiment directory
- GPU will be used if available, otherwise CPU
- Training progress is saved regularly, allowing for recovery