To aonaran, who is learning to perceive the profound depths of the world. from e
edepth is a cutting-edge deep learning model designed to estimate depth from various input sources, including single images, videos, and live camera feeds. Depth estimation is a crucial task in computer vision, with applications in autonomous driving, robotics, augmented reality, and more. edepth addresses this task by predicting the distance of objects from the camera using convolutional neural networks (CNNs).
The edepth model architecture is inspired by DenseNet and U-Net architectures, which have shown success in image segmentation tasks. The model consists of an encoder-decoder structure.
The encoder extracts features from the input data using multiple dense blocks, each containing convolutional layers with shared feature maps concatenated across layers. Transition layers follow dense blocks to reduce the number of channels and spatial dimensions.
Between the encoder and decoder, the model includes fully connected layers to process the features:
fullyConnectedI
: Linear layer transforming the encoder output to a fixed-size vector.fullyConnectedII
: Linear layer transforming the fixed-size vector back to the size expected by the decoder.
The decoder reconstructs the depth map from the encoded features using upsampling layers.
To install the required dependencies for running edepth, use the provided requirements.txt
file. This file lists all necessary Python 3.12.* packages and their versions.
pip install -r requirements.txt
To clone the repository and set up edepth on your local machine, follow these steps:
git clone https://github.com/ehsanasgharzde/edepth.git
cd edepth
Create a virtual environment and install the required packages:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
Load a pre-trained model for inference:
model.eload('path/to/pretrained_model.pt')
model.egenerate(source='image', inputFilePath='path/to/image.jpg', show=True)
model.egenerate(source='video', inputFilePath='path/to/video.mp4', show=True)
model.egenerate(source='live', show=True)
Train the edepth model using the provided training data:
from edepth import edepth
model = edepth()
model.etrain(trainLoader, validationLoader, epochs=10)
The following hyperparameters, dataset, and hardware were used to achieve the model's performance mentioned further in the readme file:
- Growth Rate: 32
- Neurons: 512
- Epochs: 1000 (96 successfully completed epochs after reaching hardware temperature limits)
- Batch Size: 16
- Gradient Clip: 2.0
- Optimizer:
swats.SWATS(self.parameters(), lr=0.0001)
- Activation:
nn.ReLU()
- Loss:
nn.MSELoss()
- Scheduler:
torch.optim.lr_scheduler.ReduceLROnPlateau(self.optimizer, 'min', patience=100, factor=0.5)
- Processor: Intel® Core™ i7-5500U CPU @ 2.40GHz × 4
- GPU: HAINAN (, LLVM 15.0.7, DRM 2.50, 6.5.0-35-generic) / Mesa Intel® HD
- RAM: 16 GB
- Storage: 1 TB SSD
- Epochs Completed: 96
- Validation Loss at 96th Epoch: 0.27274127925435704
The dataset for edepth was gathered by downloading videos from YouTube, with details available in this Google Spreadsheet. The images and corresponding depth maps were created using the Marigold depth estimation model, available here.
- Number of Images and Labels (Depth Maps): 954
- Average Image Size (resized to 224x224): 4.5MB
- Average Label (Depth Map) Size (resized to 224x224): 6MB
Model State: Download and place in checkpoints
folder.
Note: This model state was specifically trained for testing purposes on drone shots of cell tower antennas in daylight conditions. It may not be suitable for other use cases. Users are encouraged to train the model themselves for their specific applications and datasets to achieve optimal performance.
Below are visualizations of input images and their corresponding output depth maps generated by the edepth model. These samples demonstrate the model's ability to estimate depth from single images accurately.
Output Colorized Depth Map estimation time: 0.1526861349993851 seconds
Output Grayscale Depth Map estimation time: 0.1276703309995355 seconds
Output Colorized Depth Map estimation time: 0.14070451999577926 seconds
Output Grayscale Depth Map estimation time: 0.1332030850026058 seconds
edepth can process video files and generate depth maps for each frame. Here are some example results:
- Original Video: Original Video
- Depth Map Video: Depth Map Video
Note: edepth calculates accuracy by comparing the predicted depth values to the true depth values for each pixel in the input image.
- Processing Speed: edepth can process images at a rate of 21 images per second and videos at 25 frames per second.
- Accuracy: The model achieves an average accuracy of 99% on standard depth estimation benchmarks.
- Model Size: The edepth model has a total of 1.3 million parameters, making it efficient for both training and inference.
edepth offers flexibility in its architecture, allowing users to adjust parameters such as input channels, growth rate, and depth range. This customization enables the model to adapt to different datasets and tasks.
The model provides methods for training and evaluating depth estimation tasks. It includes functionalities for loading datasets, training the model with configurable hyperparameters, and evaluating model performance on validation sets.
Capable of processing live camera feeds in real-time, making it suitable for dynamic and interactive applications.
Supports images, videos, and live feeds, providing a comprehensive solution for depth estimation across different types of media.
- Speed: Processes images at 21 images per second.
- Accuracy: Achieves an average accuracy of 73% on 954 images dataset with 96 epochs of learning.
- Speed: Processes video frames at 25 to 30 frames per second.
- Accuracy: Maintains high accuracy across consecutive frames.
- Speed: Real-time processing with minimal latency.
- Accuracy: Consistent accuracy for dynamic scenes.
- cudnnenv: Manage CUDA and cuDNN versions for optimized deep learning performance.
- scikit-image: Utilize for image processing tasks like preprocessing and post-processing depth maps.
- huggingface_hub: Explore pretrained models and datasets for depth estimation tasks.
- accelerate: Enhance training efficiency with utilities for distributed and mixed precision training.
- diffusers (planned): Analyze model robustness and interpretability for depth estimation.
- transformers: Adapt transformer-based architectures for computer vision tasks, including depth estimation.
- denoisers (planned): Improve depth map quality through advanced denoising techniques.
- Update Shape and Remove Fully Connected Layers (planned): Enhance edepth's versatility to support variable input sizes efficiently.
Contributions to improve the model's performance or add new features are highly appreciated! Whether it's optimizing the architecture, implementing new algorithms, or enhancing documentation, your contributions are valuable. To contribute, fork the repository, make your changes, and submit a pull request.
- Fork the repository
- Create a new branch (
git checkout -b feature-branch
) - Commit your changes (
git commit -am 'Add new feature'
) - Push to the branch (
git push origin feature-branch
) - Create a new Pull Request
For any questions, suggestions, or collaboration opportunities, feel free to reach out to Ehsan Asgharzadeh.
This project is licensed under the MIT License. See the LICENSE file for details.
- Ehsan Asgharzadeh - GitHub, LinkedIn
- Contact: ehsanasgharzadeh.asg@gmail.com
- Version: 1.0.1
- License: https://ehsanasgharzadeh.ir - MIT