edepth: Open-Source Trainable Depth Estimation Model

To aonaran, who is learning to perceive the profound depths of the world. from e

edepth: Open-Source Trainable Depth Estimation Model

Overview

edepth is an open-source, cutting-edge deep learning model designed to estimate depth from various input sources, including single images, videos, and live camera feeds. Depth estimation is a crucial task in computer vision, with applications in autonomous driving, robotics, augmented reality, and more. edepth addresses this task by predicting the distance of objects from the camera using convolutional neural networks (CNNs).

Model Architecture and Pipeline

The edepth model architecture is inspired by DenseNet and U-Net architectures, which have shown success in image segmentation tasks. The model consists of an encoder-decoder structure.

General Architecture Overview

Encoder

The encoder extracts features from the input data using multiple dense blocks, each containing convolutional layers with shared feature maps concatenated across layers. Transition layers follow dense blocks to reduce the number of channels and spatial dimensions:

initialConvolution: Convolution layer with kernel size 5.
pool: Max pooling layer with kernel size 2 and stride 2.
denseBlocks: Sequence of dense blocks for feature extraction.
transitionLayers: Sequence of transition layers to reduce channel dimensions.

Fully Connected Layers

Between the encoder and decoder, the model includes fully connected layers to process the features:

fullyConnectedI: Linear layer transforming the encoder output to a fixed-size vector.
fullyConnectedII: Linear layer transforming the fixed-size vector back to the size expected by the decoder.

Decoder

The decoder reconstructs the depth map from the encoded features using upsampling layers:

upSampleI: Upsampling layer with scale factor 2.
convI: Convolution layer with kernel size 3.
upSampleII: Upsampling layer with scale factor 2.
convII: Convolution layer with kernel size 3.
upSampleIII: Upsampling layer with scale factor 2.
convIII: Convolution layer with kernel size 3.
upSampleIV: Upsampling layer with scale factor 2.
convIV: Convolution layer with kernel size 3, outputting a single-channel depth map.

Detailed Architecture Overview

Installation

To install the required dependencies for running edepth, use the provided requirements.txt file. This file lists all necessary Python 3.12.* packages and their versions.

pip install -r requirements.txt

Cloning and Setting Up the Model

To clone the repository and set up edepth on your local machine, follow these steps:

Clone the Repository

git clone https://github.com/ehsanasgharzde/edepth.git
cd edepth

Install Dependencies

Create a virtual environment and install the required packages:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt

Usage

Create Model

from edepth import edepth

model = edepth()

Loading a Pre-trained Model

Load a pre-trained model for inference:

model.eload('path/to/pretrained_model.pt')

Generating Depth Maps

From an Image

model.egenerate(source='image', inputFilePath='path/to/image.jpg', show=True)

From a Video

model.egenerate(source='video', inputFilePath='path/to/video.mp4', show=True)

From Live Camera Feed

model.egenerate(source='live', show=True)

Training the Model

Train the edepth model using the provided training data:

import pandas
from utilities import Dataset
from sklearn.model_selection import train_test_split

    
dataset = pandas.read_csv('path/to/dataset.csv')
train, validate = train_test_split(dataset, test_size=0.2, random_state=42)
trainset, validationset = Dataset(train, 224, 224), Dataset(validate, 224, 224)

model.etrain(trainset, validationset, epochs=100)

Test Train Details

Hyperparameters

The following hyperparameters, dataset, and hardware were used to achieve the model's performance mentioned further in the readme file:

Growth Rate: 32
Neurons: 512
Epochs: 1000 (96 successfully completed epochs after reaching hardware temperature limits)
Batch Size: 16
Gradient Clip: 2.0
Optimizer: swats.SWATS(self.parameters(), lr=0.0001)
Activation: nn.ReLU()
Loss: nn.MSELoss()
Scheduler: torch.optim.lr_scheduler.ReduceLROnPlateau(self.optimizer, 'min', patience=100, factor=0.5)

Hardware

Processor: Intel® Core™ i7-5500U CPU @ 2.40GHz × 4
GPU: HAINAN (, LLVM 15.0.7, DRM 2.50, 6.5.0-35-generic) / Mesa Intel® HD
RAM: 16 GB
Storage: 1 TB SSD

Training Performance

Epochs Completed: 96
Validation Loss at 96th Epoch: 0.27274127925435704

Dataset

The dataset for edepth was gathered by downloading videos from YouTube, with details available in this Google Spreadsheet. The images and corresponding depth maps were created using the Marigold depth estimation model, available here.

Key Details:

Number of Images and Labels (Depth Maps): 954
Average Image Size (resized to 224x224): 4.5MB
Average Label (Depth Map) Size (resized to 224x224): 6MB

Model State: Download and place in checkpoints folder.

Note: This model state was specifically trained for testing purposes on drone shots of cell tower antennas in daylight conditions. It may not be suitable for other use cases. Users are encouraged to train the model themselves for their specific applications and datasets to achieve optimal performance.

Samples

Image Samples

Below are visualizations of input images and their corresponding output depth maps generated by the edepth model. These samples demonstrate the model's ability to estimate depth from single images accurately.

Input Image

Output Colorized Depth Map estimation time: 0.1526861349993851 seconds

Output Grayscale Depth Map estimation time: 0.1276703309995355 seconds

Input Image

Output Colorized Depth Map estimation time: 0.14070451999577926 seconds

Output Grayscale Depth Map estimation time: 0.1332030850026058 seconds

Video Samples

edepth can process video files and generate depth maps for each frame. Here are some example results:

Original Video: Original Video
Depth Map Video: Depth Map Video

Performance Metrics

Note: edepth calculates accuracy by comparing the predicted depth values to the true depth values for each pixel in the input image.

Processing Speed: edepth can process images at a rate of 21 images per second and videos at 25 frames per second.
Accuracy: The model achieves an average accuracy of 99% on standard depth estimation benchmarks.
Model Size: The edepth model has a total of 1.3 million parameters, making it efficient for both training and inference.

Main Features

Customizable Architecture

edepth offers flexibility in its architecture, allowing users to adjust parameters such as input channels, growth rate, and depth range. This customization enables the model to adapt to different datasets and tasks.

Training and Evaluation Methods

The model provides methods for training and evaluating depth estimation tasks. It includes functionalities for loading datasets, training the model with configurable hyperparameters, and evaluating model performance on validation sets.

Real-time Processing Capabilities

Capable of processing live camera feeds in real-time, making it suitable for dynamic and interactive applications.

Versatile Input Support

Supports images, videos, and live feeds, providing a comprehensive solution for depth estimation across different types of media.

Performance

Image Input

Speed: Processes images at 21 images per second.
Accuracy: Achieves an average accuracy of 73% on 954 images dataset with 96 epochs of learning.

Video Input

Speed: Processes video frames at 25 to 30 frames per second.
Accuracy: Maintains high accuracy across consecutive frames.

Live Streams

Speed: Real-time processing with minimal latency.
Accuracy: Consistent accuracy for dynamic scenes.

Future Plans

cudnnenv

cudnnenv: Manage CUDA and cuDNN versions for optimized deep learning performance.

scikit-image

scikit-image: Utilize for image processing tasks like preprocessing and post-processing depth maps.

huggingface_hub

huggingface_hub: Explore pretrained models and datasets for depth estimation tasks.

accelerate

accelerate: Enhance training efficiency with utilities for distributed and mixed precision training.

diffusers

diffusers (planned): Analyze model robustness and interpretability for depth estimation.

transformers

transformers: Adapt transformer-based architectures for computer vision tasks, including depth estimation.

denoisers

denoisers (planned): Improve depth map quality through advanced denoising techniques.

Update Shape and Remove Fully Connected Layers

Update Shape and Remove Fully Connected Layers (planned): Enhance edepth's versatility to support variable input sizes efficiently.

Contributing

Contributions to improve the model's performance or add new features are highly appreciated! Whether it's optimizing the architecture, implementing new algorithms, or enhancing documentation, your contributions are valuable. To contribute, fork the repository, make your changes, and submit a pull request.

Steps to Contribute

Fork the repository
Create a new branch (git checkout -b feature-branch)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature-branch)
Create a new Pull Request

For any questions, suggestions, or collaboration opportunities, feel free to reach out to me.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Ehsan Asgharzadeh - GitHub, LinkedIn
Contact: ehsanasgharzadeh.asg@gmail.com
Version: 1.0.1
License: ehsanasgharzadeh.ir - MIT

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
architecture		architecture
checkpoints		checkpoints
dataset		dataset
input		input
output		output
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
edepth.py		edepth.py
requirements.txt		requirements.txt
run.py		run.py
utilities.py		utilities.py

License

ehsanasgharzde/edepth

Folders and files

Latest commit

History

Repository files navigation

edepth: Open-Source Trainable Depth Estimation Model

Overview

Model Architecture and Pipeline

Encoder

Fully Connected Layers

Decoder

Installation

Cloning and Setting Up the Model

Clone the Repository

Install Dependencies

Usage

Create Model

Loading a Pre-trained Model

Generating Depth Maps

From an Image

From a Video

From Live Camera Feed

Training the Model

Test Train Details

Hyperparameters

Hardware

Training Performance

Dataset

Key Details:

Samples

Image Samples

Video Samples

Performance Metrics

Main Features

Customizable Architecture

Training and Evaluation Methods

Real-time Processing Capabilities

Versatile Input Support

Performance

Image Input

Video Input

Live Streams

Future Plans

cudnnenv

scikit-image

huggingface_hub

accelerate

diffusers

transformers

denoisers

Update Shape and Remove Fully Connected Layers

Contributing

Steps to Contribute

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages