-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc: Update README and add all other README's
- Loading branch information
1 parent
cfeb5e0
commit f2d8280
Showing
6 changed files
with
452 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# DupNet | ||
|
||
### Dataset | ||
|
||
We need a labeled dataset for training but we don't want to go through every potential overlaps and label all the ground truth. | ||
Fortuantely, we don't need to. Let's see why. | ||
Using our [tile indexing scheme](RESEARCH.md) for a single image, we assert the following 2 statements about any 2 tiles i, j: | ||
|
||
1. If `i == j`, the two tiles are duplicates. The tile is simply a copy of itself. | ||
2. If `i != j`, the two tiles represent the same point in time, but their scenes are disjoint and hence by definition they should not be duplicates. | ||
|
||
Without even looking at an image, these two statements should always be true fundamentally. | ||
But as we have seen, some images contain multiple tiles with the same hash (e.g. clouds, black or blue border). | ||
|
||
We build up the dataset by looping over all pairs of tiles in each image. | ||
For each pair, we store the image name, the indexes of the two tiles, and the truth (a 1 if the indexes are the same and a 0 if they are different). | ||
We skip all solid tiles as well as pairs of tiles we have already recorded (e.g. if we already have tiles (2, 4) stored, we skip tiles (4, 2)). | ||
If we want to swap the tile order, we can do that during training. | ||
Each image can therefore contribute a maximum of 36 datapoints (9 dups and 27 non-dups) to the final dataset. | ||
To help balance the labels, we use the BlockMeanHash of each tile to filter out datapoints that are clearly non-duplicate. | ||
This brings the total down to ~4 million datapoints and this number can vary depending on how we filter with BlockMeanHash. | ||
Regardless, 4 million images is overkill. We always end up randomly sampling down to somewhere between 100k and 200k datapoints, | ||
after which we then further split into training, validation and testing. | ||
We find that this is more than enough data to sufficiently train a model capable of outperforming any of the [image metric](IMAGE_METRICS.md) based algorithms. | ||
--- | ||
### Augmentation | ||
We use the following 3 image augmentation methods during training: | ||
|
||
1. [JPEG Compression](#jpeg-compression) | ||
2. [HLS shifting](#hls-shifting) | ||
3. [Flips and Rotations](#flips-and-rotations) | ||
|
||
#### JPEG Compression | ||
Create a new folder for the `256` jpegs and save each file using the original img_id for the prefix followed by the index according to its location in table 2. | ||
Compare sliced tile with corresponding saved `256` jpeg tile. | ||
We know they are the same image but because of jpeg compression, there are slight variations in some of the pixels. | ||
We get this for free when we create the `256 x 256` tile dataset. | ||
|
||
#### HLS shifting | ||
We create a class in [hls_shift](notebooks/eda/hls_shift.ipynb) to compare perturbed and unperturbed versions of the same image. | ||
The perturbed version results from scaling one of the HLS channels of the original unperturbed image | ||
by an integer whose value falls within $\pm 180$ for the Hue channel and $\pm 100$ for the lightness and saturation channels. | ||
|
||
The idea here is to perturb the hue, lightness, and/or saturation such that the image still "looks" the same, | ||
but when comparing the sum of the pixelwise deltas, the counts can be in the tens to hundreds of thousands. | ||
For this to work, we need the upper and lower bounds on the three HLS channels. | ||
|
||
We used GIMP to create a small dataset of HLS perturbed images. | ||
Gimp has a nice UI that we could use to save out multiple versions of the same image. | ||
With each version, we scaled the H, L or S value by $\lambda$ before saving it. | ||
See the corresponding [README](data/persistent/gimp_hls/README.md) for more details. | ||
|
||
We then opened up both the original image and the gimp-scaled copies with OpenCV and | ||
experimented with different HLS offsets to determine at what point the | ||
image is significantly different enough to be regarded as not a duplicate. | ||
|
||
#### Flips and Rotations | ||
|
||
We will add horizontal and/or vertical flips. | ||
We'll restrict rotations to 0, 90, 180, and 270 degrees so we don't have to deal with cropping or resizing. | ||
Ensure that **both** images are flipped and rotated the same. | ||
--- | ||
### Model | ||
|
||
TODO: Describe in words | ||
|
||
![](notebooks/figures/dupnet_v0.png) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
## Per Image Metrics | ||
|
||
We only need to calculate the per-image metrics once when we first run the install script. | ||
The installer will save the the results to intermediate files which can then be loaded much faster than having to recalculate the metrics again. | ||
|
||
Below we show a list of algorithms we tested to find duplicates. | ||
|
||
One option is to compare various per-image metrics using various image "similarity" algorithms: | ||
- [image hashes](notebooks/eda/3_image_hashes.ipynb) | ||
- [image entropy](notebooks/eda/4_image_entropy.ipynb) | ||
- [image histograms](notebooks/eda/image_histograms.ipynb) | ||
|
||
Unfortunately, no single one of these nor any combination work particularly well across the entire dataset. | ||
They produce far too many false positives and false negatives to be useful. | ||
|
||
### Image Hashes | ||
|
||
* [md5 hash](https://docs.python.org/3/library/hashlib.html) | ||
* [block-mean hash](https://www.phash.org/docs/pubs/thesis_zauner.pdf) | ||
* [color-moment hash](http://www.naturalspublishing.com/files/published/54515x71g3omq1.pdf) | ||
|
||
Use the `hashlib` python package to calculate md5 checksum. | ||
Perceptual image hash functions are available through the contrib add-on package beginning with OpenCV 3.3.0. | ||
|
||
### Image Entropy | ||
|
||
* Cross Entropy? | ||
* Shannon Entropy? | ||
|
||
### grey-level co-occurrence matrix ([wiki](https://en.wikipedia.org/wiki/Co-occurrence_matrix)) | ||
|
||
* energy | ||
* contrast | ||
* homogeneity | ||
* correlation | ||
|
||
Available through [scikit-image](https://scikit-image.org/docs/stable/api/skimage.feature.html#greycomatrix) | ||
|
||
See also, [Harris geospatial](https://www.harrisgeospatial.com/docs/backgroundtexturemetrics.html) | ||
|
||
### Other useful per-image metrics. | ||
|
||
* Is solid? | ||
* Ship counts | ||
|
||
## Overlap Metrics | ||
|
||
* Binary pixel difference | ||
* Absolute pixel difference | ||
|
||
### Other Useful Sources | ||
|
||
<https://en.wikipedia.org/wiki/Relative_change_and_difference> | ||
|
||
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.8550&rep=rep1&type=pdf> | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Setup and Install | ||
Use the following guide to reproduce my results and run the jupyter notebooks. | ||
|
||
## Clone this repo | ||
Clone this repo and cd into the project. | ||
```shell script | ||
$ git clone https://github.com/WillieMaddox/Airbus_SDC_dup.git | ||
$ cd Airbus_SDC_dup | ||
``` | ||
|
||
## Setup the environment | ||
I use virtualenv along with virtualenvwrapper to setup an isolated python environment: | ||
```shell script | ||
$ mkvirtualenv --python=python3.6 Airbus_SDC_dup | ||
``` | ||
You should now see your command prompt prefixed with `(sdcdup)` indicating you are in the virtualenv. | ||
|
||
If using conda, you can instead try using the make script to create your environment. | ||
```shell script | ||
$ make create_environment | ||
``` | ||
I do not use conda so I haven't had a chance to verify if this works. | ||
|
||
## Install requirements | ||
From the root of the project install all requirements. | ||
```shell script | ||
(Airbus_SDC_dup) $ pip install -r requirements.txt | ||
``` | ||
or | ||
```shell script | ||
$ make requirements | ||
``` | ||
|
||
## Download the data | ||
|
||
The dataset for this project is hosted on Kaggle. [Airbus Ship Detection Challenge](https://www.kaggle.com/c/airbus-ship-detection/overview) | ||
You'll need to sign in with your Kaggle username. If you don't have an account, it's free to sign up. | ||
|
||
You can extract the dataset to wherever you like. I extracted it to `data/raw/train_768` | ||
|
||
``` | ||
├── Makefile <- Makefile with commands like `make data` or `make train` | ||
├── README.md <- This README. | ||
├── data | ||
│ ├── raw <- Data dump from the Airbu_SDC Kaggle competition goes in here. | ||
│ ├── train_768 <- The images from train_v2.zip go in here. | ||
│ ├── 00003e153.jpg | ||
│ ├── 0001124c7.jpg | ||
│ ├── ... | ||
│ └── ffffe97f3.jpg | ||
│ └── train_ship_segmentations_v2.csv <- The run length encoded ship labels. | ||
│ ├── ... | ||
├── ... | ||
``` | ||
|
||
## Preprocess tiles and interim data | ||
|
||
Once the raw dataset has been downloaded and extracted, run the image preprocessing scripts. | ||
|
||
First generate the 256 x 256 image tiles: | ||
```shell script | ||
$ make data | ||
``` | ||
Note: The `data/processed/train_256` folder takes up ??? GB of disk space. It takes approx 30 min to run on my dev system. YMMV. | ||
|
||
Next generate the image feature metadata: | ||
```shell script | ||
$ make features | ||
``` | ||
Note: The `data/interim` folder takes up ??? GB of disk space. It takes approx 2 hrs to run on my dev system. YMMV. | ||
|
||
The newly generated files will be placed into the interim and processed directories. | ||
Once complete, your directory structure should look like the following: | ||
``` | ||
├── ... | ||
├── data | ||
│ ├── raw | ||
│ ├── interim | ||
│ ├── image_bm0hash_grids.pkl | ||
│ ├── image_cm0hash_grids.pkl | ||
│ ├── image_issolid_grids.pkl | ||
│ ├── image_md5hash_grids.pkl | ||
│ ├── image_shipcnt_grids.pkl | ||
│ ├── overlap_bmh_tile_scores_1.pkl | ||
│ ├── overlap_bmh_tile_scores_2.pkl | ||
│ ├── overlap_bmh_tile_scores_3.pkl | ||
│ ├── overlap_bmh_tile_scores_4.pkl | ||
│ ├── overlap_bmh_tile_scores_6.pkl | ||
│ ├── overlap_bmh_tile_scores_9.pkl | ||
│ ├── overlap_cmh_tile_scores_1.pkl | ||
│ ├── overlap_cmh_tile_scores_2.pkl | ||
│ ├── ... | ||
│ └── overlap_shp_tile_scores_9.pkl | ||
│ ├── processed | ||
│ └── train_256 | ||
│ ├── 00003e153_0.jpg | ||
│ ├── 00003e153_1.jpg | ||
│ ├── 00003e153_2.jpg | ||
│ ├── 00003e153_3.jpg | ||
│ ├── 00003e153_4.jpg | ||
│ ├── 00003e153_5.jpg | ||
│ ├── 00003e153_6.jpg | ||
│ ├── 00003e153_7.jpg | ||
│ ├── 00003e153_8.jpg | ||
│ ├── 0001124c7_0.jpg | ||
│ ├── ... | ||
│ └── ffffe97f3_8.jpg | ||
│ ├── ... | ||
├── ... | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,8 @@ | ||
# Airbus_SDC_dup | ||
Finding and tagging duplicate satellite images based on overlapping sub tiles. | ||
Detecting duplicate regions of overlapping satellite imagery. | ||
|
||
* [Installing](INSTALL.md) | ||
* [Research](RESEARCH.md) | ||
* [Image Metrics](IMAGE_METRICS.md) | ||
* [DupNet](DUPNET.md) | ||
* [Todo](TODO.md) |
Oops, something went wrong.