Skip to content

Commit

Permalink
doc: Update README and add all other README's
Browse files Browse the repository at this point in the history
  • Loading branch information
WillieMaddox committed Sep 26, 2019
1 parent cfeb5e0 commit f2d8280
Show file tree
Hide file tree
Showing 6 changed files with 452 additions and 1 deletion.
69 changes: 69 additions & 0 deletions DUPNET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# DupNet

### Dataset

We need a labeled dataset for training but we don't want to go through every potential overlaps and label all the ground truth.
Fortuantely, we don't need to. Let's see why.
Using our [tile indexing scheme](RESEARCH.md) for a single image, we assert the following 2 statements about any 2 tiles i, j:

1. If `i == j`, the two tiles are duplicates. The tile is simply a copy of itself.
2. If `i != j`, the two tiles represent the same point in time, but their scenes are disjoint and hence by definition they should not be duplicates.

Without even looking at an image, these two statements should always be true fundamentally.
But as we have seen, some images contain multiple tiles with the same hash (e.g. clouds, black or blue border).

We build up the dataset by looping over all pairs of tiles in each image.
For each pair, we store the image name, the indexes of the two tiles, and the truth (a 1 if the indexes are the same and a 0 if they are different).
We skip all solid tiles as well as pairs of tiles we have already recorded (e.g. if we already have tiles (2, 4) stored, we skip tiles (4, 2)).
If we want to swap the tile order, we can do that during training.
Each image can therefore contribute a maximum of 36 datapoints (9 dups and 27 non-dups) to the final dataset.
To help balance the labels, we use the BlockMeanHash of each tile to filter out datapoints that are clearly non-duplicate.
This brings the total down to ~4 million datapoints and this number can vary depending on how we filter with BlockMeanHash.
Regardless, 4 million images is overkill. We always end up randomly sampling down to somewhere between 100k and 200k datapoints,
after which we then further split into training, validation and testing.
We find that this is more than enough data to sufficiently train a model capable of outperforming any of the [image metric](IMAGE_METRICS.md) based algorithms.
---
### Augmentation
We use the following 3 image augmentation methods during training:

1. [JPEG Compression](#jpeg-compression)
2. [HLS shifting](#hls-shifting)
3. [Flips and Rotations](#flips-and-rotations)

#### JPEG Compression
Create a new folder for the `256` jpegs and save each file using the original img_id for the prefix followed by the index according to its location in table 2.
Compare sliced tile with corresponding saved `256` jpeg tile.
We know they are the same image but because of jpeg compression, there are slight variations in some of the pixels.
We get this for free when we create the `256 x 256` tile dataset.

#### HLS shifting
We create a class in [hls_shift](notebooks/eda/hls_shift.ipynb) to compare perturbed and unperturbed versions of the same image.
The perturbed version results from scaling one of the HLS channels of the original unperturbed image
by an integer whose value falls within $\pm 180$ for the Hue channel and $\pm 100$ for the lightness and saturation channels.

The idea here is to perturb the hue, lightness, and/or saturation such that the image still "looks" the same,
but when comparing the sum of the pixelwise deltas, the counts can be in the tens to hundreds of thousands.
For this to work, we need the upper and lower bounds on the three HLS channels.

We used GIMP to create a small dataset of HLS perturbed images.
Gimp has a nice UI that we could use to save out multiple versions of the same image.
With each version, we scaled the H, L or S value by $\lambda$ before saving it.
See the corresponding [README](data/persistent/gimp_hls/README.md) for more details.

We then opened up both the original image and the gimp-scaled copies with OpenCV and
experimented with different HLS offsets to determine at what point the
image is significantly different enough to be regarded as not a duplicate.

#### Flips and Rotations

We will add horizontal and/or vertical flips.
We'll restrict rotations to 0, 90, 180, and 270 degrees so we don't have to deal with cropping or resizing.
Ensure that **both** images are flipped and rotated the same.
---
### Model

TODO: Describe in words

![](notebooks/figures/dupnet_v0.png)


58 changes: 58 additions & 0 deletions IMAGE_METRICS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## Per Image Metrics

We only need to calculate the per-image metrics once when we first run the install script.
The installer will save the the results to intermediate files which can then be loaded much faster than having to recalculate the metrics again.

Below we show a list of algorithms we tested to find duplicates.

One option is to compare various per-image metrics using various image "similarity" algorithms:
- [image hashes](notebooks/eda/3_image_hashes.ipynb)
- [image entropy](notebooks/eda/4_image_entropy.ipynb)
- [image histograms](notebooks/eda/image_histograms.ipynb)

Unfortunately, no single one of these nor any combination work particularly well across the entire dataset.
They produce far too many false positives and false negatives to be useful.

### Image Hashes

* [md5 hash](https://docs.python.org/3/library/hashlib.html)
* [block-mean hash](https://www.phash.org/docs/pubs/thesis_zauner.pdf)
* [color-moment hash](http://www.naturalspublishing.com/files/published/54515x71g3omq1.pdf)

Use the `hashlib` python package to calculate md5 checksum.
Perceptual image hash functions are available through the contrib add-on package beginning with OpenCV 3.3.0.

### Image Entropy

* Cross Entropy?
* Shannon Entropy?

### grey-level co-occurrence matrix ([wiki](https://en.wikipedia.org/wiki/Co-occurrence_matrix))

* energy
* contrast
* homogeneity
* correlation

Available through [scikit-image](https://scikit-image.org/docs/stable/api/skimage.feature.html#greycomatrix)

See also, [Harris geospatial](https://www.harrisgeospatial.com/docs/backgroundtexturemetrics.html)

### Other useful per-image metrics.

* Is solid?
* Ship counts

## Overlap Metrics

* Binary pixel difference
* Absolute pixel difference

### Other Useful Sources

<https://en.wikipedia.org/wiki/Relative_change_and_difference>

<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.8550&rep=rep1&type=pdf>



111 changes: 111 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Setup and Install
Use the following guide to reproduce my results and run the jupyter notebooks.

## Clone this repo
Clone this repo and cd into the project.
```shell script
$ git clone https://github.com/WillieMaddox/Airbus_SDC_dup.git
$ cd Airbus_SDC_dup
```

## Setup the environment
I use virtualenv along with virtualenvwrapper to setup an isolated python environment:
```shell script
$ mkvirtualenv --python=python3.6 Airbus_SDC_dup
```
You should now see your command prompt prefixed with `(sdcdup)` indicating you are in the virtualenv.

If using conda, you can instead try using the make script to create your environment.
```shell script
$ make create_environment
```
I do not use conda so I haven't had a chance to verify if this works.

## Install requirements
From the root of the project install all requirements.
```shell script
(Airbus_SDC_dup) $ pip install -r requirements.txt
```
or
```shell script
$ make requirements
```

## Download the data

The dataset for this project is hosted on Kaggle. [Airbus Ship Detection Challenge](https://www.kaggle.com/c/airbus-ship-detection/overview)
You'll need to sign in with your Kaggle username. If you don't have an account, it's free to sign up.

You can extract the dataset to wherever you like. I extracted it to `data/raw/train_768`

```
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- This README.
├── data
│ ├── raw <- Data dump from the Airbu_SDC Kaggle competition goes in here.
│ ├── train_768 <- The images from train_v2.zip go in here.
│ ├── 00003e153.jpg
│ ├── 0001124c7.jpg
│ ├── ...
│ └── ffffe97f3.jpg
│ └── train_ship_segmentations_v2.csv <- The run length encoded ship labels.
│ ├── ...
├── ...
```

## Preprocess tiles and interim data

Once the raw dataset has been downloaded and extracted, run the image preprocessing scripts.

First generate the 256 x 256 image tiles:
```shell script
$ make data
```
Note: The `data/processed/train_256` folder takes up ??? GB of disk space. It takes approx 30 min to run on my dev system. YMMV.

Next generate the image feature metadata:
```shell script
$ make features
```
Note: The `data/interim` folder takes up ??? GB of disk space. It takes approx 2 hrs to run on my dev system. YMMV.

The newly generated files will be placed into the interim and processed directories.
Once complete, your directory structure should look like the following:
```
├── ...
├── data
│ ├── raw
│ ├── interim
│ ├── image_bm0hash_grids.pkl
│ ├── image_cm0hash_grids.pkl
│ ├── image_issolid_grids.pkl
│ ├── image_md5hash_grids.pkl
│ ├── image_shipcnt_grids.pkl
│ ├── overlap_bmh_tile_scores_1.pkl
│ ├── overlap_bmh_tile_scores_2.pkl
│ ├── overlap_bmh_tile_scores_3.pkl
│ ├── overlap_bmh_tile_scores_4.pkl
│ ├── overlap_bmh_tile_scores_6.pkl
│ ├── overlap_bmh_tile_scores_9.pkl
│ ├── overlap_cmh_tile_scores_1.pkl
│ ├── overlap_cmh_tile_scores_2.pkl
│ ├── ...
│ └── overlap_shp_tile_scores_9.pkl
│ ├── processed
│ └── train_256
│ ├── 00003e153_0.jpg
│ ├── 00003e153_1.jpg
│ ├── 00003e153_2.jpg
│ ├── 00003e153_3.jpg
│ ├── 00003e153_4.jpg
│ ├── 00003e153_5.jpg
│ ├── 00003e153_6.jpg
│ ├── 00003e153_7.jpg
│ ├── 00003e153_8.jpg
│ ├── 0001124c7_0.jpg
│ ├── ...
│ └── ffffe97f3_8.jpg
│ ├── ...
├── ...
```

8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,8 @@
# Airbus_SDC_dup
Finding and tagging duplicate satellite images based on overlapping sub tiles.
Detecting duplicate regions of overlapping satellite imagery.

* [Installing](INSTALL.md)
* [Research](RESEARCH.md)
* [Image Metrics](IMAGE_METRICS.md)
* [DupNet](DUPNET.md)
* [Todo](TODO.md)
Loading

0 comments on commit f2d8280

Please sign in to comment.