doc: Update README and add all other README's

WillieMaddox · Sep 26, 2019 · f2d8280 · f2d8280
1 parent cfeb5e0
commit f2d8280
Show file tree

Hide file tree

Showing 6 changed files with 452 additions and 1 deletion.
diff --git a/DUPNET.md b/DUPNET.md
@@ -0,0 +1,69 @@
+# DupNet
+
+### Dataset
+
+We need a labeled dataset for training but we don't want to go through every potential overlaps and label all the ground truth.
+Fortuantely, we don't need to.  Let's see why.
+Using our [tile indexing scheme](RESEARCH.md) for a single image, we assert the following 2 statements about any 2 tiles i, j:
+
+1. If `i == j`, the two tiles are duplicates. The tile is simply a copy of itself. 
+2. If `i != j`, the two tiles represent the same point in time, but their scenes are disjoint and hence by definition they should not be duplicates.  
+
+Without even looking at an image, these two statements should always be true fundamentally.
+But as we have seen, some images contain multiple tiles with the same hash (e.g. clouds, black or blue border).
+
+We build up the dataset by looping over all pairs of tiles in each image.
+For each pair, we store the image name, the indexes of the two tiles, and the truth (a 1 if the indexes are the same and a 0 if they are different).
+We skip all solid tiles as well as pairs of tiles we have already recorded (e.g. if we already have tiles (2, 4) stored, we skip tiles (4, 2)).
+If we want to swap the tile order, we can do that during training.
+Each image can therefore contribute a maximum of 36 datapoints (9 dups and 27 non-dups) to the final dataset.
+To help balance the labels, we use the BlockMeanHash of each tile to filter out datapoints that are clearly non-duplicate.
+This brings the total down to ~4 million datapoints and this number can vary depending on how we filter with BlockMeanHash.
+Regardless, 4 million images is overkill.  We always end up randomly sampling down to somewhere between 100k and 200k datapoints, 
+after which we then further split into training, validation and testing.  
+We find that this is more than enough data to sufficiently train a model capable of outperforming any of the [image metric](IMAGE_METRICS.md) based algorithms. 
+---
+### Augmentation
+We use the following 3 image augmentation methods during training:
+
+1. [JPEG Compression](#jpeg-compression)
+2. [HLS shifting](#hls-shifting)
+3. [Flips and Rotations](#flips-and-rotations)
+
+#### JPEG Compression
+Create a new folder for the `256` jpegs and save each file using the original img_id for the prefix followed by the index according to its location in table 2.
+Compare sliced tile with corresponding saved `256` jpeg tile.
+We know they are the same image but because of jpeg compression, there are slight variations in some of the pixels.
+We get this for free when we create the `256 x 256` tile dataset.
+
+#### HLS shifting
+We create a class in [hls_shift](notebooks/eda/hls_shift.ipynb) to compare perturbed and unperturbed versions of the same image.
+The perturbed version results from scaling one of the HLS channels of the original unperturbed image
+by an integer whose value falls within $\pm 180$ for the Hue channel and $\pm 100$ for the lightness and saturation channels.
+
+The idea here is to perturb the hue, lightness, and/or saturation such that the image still "looks" the same,
+but when comparing the sum of the pixelwise deltas, the counts can be in the tens to hundreds of thousands.
+For this to work, we need the upper and lower bounds on the three HLS channels.
+
+We used GIMP to create a small dataset of HLS perturbed images. 
+Gimp has a nice UI that we could use to save out multiple versions of the same image. 
+With each version, we scaled the H, L or S value by $\lambda$ before saving it. 
+See the corresponding [README](data/persistent/gimp_hls/README.md) for more details.
+
+We then opened up both the original image and the gimp-scaled copies with OpenCV and
+experimented with different HLS offsets to determine at what point the 
+image is significantly different enough to be regarded as not a duplicate.
+
+#### Flips and Rotations
+
+We will add horizontal and/or vertical flips.
+We'll restrict rotations to 0, 90, 180, and 270 degrees so we don't have to deal with cropping or resizing.
+Ensure that **both** images are flipped and rotated the same.
+---
+### Model
+
+TODO: Describe in words
+
+![](notebooks/figures/dupnet_v0.png)
+
+
diff --git a/IMAGE_METRICS.md b/IMAGE_METRICS.md
@@ -0,0 +1,58 @@
+## Per Image Metrics
+
+We only need to calculate the per-image metrics once when we first run the install script.
+The installer will save the the results to intermediate files which can then be loaded much faster than having to recalculate the metrics again.  
+
+Below we show a list of algorithms we tested to find duplicates.  
+
+One option is to compare various per-image metrics using various image "similarity" algorithms:
+- [image hashes](notebooks/eda/3_image_hashes.ipynb)
+- [image entropy](notebooks/eda/4_image_entropy.ipynb)
+- [image histograms](notebooks/eda/image_histograms.ipynb) 
+
+Unfortunately, no single one of these nor any combination work particularly well across the entire dataset. 
+They produce far too many false positives and false negatives to be useful.
+
+### Image Hashes
+
+* [md5 hash](https://docs.python.org/3/library/hashlib.html)
+* [block-mean hash](https://www.phash.org/docs/pubs/thesis_zauner.pdf)
+* [color-moment hash](http://www.naturalspublishing.com/files/published/54515x71g3omq1.pdf)
+
+Use the `hashlib` python package to calculate md5 checksum.
+Perceptual image hash functions are available through the contrib add-on package beginning with OpenCV 3.3.0.
+
+### Image Entropy
+
+* Cross Entropy?
+* Shannon Entropy?
+
+### grey-level co-occurrence matrix ([wiki](https://en.wikipedia.org/wiki/Co-occurrence_matrix))
+
+* energy
+* contrast
+* homogeneity
+* correlation
+
+Available through [scikit-image](https://scikit-image.org/docs/stable/api/skimage.feature.html#greycomatrix)
+
+See also, [Harris geospatial](https://www.harrisgeospatial.com/docs/backgroundtexturemetrics.html)
+
+### Other useful per-image metrics.
+
+* Is solid?
+* Ship counts
+
+## Overlap Metrics
+
+* Binary pixel difference
+* Absolute pixel difference
+
+### Other Useful Sources
+
+<https://en.wikipedia.org/wiki/Relative_change_and_difference>
+
+<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.8550&rep=rep1&type=pdf>
+
+
+
diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,111 @@
+# Setup and Install
+Use the following guide to reproduce my results and run the jupyter notebooks.  
+
+## Clone this repo
+Clone this repo and cd into the project.
+```shell script
+$ git clone https://github.com/WillieMaddox/Airbus_SDC_dup.git
+$ cd Airbus_SDC_dup
+```
+
+## Setup the environment
+I use virtualenv along with virtualenvwrapper to setup an isolated python environment:
+```shell script
+$ mkvirtualenv --python=python3.6 Airbus_SDC_dup
+```
+You should now see your command prompt prefixed with `(sdcdup)` indicating you are in the virtualenv.
+
+If using conda, you can instead try using the make script to create your environment.
+```shell script
+$ make create_environment 
+```
+I do not use conda so I haven't had a chance to verify if this works.
+
+## Install requirements
+From the root of the project install all requirements.
+```shell script
+(Airbus_SDC_dup) $ pip install -r requirements.txt
+```
+or
+```shell script
+$ make requirements
+```
+
+## Download the data
+
+The dataset for this project is hosted on Kaggle. [Airbus Ship Detection Challenge](https://www.kaggle.com/c/airbus-ship-detection/overview)
+You'll need to sign in with your Kaggle username.  If you don't have an account, it's free to sign up.
+
+You can extract the dataset to wherever you like.  I extracted it to `data/raw/train_768`
+
+```
+├── Makefile           <- Makefile with commands like `make data` or `make train`
+├── README.md          <- This README.
+├── data
+│   ├── raw            <- Data dump from the Airbu_SDC Kaggle competition goes in here.
+│       ├── train_768  <- The images from train_v2.zip go in here.
+│           ├── 00003e153.jpg
+│           ├── 0001124c7.jpg
+│           ├── ...
+│           └── ffffe97f3.jpg
+│       └── train_ship_segmentations_v2.csv <- The run length encoded ship labels.
+│   ├── ...
+├── ...
+```
+
+## Preprocess tiles and interim data
+
+Once the raw dataset has been downloaded and extracted, run the image preprocessing scripts.
+
+First generate the 256 x 256 image tiles:
+```shell script
+$ make data 
+``` 
+Note: The `data/processed/train_256` folder takes up ??? GB of disk space. It takes approx 30 min to run on my dev system.  YMMV.
+
+Next generate the image feature metadata:
+```shell script
+$ make features
+```
+Note: The `data/interim` folder takes up ??? GB of disk space. It takes approx 2 hrs to run on my dev system.  YMMV.
+
+The newly generated files will be placed into the interim and processed directories.
+Once complete, your directory structure should look like the following:
+```
+├── ...
+├── data
+│   ├── raw
+│   ├── interim
+│       ├── image_bm0hash_grids.pkl
+│       ├── image_cm0hash_grids.pkl
+│       ├── image_issolid_grids.pkl
+│       ├── image_md5hash_grids.pkl
+│       ├── image_shipcnt_grids.pkl
+│       ├── overlap_bmh_tile_scores_1.pkl
+│       ├── overlap_bmh_tile_scores_2.pkl
+│       ├── overlap_bmh_tile_scores_3.pkl
+│       ├── overlap_bmh_tile_scores_4.pkl
+│       ├── overlap_bmh_tile_scores_6.pkl
+│       ├── overlap_bmh_tile_scores_9.pkl
+│       ├── overlap_cmh_tile_scores_1.pkl
+│       ├── overlap_cmh_tile_scores_2.pkl
+│       ├── ...
+│       └── overlap_shp_tile_scores_9.pkl
+│   ├── processed
+│       └── train_256
+│           ├── 00003e153_0.jpg
+│           ├── 00003e153_1.jpg
+│           ├── 00003e153_2.jpg
+│           ├── 00003e153_3.jpg
+│           ├── 00003e153_4.jpg
+│           ├── 00003e153_5.jpg
+│           ├── 00003e153_6.jpg
+│           ├── 00003e153_7.jpg
+│           ├── 00003e153_8.jpg
+│           ├── 0001124c7_0.jpg
+│           ├── ...
+│           └── ffffe97f3_8.jpg
+│   ├── ...
+├── ...
+```
+
diff --git a/README.md b/README.md
@@ -1,2 +1,8 @@
 # Airbus_SDC_dup
-Finding and tagging duplicate satellite images based on overlapping sub tiles.
+Detecting duplicate regions of overlapping satellite imagery.
+
+* [Installing](INSTALL.md)
+* [Research](RESEARCH.md)
+* [Image Metrics](IMAGE_METRICS.md)
+* [DupNet](DUPNET.md)
+* [Todo](TODO.md)