merge and fix

Gyanachand1 · Nov 30, 2020 · 49fdd3b · 49fdd3b
2 parents 12a1007 + 34ba0d0
commit 49fdd3b
Show file tree

Hide file tree

Showing 41 changed files with 1,529 additions and 635 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -23,7 +23,9 @@ workflows:
             branches:
               ignore: /.*/
       - deploy:
-          context: pypi
+          context: 
+            - pypi
+            - snark-docker
           requires:
             - test
           filters:
@@ -50,6 +52,11 @@ jobs:
             pip install -r requirements-optional.txt
             pip install -r requirements.txt
             pip install -e .
+      - run:
+          name: "Checking code style"
+          command: |
+            pip install flake8
+            flake8 . --count --exit-zero --max-complexity=10 --statistics
       - run:
           name: "Running tests"
           command: |
@@ -68,6 +75,8 @@ jobs:
   deploy:
     docker:
       - image: circleci/python:3.8
+    environment:
+      IMAGE_NAME: snarkai/hub
     steps:
       - checkout
       - run:
@@ -89,6 +98,18 @@ jobs:
           name: "Upload dist to PyPi"
           command: |
             twine upload dist/*
+      - run:
+          name: "Build Docker Hub Image"
+          command: |
+            docker build -t $IMAGE_NAME:latest .
+      - run:
+          name: "Deploy to Docker Hub"
+          command: |
+            echo "$DOCKER_HUB_PASSWORD" | docker login -u "$DOCKER_HUB_USERNAME" --password-stdin
+            IMAGE_TAG=${CIRCLE_TAG}
+            docker tag $IMAGE_NAME:latest $IMAGE_NAME:$IMAGE_TAG
+            docker push $IMAGE_NAME:latest
+            docker push $IMAGE_NAME:$IMAGE_TAG
       - slack/status:
           fail_only: true
           webhook: $SLACK_WEBHOOK

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,12 @@
+name: Lint
+
+on: [push, pull_request]
+
+jobs:
+  build:
+    runs-on: ubuntu-20.04
+
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-python@v2
+      - uses: pre-commit/action@v2.0.0
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,6 @@
+repos:
+  - repo: https://github.com/psf/black
+    rev: 20.8b1
+    hooks:
+      - id: black
+        args: ["--target-version", "py36"]
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at . All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -15,14 +15,14 @@ To contribute a feature/fix:
 ## How you can help
 
 * Adding new datasets following [this example](https://docs.activeloop.ai/en/latest/concepts/dataset.html#how-to-upload-a-dataset)
-* Fix an issue from Github Issues
+* Fix an issue from GitHub Issues
 * Add a feature. For an extended feature please create an issue to discuss.
 
 
 ## Formatting and Linting
 Hub uses Black and Flake8 to ensure a consistent code format throughout the project.
 if you are using vscode then Replace `.vscode/settings.json` content with the following:
-```
+```json
 {
     "[py]": {
         "editor.formatOnSave": true

diff --git a/README.md b/README.md
@@ -9,17 +9,26 @@
     </a>
     <a href="https://pypi.org/project/hub/"><img src="https://badge.fury.io/py/hub.svg" alt="PyPI version" height="18"></a>
     <a href="https://pypi.org/project/hub/"><img src="https://img.shields.io/pypi/dm/hub.svg" alt="PyPI version" height="18"></a>
+    <a href="https://app.circleci.com/pipelines/github/activeloopai/Hub">
+    <img alt="CircleCI" src="https://img.shields.io/circleci/build/github/activeloopai/Hub?logo=circleci">
+    </a>
     <a href="https://codecov.io/gh/activeloopai/Hub/branch/master"><img src="https://codecov.io/gh/activeloopai/Hub/branch/master/graph/badge.svg" alt="codecov" height="18"></a>
     <a href="https://twitter.com/intent/tweet?text=The%20fastest%20way%20to%20access%20and%20manage%20PyTorch%20and%20Tensorflow%20datasets%20is%20open-source&url=https://activeloop.ai/&via=activeloopai&hashtags=opensource,pytorch,tensorflow,data,datascience,datapipelines,sqlforimages,activeloop"> 
         <img alt="tweet" src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social">
-    </a>
+    </a>  
+    <br />
+<a href="https://join.slack.com/t/hubdb/shared_invite/zt-ivhsj8sz-GWv9c5FLBDVw8vn~sxRKqQ">
+  <img src="https://user-images.githubusercontent.com/13848158/97266254-9532b000-1841-11eb-8b06-ed73e99c2e5f.png" height="35" />
+</a>
+
   </a>
 </p>
+
 <h3 align="center">
 The fastest way to access and manage datasets for PyTorch and TensorFlow
 </h3>
 
-Hub provides fast access to the state-of-the-art datasets for Deep Learning, enabling data scientists to manage them, build scalable data pipelines and connect to Pytorch and Tensorflow 
+Hub provides the fastest access to the state-of-the-art datasets for Deep Learning, enabling data scientists to manage them, build scalable data pipelines and connect to Pytorch and Tensorflow.
 
 
 ### Contributors
@@ -97,6 +106,11 @@ import hub
 
 ds = hub.load("username/basic")
 ```
+### Look at Hub in action on Google Colab
+- MNIST Classification with Hub and PyTorch  
+&nbsp;
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LUeZG20A4X4WZX2AYHdI4F6InG6Jb51i?usp=sharing)
+
 For more advanced data pipelines like uploading large datasets or applying many transformations, please see [docs](http://docs.activeloop.ai).
 
 ## Things you can do with Hub

diff --git a/docs/source/concepts/tensor.md b/docs/source/concepts/tensor.md
@@ -0,0 +1,48 @@
+# Tensor
+
+Hub Tensors are scalable NumPy-like arrays stored on the cloud accessible over the internet as if they’re local NumPy arrays. Their chunkified structure makes it super fast to interact with them.
+
+Tensor represents a single array containing homogeneous data type. It could contain a list of text files, audio, images, or video data. The first dimension represents the batch dimension. 
+
+One can specify `dtag` for the element to specify its nature.
+
+
+## Initialize
+You can initialize a tensor like this and get the first element.
+```python
+from hub import tensor
+
+t = tensor.from_zeros((10, 512, 512), dtype="uint8")
+t[0].compute()
+```
+
+You can also initialize the tensor object from a numpy array.
+
+```python
+import numpy as np
+from hub import tensor
+
+t = tensor.from_zeros(np.zeros((10, 512, 512)))
+```
+
+
+## Concat or Stack
+
+Concatenating or stacking tensors works as in other frameworks.
+
+```python
+from hub import tensor
+
+t1 = tensor.from_zeros((10, 512, 512), dtype="uint8")
+t2 = tensor.from_zeros((20, 512, 512), dtype="uint8")
+tensors = [t1, t2]
+
+tensor.concat(tensors, axis=0, chunksize=-1)
+tensor.stack(tensors, axis=0, chunksize=-1)
+```
+
+## API
+```eval_rst
+.. autoclass:: hub.dataset.Tensor
+   :members:
+```
diff --git a/docs/source/integrations/pytorch.md b/docs/source/integrations/pytorch.md
@@ -1,9 +1,10 @@
 # PyTorch
 
-Here is an example to transform the dataset into pytorch form.
+Here is an example to transform the dataset into Pytorch form.
 
 ```python
-from hub import Dataset
+import torch
+from hub import dataset
 
 # Create dataset
 ds = Dataset(
@@ -16,8 +17,8 @@ ds = Dataset(
     },
 )
 
-# Load to pytorch
-ds = ds.to_pytorch()
+# Transform into Pytorch
+ds = ds.to_pytorch(transform=None)
 ds = torch.utils.data.DataLoader(
     ds,
     batch_size=8,

diff --git a/docs/source/integrations/tensorflow.md b/docs/source/integrations/tensorflow.md
@@ -1,6 +1,6 @@
 # Tensorflow
 
-Here is an example to transform the dataset into tensorflow form.
+Here is an example to transform the dataset into Tensorflow form.
 
 ```python
 from hub import Dataset
@@ -15,7 +15,7 @@ ds = Dataset(
     },
 )
 
-# tansform into Tensorflow dataset
+# transform into Tensorflow dataset
 ds = ds.to_tensorflow().batch(8)
 
 # Iterate over the data

diff --git a/docs/source/storage/tutorials.md b/docs/source/storage/tutorials.md
@@ -71,7 +71,7 @@ ds["input", 1:3] =  np.ones((2, 25, 25))
 ```
 
 ## Idea of chunking 
-Chunks are the most important part of Hub arrays. Imagine that you have a really large array stored in the cloud and want to access only some significantly smaller part of it. Let us say you have an array of 100000 images with shape ```(100000, 1024, 1024, 3)```. If we stored this array wholly without dividing into multiple chunks then in order to request only few images from it we would need to load the entire array into RAM which would be impossible and even if some computer would have that big RAM, downloading the whole array would take a lot of time. Instead we store the array in chunks and we only downlaod the chunks that contain the requested part of the array.  
+Chunks are the most important part of Hub arrays. Imagine that you have a really large array stored in the cloud and want to access only some significantly smaller part of it. Let us say you have an array of 100000 images with shape ```(100000, 1024, 1024, 3)```. If we stored this array wholly without dividing into multiple chunks then in order to request only few images from it we would need to load the entire array into RAM which would be impossible and even if some computer would have that big RAM, downloading the whole array would take a lot of time. Instead we store the array in chunks and we only download the chunks that contain the requested part of the array.  
 
 ## How to choose a proper chunk size
 Choosing a proper chunk size is crucial for performance. The chunks must be much bigger and take longer time to download than the overhead of request to cloud ~1ms. Chunks also should be small enough to fit multiple chunks into RAM. Usually, we can have up to 1 chunk per thread. 
@@ -91,7 +91,7 @@ Compresslevel is a float number from 0 to 1. Where 1 is the fastest and 0 is the
 You can easily find about all of our supported compressors, their effectiveness, and performance in the internet.  
 
 ## Integration with Pytorch and TensorFlow
-Hub datasets can easily be transformed into Pytoch and Tensorflow formats.
+Hub datasets can easily be transformed into Pytorch and Tensorflow formats.
 Pytorch:
 ```python
     datahub = hub.fs("./data/cache").connect()

diff --git a/docs/source/tutorials/pytorch.md b/docs/source/tutorials/pytorch.md
@@ -1,6 +1,6 @@
 # Pytorch Integration
 
-In this tutorial we will retreive our dataset from the local cache and integrate it with `Pytorch` for further use.
+In this tutorial we will retrieve our dataset from the local cache and integrate it with `Pytorch` for further use.
 
 For a detailed guide on dataset generation and storage see [this tutorial](samples.md).
 

diff --git a/docs/source/why.md b/docs/source/why.md
@@ -10,7 +10,7 @@ We realized that there are a few problems related with current workflow in deep
 2. **Code dependency on local folder structure**. People use a folder structure to store images or videos. As a result, the data input pipeline has to take into consideration the raw folder structure which creates unnecessary & error-prone code dependency of the dataset folder structure.
 
 
-3. **Managing preprocessing pipelines**. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelies and the dataset is very big.
+3. **Managing preprocessing pipelines**. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelines and the dataset is very big.
 
 
 4. **Visualization**. It's difficult to visualize the raw data or preprocessed dataset on servers.

diff --git a/examples/load.py b/examples/load.py
@@ -0,0 +1,7 @@
+import hub
+from hub import Dataset
+
+# ds = Dataset("s3://snark-hub-dev/public/davis/mnist-new")
+path = "s3://snark-hub-dev/public/davis/mnist-new"
+
+ds = hub.load(path)