Skip to content

Commit

Permalink
merge and fix
Browse files Browse the repository at this point in the history
  • Loading branch information
davidbuniat committed Nov 30, 2020
2 parents 12a1007 + 34ba0d0 commit 49fdd3b
Show file tree
Hide file tree
Showing 41 changed files with 1,529 additions and 635 deletions.
23 changes: 22 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ workflows:
branches:
ignore: /.*/
- deploy:
context: pypi
context:
- pypi
- snark-docker
requires:
- test
filters:
Expand All @@ -50,6 +52,11 @@ jobs:
pip install -r requirements-optional.txt
pip install -r requirements.txt
pip install -e .
- run:
name: "Checking code style"
command: |
pip install flake8
flake8 . --count --exit-zero --max-complexity=10 --statistics
- run:
name: "Running tests"
command: |
Expand All @@ -68,6 +75,8 @@ jobs:
deploy:
docker:
- image: circleci/python:3.8
environment:
IMAGE_NAME: snarkai/hub
steps:
- checkout
- run:
Expand All @@ -89,6 +98,18 @@ jobs:
name: "Upload dist to PyPi"
command: |
twine upload dist/*
- run:
name: "Build Docker Hub Image"
command: |
docker build -t $IMAGE_NAME:latest .
- run:
name: "Deploy to Docker Hub"
command: |
echo "$DOCKER_HUB_PASSWORD" | docker login -u "$DOCKER_HUB_USERNAME" --password-stdin
IMAGE_TAG=${CIRCLE_TAG}
docker tag $IMAGE_NAME:latest $IMAGE_NAME:$IMAGE_TAG
docker push $IMAGE_NAME:latest
docker push $IMAGE_NAME:$IMAGE_TAG
- slack/status:
fail_only: true
webhook: $SLACK_WEBHOOK
Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
name: Lint

on: [push, pull_request]

jobs:
build:
runs-on: ubuntu-20.04

steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: pre-commit/action@v2.0.0
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
repos:
- repo: https://github.com/psf/black
rev: 20.8b1
hooks:
- id: black
args: ["--target-version", "py36"]
76 changes: 76 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at . All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ To contribute a feature/fix:
## How you can help

* Adding new datasets following [this example](https://docs.activeloop.ai/en/latest/concepts/dataset.html#how-to-upload-a-dataset)
* Fix an issue from Github Issues
* Fix an issue from GitHub Issues
* Add a feature. For an extended feature please create an issue to discuss.


## Formatting and Linting
Hub uses Black and Flake8 to ensure a consistent code format throughout the project.
if you are using vscode then Replace `.vscode/settings.json` content with the following:
```
```json
{
"[py]": {
"editor.formatOnSave": true
Expand Down
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,26 @@
</a>
<a href="https://pypi.org/project/hub/"><img src="https://badge.fury.io/py/hub.svg" alt="PyPI version" height="18"></a>
<a href="https://pypi.org/project/hub/"><img src="https://img.shields.io/pypi/dm/hub.svg" alt="PyPI version" height="18"></a>
<a href="https://app.circleci.com/pipelines/github/activeloopai/Hub">
<img alt="CircleCI" src="https://img.shields.io/circleci/build/github/activeloopai/Hub?logo=circleci">
</a>
<a href="https://codecov.io/gh/activeloopai/Hub/branch/master"><img src="https://codecov.io/gh/activeloopai/Hub/branch/master/graph/badge.svg" alt="codecov" height="18"></a>
<a href="https://twitter.com/intent/tweet?text=The%20fastest%20way%20to%20access%20and%20manage%20PyTorch%20and%20Tensorflow%20datasets%20is%20open-source&url=https://activeloop.ai/&via=activeloopai&hashtags=opensource,pytorch,tensorflow,data,datascience,datapipelines,sqlforimages,activeloop">
<img alt="tweet" src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social">
</a>
</a>
<br />
<a href="https://join.slack.com/t/hubdb/shared_invite/zt-ivhsj8sz-GWv9c5FLBDVw8vn~sxRKqQ">
<img src="https://user-images.githubusercontent.com/13848158/97266254-9532b000-1841-11eb-8b06-ed73e99c2e5f.png" height="35" />
</a>

</a>
</p>

<h3 align="center">
The fastest way to access and manage datasets for PyTorch and TensorFlow
</h3>

Hub provides fast access to the state-of-the-art datasets for Deep Learning, enabling data scientists to manage them, build scalable data pipelines and connect to Pytorch and Tensorflow
Hub provides the fastest access to the state-of-the-art datasets for Deep Learning, enabling data scientists to manage them, build scalable data pipelines and connect to Pytorch and Tensorflow.


### Contributors
Expand Down Expand Up @@ -97,6 +106,11 @@ import hub

ds = hub.load("username/basic")
```
### Look at Hub in action on Google Colab
- MNIST Classification with Hub and PyTorch
&nbsp;
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LUeZG20A4X4WZX2AYHdI4F6InG6Jb51i?usp=sharing)

For more advanced data pipelines like uploading large datasets or applying many transformations, please see [docs](http://docs.activeloop.ai).

## Things you can do with Hub
Expand Down
48 changes: 48 additions & 0 deletions docs/source/concepts/tensor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Tensor

Hub Tensors are scalable NumPy-like arrays stored on the cloud accessible over the internet as if they’re local NumPy arrays. Their chunkified structure makes it super fast to interact with them.

Tensor represents a single array containing homogeneous data type. It could contain a list of text files, audio, images, or video data. The first dimension represents the batch dimension.

One can specify `dtag` for the element to specify its nature.


## Initialize
You can initialize a tensor like this and get the first element.
```python
from hub import tensor

t = tensor.from_zeros((10, 512, 512), dtype="uint8")
t[0].compute()
```

You can also initialize the tensor object from a numpy array.

```python
import numpy as np
from hub import tensor

t = tensor.from_zeros(np.zeros((10, 512, 512)))
```


## Concat or Stack

Concatenating or stacking tensors works as in other frameworks.

```python
from hub import tensor

t1 = tensor.from_zeros((10, 512, 512), dtype="uint8")
t2 = tensor.from_zeros((20, 512, 512), dtype="uint8")
tensors = [t1, t2]

tensor.concat(tensors, axis=0, chunksize=-1)
tensor.stack(tensors, axis=0, chunksize=-1)
```

## API
```eval_rst
.. autoclass:: hub.dataset.Tensor
:members:
```
9 changes: 5 additions & 4 deletions docs/source/integrations/pytorch.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# PyTorch

Here is an example to transform the dataset into pytorch form.
Here is an example to transform the dataset into Pytorch form.

```python
from hub import Dataset
import torch
from hub import dataset

# Create dataset
ds = Dataset(
Expand All @@ -16,8 +17,8 @@ ds = Dataset(
},
)

# Load to pytorch
ds = ds.to_pytorch()
# Transform into Pytorch
ds = ds.to_pytorch(transform=None)
ds = torch.utils.data.DataLoader(
ds,
batch_size=8,
Expand Down
4 changes: 2 additions & 2 deletions docs/source/integrations/tensorflow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Tensorflow

Here is an example to transform the dataset into tensorflow form.
Here is an example to transform the dataset into Tensorflow form.

```python
from hub import Dataset
Expand All @@ -15,7 +15,7 @@ ds = Dataset(
},
)

# tansform into Tensorflow dataset
# transform into Tensorflow dataset
ds = ds.to_tensorflow().batch(8)

# Iterate over the data
Expand Down
4 changes: 2 additions & 2 deletions docs/source/storage/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ ds["input", 1:3] = np.ones((2, 25, 25))
```

## Idea of chunking
Chunks are the most important part of Hub arrays. Imagine that you have a really large array stored in the cloud and want to access only some significantly smaller part of it. Let us say you have an array of 100000 images with shape ```(100000, 1024, 1024, 3)```. If we stored this array wholly without dividing into multiple chunks then in order to request only few images from it we would need to load the entire array into RAM which would be impossible and even if some computer would have that big RAM, downloading the whole array would take a lot of time. Instead we store the array in chunks and we only downlaod the chunks that contain the requested part of the array.
Chunks are the most important part of Hub arrays. Imagine that you have a really large array stored in the cloud and want to access only some significantly smaller part of it. Let us say you have an array of 100000 images with shape ```(100000, 1024, 1024, 3)```. If we stored this array wholly without dividing into multiple chunks then in order to request only few images from it we would need to load the entire array into RAM which would be impossible and even if some computer would have that big RAM, downloading the whole array would take a lot of time. Instead we store the array in chunks and we only download the chunks that contain the requested part of the array.

## How to choose a proper chunk size
Choosing a proper chunk size is crucial for performance. The chunks must be much bigger and take longer time to download than the overhead of request to cloud ~1ms. Chunks also should be small enough to fit multiple chunks into RAM. Usually, we can have up to 1 chunk per thread.
Expand All @@ -91,7 +91,7 @@ Compresslevel is a float number from 0 to 1. Where 1 is the fastest and 0 is the
You can easily find about all of our supported compressors, their effectiveness, and performance in the internet.

## Integration with Pytorch and TensorFlow
Hub datasets can easily be transformed into Pytoch and Tensorflow formats.
Hub datasets can easily be transformed into Pytorch and Tensorflow formats.
Pytorch:
```python
datahub = hub.fs("./data/cache").connect()
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tutorials/pytorch.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pytorch Integration

In this tutorial we will retreive our dataset from the local cache and integrate it with `Pytorch` for further use.
In this tutorial we will retrieve our dataset from the local cache and integrate it with `Pytorch` for further use.

For a detailed guide on dataset generation and storage see [this tutorial](samples.md).

Expand Down
2 changes: 1 addition & 1 deletion docs/source/why.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ We realized that there are a few problems related with current workflow in deep
2. **Code dependency on local folder structure**. People use a folder structure to store images or videos. As a result, the data input pipeline has to take into consideration the raw folder structure which creates unnecessary & error-prone code dependency of the dataset folder structure.


3. **Managing preprocessing pipelines**. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelies and the dataset is very big.
3. **Managing preprocessing pipelines**. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelines and the dataset is very big.


4. **Visualization**. It's difficult to visualize the raw data or preprocessed dataset on servers.
Expand Down
7 changes: 7 additions & 0 deletions examples/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import hub
from hub import Dataset

# ds = Dataset("s3://snark-hub-dev/public/davis/mnist-new")
path = "s3://snark-hub-dev/public/davis/mnist-new"

ds = hub.load(path)
Loading

0 comments on commit 49fdd3b

Please sign in to comment.