Add README.md & Refactor

Aliang-CN · Jun 2, 2019 · 8046083 · 8046083
1 parent 1a927f7
commit 8046083
Show file tree

Hide file tree

Showing 12 changed files with 5,537 additions and 10 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,129 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+.vscode
+data/
+runs/
+src/__pycache__
diff --git a/README.md b/README.md
@@ -0,0 +1,86 @@
+# GATNE
+
+### [Project](https://sites.google.com/view/gatne) | [Arxiv](https://arxiv.org/abs/1905.01669)
+
+Representation Learning for Attributed Multiplex Heterogeneous Network.
+
+[Yukuo Cen](https://sites.google.com/view/yukuocen), Xu Zou, Jianwei Zhang, [Hongxia Yang](https://sites.google.com/site/hystatistics/home), [Jingren Zhou](http://www.cs.columbia.edu/~jrzhou/), [Jie Tang](http://keg.cs.tsinghua.edu.cn/jietang/)
+
+Accepted to KDD 2019 Research Track!
+
+## Prerequisites
+
+- Linux or macOS
+- Python 3
+- TensorFlow >= 1.8
+- NVIDIA GPU + CUDA cuDNN
+
+## Getting Started
+
+### Installation
+
+Clone this repo.
+
+```bash
+git clone https://github.com/THUDM/GATNE
+cd GATNE
+```
+
+Please install dependencies by
+
+```bash
+pip install -r requirements.txt
+```
+
+### Dataset
+
+These datasets are sampled from the original datasets.
+
+- Amazon contains 10,166 nodes and 148,865 edges. [Source](http://jmcauley.ucsd.edu/data/amazon)
+- Twitter contains 10,000 nodes and 331,899 edges. [Source](https://snap.stanford.edu/data/higgs-twitter.html)
+- YouTube contains 2,000 nodes and 1,310,617 edges. [Source](http://socialcomputing.asu.edu/datasets/YouTube)
+- Alibaba contains 6,163 nodes and 17,865 edges.
+
+You can download the preprocessed datasets by running `python scripts/download_preprocessed_data.py`. (Alibaba dataset is to be released.)
+If you're in regions where Dropbox are blocked (e.g. Mainland China), try `python scripts/download_preprocessed_data.py --cn`.
+
+### Training
+
+#### Training on the existing datasets
+
+You can use `./scripts/run_example.sh` or `python src/main.py --input example_data` to train GATNE-T model on the example data. (If you share the server with others or you want to use the specific GPU(s), you may need to set `CUDA_VISIBLE_DEVICES`.) 
+
+If you want to train on the Amazon dataset, you can run `python src/main.py --input data/amazon` or `python src/main.py --input data/amazon --features data/feature.txt` to train GATNE-T model or GATNE-I model, respectively. 
+
+You can use the following commands to train GATNE-T on Twitter and YouTube datasets. We only evaluate the edges of the first edge type on Twitter dataset as the number of edges of other edge types is too small.
+`python src/main.py --input data/twitter --eval-type 1`
+`python src/main.py --input data/youtube`
+
+As Twitter and YouTube datasets do not have node attributes, you can generate heuristic features for them, such as DeepWalk embeddings. Then you can train GATNE-I model on these two datasets by adding the `--features` argument.
+
+#### Training on your own datasets
+
+If you want to train GATNE-T/I on your own dataset, you should prepare the following three(or four) files:
+- train.txt: Each line represents an edge, which contains three tokens `<edge_type> <node1> <node2>` where each token can be either a number or a string.
+- valid.txt: Each line represents an edge or a non-edge, which contains four tokens `<edge_type> <node1> <node2> <label>`, where `<label>` is either 1 or 0 denoting an edge or a non-edge
+- test.txt: the same format with valid.txt
+- feature.txt (optional): First line contains two number `<num> <dim>` representing the number of nodes and the feature dimension size. From the second line, each line describes the features of a node, i.e., `<node> <f_1> <f_2> ... <f_dim>`.
+
+If your dataset contains several node types and you want to use meta-path based random walk, you should also provide an additional file as follows:
+- node_type.txt: Each line contains two tokens `<node> <node_type>`, where `<node_type>` should be consistent with the meta-path schema in the training command, i.e., `--schema node_type_1-node_type_2-...-node_type_k-node_type_1`. (Note that the first node type in the schema should equals to the last node type.)
+
+
+If you have ANY difficulties to get things working in the above steps, feel free to open an issue. You can expect a reply within 24 hours.
+
+## Cite
+
+Please cite our paper if you find this code useful for your research:
+
+```
+@article{cen2019representation,
+  title={Representation Learning for Attributed Multiplex Heterogeneous Network},
+  author={Cen, Yukuo and Zou, Xu and Zhang, Jianwei and Yang, Hongxia and Zhou, Jingren and Tang, Jie},
+  journal={arXiv preprint arXiv:1905.01669},
+  year={2019}
+}
+```