Skip to content

Commit

Permalink
gtools-0.9.0 (2017-11-01); OSX version, gcontract, gtoplevelsof
Browse files Browse the repository at this point in the history
Features

- The plugin now works on OSX
- `gcontract` is a fast alternative to `contrast`
- `gtoplevelsof` is a new command that allows the user to glean the most
  common levels of a set of variables.  Similar to `gcontract` with a `gsort`
  of frequency by descending order thereafter, but `gtoplevelsof` does not
  modify the source data and saves a matrix with the results after its run.
- `gdistinct` now saves its results to a matrix when there are multiple
  variables.
- Improved and normalized documentation

Bug fixes

- OSX version; fixes #11
- `gisid` now sient w/o benchmark or verbose; fixes #20
- Added quotes to `cd cwd` in `gtools`; fixes #22
- `gcontract` available; fixes #23
  • Loading branch information
mcaceresb committed Nov 1, 2017
1 parent cc1f5e9 commit 8ce741a
Show file tree
Hide file tree
Showing 84 changed files with 7,451 additions and 8,045 deletions.
2 changes: 1 addition & 1 deletion .appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: "generic-0.1.0-{build}"
version: "generic-0.2.0-{build}"

environment:
matrix:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
releases
testing
doc/mkdocs/site/
147 changes: 93 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ _Gtools_: Faster Stata for big data. This packages provides a hash-based
implementation of collapse, egen, isid, levelsof, and unique/distinct using C
plugins for a massive speed improvement.

`version 0.8.4 29Oct2017`
`version 0.9.0 31Oct2017`
Builds: Linux [![Travis Build Status](https://travis-ci.org/mcaceresb/stata-gtools.svg?branch=develop)](https://travis-ci.org/mcaceresb/stata-gtools),
Windows (Cygwin) [![Appveyor Build status](https://ci.appveyor.com/api/projects/status/2bh1q9bulx3pl81p/branch/develop?svg=true)](https://ci.appveyor.com/project/mcaceresb/stata-gtools)

Expand All @@ -21,31 +21,32 @@ Faster Stata for Group Operations
This package's aim is to provide a fast implementation of group commands in
Stata using hashes and C plugins. This includes (benchmarked using Stata/IC):

| Function | Replaces | Speedup (IC) | Unsupported | Extras |
| ----------- | --------------- | ----------------- | --------------- | -------------------------------- |
| `gcollapse` | `collapse` | 9 to 300 (+) | Weights | Quantiles, `merge`, label output |
| `gegen` | `egen` | 9 to 26 (+, .) | Weights, labels | Quantiles |
| `gisid` | `isid` | 8 to 30 | `using`, `sort` | `if`, `in` |
| `glevelsof` | `levelsof` | 3 to 13 | | Multiple variables |
| `gunique` | `unique` | 4 to 26 | `by` | |
| `gdistinct` | `distinct` | 4 to 26 | | |
| Function | Replaces | Speedup (IC) | Unsupported | Extras |
| -------------- | --------------- | ----------------- | --------------- | -------------------------------- |
| `gcollapse` | `collapse` | 9 to 300 (+) | Weights | Quantiles, `merge`, label output |
| `gcontract` | `contract` | 5 to 7 | Weights | |
| `gegen` | `egen` | 9 to 26 (+, .) | Weights, labels | Quantiles |
| `gisid` | `isid` | 8 to 30 | `using`, `sort` | `if`, `in` |
| `glevelsof` | `levelsof` | 3 to 13 | | Multiple variables |
| `gunique` | `unique` | 4 to 26 | `by` | |
| `gdistinct` | `distinct` | 4 to 26 | | |
| `gtoplevelsof` | | | | |

<small>Commands were benchmarked on a Linux laptop with Stata/IC; gains in Stata/MP are smaller.</small>

<small>(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups.</small>

<small>(.) Only `egen group` was benchmarked rigorously.</small>

In addition, all commands take gsort-style input, that is
In addition, most commands take gsort-style input, that is

```
[+|-]varname [[+|-]varname ...]
```

This often does not matter (e.g. gegen summary stats, gisid, gunqiue) but it
saves a second sort in other places (e.g. gcollapse, gegen group, glevelsof).
If you plan to use the plugin extensively, check out the [FAQs](#faqs) for
caveats and details on the plugin.
`gisid`, `gunique`, and `gdistinct` are exceptions because the order does not
matter for those commands. If you plan to use the plugin extensively, check
out the [FAQs](#faqs) for caveats and details on the plugin.

### Hashing

Expand Down Expand Up @@ -73,11 +74,13 @@ sorting the groups, copying a sort index back to Stata, and having Stata do
the final swaps. The plugin runs fast, but the copy overhead plus the Stata
swaps often make the function be slower than Stata's native `sort`.

By contrast, Stata's `gsort` is not efficient. To sort data, you need to make
pair-wise comparisons. For real numbers, this is just `a > b`. However, a generic
comparison function can be written as `compare(a, b) > 0`. This is true if a
is greater than b and false otherwise. To invert the sort order, one need only
use `compare(b, a) > 0`, which is what gtools does internally.
The reason that the other functions are faster is because they don't deal with
all that overhead. By contrast, Stata's `gsort` is not efficient. To sort
data, you need to make pair-wise comparisons. For real numbers, this is just
`a > b`. However, a generic comparison function can be written as `compare(a,
b) > 0`. This is true if a is greater than b and false otherwise. To invert
the sort order, one need only use `compare(b, a) > 0`, which is what gtools
does internally.

However, Stata creates a variable that is the inverse of the sort variable.
This is equivalent, but the overhead makes it slower than `hashsort`.
Expand Down Expand Up @@ -385,41 +388,60 @@ Very compex stats, one variable:
We benchmark `gegen id = group(varlist)` vs egen and fegen, obs = 10,000,000,
J = 10,000 (in seconds)

| egen | fegen | gegen | ratio (e/g) | ratio (f/g) | varlist
| ---- | ----- | ----- | ----------- | ----------- | -------
| 22.2 | 4.1 | 1.14 | 19.4 | 3.6 | str_12
| 21.6 | 5.96 | 1.59 | 13.5 | 3.74 | str_12 str_32
| 23 | 7.31 | 1.95 | 11.8 | 3.74 | str_12 str_32 str_4
| 18.4 | 2.94 | .813 | 22.6 | 3.61 | double1
| 18.4 | 3.24 | .883 | 20.9 | 3.67 | double1 double2
| 19.1 | 3.36 | .945 | 20.2 | 3.56 | double1 double2 double3
| 16.6 | 1.84 | .634 | 26.2 | 2.91 | int1
| 18.3 | 2.05 | .735 | 24.9 | 2.79 | int1 int2
| 19.6 | 2.53 | .895 | 21.9 | 2.83 | int1 int2 int3
| 20.2 | . | 1.51 | 13.4 | . | int1 str_32 double1
| 22 | . | 2.07 | 10.6 | . | int1 str_32 double1 int2 str_12 double2
| 24.1 | . | 2.61 | 9.24 | . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
| egen | fegen | gegen | ratio (e/g) | ratio (f/g) | varlist
| ---- | ----- | ----- | ----------- | ----------- | -------
| 22.2 | 4.1 | 1.14 | 19.4 | 3.6 | str_12
| 21.6 | 5.96 | 1.59 | 13.5 | 3.74 | str_12 str_32
| 23 | 7.31 | 1.95 | 11.8 | 3.74 | str_12 str_32 str_4
| 18.4 | 2.94 | .813 | 22.6 | 3.61 | double1
| 18.4 | 3.24 | .883 | 20.9 | 3.67 | double1 double2
| 19.1 | 3.36 | .945 | 20.2 | 3.56 | double1 double2 double3
| 16.6 | 1.84 | .634 | 26.2 | 2.91 | int1
| 18.3 | 2.05 | .735 | 24.9 | 2.79 | int1 int2
| 19.6 | 2.53 | .895 | 21.9 | 2.83 | int1 int2 int3
| 20.2 | . | 1.51 | 13.4 | . | int1 str_32 double1
| 22 | . | 2.07 | 10.6 | . | int1 str_32 double1 int2 str_12 double2
| 24.1 | . | 2.61 | 9.24 | . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3

`gegen` ~9-26 times faster than `egen` and ~2.5-4 times faster than `fegen`.

### `contract`

Benchmark vs contract, obs = 10,000,000, J = 10,000 (in seconds).

| contract | gcontract | ratio (c/g) | varlist
| -------- | --------- | ----------- | -------
| 15.9 | 2.36 | 6.75 | str_12
| 16.4 | 3.16 | 5.2 | str_12 str_32
| 18 | 3.29 | 5.46 | str_12 str_32 str_4
| 13.9 | 1.95 | 7.14 | double1
| 14.1 | 2.09 | 6.76 | double1 double2
| 14.1 | 2.28 | 6.19 | double1 double2 double3
| 12.3 | 1.83 | 6.69 | int1
| 13.8 | 2 | 6.88 | int1 int2
| 15.2 | 2.21 | 6.88 | int1 int2 int3
| 15.3 | 2.89 | 5.31 | int1 str_32 double1
| 17 | 3.82 | 4.45 | int1 str_32 double1 int2 str_12 double2
| 19.4 | 4.07 | 4.76 | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3

### `isid`

Benchmark vs isid, obs = 10,000,000; all calls include an index to ensure uniqueness.

| isid | fisid | gisid | ratio (i/g) | ratio (f/g) | varlist
| ---- | ----- | ----- | ----------- | ----------- | -------
| 37.8 | 24.6 | 2.24 | 16.9 | 11 | str_12
| 41.5 | 29.9 | 2.4 | 17.3 | 12.5 | str_12 str_32
| 44.8 | 34 | 2.75 | 16.3 | 12.4 | str_12 str_32 str_4
| 30.4 | 14.3 | 1.86 | 16.4 | 7.72 | double1
| 31.6 | 14.9 | 1.95 | 16.2 | 7.63 | double1 double2
| 32.7 | 15.1 | 2.01 | 16.3 | 7.49 | double1 double2 double3
| 31.3 | 14.5 | 1.04 | 30.1 | 13.9 | int1
| 32.6 | 15.1 | 1.25 | 26.1 | 12.1 | int1 int2
| 34.1 | 15.4 | 2.04 | 16.7 | 7.57 | int1 int2 int3
| 38.5 | . | 2.35 | 16.4 | . | int1 str_32 double1
| 45 | . | 2.91 | 15.4 | . | int1 str_32 double1 int2 str_12 double2
| 51 | . | 3.29 | 15.5 | . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
| isid | fisid | gisid | ratio (i/g) | ratio (f/g) | varlist
| ---- | ----- | ----- | ----------- | ----------- | -------
| 37.8 | 24.6 | 2.24 | 16.9 | 11 | str_12
| 41.5 | 29.9 | 2.4 | 17.3 | 12.5 | str_12 str_32
| 44.8 | 34 | 2.75 | 16.3 | 12.4 | str_12 str_32 str_4
| 30.4 | 14.3 | 1.86 | 16.4 | 7.72 | double1
| 31.6 | 14.9 | 1.95 | 16.2 | 7.63 | double1 double2
| 32.7 | 15.1 | 2.01 | 16.3 | 7.49 | double1 double2 double3
| 31.3 | 14.5 | 1.04 | 30.1 | 13.9 | int1
| 32.6 | 15.1 | 1.25 | 26.1 | 12.1 | int1 int2
| 34.1 | 15.4 | 2.04 | 16.7 | 7.57 | int1 int2 int3
| 38.5 | . | 2.35 | 16.4 | . | int1 str_32 double1
| 45 | . | 2.91 | 15.4 | . | int1 str_32 double1 int2 str_12 double2
| 51 | . | 3.29 | 15.5 | . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3

Benchmark vs isid, obs = 10,000,000, J = 10,000 (in seconds)

Expand Down Expand Up @@ -850,7 +872,6 @@ fixed size.
In particular I use the [Spooky Hash](http://burtleburtle.net/bob/hash/spooky.html)
devised by Bob Jenkins, which is a 128-bit hash. Stata caps observations
at 20 billion or so, meaning a 128-bit hash collision is _de facto_ impossible.
Nevertheless, the function does check for hash collisions and will fall back
on `collapse` and `egen` when it encounters a collision. An internal
mechanism for resolving potential collisions is in the works. See [issue
2](https://github.com/mcaceresb/stata-gtools/issues/2) for a discussion.
Expand Down Expand Up @@ -898,17 +919,35 @@ overhead has been ~10% of the total runtime. If the user expects J to be
large, they can turn off this check via `forcemem`. If the user expects
J to be small, they can force collapsing to disk via `forceio`.

### TODO
### Radmap to 1.0

- [ ] Comment ALL the code
- [ ] Write markdown documentation for the project
- [ ] Reduce the README to bare bones; point user to docs for more
- [ ] Introduction
- [ ] FAQs
- [ ] Have one subpage for each command and each of the following
- [ ] Documentation (options and basic usage)
- [ ] Examples (expansive examples showcasing all relevant options)
- [ ] Benchmarks
- [ ] Make sure sthlp documentation is normalized
- [X] Mention gtools in each command
- [X] Note gtools special commands in each help file
- [ ] Point user to FAQs and online docs
- [ ] After you've written the docs, update the sthlp files
- [ ] Add market to exampels
- [ ] Have examples link to docs
- [ ] Improve coverage of debug checks.
- [ ] Have corner cases for ALL commands
- [ ] Test all the options in every command

### Ideas for improvements

- [ ] Add support for weights.
- [ ] Minimize memory use.
- [ ] Improve coverage of debug checks.
- [ ] Option `smart` to check if variables are sorted.
- [ ] Option `freq` to add obs count for each group.
- [ ] Option `greedy` to give user fine-grain control over gcollapse internals.
- [ ] Provide `sumup` and `sum` altetnative, `gsum`.
- [ ] Add `gtab` as a fast version of `tabulate` with a `by` option.
- [ ] Also add functionality from `tabcustom`.
- [ ] Add support for weights.
- [ ] Add `Var`, `kurtosis`, `skewness`

License
Expand Down
35 changes: 23 additions & 12 deletions build.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# Program: build.py
# Author: Mauricio Caceres Bravo <mauricio.caceres.bravo@gmail.com>
# Created: Sun Oct 15 10:26:39 EDT 2017
# Updated: Sun Oct 29 23:02:20 EDT 2017
# Updated: Tue Oct 31 14:48:08 EDT 2017
# Purpose: Main build file for gtools (copies contents into ./build and
# puts a .zip file in ./releases)

Expand Down Expand Up @@ -105,18 +105,22 @@ def makedirs_safe(directory):
gtools_ssc = [
"_gtools_internal.ado",
"gcollapse.ado",
"gcontract.ado",
"gegen.ado",
"gunique.ado",
"gdistinct.ado",
"glevelsof.ado",
"gtoplevelsof.ado",
"gisid.ado",
"hashsort.ado",
"gtools.ado",
"gcollapse.sthlp",
"gcontract.sthlp",
"gegen.sthlp",
"gunique.sthlp",
"gdistinct.sthlp",
"glevelsof.sthlp",
"gtoplevelsof.sthlp",
"gisid.sthlp",
"hashsort.sthlp",
"gtools.sthlp",
Expand Down Expand Up @@ -209,9 +213,11 @@ def makedirs_safe(directory):

testfile = open(path.join("src", "test", "gtools_tests.do")).readlines()
files = [path.join("src", "test", "test_gcollapse.do"),
path.join("src", "test", "test_gcontract.do"),
path.join("src", "test", "test_gegen.do"),
path.join("src", "test", "test_gunique.do"),
path.join("src", "test", "test_glevelsof.do"),
path.join("src", "test", "test_gtoplevelsof.do"),
path.join("src", "test", "test_gisid.do"),
path.join("src", "test", "test_hashsort.do")]

Expand All @@ -231,23 +237,28 @@ def makedirs_safe(directory):
gdir = path.join("build", "gtools")
copy2("changelog.md", gdir)

copy2(path.join("src", "gtools.pkg"), gdir)
copy2(path.join("src", "stata.toc"), gdir)
copy2(path.join("doc", "gcollapse.sthlp"), gdir)
copy2(path.join("doc", "gegen.sthlp"), gdir)
copy2(path.join("doc", "gunique.sthlp"), gdir)
copy2(path.join("doc", "gdistinct.sthlp"), gdir)
copy2(path.join("doc", "glevelsof.sthlp"), gdir)
copy2(path.join("doc", "gisid.sthlp"), gdir)
copy2(path.join("doc", "hashsort.sthlp"), gdir)
copy2(path.join("doc", "gtools.sthlp"), gdir)
copy2(path.join("src", "gtools.pkg"), gdir)
copy2(path.join("src", "stata.toc"), gdir)

copy2(path.join("doc", "stata", "gcollapse.sthlp"), gdir)
copy2(path.join("doc", "stata", "gcontract.sthlp"), gdir)
copy2(path.join("doc", "stata", "gegen.sthlp"), gdir)
copy2(path.join("doc", "stata", "gunique.sthlp"), gdir)
copy2(path.join("doc", "stata", "gdistinct.sthlp"), gdir)
copy2(path.join("doc", "stata", "glevelsof.sthlp"), gdir)
copy2(path.join("doc", "stata", "gtoplevelsof.sthlp"), gdir)
copy2(path.join("doc", "stata", "gisid.sthlp"), gdir)
copy2(path.join("doc", "stata", "hashsort.sthlp"), gdir)
copy2(path.join("doc", "stata", "gtools.sthlp"), gdir)

copy2(path.join("src", "ado", "_gtools_internal.ado"), gdir)
copy2(path.join("src", "ado", "gcollapse.ado"), gdir)
copy2(path.join("src", "ado", "gcontract.ado"), gdir)
copy2(path.join("src", "ado", "gegen.ado"), gdir)
copy2(path.join("src", "ado", "gunique.ado"), gdir)
copy2(path.join("src", "ado", "gdistinct.ado"), gdir)
copy2(path.join("src", "ado", "gdistinct.ado"), gdir)
copy2(path.join("src", "ado", "glevelsof.ado"), gdir)
copy2(path.join("src", "ado", "gtoplevelsof.ado"), gdir)
copy2(path.join("src", "ado", "gisid.ado"), gdir)
copy2(path.join("src", "ado", "hashsort.ado"), gdir)
copy2(path.join("src", "ado", "gtools.ado"), gdir)
Expand Down
Loading

0 comments on commit 8ce741a

Please sign in to comment.