gtools-0.9.0 (2017-11-01); OSX version, gcontract, gtoplevelsof

Features - The plugin now works on OSX - `gcontract` is a fast alternative to `contrast` - `gtoplevelsof` is a new command that allows the user to glean the most common levels of a set of variables. Similar to `gcontract` with a `gsort` of frequency by descending order thereafter, but `gtoplevelsof` does not modify the source data and saves a matrix with the results after its run. - `gdistinct` now saves its results to a matrix when there are multiple variables. - Improved and normalized documentation Bug fixes - OSX version; fixes #11 - `gisid` now sient w/o benchmark or verbose; fixes #20 - Added quotes to `cd cwd` in `gtools`; fixes #22 - `gcontract` available; fixes #23
mcaceresb · Nov 1, 2017 · 8ce741a · 8ce741a
1 parent cc1f5e9
commit 8ce741a
Show file tree

Hide file tree

Showing 84 changed files with 7,451 additions and 8,045 deletions.
diff --git a/.appveyor.yml b/.appveyor.yml
@@ -1,4 +1,4 @@
-version: "generic-0.1.0-{build}"
+version: "generic-0.2.0-{build}"
 
 environment:
   matrix:

diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 releases
 testing
+doc/mkdocs/site/
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ _Gtools_: Faster Stata for big data. This packages provides a hash-based
 implementation of collapse, egen, isid, levelsof, and unique/distinct using C
 plugins for a massive speed improvement.
 
-`version 0.8.4 29Oct2017`
+`version 0.9.0 31Oct2017`
 Builds: Linux [![Travis Build Status](https://travis-ci.org/mcaceresb/stata-gtools.svg?branch=develop)](https://travis-ci.org/mcaceresb/stata-gtools),
 Windows (Cygwin) [![Appveyor Build status](https://ci.appveyor.com/api/projects/status/2bh1q9bulx3pl81p/branch/develop?svg=true)](https://ci.appveyor.com/project/mcaceresb/stata-gtools)
 
@@ -21,31 +21,32 @@ Faster Stata for Group Operations
 This package's aim is to provide a fast implementation of group commands in
 Stata using hashes and C plugins. This includes (benchmarked using Stata/IC):
 
-| Function    | Replaces        | Speedup (IC)      | Unsupported     | Extras                           |
-| ----------- | --------------- | ----------------- | --------------- | -------------------------------- |
-| `gcollapse` | `collapse`      |  9 to 300 (+)     | Weights         | Quantiles, `merge`, label output |
-| `gegen`     | `egen`          |  9 to 26 (+, .)   | Weights, labels | Quantiles                        |
-| `gisid`     | `isid`          |  8 to 30          | `using`, `sort` | `if`, `in`                       |
-| `glevelsof` | `levelsof`      |  3 to 13          |                 | Multiple variables               |
-| `gunique`   | `unique`        |  4 to 26          | `by`            |                                  |
-| `gdistinct` | `distinct`      |  4 to 26          |                 |                                  |
+| Function       | Replaces        | Speedup (IC)      | Unsupported     | Extras                           |
+| -------------- | --------------- | ----------------- | --------------- | -------------------------------- |
+| `gcollapse`    | `collapse`      |  9 to 300 (+)     | Weights         | Quantiles, `merge`, label output |
+| `gcontract`    | `contract`      |  5 to 7           | Weights         |                                  |
+| `gegen`        | `egen`          |  9 to 26 (+, .)   | Weights, labels | Quantiles                        |
+| `gisid`        | `isid`          |  8 to 30          | `using`, `sort` | `if`, `in`                       |
+| `glevelsof`    | `levelsof`      |  3 to 13          |                 | Multiple variables               |
+| `gunique`      | `unique`        |  4 to 26          | `by`            |                                  |
+| `gdistinct`    | `distinct`      |  4 to 26          |                 |                                  |
+| `gtoplevelsof` |                 |                   |                 |                                  |
 
 <small>Commands were benchmarked on a Linux laptop with Stata/IC; gains in Stata/MP are smaller.</small>
 
 <small>(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups.</small>
 
 <small>(.) Only `egen group` was benchmarked rigorously.</small>
 
-In addition, all commands take gsort-style input, that is
+In addition, most commands take gsort-style input, that is
 
 ```
 [+|-]varname [[+|-]varname ...]
 ```
 
-This often does not matter (e.g. gegen summary stats, gisid, gunqiue) but it
-saves a second sort in other places (e.g. gcollapse, gegen group, glevelsof).
-If you plan to use the plugin extensively, check out the [FAQs](#faqs) for
-caveats and details on the plugin.
+`gisid`, `gunique`, and `gdistinct` are exceptions because the order does not
+matter for those commands.  If you plan to use the plugin extensively, check
+out the [FAQs](#faqs) for caveats and details on the plugin.
 
 ### Hashing
 
@@ -73,11 +74,13 @@ sorting the groups, copying a sort index back to Stata, and having Stata do
 the final swaps. The plugin runs fast, but the copy overhead plus the Stata
 swaps often make the function be slower than Stata's native `sort`.
 
-By contrast, Stata's `gsort` is not efficient. To sort data, you need to make
-pair-wise comparisons. For real numbers, this is just `a > b`. However, a generic
-comparison function can be written as `compare(a, b) > 0`. This is true if a
-is greater than b and false otherwise. To invert the sort order, one need only
-use `compare(b, a) > 0`, which is what gtools does internally.
+The reason that the other functions are faster is because they don't deal with
+all that overhead.  By contrast, Stata's `gsort` is not efficient. To sort
+data, you need to make pair-wise comparisons. For real numbers, this is just
+`a > b`. However, a generic comparison function can be written as `compare(a,
+b) > 0`. This is true if a is greater than b and false otherwise. To invert
+the sort order, one need only use `compare(b, a) > 0`, which is what gtools
+does internally.
 
 However, Stata creates a variable that is the inverse of the sort variable.
 This is equivalent, but the overhead makes it slower than `hashsort`.
@@ -385,41 +388,60 @@ Very compex stats, one variable:
 We benchmark `gegen id = group(varlist)` vs egen and fegen, obs = 10,000,000,
 J = 10,000 (in seconds)
 
- | egen | fegen | gegen | ratio (e/g) | ratio (f/g) | varlist
- | ---- | ----- | ----- | ----------- | ----------- | -------
- | 22.2 |   4.1 |  1.14 |        19.4 |         3.6 | str_12
- | 21.6 |  5.96 |  1.59 |        13.5 |        3.74 | str_12 str_32
- |   23 |  7.31 |  1.95 |        11.8 |        3.74 | str_12 str_32 str_4
- | 18.4 |  2.94 |  .813 |        22.6 |        3.61 | double1
- | 18.4 |  3.24 |  .883 |        20.9 |        3.67 | double1 double2
- | 19.1 |  3.36 |  .945 |        20.2 |        3.56 | double1 double2 double3
- | 16.6 |  1.84 |  .634 |        26.2 |        2.91 | int1
- | 18.3 |  2.05 |  .735 |        24.9 |        2.79 | int1 int2
- | 19.6 |  2.53 |  .895 |        21.9 |        2.83 | int1 int2 int3
- | 20.2 |     . |  1.51 |        13.4 |           . | int1 str_32 double1
- |   22 |     . |  2.07 |        10.6 |           . | int1 str_32 double1 int2 str_12 double2
- | 24.1 |     . |  2.61 |        9.24 |           . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
+| egen | fegen | gegen | ratio (e/g) | ratio (f/g) | varlist
+| ---- | ----- | ----- | ----------- | ----------- | -------
+| 22.2 |   4.1 |  1.14 |        19.4 |         3.6 | str_12
+| 21.6 |  5.96 |  1.59 |        13.5 |        3.74 | str_12 str_32
+|   23 |  7.31 |  1.95 |        11.8 |        3.74 | str_12 str_32 str_4
+| 18.4 |  2.94 |  .813 |        22.6 |        3.61 | double1
+| 18.4 |  3.24 |  .883 |        20.9 |        3.67 | double1 double2
+| 19.1 |  3.36 |  .945 |        20.2 |        3.56 | double1 double2 double3
+| 16.6 |  1.84 |  .634 |        26.2 |        2.91 | int1
+| 18.3 |  2.05 |  .735 |        24.9 |        2.79 | int1 int2
+| 19.6 |  2.53 |  .895 |        21.9 |        2.83 | int1 int2 int3
+| 20.2 |     . |  1.51 |        13.4 |           . | int1 str_32 double1
+|   22 |     . |  2.07 |        10.6 |           . | int1 str_32 double1 int2 str_12 double2
+| 24.1 |     . |  2.61 |        9.24 |           . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
 
 `gegen` ~9-26 times faster than `egen` and ~2.5-4 times faster than `fegen`.
 
+### `contract`
+
+Benchmark vs contract, obs = 10,000,000, J = 10,000 (in seconds).
+
+| contract | gcontract | ratio (c/g) | varlist
+| -------- | --------- | ----------- | -------
+|     15.9 |      2.36 |        6.75 | str_12
+|     16.4 |      3.16 |         5.2 | str_12 str_32
+|       18 |      3.29 |        5.46 | str_12 str_32 str_4
+|     13.9 |      1.95 |        7.14 | double1
+|     14.1 |      2.09 |        6.76 | double1 double2
+|     14.1 |      2.28 |        6.19 | double1 double2 double3
+|     12.3 |      1.83 |        6.69 | int1
+|     13.8 |         2 |        6.88 | int1 int2
+|     15.2 |      2.21 |        6.88 | int1 int2 int3
+|     15.3 |      2.89 |        5.31 | int1 str_32 double1
+|       17 |      3.82 |        4.45 | int1 str_32 double1 int2 str_12 double2
+|     19.4 |      4.07 |        4.76 | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
+
 ### `isid`
 
 Benchmark vs isid, obs = 10,000,000; all calls include an index to ensure uniqueness.
 
- | isid | fisid | gisid | ratio (i/g) | ratio (f/g) | varlist
- | ---- | ----- | ----- | ----------- | ----------- | -------
- | 37.8 |  24.6 |  2.24 |        16.9 |          11 | str_12
- | 41.5 |  29.9 |   2.4 |        17.3 |        12.5 | str_12 str_32
- | 44.8 |    34 |  2.75 |        16.3 |        12.4 | str_12 str_32 str_4
- | 30.4 |  14.3 |  1.86 |        16.4 |        7.72 | double1
- | 31.6 |  14.9 |  1.95 |        16.2 |        7.63 | double1 double2
- | 32.7 |  15.1 |  2.01 |        16.3 |        7.49 | double1 double2 double3
- | 31.3 |  14.5 |  1.04 |        30.1 |        13.9 | int1
- | 32.6 |  15.1 |  1.25 |        26.1 |        12.1 | int1 int2
- | 34.1 |  15.4 |  2.04 |        16.7 |        7.57 | int1 int2 int3
- | 38.5 |     . |  2.35 |        16.4 |           . | int1 str_32 double1
- |   45 |     . |  2.91 |        15.4 |           . | int1 str_32 double1 int2 str_12 double2
- |   51 |     . |  3.29 |        15.5 |           . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
+| isid | fisid | gisid | ratio (i/g) | ratio (f/g) | varlist
+| ---- | ----- | ----- | ----------- | ----------- | -------
+| 37.8 |  24.6 |  2.24 |        16.9 |          11 | str_12
+| 41.5 |  29.9 |   2.4 |        17.3 |        12.5 | str_12 str_32
+| 44.8 |    34 |  2.75 |        16.3 |        12.4 | str_12 str_32 str_4
+| 30.4 |  14.3 |  1.86 |        16.4 |        7.72 | double1
+| 31.6 |  14.9 |  1.95 |        16.2 |        7.63 | double1 double2
+| 32.7 |  15.1 |  2.01 |        16.3 |        7.49 | double1 double2 double3
+| 31.3 |  14.5 |  1.04 |        30.1 |        13.9 | int1
+| 32.6 |  15.1 |  1.25 |        26.1 |        12.1 | int1 int2
+| 34.1 |  15.4 |  2.04 |        16.7 |        7.57 | int1 int2 int3
+| 38.5 |     . |  2.35 |        16.4 |           . | int1 str_32 double1
+|   45 |     . |  2.91 |        15.4 |           . | int1 str_32 double1 int2 str_12 double2
+|   51 |     . |  3.29 |        15.5 |           . | int1 str_32 double1 int2 str_12 double2 int3 str_4 double3
 
 Benchmark vs isid, obs = 10,000,000, J = 10,000 (in seconds)
 
@@ -850,7 +872,6 @@ fixed size.
 In particular I use the [Spooky Hash](http://burtleburtle.net/bob/hash/spooky.html)
 devised by Bob Jenkins, which is a 128-bit hash. Stata caps observations
 at 20 billion or so, meaning a 128-bit hash collision is _de facto_ impossible.
-Nevertheless, the function does check for hash collisions and will fall back
 on `collapse` and `egen` when it encounters a collision. An internal
 mechanism for resolving potential collisions is in the works. See [issue
 2](https://github.com/mcaceresb/stata-gtools/issues/2) for a discussion.
@@ -898,17 +919,35 @@ overhead has been ~10% of the total runtime. If the user expects J to be
 large, they can turn off this check via `forcemem`. If the user expects
 J to be small, they can force collapsing to disk via `forceio`.
 
-### TODO
+### Radmap to 1.0
+
+- [ ] Comment ALL the code
+- [ ] Write markdown documentation for the project
+    - [ ] Reduce the README to bare bones; point user to docs for more
+    - [ ] Introduction
+    - [ ] FAQs
+    - [ ] Have one subpage for each command and each of the following
+        - [ ] Documentation (options and basic usage)
+        - [ ] Examples (expansive examples showcasing all relevant options)
+        - [ ] Benchmarks
+- [ ] Make sure sthlp documentation is normalized
+    - [X] Mention gtools in each command
+    - [X] Note gtools special commands in each help file
+    - [ ] Point user to FAQs and online docs
+- [ ] After you've written the docs, update the sthlp files
+    - [ ] Add market to exampels
+    - [ ] Have examples link to docs
+- [ ] Improve coverage of debug checks.
+    - [ ] Have corner cases for ALL commands
+    - [ ] Test all the options in every command
 
+### Ideas for improvements
+
+- [ ] Add support for weights.
 - [ ] Minimize memory use.
-- [ ] Improve coverage of debug checks.
 - [ ] Option `smart` to check if variables are sorted.
-- [ ] Option `freq` to add obs count for each group.
 - [ ] Option `greedy` to give user fine-grain control over gcollapse internals.
 - [ ] Provide `sumup` and `sum` altetnative, `gsum`.
-- [ ] Add `gtab` as a fast version of `tabulate` with a `by` option.
-    - [ ] Also add functionality from `tabcustom`.
-- [ ] Add support for weights.
 - [ ] Add `Var`, `kurtosis`, `skewness`
 
 License

diff --git a/build.py b/build.py
@@ -5,7 +5,7 @@
 # Program: build.py
 # Author:  Mauricio Caceres Bravo <mauricio.caceres.bravo@gmail.com>
 # Created: Sun Oct 15 10:26:39 EDT 2017
-# Updated: Sun Oct 29 23:02:20 EDT 2017
+# Updated: Tue Oct 31 14:48:08 EDT 2017
 # Purpose: Main build file for gtools (copies contents into ./build and
 #          puts a .zip file in ./releases)
 
@@ -105,18 +105,22 @@ def makedirs_safe(directory):
 gtools_ssc = [
     "_gtools_internal.ado",
     "gcollapse.ado",
+    "gcontract.ado",
     "gegen.ado",
     "gunique.ado",
     "gdistinct.ado",
     "glevelsof.ado",
+    "gtoplevelsof.ado",
     "gisid.ado",
     "hashsort.ado",
     "gtools.ado",
     "gcollapse.sthlp",
+    "gcontract.sthlp",
     "gegen.sthlp",
     "gunique.sthlp",
     "gdistinct.sthlp",
     "glevelsof.sthlp",
+    "gtoplevelsof.sthlp",
     "gisid.sthlp",
     "hashsort.sthlp",
     "gtools.sthlp",
@@ -209,9 +213,11 @@ def makedirs_safe(directory):
 
 testfile = open(path.join("src", "test", "gtools_tests.do")).readlines()
 files    = [path.join("src", "test", "test_gcollapse.do"),
+            path.join("src", "test", "test_gcontract.do"),
             path.join("src", "test", "test_gegen.do"),
             path.join("src", "test", "test_gunique.do"),
             path.join("src", "test", "test_glevelsof.do"),
+            path.join("src", "test", "test_gtoplevelsof.do"),
             path.join("src", "test", "test_gisid.do"),
             path.join("src", "test", "test_hashsort.do")]
 
@@ -231,23 +237,28 @@ def makedirs_safe(directory):
 gdir = path.join("build", "gtools")
 copy2("changelog.md", gdir)
 
-copy2(path.join("src", "gtools.pkg"),      gdir)
-copy2(path.join("src", "stata.toc"),       gdir)
-copy2(path.join("doc", "gcollapse.sthlp"), gdir)
-copy2(path.join("doc", "gegen.sthlp"),     gdir)
-copy2(path.join("doc", "gunique.sthlp"),   gdir)
-copy2(path.join("doc", "gdistinct.sthlp"),   gdir)
-copy2(path.join("doc", "glevelsof.sthlp"), gdir)
-copy2(path.join("doc", "gisid.sthlp"),     gdir)
-copy2(path.join("doc", "hashsort.sthlp"),  gdir)
-copy2(path.join("doc", "gtools.sthlp"),    gdir)
+copy2(path.join("src", "gtools.pkg"),         gdir)
+copy2(path.join("src", "stata.toc"),          gdir)
+
+copy2(path.join("doc", "stata", "gcollapse.sthlp"),    gdir)
+copy2(path.join("doc", "stata", "gcontract.sthlp"),    gdir)
+copy2(path.join("doc", "stata", "gegen.sthlp"),        gdir)
+copy2(path.join("doc", "stata", "gunique.sthlp"),      gdir)
+copy2(path.join("doc", "stata", "gdistinct.sthlp"),    gdir)
+copy2(path.join("doc", "stata", "glevelsof.sthlp"),    gdir)
+copy2(path.join("doc", "stata", "gtoplevelsof.sthlp"), gdir)
+copy2(path.join("doc", "stata", "gisid.sthlp"),        gdir)
+copy2(path.join("doc", "stata", "hashsort.sthlp"),     gdir)
+copy2(path.join("doc", "stata", "gtools.sthlp"),       gdir)
 
 copy2(path.join("src", "ado", "_gtools_internal.ado"), gdir)
 copy2(path.join("src", "ado", "gcollapse.ado"),        gdir)
+copy2(path.join("src", "ado", "gcontract.ado"),        gdir)
 copy2(path.join("src", "ado", "gegen.ado"),            gdir)
 copy2(path.join("src", "ado", "gunique.ado"),          gdir)
-copy2(path.join("src", "ado", "gdistinct.ado"),          gdir)
+copy2(path.join("src", "ado", "gdistinct.ado"),        gdir)
 copy2(path.join("src", "ado", "glevelsof.ado"),        gdir)
+copy2(path.join("src", "ado", "gtoplevelsof.ado"),     gdir)
 copy2(path.join("src", "ado", "gisid.ado"),            gdir)
 copy2(path.join("src", "ado", "hashsort.ado"),         gdir)
 copy2(path.join("src", "ado", "gtools.ado"),           gdir)