Skip to content

Commit 458c95e

Browse files
committed
Merge 'develop' gtools-1.5.1; matasave (gtop, glevelsof), greshape @
Features - `greshape` supports `@` syntax for wide and long. Change the string to be matched via `match()` - `greshape` supports stata varlist syntax for long to wide (may not be combined with `@` within a stub). - `greshape` does not support varlist syntax for wide to long, but can use `match(regex)` for complex wide to long matches (see examples). - Closes #57 - `glevelsof, mata[(name)]` saves the levels to mata. The levels are _not_ stored in `r(levels)` and option `local()` is not allowed. With `silent`, the levels are additionally not formatted. - `glevelsof, mata numfmt()` requires `numfmt` to be a mata print format instead of a C print format. - `gtop, ntop(.)` and `gtop, ntop(-.)` now allow printing all the levels from largest to smallest or the converse. - `gtop, alpha` sorts the top levels in variable order. if `gtop -var, alpha` is passed then they are sorted in reverse order. - `gtop, mata` uses temporary files on disk to read the levels from C via mata. Matrices and locals are not used, meaning `r(levels)`, `r(toplevels)`, and the resuls stored via the option -matrix()-, ``r(`matrix')``, are no longer available. The user can access each of these via the mata object `GtoolsByLevels` (the user can change the name of this object via `mata(name)`). The levels are stored raw in `GtoolsByLevels.charx` and `GtoolsByLevels.numx`; the levels are stored formatted in `GtoolsByLevels.printed`; the frequencies are stored in `GtoolsByLevels.toplevels`. - `r(matalevels)` stores the name of the mata object with the levels and frequencies. - `gtop` also stores `r(ntop)`, `r(nrows)`, and `r(alpha)` as return scalars, for the numbere of top levels (if `.`, this will be `r(J)`), the number of rows in the `toplevels` matrix (it may or not include a row for "other" and a row for "missing"), and whether the top levels are sorted by their values. - `gtop, mata numfmt()` requires `numfmt` to be a mata print format instead of a C print format.
2 parents 4cfd6dc + 944ca4c commit 458c95e

File tree

92 files changed

+4562
-1984
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+4562
-1984
lines changed

.appveyor.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
version: "generic-1.4.1-{build}"
1+
version: "generic-1.5.1-{build}"
22

33
environment:
44
matrix:

README.md

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@
1010

1111
Faster Stata for big data. This packages uses C plugins and hashes
1212
to provide a massive speed improvements to common Stata commands,
13-
including: collapse, reshape, winsor, pctile, xtile, contract, egen,
14-
isid, levelsof, duplicates, and unique/distinct.
13+
including: collapse, reshape, xtile, tabstat, isid, egen, pctile,
14+
winsor, contract, levelsof, duplicates, and unique/distinct.
1515

16-
![Dev Version](https://img.shields.io/badge/stable-v1.4.1-blue.svg?longCache=true&style=flat-square)
16+
![Stable Version](https://img.shields.io/badge/stable-v1.5.1-blue.svg?longCache=true&style=flat-square)
1717
![Supported Platforms](https://img.shields.io/badge/platforms-linux--64%20%7C%20osx--64%20%7C%20win--64-blue.svg?longCache=true&style=flat-square)
1818
[![Travis Build Status](https://img.shields.io/travis/mcaceresb/stata-gtools/master.svg?longCache=true&style=flat-square&label=linux)](https://travis-ci.org/mcaceresb/stata-gtools)
1919
[![Travis Build Status](https://img.shields.io/travis/mcaceresb/stata-gtools/master.svg?longCache=true&style=flat-square&label=osx)](https://travis-ci.org/mcaceresb/stata-gtools)
@@ -59,8 +59,8 @@ __*Gtools commands with a Stata equivalent*__
5959
| gquantiles | xtile | 10 to 30 / 13 to 25 (-) | | `by()`, various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gquantiles)) |
6060
| | pctile | 13 to 38 / 3 to 5 (-) | | Ibid. |
6161
| | \_pctile | 25 to 40 / 3 to 5 | | Ibid. |
62-
| gstats tab | tabstat | 10 to 60 / 5 to 40 (-) | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
63-
| gstats sum | sum, detail | 10 to 40 / 5 to 10 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
62+
| gstats tab | tabstat | 10 to 50 / 5 to 30 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
63+
| gstats sum | sum, detail | 10 to 20 / 5 to 10 | See [remarks](#remarks) | various (see [usage](https://gtools.readthedocs.io/en/latest/usage/gstats_summarize)) |
6464

6565
<small>(+) The upper end of the speed improvements are for quantiles
6666
(e.g. median, iqr, p90) and few groups. Weights have not been
@@ -296,8 +296,9 @@ allow weights).
296296

297297
Hence both should be able to replicate all of the functionality of their
298298
Stata counterparts. Last, `gstats tab` allows every statistic allowed
299-
by `tabstat` as well as any statistic allowed by `gcollapse`, and the
300-
syntax for the statistics specified via `statistics()` is also the same.
299+
by `tabstat` as well as any statistic allowed by `gcollapse`; the
300+
syntax for the statistics specified via `statistics()` is the same
301+
as in `tabstat`.
301302

302303
The following are implemented internally in C:
303304

@@ -324,7 +325,7 @@ The following are implemented internally in C:
324325
| min | X | X | X |
325326
| range | X | X | X |
326327
| select | X | X | X |
327-
| rawselect | X | X | X |
328+
| rawselect | X | | X |
328329
| percent | X | X | X |
329330
| first | X | X (+) | X |
330331
| last | X | X (+) | X |
@@ -349,7 +350,7 @@ gegen target = pctile(var), by(varlist) p(#)
349350
```
350351

351352
where # is a "percentile" with arbitrary decimal places (e.g. 2.5 or 97.5).
352-
`gtools` also supports selecting the `#`th smallest or largest non-missing value:
353+
`gtools` also supports selecting the `#`th smallest or largest value:
353354
```stata
354355
gcollapse (select#) target = var [(select-#) target = var ...] , by(varlist)
355356
gegen target = select(var), by(varlist) n(#)
@@ -385,13 +386,13 @@ Differences from `collapse`
385386
- `rawstat` allows selectively applying weights.
386387
- `rawselect` ignores weights for `select` (analogously to `rawsum`).
387388
- Option `wild` allows bulk-rename. E.g. `gcollapse mean_x* = x*, wild`
389+
- `gcollapse (nansum)` and `gcollapse (rawnansum)` outputs a missing
390+
value for sums if all inputs are missing (instead of 0).
388391
- `gcollapse, merge` merges the collapsed data set back into memory. This is
389392
much faster than collapsing a dataset, saving, and merging after. However,
390393
Stata's `merge ..., update` functionality is not implemented, only replace.
391394
(If the targets exist the function will throw an error without `replace`).
392395
- `gcollapse, labelformat` allows specifying the output label using placeholders.
393-
- `gcollapse (nansum)` and `gcollapse (rawnansum)` outputs a missing
394-
value for sums if all inputs are missing (instead of 0).
395396
- `gcollapse, sumcheck` keeps integer types with `sum` if the sum will not overflow.
396397

397398
Differences from `greshape`
@@ -413,7 +414,7 @@ Differences from `greshape`
413414
with this functionality.
414415
- For that same reason, "advanced" syntax is not supported, including
415416
the subcommands: clear, error, query, i, j, xij, and xi.
416-
- `@` syntax is not (yet) supported but is planned for a future release.
417+
- `@` syntax can be modified via `match()`
417418

418419
Differences from `xtile`, `pctile`, and `_pctile`
419420

@@ -453,28 +454,30 @@ Differences from `tabstat`
453454

454455
- Saving the output is done via `mata` instead of `r()`. No matrices
455456
are saved in `r()` and option `save` is not allowed. However, option
456-
`matasave` saves the output and `by()` info in `GstatsOutput`. See
457-
`mata GstatsOutput.desc()` after `gstats tab, matasave` for details.
457+
`matasave` saves the output and `by()` info in `GstatsOutput` (the object
458+
can be named via `matasave(name)`). See `mata GstatsOutput.desc()` after
459+
`gstats tab, matasave` for details.
458460
- `GstatsOutput` provides helpers for extracting rows, columns, and levels.
459461
- Multiple groups are allowed.
460462
- Options `casewise`, `longstub` are not supported.
461463
- Option `nototal` is on by default; `total` is planned for a future release.
464+
- Option `pooled` pools the source variables into one.
462465

463466
Differences from `summarize, detail`
464467

465468
- The behavior of `summarize` and `summarize, meanonly` can be
466469
recovered via options `nodetail` and `meanonly`. These two
467470
options are mainly for use with `by()`
468471
- Option `matasave` saves output and `by()` info in `GstatsOutput`,
469-
a mata class object. See `mata GstatsOutput.desc()` after
470-
`gstats sum, matasave` for details.
472+
a mata class object (the object can be named via `matasave(name)`).
473+
See `mata GstatsOutput.desc()` after `gstats sum, matasave` for details.
471474
- Option `noprint` saves the results but omits printing output.
472475
- Option `tab` prints statistics in the style of `tabstat`
473-
- Option `pooled` pools the source variables and computes summary
476+
- Option `pooled` pools the source variables and computes summary
474477
stats as if it was a single variable.
475478
- `pweights` are allowed.
476479
- Largest and smallest observations are weighted.
477-
- `rolling:`, `statsby`, and `by:` are not allowed. To use `by` pass
480+
- `rolling:`, `statsby:`, and `by:` are not allowed. To use `by` pass
478481
the option `by()`
479482
- `display options` are not supported.
480483
- Factor and time series variables are not allowed.
@@ -560,9 +563,9 @@ TODO
560563
----
561564

562565
- [ ] Update benchmarks for all commands. Still on 0.8 benchmarks.
563-
- [ ] Allow keeping both variable names and labels in `greshape spread/gather`
564566
- [ ] Implement `collapse()` option for `greshape`.
565-
- [ ] Implement variable group syntax for `greshape`.
567+
- [ ] `geomean` for geometric mean (`exp(mean(log(x)))` for gcollapse, gstats tab, gegen).
568+
- [ ] Allow keeping both variable names and labels in `greshape spread/gather`
566569
- [ ] Implement `selectoverflow(missing|closest)`
567570
- [ ] Add totals row for `J > 1` in gstats
568571

@@ -577,7 +580,6 @@ have an ETA for them:
577580
- [ ] Create a Stata C hashing API with thin wrappers around core functions.
578581
- [ ] This will be a C library that other users can import.
579582
- [ ] Some functionality will be available from Stata via gtooos, api()
580-
- [ ] Add option to `gtop` to display top X results in alpha order
581583
- [ ] Improve debugging info.
582584
- [ ] Improve code comments when you write the API!
583585
- [ ] Have some type of coding standard for the base (coding style)

0 commit comments

Comments
 (0)