The table1c
package is a light wrapper around the package table1, with some customizations for the convenience of Certara IDD.
This vignette serves as User Guide for the package. We run through an example using a simulated dataset Xyz‑pk.csv with data pooled from three hypothetical studies. This dataset is in the style of a NONMEM PopPK dataset. Click here to download it (note: due to a Chrome bug the file may download with a .xls extension, which is incorrect; find the file where it was saved and change it to .csv).
(Note: The dataset has column names in all lower case letters, versus the more traditional upper case used in NONMEM. This is preferred because we end up typing these names a lot, and by avoiding the strain of multi-key combinations needed for capital letters it is not only faster to type, but also decreases the risk of repetitive strain injury.)
Using a Data Specification FileOne of the benefits of this package is the ability to separate meta-data (data about data) from scripting logic, by segregating meta-data into a central location (the data specification file), which results in scripts that are simpler, more generic and re-usable.
The data specification is written in YAML (see below), a suitable language for encoding data or meta-data, which allows it to be clear and concise. (Note: currently this file needs to be written by hand, but in the future it’s generation may be partially or fully automated.)
What is YAML?YAML is a markup language for encoding structured data, similar to XML or JSON, but more geared towards human readability (you may already be familiar with YAML since it is used in the header of R markdown documents). It has the advantage of being both very easy for humans to read and write, as well as machine parseable (because although it looks natural, it actually has strict syntactic rules), with support in many popular languages, including R. You can edit YAML files in RStudio (with syntax highlighting).
The data specification fileThe way YAML works is best illustrated with an example. The current directory contains the file data_spec.yaml
which contains the following:
The meaning of the file contents is intuitively clear. Indentation is used to denote hierarchical structure. Line breaks separate items from each other. Space are used to indent things, and other than that spaces are basically ignored (except inside strings).
Warning: Make sure you are not using tabs instead of spaces; it can be hard to tell, and YAML is sensitive to this difference. If you have an error when reading the file, this is something to check. Most editor programs have a setting that will cause the tab key to insert a number of spaces instead of a tab character (RStudio does under ‘Tools>Global Options>Code>Editing>General’). Some also have a feature that allows you to “see” the whitespace characters (in RStudio it’s in ‘Tools>Global Options>Code>Display>General’) which can help to debug the problem.
Data structures come in 2 forms: sequential (i.e. lists) and named (i.e. dictionaries). For sequential data, each element is preceded by a dash and whitespace (don’t use tabs) (e.g. - item
); thus, it looks the way one would write a list in a plain-text e-mail, for instance. Named data consists of key-value pairs, where a colon and whitespace (don’t use tabs) separate the key from the value (e.g. key: value
); if the value is itself a nested structure, it can appear indented starting on the next line (same for list items). Primitive types (numbers, strings) are written the way one would write them naturally. In most cases, strings to not need to be quoted (but they can be); there are some exceptions though. Strings can contain Unicode symbols. For more details on the syntax, see the YAML documentation.
In the example above, the whole file encodes a named structure, with 3 top-level items: dataset
, labels
and categoricals
. The dataset
item contains a single string, the name of a .csv file that contains the data to which this meta-data is associated. The labels
item contains another named structure: key-value pairs of column names and associated labels. The last item, categoricals
, contains information on the coding of certain variables (i.e., variables that are really categorical but have been assigned numeric codes in the dataset). When the data is presented in a table, these variables should be translated back to their original descriptive identifiers. Nested within the categoricals
item is another named structure. Here, the names correspond to columns in the dataset, and the values are lists, whereby each list item relates a (numeric) code to its (string) identifier.
(Note: currently this file needs to be written by hand, but in the future its generation could be partially or fully automated.)
Reading a dataset from its specificationWith the data_spec.yaml
file above, we can use the read_from_spec()
function to read the data and have it augmented with the meta-data from the spec file:
Note that we did not need to include the name of the data file in our script, since it is contained in the spec.
Before proceeding to describe the baseline characteristics of our study subjects, we need to make sure that each individual is only counted once. There is a convenience function for that:
Only columns that are invariant (and hence unambiguous) within each ID level are retained.
Here are six random rows of the resulting dataset:
## id study sex race ethnic
## 928 159 Xyz-ph3-07 Female White Hispanic or Latino
## 118 14 Xyz-hv-01 Male White Not Hispanic or Latino
## 442 50 Xyz-ri-02 Female Multiple or Other Not Hispanic or Latino
## 764 118 Xyz-ph3-07 Female White Not Hispanic or Latino
## 379 43 Xyz-ri-02 Female Asian Not reported
## 968 169 Xyz-ph3-07 Female White Not Hispanic or Latino
## hv form fasted age agecat wt ht
## 928 Patient Tablet Unknown 40 < 65 years 69.11780 162.6218
## 118 Healthy Subject Capsule Fed 32 < 65 years 94.09640 178.6365
## 442 Patient Tablet Fasted 79 = 65 years 46.29503 154.5257
## 764 Patient Tablet Unknown 37 < 65 years 63.69094 168.2729
## 379 Patient Tablet Fasted 67 = 65 years 75.17084 156.0623
## 968 Patient Tablet Unknown 69 = 65 years 51.20350 149.6104
## bmi bsa alb alp alt ast bili creat
## 928 26.13560 1.742997 41.39164 59.16552 20.20104 29.25946 5.831341 59.48539
## 118 29.48715 2.127224 42.39993 102.36809 29.20991 29.82011 15.139345 106.61792
## 442 19.38798 1.416590 37.22390 140.36920 11.99238 29.56350 8.078682 159.12966
## 764 22.49309 1.725678 41.66294 66.00227 10.89596 58.22654 6.864295 46.61096
## 379 30.86407 1.753188 36.73265 131.09500 17.25876 15.40761 10.866950 183.66983
## 968 22.87580 1.444328 35.25484 60.00767 23.62474 16.59708 8.345510 65.85870
## crcl fdarenal
## 928 121.26020 Normal
## 118 117.02707 Normal
## 442 18.52046 Severe
## 764 146.88099 Normal
## 379 31.17979 Moderate
## 968 57.60809 Moderate
Note that the categorical variables (which were numeric in the original .csv file) have been translated to factors, with the appropriate textual labels, and in the desired order. Compared to the corresponding R code that would be needed to achieve this, the YAML specification is much cleaner and more concise. Similarly for the label attributes.
Note on preserving label attributes: in most cases, subsetting a data.frame
results in the label attributes being stripped away. The function subsetp()
(‘p’ for preserve) can be used to avoid this. (It is used internally in one_row_per_id()
, for instance.)
To recap, so far our script has 3 lines of code:
That is all we need to be able to start creating our tables!
The variables to include in the table are specified using a one-sided formula, with stratification denoted by conditioning (i.e., the name of the stratification variable appears to the right of a vertical bar).
In this case, the data contains three studies, and the descriptive statistics are presented stratified by study, and overall. In general, if there are multiple studies it makes sense to stratify by study, and if there is a single study, there is usually some other variable that it makes sense to stratify on, like treatment arm or cohort.
In this example, I have split the baseline characteristics into two tables by logical grouping, simply because there are too many of them to fit comfortably in a single table. The logical groups I have used are:
And here are the results:
Summary of Baseline Characteristics in the PK Population – Demographic Xyz-hv-01table1(~ alb + alp + alt + ast + bili + crcl + fdarenal | study, data=dat,
footnote=abbrev_footnote("ALP", "ALT", "AST", "CrCL", "FDA", "SD", "Min", "Max", "N"))
Xyz-hv-01For continuous variables, the arithmetic mean, standard deviation, median, coefficient of variation, minimum and maximum are displayed on 3 lines. For categorical variable the frequency and percentage (within each column) are shown, one line per category. Missing values, if any, will also be shown as count and percent. The following rounding rules are applied:
Continuous and categorical variables can be mixed in the same table (note that age and age category are next to each other).
The above tables (including headings and footnotes) can be copied from the Chrome browser directly to a Word document (or Excel sheet). All formatting will be preserved (the one exception I have found is that in certain cases, superscripts or subscripts may be too small, so it might be necessary to reset the font site for the whole table to 10 pt). Pasting into PowerPoint works too, but does a less good a job at preserving the formatting, so it may be better to paste to a Word document first, and then copy that to PowerPoint (use the “Keep Source Formatting” option when pasting).
The complete example can be found on GitHub.
R session information## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] table1c_0.1 table1_1.2 rmarkdown_2.2 nvimcom_0.9-102
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.1 magrittr_1.5 htmltools_0.5.0 tools_4.0.1
## [5] yaml_2.2.1 stringi_1.4.6 knitr_1.28 Formula_1.2-3
## [9] stringr_1.4.0 xfun_0.14 digest_0.6.25 rlang_0.4.6
## [13] evaluate_0.14
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4