I will be using Jose Morales excellent post to show how .SD’s functionality can be replicated in python’s datatable. Not all functions can be replicated; R data.table has a whole lot more functions and features that are not yet implemented in datatable.
from datatable import dt, by, sort, f, fread
DT = fread('Data_files/iris.csv') DT.head()sepal_length sepal_width petal_length petal_width species ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa 6 4.6 3.4 1.4 0.3 setosa 7 5 3.4 1.5 0.2 setosa 8 4.4 2.9 1.4 0.2 setosa 9 4.9 3.1 1.5 0.1 setosa Number of unique observations per column#
# DT[, lapply(.SD, uniqueN)] --> Rdatatable DT.nunique()sepal_length sepal_width petal_length petal_width species ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 35 23 43 22 3 Mean of all columns by
species
#
# DT[, lapply(.SD, mean), by = species] --> Rdatatable DT[:, f[:].mean(), by('species')]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 5.006 3.428 1.462 0.246 1 versicolor 5.936 2.77 4.26 1.326 2 virginica 6.588 2.974 5.552 2.026 Filtering# First two observations by species#
# DT[, .SD[1:2], by = species] DT[:2, :, by('species')]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 5.1 3.5 1.4 0.2 1 setosa 4.9 3 1.4 0.2 2 versicolor 7 3.2 4.7 1.4 3 versicolor 6.4 3.2 4.5 1.5 4 virginica 6.3 3.3 6 2.5 5 virginica 5.8 2.7 5.1 1.9
In datatable, rows are selected in the i
section after the grouping, unlike in R’s data.table, where rows are selected in i
before grouping, and rows selected in the .SD
after grouping.
species
#
# DT[, tail(.SD, 2), by = species] DT[-2:, :, by('species')]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 5.3 3.7 1.5 0.2 1 setosa 5 3.3 1.4 0.2 2 versicolor 5.1 2.5 3 1.1 3 versicolor 5.7 2.8 4.1 1.3 4 virginica 6.2 3.4 5.4 2.3 5 virginica 5.9 3 5.1 1.8
Again, the rows are selected after grouping by using Python’s negative index slicing.
Select the top two sorted bysepal length
in descending order#
# DT[order(-sepal_length), head(.SD, 2), by = species] DT[:2, :, by('species'), sort(-f.sepal_length)]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 5.8 4 1.2 0.2 1 setosa 5.7 4.4 1.5 0.4 2 versicolor 7 3.2 4.7 1.4 3 versicolor 6.9 3.1 4.9 1.5 4 virginica 7.9 3.8 6.4 2 5 virginica 7.7 3.8 6.7 2.2
In datatable, the sort function replicates the order
function in R’s data.table. Note the -
symbol before the sepal_length f-expression; this instructs the dataframe to sort in descending order.
sepal length
and sepal width
#
# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] DT[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 4.6 3.6 1 0.2 1 setosa 5.2 4.1 1.5 0.1 2 versicolor 5.4 3 4.5 1.5 3 versicolor 4.9 2.4 3.3 1 4 virginica 4.9 2.5 4.5 1.7 5 virginica 6.2 3.4 5.4 2.3
Just like in R’s data.table, boolean expressions can be passed to the sort function.
Filter observations above the mean ofsepal_length
by species#
# DT[, .SD[sepal_length > mean(sepal_length)], by = species] DT[:, [f[:], (f.sepal_length > f.sepal_length.mean()).alias('temp')], by('species')][f.temp==1, :-1]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 5.1 3.5 1.4 0.2 1 setosa 5.4 3.9 1.7 0.4 2 setosa 5.4 3.7 1.5 0.2 3 setosa 5.8 4 1.2 0.2 4 setosa 5.7 4.4 1.5 0.4 5 setosa 5.4 3.9 1.3 0.4 6 setosa 5.1 3.5 1.4 0.3 7 setosa 5.7 3.8 1.7 0.3 8 setosa 5.1 3.8 1.5 0.3 9 setosa 5.4 3.4 1.7 0.2 10 setosa 5.1 3.7 1.5 0.4 11 setosa 5.1 3.3 1.7 0.5 12 setosa 5.2 3.5 1.5 0.2 13 setosa 5.2 3.4 1.4 0.2 14 setosa 5.4 3.4 1.5 0.4 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 63 virginica 6.7 3.1 5.6 2.4 64 virginica 6.9 3.1 5.1 2.3 65 virginica 6.8 3.2 5.9 2.3 66 virginica 6.7 3.3 5.7 2.5 67 virginica 6.7 3 5.2 2.3
Unlike in R’s data.table, boolean expressions can not be applied within the i
section, in the presence of by
. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column.
# DT[, .SD[.N > 10], keyby = .(species, petal_width)] DT[:, [f[:], (dt.count()>10).alias('temp')], by('species', 'petal_width')][f.temp==1, :-1]species petal_width sepal_length sepal_width petal_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 0.2 5.1 3.5 1.4 1 setosa 0.2 4.9 3 1.4 2 setosa 0.2 4.7 3.2 1.3 3 setosa 0.2 4.6 3.1 1.5 4 setosa 0.2 5 3.6 1.4 5 setosa 0.2 5 3.4 1.5 6 setosa 0.2 4.4 2.9 1.4 7 setosa 0.2 5.4 3.7 1.5 8 setosa 0.2 4.8 3.4 1.6 9 setosa 0.2 5.8 4 1.2 10 setosa 0.2 5.4 3.4 1.7 11 setosa 0.2 4.6 3.6 1 12 setosa 0.2 4.8 3.4 1.9 13 setosa 0.2 5 3 1.6 14 setosa 0.2 5.2 3.5 1.5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 48 virginica 1.8 6.2 2.8 4.8 49 virginica 1.8 6.1 3 4.9 50 virginica 1.8 6.4 3.1 5.5 51 virginica 1.8 6 3 4.8 52 virginica 1.8 5.9 3 5.1 Get the row with the max petal_length by species.#
# DT[, .SD[which.max(petal_length)], by = species] OR # DT[, .SD[petal_length == max(petal_length)], by = species] # get rid of temp column DT[0, :, by('species'), sort(-f.petal_length)]species sepal_length sepal_width petal_length petal_width ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 setosa 4.8 3.4 1.9 0.2 1 versicolor 6 2.7 5.1 1.6 2 virginica 7.7 2.6 6.9 2.3
In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when petal_length
is sorted in descending order.
.SD
#
# col_idx <- grep("^sepal", names(DT)) --> filter for the specicfic columns # DT[, lapply(.SD, mean), .SDcols = col_idx] # filter for the specific columns with a list comprehension names = [name for name in DT.names if name.startswith('sepal')] DT[:, f[names].mean()]sepal_length sepal_width ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 5.84333 3.05733 Removing columns from
.SD
#
# col_idx <- grep("^(petal|species)", names(DT)) # DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols # here, exclusion occurs within the list comprehension names = [name for name in DT.names if not name.startswith(('petal','species'))] DT[:, f[names].mean()]sepal_length sepal_width ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 5.84333 3.05733 Column ranges#
# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width] DT[:, f['sepal_length':'sepal_width'].mean()]sepal_length sepal_width ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ 0 5.84333 3.05733 Summary#
We’ve seen how to replicate .SD
in datatable. There are other functionalities in .SD
that are not presently possible in Python’s datatable. It is possible that in the future, .SD
will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow numpy functions and functions from other Python libraries into datatable.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4