datapyground.compute.selection¶
Query plan nodes that implement projection of columns.
A common request in queries is to select specific columns
and project new columns based on expressions.
An example is the SELECT clause in SQL queries.
This module implements the basic projection capabilities:
SELECT a, b, a + b AS ab_sum
Classes
- class datapyground.compute.selection.ProjectNode(select: list[str] | None, project: dict[str, Expression] | None, child: QueryPlanNode)[source]¶
Project data by selecting specific columns and computing expressions.
The projection expects a list of column names to select and a dictionary of column names and expressions to project new columns.
>>> import pyarrow as pa >>> import pyarrow.compute as pc >>> from datapyground.compute import col, lit, FunctionCallExpression, PyArrowTableDataSource >>> data = pa.record_batch({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> next(ProjectNode(["a"], {"ab_sum": FunctionCallExpression(pc.add, col("a"), col("b"))}, ... PyArrowTableDataSource(data)).batches()) pyarrow.RecordBatch a: int64 ab_sum: int64 ---- a: [1,2,3] ab_sum: [5,7,9]
- Parameters:
select – The list of column names to select.
Nonemeans select all columns. []` means select only the projected columns.project – The dict {name: Expression} to project new columns.
child – The node emitting the data to be projected.
- batches() Generator[RecordBatch, None, None][source]¶
Apply the projection to the child node.
For each recordbatch yielded by the child node, sequentially apply the expressions to project new columns and then select the requested columns.
We need to do this to allow projected columns to depend on previously projected columns or from columns that are not in the selection.