datapyground.compute.base¶
Base classes and interfaces for Compute Engine
This module defines the base components that are necessary to represent a query plan and execute it.
Classes
- class datapyground.compute.base.ColumnRef(name: str)[source]¶
References a column in a record batch.
When another expression or the engine need to operate on a specific column, we will need a way to reference that column and its data.
This expression is aware of the column and when applied to a record batch returns the data for that column.
>>> import pyarrow as pa >>> data = pa.record_batch({"my_column": pa.array([1, 2, 3]), ... "othercol": pa.array([4, 5, 6])}) >>> column = ColumnRef("my_column") >>> column.apply(data) <pyarrow.lib.Int64Array object at ...> [ 1, 2, 3 ]
- Parameters:
name – The name of the column being referenced.
- apply(batch: RecordBatch) Array | Scalar[source]¶
Get the data for the column.
- class datapyground.compute.base.Expression[source]¶
Expression to apply to a RecordBatch.
Expressions are some form of operation that has to be applied to the data of a
pyarrow.RecordBatchto create new data.Typical example of expressions are: A + B which is expected to sum column A of the RecordBatch to column B of the RecordBatch and return the result.
As our engine is Column Major, applying an expression always results in a new column, thus in a
pyarrow.Arraythat contains the data for that column.- abstract apply(batch: RecordBatch) Array | Scalar[source]¶
Apply the expression to a RecordBatch.
Expression classes must implement this method to dictate what will happen when an expression is applied.
Suppose want to implement a
SumExpressionclass that might look like:class SumExpression(Expression): def __init__(self, leftcol, rightcol): self.lcol = lcol # left column name self.rcol = rcol # right column name def apply(self, batch): return pyarrow.compute.sum( batch[self.lcol], batch[self.rcol] )
- class datapyground.compute.base.Literal(value: str | int | float)[source]¶
Literal value.
Usually the engine will automatically cast to a Literal values of Python base types.
But using the literal object is helpful in cases where there is the need to be explicit.
>>> str(Literal(42)) 'Literal(<pyarrow.Int64Scalar: 42>)'
- Parameters:
value – The literal value.
- apply(batch: RecordBatch) Array | Scalar[source]¶
Get the literal value.
- class datapyground.compute.base.QueryPlanNode[source]¶
A node of a query execution plan.
The Query plan is represented as a tree of nodes. Each node is a step in the execution and all previous steps are children of the last one.
For example a simple plan might involve loading data and filtering it:
LoadDataNode -> FilterDataNode(filter)
That would be a plan where the last step is filtering, and the LoadDataNode is a child of the filter node.
The number of children can be variable, some nodes like for example Joins, will accept one or more child nodes that have to be joined together.
Each Node accepts
pyarrow.RecordBatchdata as its input and emits a newpyarrow.RecordBatchas its output.The base QueryPlanNode class does nothing and purely acts as the interface that all nodes must implement. Actual work will be done in the subclasses.
For example a simple node that takes data and just forwards it as is after printing its content can be implemented as:
class DebugDataNode(QueryPlanNode): def __init__(self, child): self.child = child def batches(self): for b in self.child.batches(): print(b) yield b def __str__(self): return f"DebugDataNode()"
- abstract batches() Generator[RecordBatch, None, None][source]¶
Emits the batches for the next node.
Each QueryPlan node is expected to be able to generate data that has to be provided to the next node in the plan.
Usually this happens by consuming data from its child nodes, transforming it somehow, and yielding it back to the next consumer.