datapyground.computeΒΆ
The DataPyground Compute Engine
The compute engine defines the in-memory format for query plans and the plan nodes supported.
The compute engine is tightly bound to Apache Arrow,
thus the engine will expect to always deal with
pyarrow.RecordBatch and emit a new RecordBatch
as the result of the node execution.
This allows to easily build compute pipelines like:
(RecordBatch)-->Node1--(RecordBatch)-->Node2--(RecordBatch)-->...
The query plan nodes themselves are in charge of their execution, this keeps the behavior near to the node and thus makes easy to know how a Node is actually executed without having to look around too much.
Building a query plan requires to combine the nodes that we want
to be executed starting with one or ore DataSource nodes as the
leafs node of a query:
>>> import pyarrow as pa
>>> data = pa.table({
... "animals": pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"]),
... "n_legs": pa.array([2, 4, 5, 100])
... })
>>>
>>> import pyarrow.compute as pc
>>> from datapyground.compute import col, PyArrowTableDataSource
>>> from datapyground.compute import FilterNode, FunctionCallExpression
>>> # SELECT * FROM data WHERE n_legs >= 5
>>> query = FilterNode(
... FunctionCallExpression(pc.greater_equal, col("n_legs"), 5),
... child=PyArrowTableDataSource(
... data
... )
... )
>>> for data in query.batches():
... print(data)
pyarrow.RecordBatch
animals: string
n_legs: int64
----
animals: ["Brittle stars","Centipede"]
n_legs: [5,100]
Modules
Query plan nodes that aggregations. |
|
Base classes and interfaces for Compute Engine |
|
Query Plan nodes that load data |
|
Expressions executed by compute engine nodes. |
|
Query plan nodes that implement filtering of rows. |
|
Query plan nodes that implement join operations. |
|
Support limiting or skipping data in a query plan. |
|
Query plan nodes that implement projection of columns. |
|
Query plan nodes that perform sorting of data. |