datapyground.compute.datasources

Query Plan nodes that load data

The datasource nodes are expected to fetch the data from some source, convert it into the format accepted by the compute engine and forward it to the next node in the plan.

They are used to do things like loading data from CSV files or equivalent operations

Classes

class datapyground.compute.datasources.CSVDataSource(filename: str, block_size: int | None = None)[source]

Load data from a CSV file.

Given a local CSV file path, load the content, convert it into Arrow format, and emit it for the next nodes of the query plan to consume.

Parameters:
  • filename – The path of the local CSV file.

  • block_size – How big to make batches of data, Influences how many batches will be produced

batches() Generator[RecordBatch, None, None][source]

Open CSV file and emit the batches.

poll_schema() Schema[source]

Poll the schema of the CSV file.

class datapyground.compute.datasources.DataSourceNode[source]

Base class for nodes that load data from a source.

abstract poll_schema() Schema[source]

Poll the schema of the data source without loading its content.

class datapyground.compute.datasources.ParquetDataSource(filename: str, batch_size: int | None = None)[source]

Load data from a Parquet file.

Given a local parquet file path, load the content, convert it into Arrow format, and emit it for the next nodes of the query plan to consume.

Parameters:
  • filename – The path of the local parquet file.

  • batch_size – How big to make batches of data, Influences how many batches will be produced

batches() Generator[RecordBatch, None, None][source]

Open CSV file and emit the batches.

poll_schema() Schema[source]

Poll the schema of the Parquet file.

class datapyground.compute.datasources.PyArrowTableDataSource(table: Table | RecordBatch)[source]

Load data from an in-memory pyarrow.Table or pyarrow.RecordBatch.

Given a pyarrow.Table or pyarrow.RecordBatch object, allow to use its data in a query plan.

Parameters:

table – The table or recordbatch with the data to read.

batches() Generator[RecordBatch, None, None][source]

Emit the data contained in the Table for consumption by other node.

poll_schema() Schema[source]

Poll the schema of the Table.