datapyground.compute.join

Query plan nodes that implement join operations.

The join operations are implemented by aligning the keys of the two tables and only keeping the rows where the keys are equal.

An alternative implementation would be to use a hash join algorithm that builds a hash table from one of the tables and then probes the other table to find matching rows.

Inner Join

Provided by InnnerJoinNode, the class provides a complete description of the steps involved in performing an inner join operation.

>>> import pyarrow as pa
>>> from datapyground.compute import InnnerJoinNode
>>> from datapyground.compute import PyArrowTableDataSource
>>> left = PyArrowTableDataSource(pa.record_batch({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]}))
>>> right = PyArrowTableDataSource(pa.record_batch({"id": [3, 2], "age": [25, 30]}))
>>> join_node = InnnerJoinNode("id", "id", left, right)
>>> next(join_node.batches())
pyarrow.RecordBatch
id: int64
name: string
age: int64
----
id: [2,3]
name: ["Bob","Charlie"]
age: [30,25]

Classes

class datapyground.compute.join.InnnerJoinNode(left_key: str, right_key: str, left_child: QueryPlanNode, right_child: QueryPlanNode)[source]

Join two data sources using an inner join.

The inner join is performed by aligning the keys of the two tables and only keeping the rows where the keys are equal.

An alternative implementation would be to use a hash join algorithm that builds a hash table from one of the tables and then probes the other table to find matching rows. This is more efficient for large tables but PyArrow 0.17 doesn’t currently expose a way to compute hashes and we want to avoid using Python loops to implement the hash join.

Supposing we have two tables:

left:
+----+--------+
| id | name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+


right:
+----+-----+
| id | age |
+----+-----+
| 3  | 25  |
| 2  | 30  |
+----+-----+

We would perform the following steps:

  1. Compute the unique values of the join keys in both tables. This is primarily done to speed up step 2:

    left_keys = [1, 2, 3]
    right_keys = [3, 2]
    
  2. Filter the tables to only keep the rows where the keys are in the other table. This allows us to remove the rows that don’t have a match in the other table. As inner joins return only the rows that have a match in both tables:

    left:
    +----+--------+
    | id | name   |
    +----+--------+
    | 2  | Bob    |
    | 3  | Charlie|
    +----+--------+
    
    right:
    +----+-----+
    | id | age |
    +----+-----+
    | 3  | 25  |
    | 2  | 30  |
    +----+-----+
    
  3. Sort the tables so that the keys are aligned. This is necessary to make sure that to each key in the left table corresponds the same key in the right table. If the keys are not aligned the join operation wouldn’t work as we would end up computing Bob is 25 years old. and Charlie is 30 years old. While instead according to the matching keys Bob should be 30 years old and Charlie should be 25 years old. After sorting the keys instead we seen that Bob is 30 years old and Charlie is 25 years old as row[0] has the same id in both tables and row[1] has the same id in both tables:

    left:
    +----+--------+
    | id | name   |
    +----+--------+
    | 2  | Bob    |
    | 3  | Charlie|
    +----+--------+
    
    right:
    +----+-----+
    | id | age |
    +----+-----+
    | 2  | 30  |
    | 3  | 25  |
    +----+-----+
    
  4. Combine the two tables into a new table. This is done by creating a new recordbatch that contains the columns of both tables. The keys are aligned so that for each key in the left table, the same key is in the right table at the same row. This way we can join on equal keys:

    combined:
    +----+--------+-----+
    | id | name   | age |
    +----+--------+-----+
    | 2  | Bob    | 30  |
    | 3  | Charlie| 25  |
    +----+--------+-----+
    
Parameters:
  • left_key – The key to join on in the left table.

  • right_key – The key to join on in the right table.

  • left_child – The left source of data to join.

  • right_child – The right source of data to join.

batches() Generator[RecordBatch, None, None][source]

Perform the inner join operation.

Accumulates all rows of both children to perform the join operation, so it is not suitable for large datasets.