datapyground.sql.tokenize¶

A simple regular expression-based tokenizer for SQL queries.

Given a text string containing an SQL query, this module provides a simple tokenizer that converts the input text into a sequence of tokens.

Each token will be an object of a different class depending on the type of token, this allows for the parser to dispach parsing code based on the type of token and build the abstract syntax tree (AST) of the query.

This approach has some notable limitations, primarily in the context of tokenizing literals:

Nested quotes in string literals are not correctly supported.
The tokenizer does not support comments.
The tokenizer does not support escape characters in string literals.
The tokenizer does not support escape sequences in string literals.

For a more robust parser, you would typically use a dedicated library like SQLGlot or Calcite. But for the purposes of DataPyground, this simple tokenizer is good enough and it serves the purpose of showcasing how SQL queries can be parsed and executed.

The main class in this module is the Tokenizer class, which is responsible for the tokenization process itself.

Functions

datapyground.sql.tokenize.GENERATE_TOKENIZATION_REGEX() → Pattern[source]¶

Combine the token specification into a regex pattern for tokenization.

This will generate the regular expression that the tokenizer will use to match tokens in the input text.

The regular expression is generated by combining the regex patterns of the token specification into a single regex pattern that will match any of the tokens.

When a token is matched, the group name of the match will be the name of the token type, which will be used to dispatch the parsing code based on the type

For example "SELECT" will be matched by the KEYWORD token of the specification and as the name of the token in the specification is also the group name of the match in the constructed regular expression, the match group will be 'KEYWORD'.

datapyground.sql.tokenize.GENERATE_TOKEN_SPECIFICATION() → list[tuple[str, str]][source]¶

Provides the token specification for the tokenizer.

Each entry in the specification is a tuple of (token_name, regex_pattern). The order of the entries is important as it defines the priority of the tokens. For example, the KEYWORD token should be before the IDENTIFIER token because if a keyword is matched, it should not be matched as an identifier.

Classes

class datapyground.sql.tokenize.AliasToken(value: str)[source]¶

Token representing the AS keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.EOFToken[source]¶

Special Token representing the end of the input text.

Value is hardcoded to EOF

class datapyground.sql.tokenize.FromToken(value: str)[source]¶

Token representing the FROM keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.GroupByToken(value: str)[source]¶

Token representing the GROUP BY keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.IdentifierToken(value: str)[source]¶

Token representing an identifier (table name, column name, etc).

Parameters:: value – The text value of the token.

class datapyground.sql.tokenize.InsertToken(value: str)[source]¶

Token representing the INSERT keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.JoinOnToken(value: str)[source]¶

Token representing the ON keyword in a JOIN clause.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.JoinToken(value: str)[source]¶

Token representing the JOIN keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.JoinTypeToken(value: str)[source]¶

Token representing the INNER, OUTER, LEFT, RIGHT, and FULL keywords in a JOIN clause.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.KeywordToken(value: str)[source]¶

Base class for all SQL Keywords.

Keywords are always rappresented in uppercase as a convention to distinguish them from other tokens.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.LimitToken(value: str)[source]¶

Token representing the LIMIT keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.LiteralToken(value: str)[source]¶

Token representing a literal value (string, number, etc).

Parameters:: value – The text value of the token.

class datapyground.sql.tokenize.OffsetToken(value: str)[source]¶

Token representing the OFFSET keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.OperatorToken(value: str)[source]¶

Token representing an operator (comparison, arithmetic, etc).

Operators are always rappresented in uppercase if they are text operators.

Parameters:: value – The text value of the operator token.

class datapyground.sql.tokenize.OrderByToken(value: str)[source]¶

Token representing the ORDER BY keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.PunctuationToken(value: str)[source]¶

Token representing a punctuation character (comma, parenthesis, etc).

Parameters:: value – The text value of the token.

class datapyground.sql.tokenize.SelectToken(value: str)[source]¶

Token representing the SELECT keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.SortingOrderToken(value: str)[source]¶

Token representing the ASC and DESC keywords.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.Token(value: str)[source]¶

Base class for all token types.

Every token will have a value attribute that represents the text value of the token as it appears in the input query.

Parameters:: value – The text value of the token.

class datapyground.sql.tokenize.Tokenizer(text: str)[source]¶

A simple regular expression-based tokenizer for SQL queries.

Given a text string containing an SQL query, this class uses the regular expression generated by GENERATE_TOKENIZATION_REGEX() to process the input text and convert it into a sequence of tokens.

The tokens can subsequently be used by the parser to build the abstract syntax tree (AST) of the query.

The tokenizer takes for granted that the input text is a valid SQL SELECT query. If the input text is not a valid SQL query, the tokenizer may raise exceptions or have unpredictable behavior. This is a limitation of the simple regex-based approach used in this tokenizer. Also given that the parser doesn’t support queries of any type other than SELECT queries, the tokenizer is not designed to handle other types of queries.

The tokenizer works by matching the regular expression against the text, once a match is found, an object of the corresponding token class is created and added to the list of tokens. The tokenizer then advances to the end of the matched token and continues matching tokens until the end of the text is reached.

For example:

      SELECT id FROM table WHERE age >= 18
      ^      ^  ^    ^     ^     ^   ^  ^
pos = 0      7  10   15    20    25  28 31

would be tokenized into a sequence of tokens like:

SelectToken, IdentifierToken, FromToken, IdentifierToken, WhereToken, IdentifierToken, OperatorToken, LiteralToken

Note

The tokenizer is not thread-safe

Params text:: The input text containing the SQL query to tokenize.

keyword_token_classes¶: Mapping of the keyword token values to the token class

tokenize() → list[Token][source]¶: Tokenize the input text into a sequence of tokens.

get_next_token() → Match[str] | None[source]¶: Match the next token in the input text from the current tokenizer position.

advance_to(pos: int) → None[source]¶

Advance the tokenizer to a new position in the input text.

Parameters:: pos – The new position to which to advance the tokenizer.

Subsequent calls to get_next_token() will start from the new position and only match tokens that come after the new position.

The tokenizer will automatically advance during the tokenization process, there is no need to invoke this manually.

class datapyground.sql.tokenize.UpdateToken(value: str)[source]¶

Token representing the UPDATE keyword.

Parameters:: value – The text value of the keyword token.

class datapyground.sql.tokenize.WhereToken(value: str)[source]¶

Token representing the WHERE keyword.

Parameters:: value – The text value of the keyword token.

Exceptions

SQLTokenizeException

Exception raised when an error occurs during tokenization.

datapyground.sql.tokenize¶

Table of Contents

Previous topic

Next topic

This Page