пятница, 6 января 2023 г.

Creating a simple foreign data wrapper for PostgreSQL

Today I'd like to describe how to create own minimal usable foreign data wrapper (FDW) for PostgreSQL.

Common information about FDW interfaces

I suppose you know C and have basic PostgreSQL DBA experience. Foreign data wrapper (FDW) is a mean of accesssing another data sources from PostgreSQL DBMS. In this article I'll walk you through writing minimal usable FDW (mufdw). Going through mufdw code, I'll show you how you can use PostgreSQL FDW API (its description you can find in the official documentation).

To access another data source you define FDW by specifying its name, validator function and handler function like this

CREATE FOREIGN DATA WRAPPER mufdw
  HANDLER mufdw_handler
  VALIDATOR mufdw_validator;
Both handler and validator are SQL functions. A validator is called when user mapping, foreign server or foreign table is created or modified to check if option is valid for the object. A handler just returns a struct, containing methods defining FDW (FdwRoutine). Most FDW will do foreign data wrapper creation behind the scenes when you create extension, so for administrator this could look like
CREATE EXTENSION mufdw;
Next you should define a foreign server, which basically specifies a remote server host in one or another way. For example, you can define host name, port and database name in foreign server options. Our sample mufdw allows to query another table in the same database. So server definition doesn't define any options and you can create server simply by
CREATE SERVER loopback FOREIGN DATA WRAPPER mufdw;
Now we have to define user mapping - how local user should access remote data source. Usually user mapping definition includes authentication information - like remote user name and password. As our minimal usable fdw will just wrap local tables and use the same user for scanning them (via SPI interface), our user mapping will be formal:
CREATE USER MAPPING FOR PUBLIC SERVER loopback;
Let's create a plain table which will be used as a source for our foreign table.
CREATE TABLE players(id int primary key, nick text, score int);
INSERT INTO players SELECT i, 'nick_'||i, i*10 from generate_series(1,10) i;
Foreign table is actually an object which behaves mostly like normal table, but provides access to remote data. In foreign table definition we provide options, necessary to identify this data. In our case it's local table and schema name.
CREATE FOREIGN TABLE f_players (id int, nick text, score int) server loopback options (table_name 'players', schema_name 'public');
Now we can query data from our foreign table.
SELECT * FROM players ;
 id |  nick   | score 
----+---------+-------
  1 | nick_1  |    10
  2 | nick_2  |    20
  3 | nick_3  |    30
  4 | nick_4  |    40
  5 | nick_5  |    50
  6 | nick_6  |    60
  7 | nick_7  |    70
  8 | nick_8  |    80
  9 | nick_9  |    90
 10 | nick_10 |   100
(10 rows)

What happens when you query a foreign table? PostgreSQL planner firstly looks if a table used in query is a foreign table and looks for its foreign server FdwRoutine. Pointer to FdwRoutine is recorded in RelOptInfo structure, which is used in planner to represent relation. This work is done in get_relation_info() function. Firstly planner access our fdw methods when looks for relation size in set_rel_size() (which is called for each base relation in the begining of planning (make_one_rel()). set_foreign_size() calls GetForeignRelSize() to find out relation size after applying restriction clauses. When planner generate possible paths to access a base relation in set_rel_pathlist(), it calls GetForeignPaths() function from RelOptInfo->fdwroutine to generate foreign path. Access paths are used by planner to find out possible strategy to access relation. Usually for foreign tables there will be no another paths besides foreign path. For join relation there could be several paths - foreign join path and several local join paths (for example, using nestloop or hash join method). If foreign path is the best access path for a particular relation, GetForeignPlan() function from its fdwroutine will be called to generate ForeignScan plan in create_foreignscan_plan(). Executor executes ForeignScan plan. When it initializes foreign scan state, BeginForeignScan() is called. Later tuples are fetched as needed by executor in IterateForeignScan(). If it's necessary to restart foreign scan from begining, ReScanForeignScan() is called. When relation scan is ended, EndForeignScan() is called.

Mufdw implementation

Let's look at how the minimal set of functions needed by read-only FDW is implemented in mufdw.

The basic logic behind mufdw is simple. For each foreign table we provide table name and table schema as foreign table options. When user queries remote table we open SPI cursor and fetch data from local table, identified by these schema and name. Query text is saved in foreign scan private field during planning. In BeginForeignScan() we create cursor for later use. IterateForeignScan() fetches one tuple from it and saves in executor state node scan slot. Here node is a basic "class"-like hierarchical structures used in PostgreSQL source code.

GetForeignRelSize() function

void GetForeignRelSize(PlannerInfo *root, RelOptInfo *baserel, Oid foreigntableid)

This methods obtains relation size estimates for a foreign table. Here root is the planner information about the query, baserel - information about the table and foreigntableid is the Oid of the foreign table. In this function we have to fill in baserel->tuples and baserel->rows. baserel->fdw_private can be initialized and used for private fdw purposes.

In mufdw it is implemented as mufdwGetForeignRelSize(). mufdwGetForeignRelSize() basically allocates memory for relation fdw_private structure, searches in foreign table options for "table_schema" and "table_name" and saves them for further use. It also makes minimal effort to estimate relation size, but in fact it gives default estimations, as we don't gather any statistic for mufdw foreign table.

GetForeignPaths() function

void GetForeignPaths(PlannerInfo *root, RelOptInfo *baserel, Oid foreigntableid)

This function should create foreign path and add it to baserel->pathlist. It's recommended to use create_foreignscan_path to build the ForeignPath. Arguments are the same.

In mufdw it is implemented as mufdwGetForeignPaths(). It creates basic foreign scan path and adds it to the relation path list. Cost is estimated as a sequential scan. These estimates are not accurate in any way.

GetForeignPlan() function

ForeignScan *GetForeignPlan(PlannerInfo *root, RelOptInfo *baserel, Oid foreigntableid, ForeignPath *best_path, List *tlist, List *scan_clauses, Plan *outer_plan)

This function is called in the end of query planning and should create a ForeignScan plan node. It is passed foreign path, generated by GetForeignPaths(), GetForeignJoinPaths() or GetForeignUpperPaths(). We also get the target list, which should be emitted by plan node, restriction clauses to be enforced and outer subplan of the ForeignScan.

In mufdw it is implemented as mufdwGetForeignPlan(). We put all scan clauses in the plan node qual list for recheck as mufdw doesn't perform "remote" filtering. Function gets scan clauses as list of RestrinctInfo nodes, but should use expression list for plan quals. extract_actual_clauses() is used to perform this transformation. Also we construct simple "SELECT *" query to extract data from plain table. fdw_private foreignscan field should be a list of nodes, so that copyObject() could copy them. So we wrap our C string, representing query, into String node and create fdw_private as single-member list.

BeginForeignScan() function

void BeginForeignScan(ForeignScanState *node, int eflags)

This function is called to begin foreign scan execution. It can, for example, establish connection to DBMS. Note that function will also be called when running explain for a query (in that case (eflags & EXEC_FLAG_EXPLAIN_ONLY) is true). When called from explain, function should avoid doing any externally visible actions.

In mufdw it is implemented as mufdwBeginForeignScan(). Here we create cursor for the query, constructed during plan creation and initialize internal scan state with its name. Our local open_new_cursor() function opens cursor for query using SPI and returns name of corresponding portal. The issue here is that SPI context is transient, we can't use memory, allocated in this context, in other parts of program, so should copy it to our old context.

IterateForeignScan function

TupleTableSlot *IterateForeignScan(ForeignScanState *node)

This function is used to fetch one tuple and return it in ForeignScanState's tuple slot.

In mufdw it is implemented as mufdwIterateForeignScan(). Firstly we clear tuple slot, search for our SPI cursor and fetch one tuple from it. If there's no data, it will be enough. If there is some data, we deform SPI tuple to slot's array of values and nulls and call ExecStoreVirtualTuple() to store it. The only issue is that it should be materialized, as SPI tuple will be released when SPI context is destroyed.

ReScanForeignScan function

void ReScanForeignScan(ForeignScanState *node)

This function is used to restart scan from the begining.

In mufdw it is implemented as mufdwReScanForeignScan(). It just closes old cursor and opens new one for our query, stored in foreign scan internal state.

EndForeignScan function

void EndForeignScan(ForeignScanState *node)

This function should just end scan and release resources.

In mufdw it is an empty function. We don't explicitly release resources as they will be released when transaction ends.

fdw_handler function

fdw_handler is just a SQL function which should return pointer to FdwRoutine structure, which contains references to FDW functions, which we've just described.

In mufdw it's implemeted as mufdw_handler(). Only functions, necessary for qeurying foreign tables, are implemented.

fdw_validator function

fdw_validator checks options, used during CREATE or ALTER statements for foreign data wrapeprs, foreign servers, user mappings and foreign tables. It gets list of options as an array of text as the first argument and OID representing the type of object the options are associated with. OID can be ForeignDataWrapperRelationId, ForeignServerRelationId, UserMappingRelationId, or ForeignTableRelationId.

In mufdw it's implemented as mufdw_validator. It allows only "table_name" and "schema_name" options for foreign tables and doesn't check options for other objects.

SQL code

We need a bit of SQL code to glue this all together. We should create sql-level functions for validator and handler and create foreign data wrapper. For mufdw it's done in its extension script.

Afterword

We looked at how you can create a simple read-only foreign data wrapper in C, analyzing mufdw sample foreign data wrapper. I hope, this was useful, or at least interesting.