Multi source data injection to Postgres RDS with encryption and FTS support
Multi source data injection to Postgres RDS with encryption and FTS support.
By 3manuek in Postgres
March 23, 2016
Sponsored: Pythian Inc.
Note 1: All the of this presentation is published in this repository. You will find a lot of folders and information, probably part of a blog series.
Note 2: All the work on this article is a POC (Proof of concept).
Note 3: This is something that is related for HIPAA compliant.
KMS/RDS
The POC on this article was developed before the releasing of the Key Management service for RDS.
I totally discourage to use the current approach for encrypting data. Use REST.
Introduction
I’ve been dealing with an issue that came into my desktop from people of the community, regarding RDS and HIPAA rules. There was a confusing scenario whether PostgreSQL was using FTS and encryption on RDS. There are a lot of details regarding the architecture, however I think it won’t be necessary to dig into very deeply to understand the basics of the present article moto.
HIPAA rules are complex and if you need to deal with them, you’ll probably need to go through a careful read.
tl;dr
? they tell us to store data encrypted on servers that are not in the premises.
And that’s the case of RDS. However, all the communications are encrypted using
SSL protocol, but is not enough to compliant with HIPAA rules.
CPU resources in RDS are expensive and not stable sometimes, which makes encryption and
FTS features not very well suited for this kind of service. I not saying that you
can’t implement them, just keep in mind that a standard CPU against vCPU could
have a lot difference. If you want to benchmark your local CPU against RDS vCPU,
you can run the following query inside psql
on both instances:
|
|
There are a lot of things and functions you can combine from the pgcrypto
package
(you will see that the repository contemplates all of them).
I will try to post another blog post regarding this kind of benchmarks. In the
meantime, this query should be enough to have a rough idea of the performance difference
between RDS instance vCPU and your server CPUs.
Architecture basics
For this POC we are going to store FTS and GPG keys locally, in a simple PostgreSQL instance and, using a trigger, encrypt and upload transparently to RDS using the standard FDW (Foreign Data Wrappers).
Have in mind that RDS communication is already encrypted via SSL when data flows between server/client. It’s important to clarify this, to avoid confusions between communication encryption and storing data encrypted.
The simple trigger will split the unencrypted data between a local table storing
in a tsvector
column (jsonb in the TODO), it will encrypt and push the encrypted
data into RDS using FDW (the standard postgres_fdw package).
A simple flight view of the idea can be observed in the image bellow.
RDS structure and mirrored local structure with FDW
RDS instance schema structure contains a parent table , a partitioning trigger and its trigger:
|
|
We are not going to use the partial SSN
column in the examples, but I found it very helpful to
do RDS searches over encrypted data without fall into the need of decrypting in-the-fly the SSN.
The last SSN’s 4-digits do not provide useful information if stolen.
Also, the magic of the multi-source data injection comes from the compound key using a bigint and a source tag.
Basically, you can think on the local nodes as proxies. You can insert data on every node, but the data will point to the RDS instance.
If you are planning to manage large amounts of data, you can partition the table on RDS, allowing a better organization for data management.
You will see no indexes over encrypted data
Local nodes structure:
|
|
Same table. Everytime we want to deal with the RDS table, we are going to do so using the __person__pgp_RDS
table, which is just a mapping table. We can query this table as any other usual table.
For testing purposes, I also created a table with the same structure as the above with
__person_rds_RDS_URE
table name and added the use_remote_estimate 'true'
option.
When enabled, postgres_fdw obtains the row count and estimates from the remote server.
Inserting keys locally
Just to avoid an extended article, I will skip the GPG key creation commands here. Please follow the instructions on the link at the referece section about keys.
We can insert they keys in several ways, but I found very convenient to use psql
features to do so. Once the keys are in place you can use \lo_import
command:
|
|
The next steps are very straightforward. In a real scenario, you won’t probably want to upload private keys into the table, just for practical purposes of this article I’m going to do so (only for decrypt data in the SELECT query).
pgp_key_id
will return the same key no matter if you use private or public key.
|
|
Splitting data to FTS, encrypt and push into RDS
Now, here is when the tricky part starts. We are going to achieve some functionalities:
- We are going to simulate routing using inheritance on the FTS records. That will allow us to split data as we want and, replicate using Logical Decoding feature between the nodes. I won’t include this on the current article just to avoid it to be extense.
- We are going to encrypt using the key that we select on the insert query. If you want a key per table basis, you will find easier to hardcode the key id on the
_func_get_FTS_encrypt_and_push_to_RDS
. - Once the records are encrypted, the function will insert those records to the foreign table (RDS).
- When querying the FTS table, we will be able to determine the source (something like the
routing
technique, you will find this familiar if you played with ElasticSearch). That allow us to make the FTS search transparent to the application, pointing always to the parent table. :dogewow:
Isn’t Postgres cool? :o
FTS table structures
|
|
Doing this, you avoid to have a column with a constant value in the table, consuming unnecessary space. You can have with this method, different names and tables accross the cluster, but always using the same query against local_search
. You can map/reduce the data if you want to across the nodes, with the very same query.
Is not necessary to only have 1 source or route per node. The only requirement for this is to have different routes per node (combining source and route could increase complexity, however is possible).
Main code
|
|
This functions does everything. It inserts the data on RDS and split the data on the corresponding FTS child table. For performance purposes, I didn’t want to catch exceptions at insert time (if the child table does not exists, i.e.), but you can also add this feature with an exception block as follows:
|
|
The same can be done over the foreign table. More info in “Class HV — Foreign Data Wrapper Error (SQL/MED)” (HV00R -fdw_table_not_found
).
Check “Appendix A. PostgreSQL Error Codes” on the official manual for references about error codes.
Inserting data
At insertion time, we are going to push data through a mapping table. The reason for this is that all the encrypted data is stored in bytea
datatype, and we want to have clear queries instead.
A random data query will look as:
|
|
Did you see the inner comment? Well, probably you want to split by customer
or any other alias. I’m using this ugly harcoded text just to avoid a long article.
Also, if you want to avoid harcoding as much as posible, you can consider to use a function that returns the host name or routing tag.
Querying the data
We are almost done! Now we can do some queries. Here are some examples:
Limiting the matches:
|
|
All the matches and double check from were the data came from:
|
|
And, we can’t finish the article without showing how to use the ranking (did you see those setweight functions used in the function? You got it!):
# SELECT rds.id,
# convert_from(pgp_pub_decrypt(fname::bytea, ks.priv,''::text)::bytea,'SQL_ASCII'::name),
# convert_from(pgp_pub_decrypt(lname::bytea, ks.priv,''::text)::bytea,'SQL_ASCII'::name),
# ts_rank( ls._FTS, query ) as rank
# FROM local_search ls JOIN
# __person__pgp_rds as rds ON (rds.id = ls.id AND rds.source = 'host1') JOIN
# keys ks USING (keyid),
# to_tsquery('Mario | Casas | (Casas:*A & Mario:*B) ') query
# WHERE
# ls._FTS @@ query
# ORDER BY rank DESC;
id | convert_from | convert_from | rank
----+--------------+--------------+----------
43 | Mario | Casas | 0.425549
61 | Ana | Casas | 0.303964
66 | Horacio | Casas | 0.303964
(3 rows)
Remember, think that this query is doing FTS, decryption and ranking in just one query, over a local and a remote server. You can’t say that PostgreSQL isn’t hipster enough!
I can’t continue the article without showing the query plan executed by the local host (using buffers, analyze and verbose options).
|
|
From the above Query Plan extract, we can see that the partitioning at RDS is transparent for the query. The execution node in charge of extracting data from the RDS is the Foreign Scan, which also provides the query executed remotely.
Wait a minute. Looks like the remote SQL is somehow dangerous to execute. It is not using the id! There is a reason for that, and its related on how postgres gather the foreign table statistics. If I use the remote estimations we can see how the remote SQL changes in the Query Plan:
|
|
Query Plan (Foreign Scan execution node):
...
-> Foreign Scan on public.__person__pgp_rds_ure rds (cost=100.01..108.21 rows=2 width=1018) (actual time=250.334..250.336 rows=1 loops=31)
Output: rds.id, rds.source, rds.partial_ssn, rds.ssn, rds.keyid, rds.fname, rds.lname, rds.description, rds.auth_drugs, rds.patology
Remote SQL: SELECT id, keyid, fname, lname FROM enc_schema.__person__pgp WHERE ((source = 'host1'::text)) AND (($1::bigint = id))
...
Foreign tables also need the local statistics to be updated. In the next examples
there are 3 queries: using the use_remote_estimate
, without previous ANALYZE and
without use_remote_estimate
and a query using the local estimations (__person_pgp_rds
)
after issuing ANALYZE and without URE.
|
|
After analyzing both foreign tables , the execution time difference was calculated at 11% in favor of using local estimations.
NOTE about UPDATES: it is necessary to code the UPDATE trigger also, in order to decrypt , modify and re-encrypt the data.
Json/jsonb datatype is here to help
You can collapse all the data and use json
datatype on the mapping and foreign table, allowing you to avoid the pain of pointing and decrypting data per column basis.
Put all the encrypted columns in a bytea
column on RDS. The mapping table will look as follows:
|
|
At insert time, just use a json column instead per column basis. Keep in mind that you will need to deal within the json contents.
I found using this easier for insert, but the FTS needs some clean up to avoid insert column names in the _fts
field at local_search
tables.
Also, for updates, the jsonb datatype will need extra work when extracting attributes.
Additional functions used here
In the insert statement above, you will see a user defined function that gets a random length vector of drugs. It is implemented using the following code:
|
|
References
A very awesome tutorial about FTS for PostgreSQL can be found here.
Discussion in the community mailing lis here
- Posted on:
- March 23, 2016
- Length:
- 16 minute read, 3306 words
- Categories:
- Postgres
- Series:
- Getting Started
- Tags:
- hugo-site