Engineering reference
PII handling, GDPR compliance, and anonymisation for data teams: risk audit, pseudonymisation toolkit, right to erasure implementation, and the analytics-safe data layer pattern.
Privacy engineering is not a compliance exercise bolted onto a data pipeline after it ships. It is an architectural constraint that changes the design of the pipeline from the ingestion point forward. The organisations that treat it as an afterthought pay for it twice: once when they implement controls after the fact against a schema that was not designed for them, and once when the audit or the breach makes the cost of the original decision concrete.
This reference covers the five categories of personal data in data engineering contexts, the technical controls that apply at each stage of a pipeline, Python implementations of the three most useful pseudonymisation techniques, the GDPR technical requirements translated into engineering tasks rather than legal obligations, and the analytics-safe data layer pattern that gives analysts working access to data without requiring access to raw PII.
Before implementing controls, you need to know what you have and where it is. Schema names alone are not sufficient: PII hides in JSON blobs, in poorly named columns, in logs that were never meant to be permanent, and in audit tables that nobody thought to exclude from the analytics pipeline.
Not all PII requires the same engineering overhead. Classify to two tiers and apply controls proportionately:
Four steps, executed in order:
information_schema.columns for column names matching known PII patterns: email, phone, ip_address, ssn, national_id, passport, dob, date_of_birth.pii_sensitivity: high or pii_sensitivity: low. dbt meta tags, Amundsen, and Datahub all support this. Without tagging, the inventory is a document that becomes stale. With tagging, the catalogue is the inventory.Collect and process only what is necessary for the specific purpose. Implementation: drop PII columns during the extraction phase if they are not needed downstream. Calculate age or age group from date of birth at the edge; do not ingest the date of birth into the analytics pipeline.
Replace direct identifiers with artificial identifiers. The data cannot be attributed to a specific individual without additional information that is kept separately and under restricted access. Implementation: hash user IDs, emails, and IP addresses at ingestion. Store the mapping table in a secure, isolated database. The analytics warehouse never sees the raw identifier.
Irreversible alteration so that individuals can no longer be identified directly or indirectly. Once genuinely anonymised, data is no longer subject to GDPR. Techniques include hashing without retaining the salt or key, k-anonymity (generalising data so each record is indistinguishable from at least k-1 others), and adding calibrated statistical noise (differential privacy). Note: pseudonymisation is not anonymisation. Pseudonymised data can be re-identified given access to the mapping table. Anonymised data, correctly implemented, cannot.
In transit: enforce TLS 1.2 or higher for all database connections and API endpoints. At rest: enable transparent data encryption (TDE) on databases using cloud key management (AWS KMS, GCP Cloud KMS). Column-level encryption for highly sensitive fields. Key management is as important as encryption: keys stored in the same system as the data they encrypt provide no protection.
Role-Based Access Control (RBAC) at the database, warehouse, and BI tool levels. Row-level security and dynamic data masking in modern warehouses (Snowflake, BigQuery) allow analysts to query tables without seeing PII fields. Developers must not have read access to production PII in lower environments. Separate credentials per service account, per purpose.
Set Time-to-Live policies on object storage. Partition data warehouses by date and drop old partitions on schedule. Implement soft-delete mechanisms followed by hard-delete batch processes. Retention is only meaningful if it is enforced automatically: manual deletion processes will be missed.
Three techniques, each suited to different use cases. Selecting the wrong technique creates either a security gap or an operational problem at scale.
Use case: joining datasets across different systems without exposing the raw identifier. The same input always produces the same output, enabling cross-system joins. Trade-off: susceptible to dictionary attacks if the input space is small or predictable (email addresses have a finite known set). Always use a strong, secret salt.
import hashlib
import hmac
def consistent_hash_pii(pii_value: str, secret_salt: bytes) -> str:
"""
Consistently hashes PII using HMAC-SHA256.
secret_salt must be stored in a secure vault (e.g., AWS Secrets Manager)
and kept consistent across all systems that need to join on this field.
"""
if not pii_value:
return ""
hashed = hmac.new(
key=secret_salt,
msg=pii_value.encode('utf-8'),
digestmod=hashlib.sha256
).hexdigest()
return hashed
# In production: load the salt from a secure vault, never hardcode
SHARED_SECRET_SALT = b"load_from_secrets_manager"
email = "user@example.com"
pseudonym = consistent_hash_pii(email, SHARED_SECRET_SALT)
Use case: when re-identification is required for legitimate purposes (customer support, compliance audits, right of access requests). Supports key rotation: rotating the key re-encrypts all values, maintaining pseudonymisation without maintaining a mapping table. Trade-off: requires key management infrastructure. Computationally heavier than hashing.
from cryptography.fernet import Fernet
class PIIEncryptor:
def __init__(self, key: bytes):
self.cipher_suite = Fernet(key)
def encrypt_pii(self, pii_value: str) -> str:
if not pii_value:
return ""
return self.cipher_suite.encrypt(pii_value.encode('utf-8')).decode('utf-8')
def decrypt_pii(self, encrypted_value: str) -> str:
if not encrypted_value:
return ""
return self.cipher_suite.decrypt(encrypted_value.encode('utf-8')).decode('utf-8')
# Generate key: Fernet.generate_key() — store in secure vault, not in code
MASTER_KEY = Fernet.generate_key()
encryptor = PIIEncryptor(MASTER_KEY)
encrypted_phone = encryptor.encrypt_pii("+971501234567")
original_phone = encryptor.decrypt_pii(encrypted_phone)
Use case: completely decoupling the identifier from its underlying value. The token in the analytics pipeline carries no information about the original identifier. Reversible only by querying the isolated token vault. Trade-off: requires maintaining state in a separate database. Can become a bottleneck in high-throughput pipelines: benchmark before deploying at scale.
import uuid
import sqlite3
class Tokeniser:
def __init__(self, db_path='token_vault.db'):
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
self._setup_table()
def _setup_table(self):
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS token_vault (
token TEXT PRIMARY KEY,
pii_value TEXT UNIQUE
)
''')
self.conn.commit()
def get_or_create_token(self, pii_value: str) -> str:
if not pii_value:
return ""
self.cursor.execute(
'SELECT token FROM token_vault WHERE pii_value = ?',
(pii_value,)
)
result = self.cursor.fetchone()
if result:
return result[0]
new_token = str(uuid.uuid4())
self.cursor.execute(
'INSERT INTO token_vault (token, pii_value) VALUES (?, ?)',
(new_token, pii_value)
)
self.conn.commit()
return new_token
def resolve_token(self, token: str) -> str:
self.cursor.execute(
'SELECT pii_value FROM token_vault WHERE token = ?',
(token,)
)
result = self.cursor.fetchone()
return result[0] if result else None
Legal obligations translated into engineering tasks.
Default to opaque: internal data pipelines run on pseudonymised identifiers. Raw PII is accessible only at the absolute edge (the service sending the email, the service generating the PDF for the customer). Bake RBAC, network isolation, and encryption configurations into Terraform or CloudFormation so new infrastructure is compliant by default. Integrate PII detection into CI/CD pipelines to prevent developers from accidentally logging sensitive fields.
The technical challenge: finding and deleting a single user's records across petabytes of append-only data lakes and immutable logs is expensive and slow. Two approaches:
Use Tokenisation or Keyed Encryption. When a user requests deletion, delete their encryption key or their token mapping from the central vault. The data in the lake remains but is mathematically unreadable. The data is effectively anonymised without a physical deletion operation. This approach scales. The physical deletion approach does not.
Maintain a deleted_users table. ETL processes filter out records belonging to users in that table before writing to the analytics layer. A batch job runs on a schedule (every 30 days) to physically purge deleted user data from raw storage. This approach requires the batch job to be maintained and tested. If the batch job fails silently, deletion compliance fails silently.
Data subjects can request a copy of their personal data. The technical challenge: gathering scattered data without requiring a data engineer to write manual queries for every request. Implementation: maintain a centralised data map. Build an automated DSAR (Data Subject Access Request) pipeline triggered via API. When triggered with a user identifier, the pipeline queries all known PII locations, aggregates results into a standard format, and delivers to the customer service portal. The data map in Section 1 is the configuration file for this pipeline. Without it, the pipeline cannot be built.
Analysts need data. Privacy requirements restrict access to raw PII. The analytics-safe layer resolves this without preventing analytical work: it is the architecture that means analysts never need access to the raw layer to do their job.
Raw data as ingested. No transformations. Access: ETL service accounts and data engineering team leads only. No analysts. No BI tools. This layer exists for replay and reprocessing, not for consumption.
Pseudonymisation applied. user_id replaced with hashed_user_id. User-agent strings parsed into device categories and the raw string dropped. IP addresses truncated to subnet (192.168.1.100 becomes 192.168.1.0/24) to retain regional data without exact location. Sensitive columns removed or encrypted. This is the transformation layer where privacy controls are applied, not later.
Aggregated or further anonymised. No direct identifiers. Row-level security applied for any remaining quasi-identifiers. This is the layer BI tools connect to. If an analyst needs a capability that requires raw PII, the answer is a controlled process for that specific use case, not broad access to the raw layer.
-- Create a masking policy for low-sensitivity fields
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ENGINEERING', 'COMPLIANCE_TEAM')
THEN val
ELSE '****@****.***'
END;
-- Apply to a column
ALTER TABLE users MODIFY COLUMN email
SET MASKING POLICY email_mask;
Analysts querying the users table see masked email values. Data engineering and compliance roles see the actual values. The table is the same. The access control is in the masking policy, not in a separate table or view. No duplication of data. No synchronisation risk.
Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and Gulf region. Privacy engineering for GCC regulatory requirements (UAE PDPL, KSA PDPL, Bahrain PDPL) is covered in the companion GCC Data Compliance reference at data.nauman.cc. Diagnostic engagements: www.mindflex.tech.
These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:
Discuss a diagnostic engagement →