Engineering reference

Privacy-First Data Engineering

PII handling, GDPR compliance, and anonymisation for data teams: risk audit, pseudonymisation toolkit, right to erasure implementation, and the analytics-safe data layer pattern.

AuthorNauman Shahid

RolePrincipal Data Engineer

TypeEngineering reference

Privacy engineering is not a compliance exercise bolted onto a data pipeline after it ships. It is an architectural constraint that changes the design of the pipeline from the ingestion point forward. The organisations that treat it as an afterthought pay for it twice: once when they implement controls after the fact against a schema that was not designed for them, and once when the audit or the breach makes the cost of the original decision concrete.

This reference covers the five categories of personal data in data engineering contexts, the technical controls that apply at each stage of a pipeline, Python implementations of the three most useful pseudonymisation techniques, the GDPR technical requirements translated into engineering tasks rather than legal obligations, and the analytics-safe data layer pattern that gives analysts working access to data without requiring access to raw PII.

The PII Risk Audit

Before implementing controls, you need to know what you have and where it is. Schema names alone are not sufficient: PII hides in JSON blobs, in poorly named columns, in logs that were never meant to be permanent, and in audit tables that nobody thought to exclude from the analytics pipeline.

Five categories of personal data

Direct identifiers: Full name, email address, phone number, national ID, passport number. Individually sufficient to identify a specific person.
Quasi-identifiers: Date of birth, postal code, gender, job title. Individually insufficient but combinable with other fields to achieve identification. The risk is combinatorial and frequently underestimated.
Digital identifiers: IP address, MAC address, device ID (IDFA/AAID), cookie IDs, browser fingerprints. Considered personal data under GDPR. Commonly present in event logs and treated as non-sensitive.
Behavioural and observational data: Location history, purchase history, browsing logs, search queries. Often stored at granularity that enables individual identification even when direct identifiers have been removed.
Sensitive PII (Special Category): Health and medical records, biometric data, financial and credit data, political opinions, religious beliefs. Subject to highest levels of protection. Requires explicit lawful basis for processing. Must never enter an analytics layer without explicit review.

Risk classification

Not all PII requires the same engineering overhead. Classify to two tiers and apply controls proportionately:

High-sensitivity PII: Health data, financial data, national IDs, biometric data. Treatment: strict access controls, encryption at rest and in transit, pseudonymisation immediately at ingestion, restricted retention periods.
Low-sensitivity PII: Names, email addresses, IP addresses, basic demographics. Treatment: standard access controls, pseudonymisation for analytics layer, standard retention policies.

Producing a data inventory

Four steps, executed in order:

Schema scanning: Query information_schema.columns for column names matching known PII patterns: email, phone, ip_address, ssn, national_id, passport, dob, date_of_birth.
Data profiling: Run regex patterns against a sample of table data to find PII in poorly named columns or inside JSON blobs. Schema names are aspirational; the data tells the truth.
Flow documentation: Map data from Source to Ingestion to Storage to Transformation to BI. Document the specific PII fields at each stage. The transfer points are where controls are applied and where the most common failures occur.
Classification tagging: Tag tables and columns in the data catalogue with pii_sensitivity: high or pii_sensitivity: low. dbt meta tags, Amundsen, and Datahub all support this. Without tagging, the inventory is a document that becomes stale. With tagging, the catalogue is the inventory.

Technical Controls

Data Minimisation

Collect and process only what is necessary for the specific purpose. Implementation: drop PII columns during the extraction phase if they are not needed downstream. Calculate age or age group from date of birth at the edge; do not ingest the date of birth into the analytics pipeline.

Pseudonymisation

Replace direct identifiers with artificial identifiers. The data cannot be attributed to a specific individual without additional information that is kept separately and under restricted access. Implementation: hash user IDs, emails, and IP addresses at ingestion. Store the mapping table in a secure, isolated database. The analytics warehouse never sees the raw identifier.

Anonymisation

Irreversible alteration so that individuals can no longer be identified directly or indirectly. Once genuinely anonymised, data is no longer subject to GDPR. Techniques include hashing without retaining the salt or key, k-anonymity (generalising data so each record is indistinguishable from at least k-1 others), and adding calibrated statistical noise (differential privacy). Note: pseudonymisation is not anonymisation. Pseudonymised data can be re-identified given access to the mapping table. Anonymised data, correctly implemented, cannot.

Encryption

In transit: enforce TLS 1.2 or higher for all database connections and API endpoints. At rest: enable transparent data encryption (TDE) on databases using cloud key management (AWS KMS, GCP Cloud KMS). Column-level encryption for highly sensitive fields. Key management is as important as encryption: keys stored in the same system as the data they encrypt provide no protection.

Access Controls

Role-Based Access Control (RBAC) at the database, warehouse, and BI tool levels. Row-level security and dynamic data masking in modern warehouses (Snowflake, BigQuery) allow analysts to query tables without seeing PII fields. Developers must not have read access to production PII in lower environments. Separate credentials per service account, per purpose.

Data Retention and Deletion

Set Time-to-Live policies on object storage. Partition data warehouses by date and drop old partitions on schedule. Implement soft-delete mechanisms followed by hard-delete batch processes. Retention is only meaningful if it is enforced automatically: manual deletion processes will be missed.

The Pseudonymisation Toolkit

Three techniques, each suited to different use cases. Selecting the wrong technique creates either a security gap or an operational problem at scale.

Technique 1: Consistent Hashing

Use case: joining datasets across different systems without exposing the raw identifier. The same input always produces the same output, enabling cross-system joins. Trade-off: susceptible to dictionary attacks if the input space is small or predictable (email addresses have a finite known set). Always use a strong, secret salt.

import hashlib
import hmac

def consistent_hash_pii(pii_value: str, secret_salt: bytes) -> str:
    """
    Consistently hashes PII using HMAC-SHA256.
    secret_salt must be stored in a secure vault (e.g., AWS Secrets Manager)
    and kept consistent across all systems that need to join on this field.
    """
    if not pii_value:
        return ""

    hashed = hmac.new(
        key=secret_salt,
        msg=pii_value.encode('utf-8'),
        digestmod=hashlib.sha256
    ).hexdigest()

    return hashed

# In production: load the salt from a secure vault, never hardcode
SHARED_SECRET_SALT = b"load_from_secrets_manager"

email = "user@example.com"
pseudonym = consistent_hash_pii(email, SHARED_SECRET_SALT)

Technique 2: Keyed Encryption

Use case: when re-identification is required for legitimate purposes (customer support, compliance audits, right of access requests). Supports key rotation: rotating the key re-encrypts all values, maintaining pseudonymisation without maintaining a mapping table. Trade-off: requires key management infrastructure. Computationally heavier than hashing.

from cryptography.fernet import Fernet

class PIIEncryptor:
    def __init__(self, key: bytes):
        self.cipher_suite = Fernet(key)

    def encrypt_pii(self, pii_value: str) -> str:
        if not pii_value:
            return ""
        return self.cipher_suite.encrypt(pii_value.encode('utf-8')).decode('utf-8')

    def decrypt_pii(self, encrypted_value: str) -> str:
        if not encrypted_value:
            return ""
        return self.cipher_suite.decrypt(encrypted_value.encode('utf-8')).decode('utf-8')

# Generate key: Fernet.generate_key() — store in secure vault, not in code
MASTER_KEY = Fernet.generate_key()
encryptor = PIIEncryptor(MASTER_KEY)

encrypted_phone = encryptor.encrypt_pii("+971501234567")
original_phone = encryptor.decrypt_pii(encrypted_phone)

Technique 3: Tokenisation

Use case: completely decoupling the identifier from its underlying value. The token in the analytics pipeline carries no information about the original identifier. Reversible only by querying the isolated token vault. Trade-off: requires maintaining state in a separate database. Can become a bottleneck in high-throughput pipelines: benchmark before deploying at scale.

import uuid
import sqlite3

class Tokeniser:
    def __init__(self, db_path='token_vault.db'):
        self.conn = sqlite3.connect(db_path)
        self.cursor = self.conn.cursor()
        self._setup_table()

    def _setup_table(self):
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS token_vault (
                token TEXT PRIMARY KEY,
                pii_value TEXT UNIQUE
            )
        ''')
        self.conn.commit()

    def get_or_create_token(self, pii_value: str) -> str:
        if not pii_value:
            return ""
        self.cursor.execute(
            'SELECT token FROM token_vault WHERE pii_value = ?',
            (pii_value,)
        )
        result = self.cursor.fetchone()
        if result:
            return result[0]
        new_token = str(uuid.uuid4())
        self.cursor.execute(
            'INSERT INTO token_vault (token, pii_value) VALUES (?, ?)',
            (new_token, pii_value)
        )
        self.conn.commit()
        return new_token

    def resolve_token(self, token: str) -> str:
        self.cursor.execute(
            'SELECT pii_value FROM token_vault WHERE token = ?',
            (token,)
        )
        result = self.cursor.fetchone()
        return result[0] if result else None

GDPR Technical Requirements

Legal obligations translated into engineering tasks.

Privacy by design

Default to opaque: internal data pipelines run on pseudonymised identifiers. Raw PII is accessible only at the absolute edge (the service sending the email, the service generating the PDF for the customer). Bake RBAC, network isolation, and encryption configurations into Terraform or CloudFormation so new infrastructure is compliant by default. Integrate PII detection into CI/CD pipelines to prevent developers from accidentally logging sensitive fields.

Right to Erasure

The technical challenge: finding and deleting a single user's records across petabytes of append-only data lakes and immutable logs is expensive and slow. Two approaches:

Crypto-shredding (recommended)

Use Tokenisation or Keyed Encryption. When a user requests deletion, delete their encryption key or their token mapping from the central vault. The data in the lake remains but is mathematically unreadable. The data is effectively anonymised without a physical deletion operation. This approach scales. The physical deletion approach does not.

Tombstoning

Maintain a deleted_users table. ETL processes filter out records belonging to users in that table before writing to the analytics layer. A batch job runs on a schedule (every 30 days) to physically purge deleted user data from raw storage. This approach requires the batch job to be maintained and tested. If the batch job fails silently, deletion compliance fails silently.

Right of Access

Data subjects can request a copy of their personal data. The technical challenge: gathering scattered data without requiring a data engineer to write manual queries for every request. Implementation: maintain a centralised data map. Build an automated DSAR (Data Subject Access Request) pipeline triggered via API. When triggered with a user identifier, the pipeline queries all known PII locations, aggregates results into a standard format, and delivers to the customer service portal. The data map in Section 1 is the configuration file for this pipeline. Without it, the pipeline cannot be built.

The Analytics-Safe Data Layer

Analysts need data. Privacy requirements restrict access to raw PII. The analytics-safe layer resolves this without preventing analytical work: it is the architecture that means analysts never need access to the raw layer to do their job.

Three-layer pattern

Raw Layer (Bronze): Highly Restricted

Raw data as ingested. No transformations. Access: ETL service accounts and data engineering team leads only. No analysts. No BI tools. This layer exists for replay and reprocessing, not for consumption.

Cleansing Layer (Silver): ETL Service Accounts Only

Pseudonymisation applied. user_id replaced with hashed_user_id. User-agent strings parsed into device categories and the raw string dropped. IP addresses truncated to subnet (192.168.1.100 becomes 192.168.1.0/24) to retain regional data without exact location. Sensitive columns removed or encrypted. This is the transformation layer where privacy controls are applied, not later.

Analytics Layer (Gold): Analyst Access

Aggregated or further anonymised. No direct identifiers. Row-level security applied for any remaining quasi-identifiers. This is the layer BI tools connect to. If an analyst needs a capability that requires raw PII, the answer is a controlled process for that specific use case, not broad access to the raw layer.

Dynamic data masking example (Snowflake)

-- Create a masking policy for low-sensitivity fields
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
    CASE
        WHEN CURRENT_ROLE() IN ('DATA_ENGINEERING', 'COMPLIANCE_TEAM')
            THEN val
        ELSE '****@****.***'
    END;

-- Apply to a column
ALTER TABLE users MODIFY COLUMN email
    SET MASKING POLICY email_mask;

Analysts querying the users table see masked email values. Data engineering and compliance roles see the actual values. The table is the same. The access control is in the masking policy, not in a separate table or view. No duplication of data. No synchronisation risk.

Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and Gulf region. Privacy engineering for GCC regulatory requirements (UAE PDPL, KSA PDPL, Bahrain PDPL) is covered in the companion GCC Data Compliance reference at data.nauman.cc. Diagnostic engagements: www.mindflex.tech.

These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:

Discuss a diagnostic engagement →

← Back to the Reference Library · Ko-fi