Infrastructure reference
The Data Engineering Cloud Cost Playbook
Where the money actually leaks, and how to stop it: a 20-point audit, AWS and GCP high-impact cuts, Databricks cost control, and a monitoring setup that catches anomalies before they become invoices.
When a data infrastructure bill gets large enough that someone notices, engineers look for the obvious candidates: forgotten S3 buckets, idle EC2 instances, snapshots from a project that ended two years ago. Those are worth cleaning up. They are also not where the money is going. The real cost drivers in modern data engineering are invisible: a JOIN running hourly without partition filters scanning petabytes it has no business touching, an always-on cluster provisioned for peak load that runs at five percent utilisation for twenty-three hours a day, data transfer fees that appear nowhere in the architecture diagram.
This is a systematic guide to finding the waste before it compounds. The 20-point audit works across cloud providers. The AWS, GCP, and Databricks sections cover the highest-impact cuts specific to each platform. The monitoring section covers the alerts that would have prevented the last three bill surprises.
Where the Money Goes
Five categories, ranked by waste potential. The order matters: if you start with storage, you are optimising the least important line item first.
1. Compute Credits (40–60% of waste)
Snowflake, BigQuery, Databricks: charge by compute time or bytes scanned. A single badly written JOIN running hourly without partition filters can cost thousands per month. One query. Hourly. The arithmetic is straightforward. This is the biggest leak in almost every data engineering budget.
2. Always-On Analytical Clusters (20–30% of waste)
Redshift, EMR, or Databricks clusters provisioned for peak load but running continuously. Peak load is not the baseline. The baseline is the median workload at 2am on a Tuesday. Provisioning for peak and running continuously is paying for peak at all times.
3. Data Transfer and Networking (10–15% of waste)
The silent category. Moving data across Availability Zones or through NAT Gateways when moving terabytes daily produces a bill that compounds without appearing in any architecture review. Cross-AZ traffic costs money. Routing S3 traffic through a NAT Gateway when a free VPC Gateway Endpoint exists is a configuration error with a recurring monthly cost.
4. Managed Pipeline Services (5–10% of waste)
Over-provisioned Glue DPUs set to the 10-node default for ETL scripts that need two. Excessive Fivetran sync frequencies on slowly changing dimension tables. Managed Airflow environments right-sized for a workload that existed eighteen months ago.
5. Storage (1–5% of waste)
S3 Standard is $23 per terabyte per month. Storage is rarely the problem. Set lifecycle policies, move cold data to appropriate tiers, and spend the rest of the audit time on compute. Only prioritise storage if it exceeds 30% of the total bill.
The 20-Point Audit
Work through this in order. The highest-impact items are at the top of each category.
Compute and Warehouse
- Identify the ten most expensive queries. Check
INFORMATION_SCHEMA in the data warehouse. Find queries scanning the most bytes or consuming the most compute time. Action: Rewrite with partition filters or materialise the results. One query fix here typically exceeds weeks of storage optimisation in impact.
- Check warehouse provisioning vs utilisation. AWS: CloudWatch Redshift CPU metrics. Action: Average CPU under 20% means the instance class is oversized. Scale down.
- Audit development clusters. Any cluster with dev, test, or sandbox in the name that has been running for more than 24 hours is a candidate. Action: Terminate idle clusters. Set auto-suspend policies: 15 minutes of inactivity is sufficient for most development environments.
- Cap auto-scaling limits. Serverless warehouses without a hard concurrency or node limit expose the infrastructure to unbounded spend under a bad query loop. Action: Set explicit maximums. A single unguarded query in a serverless warehouse can generate a four-figure bill before anyone notices.
- Find orphaned snapshots.
aws ec2 describe-snapshots --owner-ids self. Action: Delete manual snapshots older than 30 days.
Pipelines and Data Processing
- Audit pipeline frequency. A slowly changing dimension table syncing every five minutes is generating compute and transfer costs for data that changes weekly. Action: Move batch pipelines from hourly to daily where real-time latency is not a genuine business requirement, not a stated preference.
- Switch to Graviton instances for batch jobs. AWS Graviton (ARM) processors deliver approximately 20% lower cost with better performance for compatible workloads. Action: Migrate Glue and EMR task nodes to Graviton where supported.
- Find duplicate pipelines. Two dbt models materialising the same base data is a codebase archaeology problem that compounds at scale. Action: Refactor into a single upstream model. This is a code review failure that becomes a cloud cost problem.
- Review streaming infrastructure. Kafka or Kinesis running for workloads that update hourly and are consumed in batch is a pattern that appears frequently after architectural decisions made under pressure. Action: Audit latency requirements against actual consumption patterns. If no consumer needs sub-minute latency, batch is the correct answer.
- Right-size orchestration environments. Managed Airflow environments are often over-provisioned relative to current DAG count and complexity. Action: Profile actual resource usage against current provisioning. Reduce if the gap is significant.
Storage and Data Transfer
- Fix Cross-AZ data transfer. AWS Cost Explorer: DataTransfer-Regional-Bytes. Action: Keep workloads in a single AZ where possible. Use VPC Endpoints for AWS services to eliminate cross-AZ transfer costs on managed service traffic.
- Replace NAT Gateway with VPC Endpoints for S3. S3 Gateway Endpoints are free. NAT Gateway charges per gigabyte of traffic routed through it. Action: Creating the VPC Endpoint takes approximately five minutes. Not creating it is paying per gigabyte for traffic that has a free routing option.
- Set S3 Lifecycle policies. Data older than 90 days on S3 Standard that has not been accessed recently belongs on Glacier or Standard-IA. Action: Implement Intelligent-Tiering for data lakes. The monitoring fee is negligible compared to the resulting savings on large datasets.
- Set CloudWatch log retention. Application logs accumulating indefinitely in CloudWatch Logs are a common discovery in cost audits. Action: Set retention policy to 14–30 days. Logs older than that are available in S3 if needed for compliance.
- Audit Fivetran column selection. Syncing columns that no downstream model ever references is a common result of initial connector setup done quickly. Action: Block unused columns and tables from syncing. Fivetran pricing is based on rows synced for most connectors.
Architecture and Miscellaneous
- Deprecate unused BI dashboards. Dashboards refreshing hourly that have not been viewed in 90 days are generating query costs for nobody. Action: Pull access logs. Delete dashboards that are not being used. Notify the stated owner before deletion.
- Check EBS volume utilisation. Volumes at less than 10% utilisation are paying for capacity that is not needed. Action: Resize gp3 volumes to match actual usage plus a reasonable headroom margin.
- Enforce resource tagging. Without consistent Environment, Project, and Owner tags, cost attribution by team or project is impossible and anomalies are invisible until they appear on the invoice. Action: Block resource creation without mandatory tags via Service Control Policies (SCPs) or equivalent.
- Check multi-region replications. Cross-region S3 replication creates ongoing transfer costs and double the storage bill for replicated data. Action: Disable unless the replication serves a specific, documented disaster recovery requirement with a tested recovery objective.
- Audit personal cloud storage. Users dumping CSVs and exports to personal cloud storage folders at scale creates both a cost and a data governance problem. Action: Implement quotas. Automated cleanup scripts for unaccessed personal exports after 30 days.
AWS: Eight Highest-Impact Cuts
1Redshift: Reserved Instances vs Serverless — Predictable constant baseline load belongs on Reserved Instances. Serverless is correct for spiky, unpredictable workloads. Using Serverless for a constant baseline load is paying on-demand rates for predictable consumption.
2Redshift Concurrency Scaling — Concurrency scaling bills per second of activation. If it is constantly active, the main cluster is undersized. Monitor usage before accepting it as a cost of operations.
3S3 Intelligent-Tiering — Enable for data lakes. Automatic movement of unaccessed objects to cheaper tiers. The monitoring fee is negligible. This is a configuration change that requires no ongoing maintenance.
4Glue DPU Right-Sizing — The default AWS Glue job provisions 10 DPUs. Most small ETL scripts need two. Always set DPU count manually. Enable Job Bookmarks to avoid reprocessing previously processed data on reruns.
5EMR Spot Instances for Task Nodes — Master and Core nodes: On-Demand for stability. Task nodes: Spot instances. If a Spot instance is interrupted, EMR replaces it automatically. Task nodes are stateless and replaceable.
6The NAT Gateway Trap — Never route AWS service traffic (S3, DynamoDB) through a NAT Gateway. Create a VPC Gateway Endpoint for S3. It is free. Every gigabyte currently routed through NAT to S3 has a free alternative.
7EC2 and RDS Graviton Migration — Graviton processors: approximately 20% lower cost, up to 40% better performance for compatible workloads. Migrate compatible analytical workloads. The performance improvement often means the workload completes faster and uses fewer compute-hours.
8S3 Requester Pays for External Partners — If large datasets are shared with external partners, enable Requester Pays. The partner covers the data transfer cost. This is a configuration setting, not a negotiation.
GCP: Eight Highest-Impact Cuts
1BigQuery Partitioning — Never create a large BigQuery table without partitioning by a DATE or TIMESTAMP column. A single unpartitioned table query scanning the entire dataset at $5 per terabyte has no ceiling. Partitioning is not an optimisation: it is a cost control.
2Enforce Partition Filters — Prevent accidental full-table scans: ALTER TABLE my_table SET OPTIONS (require_partition_filter = true). This prevents the query that runs during a late-night exploration session from scanning years of data.
3BigQuery Clustering — Cluster tables by columns frequently used in WHERE or JOIN clauses. Clustering reduces bytes scanned and does not add query cost or maintenance overhead.
4On-Demand vs Capacity Slots — If BigQuery spend exceeds $10,000 per month, evaluate switching from On-Demand (pay per terabyte scanned) to Capacity pricing (fixed slot commitment). The break-even varies by workload pattern.
5INFORMATION_SCHEMA Cost Monitoring — Run weekly against region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT to identify users or service accounts with the highest scan volumes. Cost attribution belongs with the team generating the cost.
6Cloud Storage Lifecycle Policies — Move raw unprocessed data to Nearline after 30 days, Coldline after 90 days. Set and forget.
7Dataflow FlexRS for Non-Time-Sensitive Batch — For batch Dataflow jobs without latency requirements, FlexRS allows GCP to schedule execution using preemptible VMs at a significant discount. The trade-off is uncertain start time, not quality.
8Same-Region Colocation — GCS buckets, BigQuery datasets, and Dataflow jobs must be in the exact same region (e.g., us-central1), not the same multi-region (us). Multi-region routing incurs transfer costs between regional components.
Databricks: Five Critical Controls
1. Job Clusters vs All-Purpose Clusters
This is the most common Databricks cost mistake. All-Purpose (interactive) clusters are billed at significantly higher DBU rates. Job clusters receive discounts of up to 50%. Rule: All-Purpose clusters are for notebook development only. Automated jobs run on Job clusters. No exceptions. The engineering team will not notice the difference; the invoice will.
2. Auto-Termination Policy
Set all interactive clusters to terminate after 15–30 minutes of inactivity. Developers will object to the startup time. The startup time is a fixed cost per session. An always-on cluster is a fixed cost per hour, every hour, including the hours when nobody is working. Set the policy.
3. Spot Instances for Worker Nodes
Configure clusters to use Spot instances for worker nodes. Databricks handles Spot interruptions gracefully. Driver nodes should remain On-Demand for stability. Worker nodes are stateless and replaceable within a running job.
4. Photon Usage Decision
Photon engine costs more DBUs. The additional cost is justified for heavy SQL workloads and complex joins where the execution speed reduction in compute time outweighs the DBU premium. It is not justified for simple ETL pipelines doing basic filtering or data movement. The speedup will not offset the price. Configure at the workload level, not the cluster level.
5. Driver Node Right-Sizing
Driver nodes require large RAM allocations only when running collect() operations that bring significant data to the driver. Most jobs do not do this. Downsize driver nodes to match actual usage. Worker nodes should remain appropriately sized for the distributed processing load.
Monitoring Setup
A monthly review of the billing console is not a monitoring strategy. By the time an anomaly is visible on the monthly invoice, it has been running for weeks. The alerts below catch it in hours.
- Daily spend anomaly alert: Configure a budget alert in AWS Cost Explorer or GCP Billing at 110% of the 30-day daily average. A single bad query running overnight is visible the next morning, not at month-end.
- Per-service budget alerts: Set separate alerts for the top three cost drivers (typically compute, transfer, and managed services). A spike in one service is a different problem from a spike across all services.
- Query cost alerting: For BigQuery: monitor
INFORMATION_SCHEMA.JOBS_BY_PROJECT on a schedule. For Snowflake: QUERY_HISTORY view. Route alerts for any single query exceeding a defined cost threshold.
- Idle cluster detection: A CloudWatch or GCP Monitoring metric alert on clusters with CPU under 5% for more than two hours during business hours. Idle clusters during off-hours are expected. Idle clusters at 2pm are a configuration problem.
- Data transfer spike detection: Alert on daily data transfer costs exceeding 150% of the rolling 14-day average. Transfer spikes usually indicate a pipeline change that introduced cross-region or cross-AZ traffic.
Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and Gulf region. If the bill is large enough that a systematic audit is worth running as an engagement, the diagnostic starts at: www.mindflex.tech.
These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:
Discuss a diagnostic engagement →