Hero

Performance & Optimization

Tune pipeline builds, partition strategies, incremental transforms, and resource allocation for production workloads.

⏱ 14 min read📖 3 chapters+150 XP

1 · Partitioning Strategies

Partitioning controls how data is physically stored on disk. Hash partitioning on a high-cardinality column (like date or region) means queries that filter on that column only read the relevant partitions — potentially 100x faster. Choose partition columns based on how the data is queried downstream. If most consumers filter by date, partition by date. Avoid partitioning on columns with very low cardinality (e.g., boolean flags) — you'll end up with two giant files.

2 · Incremental Transforms

Incremental transforms process only new or changed data rather than reprocessing the entire dataset. This is critical for large tables where daily builds would otherwise take hours. Foundry supports incremental mode by tracking the transaction history of input datasets. Your transform logic reads only the new transactions, computes the delta, and appends or merges it into the output.

3 · Resource Allocation and Profiling

Each build runs on a Spark cluster whose size you can configure: - Executor count — how many parallel workers. - Executor memory — RAM per worker. - Driver memory — memory for the orchestrator. Use Foundry's build profiler to identify bottlenecks: data skew (one partition is 100x larger), spill to disk (not enough memory), or shuffle overhead (too many wide joins). Address the bottleneck, not the symptom.

✅ Key Takeaways

Partition on columns that downstream consumers filter on most.
Incremental transforms process only new data — essential for large tables.
Use the build profiler to identify skew, spill, and shuffle bottlenecks.
Right-size executor count and memory based on profiling data, not guesswork.

1 · Partitioning Strategies

2 · Incremental Transforms

3 · Resource Allocation and Profiling

✅ Key Takeaways

🧠 Knowledge Check