Skip to main content

On This Page

Mastering the Top 12 SQL Interview Patterns for Data Engineers

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Top 12 SQL Interview Problems for Data Engineers, With Answers

DataDriven outlines the recurring patterns used in FAANG and fintech SQL interviews. Analysis shows that 32% of these interview questions specifically test GROUP BY functionality.

Why This Matters

There is a significant gap between knowing basic syntax and understanding data grain. Many candidates fail aggregation problems—specifically at Meta interviews—because they join tables at the wrong grain, leading to double-counting errors where revenue figures can be inflated by 3x.

Key Insights

  • Execution Order: SQL runs FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY; filtering aggregates in WHERE causes parse errors in PostgreSQL.
  • Window Functions vs Aggregation: ROW_NUMBER preserves full row data whereas GROUP BY requires all non-aggregated columns to be grouped.
  • NULL Handling: NOT IN returns zero rows if the subquery contains a single NULL, making NOT EXISTS the semantically safer production choice.
  • Gaps and Islands: The subtraction of a sequential ROW_NUMBER from a date creates a constant group identifier for consecutive sequences.

Working Examples

Filtering aggregated spend using HAVING to avoid WHERE clause parse errors.

SELECT customer_id,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 500
ORDER BY total_spent DESC;

Using CTEs to aggregate at different grains to prevent double-counting revenue.

WITH order_totals AS (
SELECT order_id,
customer_id,
SUM(quantity * price) AS order_revenue
FROM orders
JOIN order_items USING (order_id)
GROUP BY order_id, customer_id
)
SELECT customer_id,
COUNT(*) AS num_orders,
SUM(order_revenue) AS total_revenue
FROM order_totals
GROUP BY customer_id;

Retrieving the latest record per entity using window functions.

WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id 
ORDER BY updated_at DESC
) AS rn
FROM customer_updates
)
SELECT customer_id, updated_at, email, status
FROM ranked
WHERE rn = 1;

Practical Applications

  • Funnel Leak Analysis: Using LEFT JOIN with IS NULL (Anti-Join) to identify users who signed up but never converted.

  • Sessionization: Combining LAG and cumulative SUM to assign session IDs based on a time threshold (e.g., 30 minutes), avoiding off-by-one errors caused by NULL lags.

  • Hierarchy Mapping: Implementing Recursive CTEs for org charts while managing circular references to prevent infinite loops.

References:

Continue reading

Next article

Lessons in Data Normalization: Avoiding Over-Abstraction in Production Migrations

Related Content