MySQL Interview Guide
Comprehensive interview questions and answers to help you prepare for technical interviews.
đ Day 1: Python Basics & NumPy Foundations
Here are all the answers formatted for direct copy-paste:
Q1. Explain the execution order of a SQL query (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT). Why is this important? [3 Marks]
Answer: SQL queries are not executed in the order they are written. The actual logical execution order is:
- FROM â Identifies the source table(s) and performs JOINs
- WHERE â Filters rows before any grouping
- GROUP BY â Groups the filtered rows
- HAVING â Filters groups after aggregation
- SELECT â Selects and computes the output columns
- ORDER BY â Sorts the final result
- LIMIT / TOP â Restricts the number of rows returned
Why it matters: This order explains why you cannot use a SELECT alias in a WHERE clause (WHERE runs before SELECT), but you can use it in ORDER BY (ORDER BY runs after SELECT). It also clarifies why aggregate functions like COUNT() or SUM() cannot appear in WHERE â they must go in HAVING instead.
Q2. Write a query to select all employees from 'employees' table where department is 'Sales' and salary > 50000. [2 Marks]
Answer:
SELECT *
FROM employees
WHERE department = 'Sales'
AND salary > 50000;
Q3. What is the difference between DISTINCT and GROUP BY for retrieving unique values? [3 Marks]
Answer:
| Feature | DISTINCT | GROUP BY |
|---|---|---|
| Purpose | Removes duplicate rows from the result | Groups rows for aggregation |
| Use with aggregates | Cannot directly aggregate | Designed for aggregate functions (SUM, COUNT, etc.) |
| Performance | Generally simpler and faster for deduplication | More overhead due to grouping logic |
| Flexibility | Works on one or more columns | Can filter groups using HAVING |
Example:
-- DISTINCT: just unique departments
SELECT DISTINCT department FROM employees;
-- GROUP BY: unique departments with employee count
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department;
Rule of thumb: Use DISTINCT when you only need unique values. Use GROUP BY when you also need to perform calculations on each group.
Q4. Write a query to find the top 10 highest-paid employees. Use both LIMIT and TOP syntax. [3 Marks]
Answer:
Using LIMIT (MySQL, PostgreSQL, SQLite):
SELECT employee_id, name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;
Using TOP (SQL Server / MS Access):
SELECT TOP 10 employee_id, name, salary
FROM employees
ORDER BY salary DESC;
Using FETCH NEXT (Standard SQL / PostgreSQL / Oracle 12c+):
SELECT employee_id, name, salary
FROM employees
ORDER BY salary DESC
FETCH NEXT 10 ROWS ONLY;
Q5. Explain the difference between WHERE and HAVING clauses with an example. [2 Marks]
Answer:
- WHERE filters rows before grouping â it works on individual row values and cannot use aggregate functions.
- HAVING filters after grouping â it works on aggregated results like
COUNT(),SUM(), etc.
Example:
-- WHERE filters individual rows first
-- HAVING filters the grouped results
SELECT department, COUNT(*) AS total_employees
FROM employees
WHERE salary > 30000 -- filters rows before grouping
GROUP BY department
HAVING COUNT(*) > 5; -- filters groups after aggregation
Here, only employees with salary > 30000 are considered, and only departments with more than 5 such employees are shown.
Q6. What does the BETWEEN operator do? Write a query to find orders placed between Jan 1 and Mar 31, 2024. [2 Marks]
Answer: The BETWEEN operator filters values within an inclusive range â meaning both the start and end values are included in the result. It works with numbers, dates, and text.
SELECT order_id, customer_id, order_date, amount
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';
Note:
BETWEEN '2024-01-01' AND '2024-03-31'is equivalent toorder_date >= '2024-01-01' AND order_date <= '2024-03-31'. For datetime columns, use'2024-03-31 23:59:59'or< '2024-04-01'to capture the full last day.
Q7. Explain the difference between IN, EXISTS, and ANY operators with examples. [4 Marks]
Answer:
IN â checks if a value matches any value in a fixed list or subquery result.
-- Finds employees in Sales or HR
SELECT name FROM employees
WHERE department IN ('Sales', 'HR');
EXISTS â checks whether a subquery returns at least one row. It is more efficient for large datasets because it stops as soon as a match is found.
-- Finds customers who have placed at least one order
SELECT name FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
ANY â compares a value against any value returned by a subquery using a comparison operator (=, >, <, etc.).
-- Finds employees earning more than at least one person in the Intern grade
SELECT name FROM employees
WHERE salary > ANY (
SELECT salary FROM employees WHERE grade = 'Intern'
);
| Operator | Best Use Case | Stops Early? |
|---|---|---|
| IN | Fixed list or small subquery | No |
| EXISTS | Checking row existence (correlated) | Yes |
| ANY | Comparison against a set of values | No |
Q8. What is the difference between LIKE and ILIKE? Write a query to find all customers whose names start with 'A' (case-insensitive). [3 Marks]
Answer:
- LIKE â performs case-sensitive pattern matching (standard SQL).
- ILIKE â performs case-insensitive pattern matching (PostgreSQL-specific extension).
-- PostgreSQL: using ILIKE (case-insensitive)
SELECT customer_id, name
FROM customers
WHERE name ILIKE 'A%';
-- MySQL: LIKE is case-insensitive by default for most collations
SELECT customer_id, name
FROM customers
WHERE name LIKE 'A%';
-- SQL Server: case-insensitive workaround using UPPER()
SELECT customer_id, name
FROM customers
WHERE UPPER(name) LIKE 'A%';
Note: In MySQL, case sensitivity depends on the column's collation (e.g.,
utf8_general_ciis case-insensitive). In PostgreSQL, always useILIKEfor case-insensitive matching.
Q9. Write a query to find all products whose name contains 'pro' but not 'professional'. [3 Marks]
Answer:
-- Works in MySQL, PostgreSQL, SQL Server
SELECT product_id, product_name
FROM products
WHERE product_name ILIKE '%pro%' -- contains 'pro' (case-insensitive)
AND product_name NOT ILIKE '%professional%'; -- excludes 'professional'
MySQL version (LIKE is case-insensitive by default):
SELECT product_id, product_name
FROM products
WHERE product_name LIKE '%pro%'
AND product_name NOT LIKE '%professional%';
SQL Server version:
SELECT product_id, product_name
FROM products
WHERE LOWER(product_name) LIKE '%pro%'
AND LOWER(product_name) NOT LIKE '%professional%';
Q10. Explain NULL handling in SQL. Why does 'WHERE column = NULL' not work? What should be used instead? [3 Marks]
Answer: In SQL, NULL represents the absence of a value â it is not zero, not an empty string, and not equal to anything, including itself. This is based on three-valued logic (TRUE, FALSE, UNKNOWN).
Why WHERE column = NULL fails:
-- This will NEVER return rows â NULL = NULL evaluates to UNKNOWN, not TRUE
SELECT * FROM employees WHERE manager_id = NULL; -- â Wrong
Correct approach â use IS NULL / IS NOT NULL:
-- Correct: find rows where manager_id is NULL
SELECT * FROM employees WHERE manager_id IS NULL; -- â
-- Correct: find rows where manager_id is NOT NULL
SELECT * FROM employees WHERE manager_id IS NOT NULL; -- â
Other NULL-safe handling functions:
-- COALESCE returns the first non-NULL value
SELECT COALESCE(phone, 'N/A') FROM customers;
-- NULL-safe equality in MySQL
SELECT * FROM employees WHERE manager_id <=> NULL;
Q11. What is the difference between COALESCE, ISNULL, and NULLIF? [2 Marks]
Answer:
| Function | Description | Dialect |
|---|---|---|
COALESCE(a, b, c, ...) |
Returns the first non-NULL value from a list of arguments | Standard SQL (all databases) |
ISNULL(expr, replacement) |
Returns replacement if expr is NULL; similar to COALESCE but takes only 2 arguments |
SQL Server / MySQL |
NULLIF(a, b) |
Returns NULL if a = b, otherwise returns a â used to avoid division by zero |
Standard SQL (all databases) |
Examples:
-- COALESCE: returns first non-null among multiple options
SELECT COALESCE(phone, mobile, 'No Contact') FROM customers;
-- ISNULL: SQL Server specific
SELECT ISNULL(salary, 0) FROM employees;
-- NULLIF: avoid division by zero
SELECT total_sales / NULLIF(total_orders, 0) AS avg_order_value
FROM sales_summary;
Q12. Write a query to sort customers by country ascending and total_spend descending. [3 Marks]
Answer:
SELECT
customer_id,
name,
country,
total_spend
FROM customers
ORDER BY
country ASC,
total_spend DESC;
Explanation: When multiple columns are specified in
ORDER BY, SQL sorts by the first column first. Rows with the same country are then sorted bytotal_spendin descending order (highest spender first within each country).
Q13. How do you implement pagination in SQL? Write a query to fetch page 3 with 20 records per page. [4 Marks]
Answer: Pagination is implemented by combining LIMIT (number of records per page) with OFFSET (number of records to skip).
Formula: OFFSET = (page_number - 1) Ã records_per_page
For Page 3 with 20 records: OFFSET = (3 - 1) Ã 20 = 40
MySQL / PostgreSQL (LIMIT + OFFSET):
SELECT customer_id, name, email
FROM customers
ORDER BY customer_id
LIMIT 20 OFFSET 40;
SQL Server (OFFSET FETCH):
SELECT customer_id, name, email
FROM customers
ORDER BY customer_id
OFFSET 40 ROWS
FETCH NEXT 20 ROWS ONLY;
Oracle (ROWNUM â older versions):
SELECT * FROM (
SELECT customer_id, name, ROWNUM AS rn
FROM customers
WHERE ROWNUM <= 60
)
WHERE rn > 40;
Important: Always include
ORDER BYwith pagination â without it, the order of rows is not guaranteed across pages.
Q14. What is the difference between LIMIT/OFFSET and FETCH NEXT in standard SQL? [3 Marks]
Answer:
| Feature | LIMIT / OFFSET | FETCH NEXT |
|---|---|---|
| Standard | Not in official SQL standard | Part of ISO/ANSI SQL:2008 standard |
| Supported by | MySQL, PostgreSQL, SQLite | PostgreSQL, SQL Server, Oracle 12c+, DB2 |
| Syntax style | Simpler and shorter | More verbose but portable |
| Requires ORDER BY | Recommended but not enforced (varies) | Mandatory |
LIMIT / OFFSET syntax:
SELECT name FROM customers
ORDER BY customer_id
LIMIT 20 OFFSET 40;
FETCH NEXT syntax (Standard SQL):
SELECT name FROM customers
ORDER BY customer_id
OFFSET 40 ROWS
FETCH NEXT 20 ROWS ONLY;
Recommendation: Use
FETCH NEXTwhen writing database-agnostic or enterprise SQL. UseLIMIT/OFFSETfor MySQL or quick scripts.
Q15. Given a 'customers' table, write a query to find the number of customers per country, excluding countries with less than 10 customers. [5 Marks]
Answer:
SELECT
country,
COUNT(*) AS total_customers
FROM customers
GROUP BY country
HAVING COUNT(*) >= 10
ORDER BY total_customers DESC;
Step-by-step explanation:
| Step | Clause | What it does |
|---|---|---|
| 1 | FROM customers |
Reads all rows from the customers table |
| 2 | GROUP BY country |
Groups rows by each unique country |
| 3 | COUNT(*) |
Counts customers in each group |
| 4 | HAVING COUNT(*) >= 10 |
Excludes countries with fewer than 10 customers |
| 5 | ORDER BY total_customers DESC |
Shows highest-count countries first |
Note:
WHEREcannot be used here because it runs before grouping.HAVINGis the correct clause to filter aggregated results.
Q16. Write a query to find the second highest salary in the employees table without using LIMIT. [5 Marks]
Answer:
Method 1 â Using a Subquery (works in all databases):
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (
SELECT MAX(salary) FROM employees
);
Method 2 â Using NOT IN:
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary NOT IN (
SELECT MAX(salary) FROM employees
);
Method 3 â Using a Correlated Subquery:
SELECT DISTINCT salary AS second_highest_salary
FROM employees e1
WHERE 1 = (
SELECT COUNT(DISTINCT salary)
FROM employees e2
WHERE e2.salary > e1.salary
);
Method 4 â Using DENSE_RANK() Window Function (recommended for production):
SELECT salary AS second_highest_salary
FROM (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
) ranked
WHERE rnk = 2;
Why DENSE_RANK() is preferred: It handles duplicate salaries gracefully. If two employees share the highest salary,
DENSE_RANK()still correctly identifies the true second-highest distinct value.
đ Day 2: Aggregations, GROUP BY & HAVING
Here are all the answers formatted for direct copy-paste:
Q1. Explain the difference between COUNT(*), COUNT(column_name), and COUNT(DISTINCT column_name). [4 Marks]
Answer:
| Function | What it counts | Includes NULLs? |
|---|---|---|
COUNT(*) |
Every row in the result set, regardless of values | â Yes |
COUNT(column_name) |
Only rows where the specified column is NOT NULL | â No |
COUNT(DISTINCT column_name) |
Only unique, non-NULL values in the column | â No |
Example â Assume this employees table:
| emp_id | name | manager_id |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 101 |
| 3 | Carol | 101 |
| 4 | Dave | NULL |
SELECT
COUNT(*) AS total_rows, -- 4
COUNT(manager_id) AS non_null_managers, -- 2
COUNT(DISTINCT manager_id) AS unique_managers -- 1
FROM employees;
Key insight: Always use
COUNT(*)to count total rows. UseCOUNT(column)when NULLs should be excluded. UseCOUNT(DISTINCT column)when you need unique value frequency.
Q2. Write a query to calculate the total revenue, average order value, and number of orders per month. [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL:
SELECT
YEAR(order_date) AS order_year,
MONTH(order_date) AS order_month,
COUNT(order_id) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
GROUP BY
YEAR(order_date),
MONTH(order_date)
ORDER BY
order_year ASC,
order_month ASC;
PostgreSQL:
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS order_month,
COUNT(order_id) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(AVG(amount)::NUMERIC, 2) AS avg_order_value
FROM orders
GROUP BY
TO_CHAR(order_date, 'YYYY-MM')
ORDER BY
order_month ASC;
SQL Server:
SELECT
FORMAT(order_date, 'yyyy-MM') AS order_month,
COUNT(order_id) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
GROUP BY
FORMAT(order_date, 'yyyy-MM')
ORDER BY
order_month ASC;
Q3. What happens when you use an aggregate function with NULL values? Illustrate with SUM and AVG. [3 Marks]
Answer: SQL aggregate functions ignore NULL values automatically â except COUNT(*) which counts all rows including those with NULLs.
Example â Assume this salaries table:
| emp_id | salary |
|---|---|
| 1 | 50000 |
| 2 | 60000 |
| 3 | NULL |
| 4 | 40000 |
SELECT
SUM(salary) AS total_salary, -- 150000 (NULL ignored, not treated as 0)
AVG(salary) AS avg_salary, -- 50000 (150000 / 3, NOT divided by 4)
COUNT(*) AS total_rows, -- 4
COUNT(salary) AS non_null_count -- 3
FROM salaries;
Critical point for AVG:
- AVG(salary) = SUM(salary) / COUNT(salary) = 150000 / 3 = 50000
- It does NOT divide by 4 (total rows), which would give 37500
- This can lead to misleading averages if NULLs represent missing data (e.g., unpaid employees)
Safe approach â treat NULL as 0 using COALESCE:
SELECT AVG(COALESCE(salary, 0)) AS avg_including_nulls -- 37500
FROM salaries;
Q4. Write a query to find the department with the highest average salary. [4 Marks]
Answer:
Assumption: Table employees(emp_id, name, department, salary) exists.
Method 1 â Using ORDER BY + LIMIT (MySQL / PostgreSQL):
SELECT
department,
ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
LIMIT 1;
Method 2 â Using a Subquery (works in all databases):
SELECT
department,
ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) = (
SELECT MAX(avg_sal)
FROM (
SELECT AVG(salary) AS avg_sal
FROM employees
GROUP BY department
) AS dept_avgs
);
Method 3 â Using RANK() Window Function (recommended â handles ties):
SELECT department, avg_salary
FROM (
SELECT
department,
ROUND(AVG(salary), 2) AS avg_salary,
RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
FROM employees
GROUP BY department
) ranked
WHERE rnk = 1;
Why Method 3 is best: If two departments share the highest average salary,
LIMIT 1would arbitrarily drop one.RANK()returns all tied departments correctly.
Q5. Can you use columns in SELECT that are not in GROUP BY? Explain the rule and give an example. [3 Marks]
Answer: Generally, no â in standard SQL, every column in the SELECT clause must either appear in the GROUP BY clause or be wrapped inside an aggregate function (SUM, COUNT, MAX, etc.).
Why this rule exists: After grouping, each group collapses into a single row. If a non-grouped column has multiple different values within a group, SQL has no way to decide which value to display.
â Invalid query (SQL Server / PostgreSQL will throw an error):
SELECT department, name, AVG(salary) -- 'name' is neither grouped nor aggregated
FROM employees
GROUP BY department;
-- ERROR: column "name" must appear in GROUP BY or aggregate function
â Correct approach:
-- Option 1: Add name to GROUP BY
SELECT department, name, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, name;
-- Option 2: Use an aggregate on name
SELECT department, MAX(name) AS sample_employee, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
MySQL Exception: MySQL with sql_mode not including ONLY_FULL_GROUP_BY allows non-grouped columns â but it returns a random unpredictable value from the group, which is considered bad practice.
Q6. What is ROLLUP? Write a query to show sales by region, by country, and grand total using ROLLUP. [4 Marks]
Answer: ROLLUP is a GROUP BY extension that generates subtotals and a grand total automatically. It creates a hierarchy of groupings moving from the most specific to the most general level.
Assumption: Table sales(sale_id, region, country, amount) exists.
MySQL / PostgreSQL / SQL Server:
SELECT
COALESCE(region, 'ALL REGIONS') AS region,
COALESCE(country, 'ALL COUNTRIES') AS country,
SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(region, country)
ORDER BY region, country;
Sample Output:
| region | country | total_sales |
|---|---|---|
| Asia | India | 500000 |
| Asia | Japan | 300000 |
| Asia | ALL COUNTRIES | 800000 |
| Europe | France | 400000 |
| Europe | Germany | 600000 |
| Europe | ALL COUNTRIES | 1000000 |
| ALL REGIONS | ALL COUNTRIES | 1800000 |
Note: ROLLUP produces subtotals for each region and a grand total at the end.
COALESCEis used to label the NULL placeholders that ROLLUP generates for subtotal/grand total rows.
Q7. Explain the difference between CUBE and ROLLUP with examples. [4 Marks]
Answer:
| Feature | ROLLUP | CUBE |
|---|---|---|
| Subtotals generated | Hierarchical â follows left-to-right column order | All possible combinations of the specified columns |
| Number of groupings | n + 1 (for n columns) | 2âŋ (for n columns) |
| Use case | Hierarchical data (year â month â day, region â country) | Multi-dimensional analysis (pivot-style reporting) |
Example â Table: sales(region, product, amount)
ROLLUP â Hierarchical subtotals:
SELECT
region,
product,
SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(region, product);
Produces groupings:
- (region, product) â detail level
- (region) â subtotal per region
- () â grand total
CUBE â All combinations:
SELECT
region,
product,
SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE(region, product);
Produces groupings:
- (region, product) â detail level
- (region) â subtotal per region
- (product) â subtotal per product â extra grouping CUBE adds
- () â grand total
Rule of thumb: Use
ROLLUPfor drill-down reports (top to bottom hierarchy). UseCUBEwhen you need totals sliced every possible way â like a pivot table.
Q8. Write a query to find customers who placed more than 5 orders in the last 6 months. [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL:
SELECT
customer_id,
COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;
PostgreSQL:
SELECT
customer_id,
COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;
SQL Server:
SELECT
customer_id,
COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATEADD(MONTH, -6, GETDATE())
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;
Note:
WHEREfilters rows to the last 6 months before grouping (efficient â reduces data early).HAVINGthen filters the grouped results to only customers with more than 5 orders.
Q9. Write a query to find products whose average rating is above 4.5 but with at least 100 reviews. [4 Marks]
Answer:
Assumption: Table reviews(review_id, product_id, rating) exists. Table products(product_id, product_name) exists.
SELECT
p.product_id,
p.product_name,
COUNT(r.review_id) AS total_reviews,
ROUND(AVG(r.rating), 2) AS avg_rating
FROM products p
JOIN reviews r
ON p.product_id = r.product_id
GROUP BY
p.product_id,
p.product_name
HAVING
COUNT(r.review_id) >= 100 -- at least 100 reviews
AND AVG(r.rating) > 4.5 -- average rating above 4.5
ORDER BY avg_rating DESC;
Why HAVING and not WHERE: Both conditions (
AVGandCOUNT) are aggregate results computed after grouping. WHERE runs before grouping and cannot reference aggregates â so HAVING is the only correct option here.
Q10. Can HAVING be used without GROUP BY? Give an example. [3 Marks]
Answer: Yes, HAVING can be used without GROUP BY. In this case, the entire table is treated as one single group, and HAVING filters whether that group satisfies the condition â returning either all rows or no rows.
Example â Check if the overall average salary in the company exceeds 50000:
SELECT AVG(salary) AS company_avg_salary
FROM employees
HAVING AVG(salary) > 50000;
- If the company-wide average salary is above 50000, the result row is returned.
- If not, no rows are returned at all.
Practical use case â Return all orders only if there are more than 1000 total orders:
SELECT *
FROM orders
HAVING COUNT(*) > 1000;
Note: While valid,
HAVINGwithoutGROUP BYis uncommon in real-world queries. AWHEREclause or a simpleSELECTwith a conditional check is usually more readable for this purpose.
Q11. Write a query to find departments where the total salary exceeds 1 million AND the employee count is more than 20. [3 Marks]
Answer:
Assumption: Table employees(emp_id, name, department, salary) exists.
SELECT
department,
COUNT(emp_id) AS employee_count,
SUM(salary) AS total_salary
FROM employees
GROUP BY department
HAVING
SUM(salary) > 1000000 -- total salary exceeds 1 million
AND COUNT(emp_id) > 20 -- employee count more than 20
ORDER BY total_salary DESC;
Explanation: Both conditions operate on aggregated values (SUM and COUNT), so they must go in the
HAVINGclause â not inWHERE.WHEREonly filters individual rows before grouping and cannot use aggregate functions.
Q12. Write a query to find the top 3 products by revenue in each category. [5 Marks]
Answer:
Assumption: Table products(product_id, product_name, category) and table order_items(order_item_id, product_id, quantity, unit_price) exist.
Method â Using DENSE_RANK() Window Function (recommended):
SELECT
category,
product_name,
total_revenue,
revenue_rank
FROM (
SELECT
p.category,
p.product_name,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
DENSE_RANK() OVER (
PARTITION BY p.category
ORDER BY SUM(oi.quantity * oi.unit_price) DESC
) AS revenue_rank
FROM products p
JOIN order_items oi
ON p.product_id = oi.product_id
GROUP BY
p.category,
p.product_name
) ranked
WHERE revenue_rank <= 3
ORDER BY
category ASC,
revenue_rank ASC;
How it works step by step:
| Step | What happens |
|---|---|
JOIN |
Combines products with their order line items |
GROUP BY category, product_name |
Calculates total revenue per product per category |
DENSE_RANK() OVER (PARTITION BY category ...) |
Ranks products within each category by revenue |
WHERE revenue_rank <= 3 |
Keeps only top 3 per category |
Why DENSE_RANK() over ROW_NUMBER():
DENSE_RANK()handles ties correctly â if two products share rank 2, both appear, and rank 3 is still assigned to the next product.ROW_NUMBER()would arbitrarily break the tie and potentially exclude a valid top-3 product.
Q13. Given an orders table with columns (order_id, customer_id, order_date, amount), write a query to find the month with the highest total sales. [5 Marks]
Answer:
MySQL:
SELECT
YEAR(order_date) AS order_year,
MONTH(order_date) AS order_month,
SUM(amount) AS total_sales
FROM orders
GROUP BY
YEAR(order_date),
MONTH(order_date)
ORDER BY total_sales DESC
LIMIT 1;
PostgreSQL:
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS order_month,
SUM(amount) AS total_sales
FROM orders
GROUP BY
TO_CHAR(order_date, 'YYYY-MM')
ORDER BY total_sales DESC
LIMIT 1;
SQL Server (handles ties correctly using RANK):
SELECT order_month, total_sales
FROM (
SELECT
FORMAT(order_date, 'yyyy-MM') AS order_month,
SUM(amount) AS total_sales,
RANK() OVER (
ORDER BY SUM(amount) DESC
) AS rnk
FROM orders
GROUP BY FORMAT(order_date, 'yyyy-MM')
) ranked
WHERE rnk = 1;
Why use RANK() for SQL Server: If two months share the exact same highest total sales,
TOP 1would arbitrarily return one.RANK()ensures all tied months are returned correctly. For MySQL/PostgreSQL, useLIMIT 1for simplicity or wrap in a subquery withMAX()for tie-safety.
Tie-safe version for MySQL / PostgreSQL:
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS order_month,
SUM(amount) AS total_sales
FROM orders
GROUP BY TO_CHAR(order_date, 'YYYY-MM')
HAVING SUM(amount) = (
SELECT MAX(monthly_total)
FROM (
SELECT SUM(amount) AS monthly_total
FROM orders
GROUP BY TO_CHAR(order_date, 'YYYY-MM')
) AS monthly_sales
);
đ Day 3: Joins â INNER, LEFT, RIGHT, FULL & Self Joins
Here are all the answers formatted for direct copy-paste:
Q1. Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN with Venn diagrams. [4 Marks]
Answer:
INNER JOIN â Returns only rows where there is a match in both tables.
A B
[ |âââ| ]
Only the intersection
SELECT c.name, o.order_id
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
-- Only customers who HAVE orders
LEFT JOIN (LEFT OUTER JOIN) â Returns all rows from the left table and matched rows from the right table. Unmatched right-side columns return NULL.
A B
[âââ|âââ| ]
All of A + matching B
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
-- All customers, even those with no orders (order_id = NULL)
RIGHT JOIN (RIGHT OUTER JOIN) â Returns all rows from the right table and matched rows from the left table. Unmatched left-side columns return NULL.
A B
[ |âââ|âââ]
Matching A + all of B
SELECT c.name, o.order_id
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;
-- All orders, even if the customer record no longer exists
FULL OUTER JOIN â Returns all rows from both tables. NULLs fill in where there is no match on either side.
A B
[âââ|âââ|âââ]
Everything from both
SELECT c.name, o.order_id
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;
-- All customers + all orders, matched where possible
-- MySQL does NOT support FULL OUTER JOIN natively (use UNION of LEFT + RIGHT)
| Join Type | Left Unmatched | Right Unmatched | Matched Rows |
|---|---|---|---|
| INNER JOIN | â | â | â |
| LEFT JOIN | â | â | â |
| RIGHT JOIN | â | â | â |
| FULL OUTER JOIN | â | â | â |
Q2. Write a query to find all customers who have never placed an order using LEFT JOIN. [3 Marks]
Answer:
Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.
SELECT
c.customer_id,
c.name,
c.email
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL
ORDER BY c.customer_id;
How it works:
| Step | What happens |
|---|---|
LEFT JOIN |
Returns all customers, with NULL in order columns if no matching order exists |
WHERE o.order_id IS NULL |
Filters to keep only customers where no order was found (the NULL rows) |
Common mistake: Using
WHERE o.order_id = NULLinstead ofIS NULLâ this will always return zero rows because NULL comparisons with=evaluate to UNKNOWN, not TRUE. Always useIS NULLto check for NULLs.
Q3. What is a CROSS JOIN? When is it useful? [3 Marks]
Answer: A CROSS JOIN produces the Cartesian product of two tables â every row from the left table is combined with every row from the right table. If Table A has m rows and Table B has n rows, the result has m à n rows.
Syntax:
SELECT
a.color,
b.size
FROM colors a
CROSS JOIN sizes b;
Example output (3 colors à 3 sizes = 9 rows):
| color | size |
|---|---|
| Red | Small |
| Red | Medium |
| Red | Large |
| Blue | Small |
| Blue | Medium |
| ... | ... |
When is CROSS JOIN useful?
-- 1. Generating all combinations (e.g., product variants)
SELECT p.product_name, c.color, s.size
FROM products p
CROSS JOIN colors c
CROSS JOIN sizes s;
-- 2. Generating a date/calendar table
SELECT d.date_value
FROM (SELECT CURDATE() + INTERVAL seq DAY AS date_value
FROM generate_series(0, 364)) d;
-- 3. Seeding test data with all possible combinations
SELECT e.employee_id, t.task_type
FROM employees e
CROSS JOIN task_types t;
Warning: CROSS JOINs on large tables can be extremely expensive. A table of 1,000 rows CROSS JOINed with another 1,000-row table produces 1,000,000 rows. Always ensure this is intentional.
Q4. Explain the difference between ON and USING in join syntax. [2 Marks]
Answer:
| Feature | ON |
USING |
|---|---|---|
| Column names | Works even if column names differ between tables | Requires the join column to have the same name in both tables |
| Flexibility | Supports complex conditions (>, <, multiple columns, expressions) |
Only supports simple equality on identically named columns |
| Result columns | Join column appears twice (once per table) in SELECT * |
Join column appears only once in SELECT * |
| Standard support | All SQL databases | Standard SQL, supported by MySQL and PostgreSQL |
-- Using ON (flexible â column names can differ)
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.cust_id; -- different column names
-- Using USING (concise â column name must be identical in both tables)
SELECT name, order_id
FROM customers
JOIN orders USING (customer_id); -- customer_id exists in both tables
-- ON with multiple conditions
SELECT *
FROM orders o
JOIN promotions p
ON o.customer_id = p.customer_id
AND o.order_date BETWEEN p.start_date AND p.end_date;
Rule of thumb: Use
USINGfor clean, simple joins when column names match. UseONfor everything else â it is more explicit and universally supported.
Q5. Write a query to join three tables: customers, orders, and products, showing customer name, order date, and product name. [3 Marks]
Answer:
Assumption: Tables are:
- customers(customer_id, name, email)
- orders(order_id, customer_id, order_date)
- order_items(order_item_id, order_id, product_id, quantity)
- products(product_id, product_name, price)
SELECT
c.name AS customer_name,
o.order_date,
p.product_name
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
INNER JOIN products p
ON oi.product_id = p.product_id
ORDER BY
o.order_date DESC,
c.name ASC;
Note: In a normalized schema,
ordersandproductsare linked through an intermediateorder_itemstable. Attempting to joinordersdirectly toproductswithoutorder_itemswould be incorrect as no direct foreign key relationship exists between them.
Q6. What is a self join? Write a query to find each employee and their manager's name from an 'employees' table with a 'manager_id' column. [4 Marks]
Answer: A self join is when a table is joined to itself. This is useful when a table contains a hierarchical or relational reference to its own rows â such as an employee's manager also being an employee in the same table.
Assumption: Table employees(emp_id, name, department, salary, manager_id) where manager_id references emp_id in the same table.
-- INNER JOIN version (excludes employees with no manager, e.g., the CEO)
SELECT
e.emp_id AS employee_id,
e.name AS employee_name,
m.name AS manager_name
FROM employees e
INNER JOIN employees m
ON e.manager_id = m.emp_id
ORDER BY e.emp_id;
-- LEFT JOIN version (includes employees with no manager â shows NULL for top-level)
SELECT
e.emp_id AS employee_id,
e.name AS employee_name,
COALESCE(m.name, 'No Manager â Top Level') AS manager_name
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.emp_id
ORDER BY e.emp_id;
Key concept: The same table is aliased twice â e for the employee row and m for the manager row â so SQL treats them as two separate logical tables during the join.
Q7. Write a query to find all pairs of employees in the same department (without pairing employees with themselves or duplicating pairs). [3 Marks]
Answer:
Assumption: Table employees(emp_id, name, department) exists.
SELECT
e1.name AS employee_1,
e2.name AS employee_2,
e1.department
FROM employees e1
JOIN employees e2
ON e1.department = e2.department -- same department
AND e1.emp_id < e2.emp_id -- prevents self-pairing AND duplicate pairs
ORDER BY
e1.department,
e1.name;
Why e1.emp_id < e2.emp_id works:
| Condition | What it prevents |
|---|---|
e1.emp_id != e2.emp_id |
Prevents self-pairing (Alice, Alice) but still gives (Alice, Bob) AND (Bob, Alice) |
e1.emp_id < e2.emp_id |
Prevents self-pairing AND ensures each pair appears only once in one direction |
Example: If Alice has emp_id=1 and Bob has emp_id=2, only the pair (1, 2) is included. The reverse pair (2, 1) is excluded because
2 < 1is FALSE.
Q8. What is a non-equi join? Give an example with price ranges. [3 Marks]
Answer: A non-equi join is a join that uses a comparison operator other than = in the ON clause â such as >, <, >=, <=, BETWEEN, or !=. It is used when rows should be matched based on a range or inequality condition rather than exact key matching.
Example â Categorize products into price tiers using a price_ranges table:
Assumption:
- products(product_id, product_name, price)
- price_ranges(tier_name, min_price, max_price)
-- price_ranges table:
-- | tier_name | min_price | max_price |
-- | Budget | 0 | 999 |
-- | Mid-Range | 1000 | 4999 |
-- | Premium | 5000 | 9999 |
-- | Luxury | 10000 | 999999 |
SELECT
p.product_id,
p.product_name,
p.price,
pr.tier_name AS price_tier
FROM products p
JOIN price_ranges pr
ON p.price BETWEEN pr.min_price AND pr.max_price
ORDER BY p.price;
Other common non-equi join use cases:
-- Find employees whose salary falls within a salary band
SELECT e.name, sb.band_name
FROM employees e
JOIN salary_bands sb
ON e.salary BETWEEN sb.lower_bound AND sb.upper_bound;
-- Find all orders placed during an active promotion period
SELECT o.order_id, promo.promo_name
FROM orders o
JOIN promotions promo
ON o.order_date >= promo.start_date
AND o.order_date <= promo.end_date;
Q9. Write a query using LEFT JOIN to find customers who have placed orders but never returned any product. (tables: customers, orders, returns) [5 Marks]
Answer:
Assumption: Tables are:
- customers(customer_id, name, email)
- orders(order_id, customer_id, order_date, amount)
- returns(return_id, order_id, customer_id, return_date, reason)
SELECT DISTINCT
c.customer_id,
c.name,
c.email,
COUNT(o.order_id) AS total_orders
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id -- only customers who have ordered
LEFT JOIN returns r
ON o.order_id = r.order_id -- attempt to match any returns
WHERE r.return_id IS NULL -- keep only those with NO returns
GROUP BY
c.customer_id,
c.name,
c.email
ORDER BY total_orders DESC;
Step-by-step logic:
| Step | Clause | Purpose |
|---|---|---|
| 1 | INNER JOIN orders |
Restrict to customers who have placed at least one order |
| 2 | LEFT JOIN returns |
Attempt to find a matching return for each order |
| 3 | WHERE r.return_id IS NULL |
Keep only orders that have no corresponding return |
| 4 | GROUP BY |
Aggregate per customer to count their orders |
Why INNER JOIN for orders but LEFT JOIN for returns: We want customers who definitely ordered (
INNER JOIN) but never returned anything (LEFT JOIN+IS NULLanti-join pattern).
Q10. Explain the difference between a LEFT JOIN with WHERE clause vs a LEFT JOIN with a condition in the ON clause. Give examples showing the different results. [5 Marks]
Answer: This is one of the most commonly misunderstood concepts in SQL joins. The placement of a filter condition â in ON vs WHERE â produces fundamentally different results with LEFT JOIN.
Setup â Two tables:
-- customers: Alice (1), Bob (2), Carol (3)
-- orders: Alice has 1 order, Bob has 1 order, Carol has none
Case 1 â Filter in ON clause (soft filter):
SELECT
c.name,
o.order_id
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
AND o.amount > 500; -- condition applied DURING the join
Result:
| name | order_id |
|---|---|
| Alice | 101 |
| Bob | NULL |
| Carol | NULL |
All customers are returned. Bob's order exists but its amount was ⤠500, so it doesn't match the ON condition â Bob gets NULL for order_id but still appears in results.
Case 2 â Filter in WHERE clause (hard filter):
SELECT
c.name,
o.order_id
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.amount > 500; -- condition applied AFTER the join
Result:
| name | order_id |
|---|---|
| Alice | 101 |
Carol and Bob are completely removed. The LEFT JOIN first produces NULLs for unmatched rows, but then WHERE o.amount > 500 filters them out â NULL > 500 evaluates to UNKNOWN (not TRUE), so those rows are eliminated. This effectively converts LEFT JOIN into INNER JOIN.
Summary:
| ON Condition | WHERE Condition | |
|---|---|---|
| Filter applied | During join construction | After join is complete |
| Effect on unmatched rows | Still included (with NULLs) | Eliminated if condition fails |
| Result type | True LEFT JOIN preserved | Behaves like INNER JOIN |
| Use when | You want all left rows regardless | You want to strictly filter final results |
Golden rule: To keep the LEFT JOIN behavior intact while filtering the right-side table, always put right-table conditions in the
ONclause. Putting them inWHEREsilently turns your LEFT JOIN into an INNER JOIN.
Q11. Write a query to find customers whose total order value in 2024 is greater than their total order value in 2023. [5 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) and customers(customer_id, name) exist.
WITH orders_2023 AS (
SELECT
customer_id,
SUM(amount) AS total_2023
FROM orders
WHERE YEAR(order_date) = 2023
GROUP BY customer_id
),
orders_2024 AS (
SELECT
customer_id,
SUM(amount) AS total_2024
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
COALESCE(o23.total_2023, 0) AS total_2023,
COALESCE(o24.total_2024, 0) AS total_2024,
(COALESCE(o24.total_2024, 0)
- COALESCE(o23.total_2023, 0)) AS growth
FROM customers c
INNER JOIN orders_2024 o24
ON c.customer_id = o24.customer_id -- must have 2024 orders
LEFT JOIN orders_2023 o23
ON c.customer_id = o23.customer_id -- may or may not have 2023 orders
WHERE COALESCE(o24.total_2024, 0)
> COALESCE(o23.total_2023, 0) -- 2024 must exceed 2023
ORDER BY growth DESC;
Why COALESCE: A customer who had zero orders in 2023 but ordered in 2024 would have NULL for
total_2023.COALESCE(o23.total_2023, 0)converts that NULL to 0, allowing the comparison to work correctly and also including new customers who grew from 0 to something.
Q12. Write a query to find the top 5 customers by total spend, including customers who haven't ordered anything (show them with 0). [5 Marks]
Answer:
Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.
SELECT
c.customer_id,
c.name,
c.email,
COALESCE(SUM(o.amount), 0) AS total_spend,
COUNT(o.order_id) AS total_orders
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.name,
c.email
ORDER BY total_spend DESC
LIMIT 5; -- use TOP 5 for SQL Server
SQL Server version:
SELECT TOP 5
c.customer_id,
c.name,
c.email,
COALESCE(SUM(o.amount), 0) AS total_spend,
COUNT(o.order_id) AS total_orders
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.name,
c.email
ORDER BY total_spend DESC;
Design decisions explained:
| Choice | Reason |
|---|---|
LEFT JOIN instead of INNER JOIN |
Ensures customers with zero orders are included |
COALESCE(SUM(o.amount), 0) |
Converts NULL (no orders) to 0 for clean display |
COUNT(o.order_id) not COUNT(*) |
COUNT(*) would count 1 even for NULL rows; COUNT(order_id) correctly returns 0 |
Q13. Given 'departments' and 'employees' tables, find departments that have no employees assigned. [5 Marks]
Answer:
Assumption: Tables departments(dept_id, dept_name, location) and employees(emp_id, name, dept_id, salary) exist.
Method 1 â LEFT JOIN + IS NULL (Anti-Join pattern â most common):
SELECT
d.dept_id,
d.dept_name,
d.location
FROM departments d
LEFT JOIN employees e
ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL
ORDER BY d.dept_name;
Method 2 â NOT EXISTS (most readable and efficient for large datasets):
SELECT
d.dept_id,
d.dept_name,
d.location
FROM departments d
WHERE NOT EXISTS (
SELECT 1
FROM employees e
WHERE e.dept_id = d.dept_id
)
ORDER BY d.dept_name;
Method 3 â NOT IN (simple but unsafe if dept_id can be NULL in employees):
SELECT
dept_id,
dept_name,
location
FROM departments
WHERE dept_id NOT IN (
SELECT dept_id
FROM employees
WHERE dept_id IS NOT NULL -- critical: NOT IN fails silently if subquery has NULLs
)
ORDER BY dept_name;
Method comparison:
| Method | Performance | NULL safe | Readability |
|---|---|---|---|
| LEFT JOIN + IS NULL | â Fast | â Yes | â Common pattern |
| NOT EXISTS | â Fast (stops early) | â Yes | â Very readable |
| NOT IN | â ī¸ Slower on large sets | â Fails with NULLs | â Simple |
Best practice: Prefer
NOT EXISTSorLEFT JOIN + IS NULLoverNOT INin production queries. If the subquery inNOT INreturns even one NULL value, the entire query returns zero rows â a silent and dangerous bug.
đ Day 4: Subqueries, CTEs & Derived Tables
Here are all the answers formatted for direct copy-paste:
Q1. Explain the difference between a correlated and non-correlated subquery with examples. [4 Marks]
Answer:
| Feature | Non-Correlated Subquery | Correlated Subquery |
|---|---|---|
| Dependency | Independent of the outer query | References a column from the outer query |
| Execution | Runs once, result reused | Runs once per row of the outer query |
| Performance | Generally faster | Generally slower on large datasets |
| Can run alone? | â Yes | â No â depends on outer query |
Non-Correlated Subquery â runs once, no reference to outer query:
-- Find all employees earning more than the company-wide average salary
SELECT emp_id, name, salary
FROM employees
WHERE salary > (
SELECT AVG(salary) -- runs once, returns a single value
FROM employees
);
Correlated Subquery â references outer query, runs per row:
-- Find all employees earning more than their OWN department's average salary
SELECT emp_id, name, department, salary
FROM employees e1
WHERE salary > (
SELECT AVG(salary)
FROM employees e2
WHERE e2.department = e1.department -- references e1 from outer query
-- re-executes for every row in e1
);
Performance tip: Correlated subqueries can be slow on large tables because they execute once per outer row. They are often rewritable as JOINs or window functions for better performance.
Q2. Write a query using a subquery to find employees who earn more than the average salary of their department. [4 Marks]
Answer:
Assumption: Table employees(emp_id, name, department, salary) exists.
Method 1 â Correlated Subquery:
SELECT
emp_id,
name,
department,
salary
FROM employees e1
WHERE salary > (
SELECT AVG(salary)
FROM employees e2
WHERE e2.department = e1.department
)
ORDER BY department, salary DESC;
Method 2 â Derived Table (often more efficient):
SELECT
e.emp_id,
e.name,
e.department,
e.salary,
ROUND(dept_avg.avg_salary, 2) AS dept_avg_salary
FROM employees e
JOIN (
SELECT
department,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department
) AS dept_avg
ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_salary
ORDER BY e.department, e.salary DESC;
Method 3 â Using Window Function (most efficient):
SELECT emp_id, name, department, salary
FROM (
SELECT
emp_id,
name,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees
) ranked
WHERE salary > dept_avg
ORDER BY department, salary DESC;
Why Method 3 is best: The window function computes department averages in a single pass over the data. No correlated re-execution, no extra JOIN â it is the most performant approach for large datasets.
Q3. What is the difference between a scalar, row, and table subquery? [3 Marks]
Answer:
| Type | Returns | Used in | Must return |
|---|---|---|---|
| Scalar | A single value (1 row à 1 column) | SELECT, WHERE, HAVING |
Exactly 1 value (error if multiple) |
| Row | A single row (1 row à multiple columns) | WHERE with row constructors |
Exactly 1 row |
| Table | Multiple rows and columns | FROM clause (as derived table) |
Any number of rows |
Scalar Subquery â returns one value:
-- Returns a single number used in comparison
SELECT name, salary,
(SELECT AVG(salary) FROM employees) AS company_avg
FROM employees;
Row Subquery â returns one row, compared as a unit:
-- Find the employee with the exact same department and salary as emp_id 101
SELECT name, department, salary
FROM employees
WHERE (department, salary) = (
SELECT department, salary
FROM employees
WHERE emp_id = 101
);
Table Subquery â returns a result set used as a virtual table:
-- Use multiple rows and columns as an inline table in FROM
SELECT dept_summary.department, dept_summary.avg_salary
FROM (
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
) AS dept_summary
WHERE dept_summary.avg_salary > 60000;
Q4. Write a query using EXISTS to find customers who have placed at least one order above 10000. [4 Marks]
Answer:
Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.
SELECT
c.customer_id,
c.name,
c.email
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 10000
)
ORDER BY c.name;
Why SELECT 1 inside EXISTS:
-- EXISTS only checks whether the subquery returns ANY row â it doesn't care about values
-- SELECT 1, SELECT *, SELECT NULL â all work identically inside EXISTS
-- SELECT 1 is a common convention signaling "we only care about existence, not data"
EXISTS vs IN for this query:
-- Using IN (alternative)
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE amount > 10000
);
-- EXISTS is preferred when:
-- 1. The subquery could return a large result set (EXISTS stops at first match)
-- 2. The subquery involves NULLs (IN behaves unexpectedly with NULLs)
-- 3. You need a correlated condition (EXISTS naturally supports correlation)
Performance note:
EXISTSshort-circuits â it stops scanning as soon as it finds the first matching row.INcollects all matching values first, then checks membership. For large datasets,EXISTSis typically faster.
Q5. What is a CTE? How does it differ from a subquery and a temporary table? When would you prefer a CTE? [4 Marks]
Answer: A Common Table Expression (CTE) is a named, temporary result set defined using the WITH clause at the beginning of a query. It exists only for the duration of that single query execution.
WITH cte_name AS (
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
)
SELECT e.name, e.salary, c.avg_salary
FROM employees e
JOIN cte_name c ON e.department = c.department;
Comparison table:
| Feature | Subquery | CTE | Temporary Table |
|---|---|---|---|
| Named & reusable | â No | â Yes (within query) | â Yes (within session) |
| Scope | Inline only | Current query only | Entire session |
| Readability | â Can nest and become complex | â Clean, modular | â Clear |
| Supports recursion | â No | â Yes | â No |
| Indexes possible | â No | â No | â Yes |
| Materialized to disk | â No | â Usually not | â Yes |
| DDL required | â No | â No | â Yes (CREATE TABLE) |
When to prefer a CTE:
- When the same subquery logic is needed more than once in the same query
- When building recursive queries (hierarchy traversal, number generation)
- When breaking a complex query into readable named steps
- When you want no side effects â CTEs vanish after the query, unlike temp tables
Q6. Write a query using a CTE to calculate monthly running total of sales. [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL:
WITH monthly_sales AS (
SELECT
YEAR(order_date) AS sale_year,
MONTH(order_date) AS sale_month,
SUM(amount) AS monthly_total
FROM orders
GROUP BY
YEAR(order_date),
MONTH(order_date)
)
SELECT
sale_year,
sale_month,
monthly_total,
SUM(monthly_total) OVER (
PARTITION BY sale_year
ORDER BY sale_month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM monthly_sales
ORDER BY sale_year, sale_month;
PostgreSQL:
WITH monthly_sales AS (
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS sale_month,
SUM(amount) AS monthly_total
FROM orders
GROUP BY TO_CHAR(order_date, 'YYYY-MM')
)
SELECT
sale_month,
monthly_total,
SUM(monthly_total) OVER (
ORDER BY sale_month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM monthly_sales
ORDER BY sale_month;
Sample output:
| sale_month | monthly_total | running_total |
|---|---|---|
| 2024-01 | 120000 | 120000 |
| 2024-02 | 95000 | 215000 |
| 2024-03 | 140000 | 355000 |
Why
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: This window frame explicitly defines the running total from the very first row up to the current row. Without specifying the frame, the default behavior can vary across databases and lead to incorrect results.
Q7. What is a recursive CTE? Write one to generate numbers from 1 to 100. [4 Marks]
Answer: A recursive CTE is a CTE that references itself. It consists of two parts joined by UNION ALL:
- Anchor member â the starting/base case (runs once)
- Recursive member â references the CTE itself and runs repeatedly until the termination condition is met
Structure:
WITH RECURSIVE cte_name AS (
-- Anchor: starting point
SELECT base_value ...
UNION ALL
-- Recursive member: references cte_name itself
SELECT next_value ...
FROM cte_name
WHERE termination_condition
)
SELECT * FROM cte_name;
Generate numbers 1 to 100:
PostgreSQL / SQL Server:
WITH RECURSIVE number_series AS (
-- Anchor member: start at 1
SELECT 1 AS n
UNION ALL
-- Recursive member: add 1 each iteration
SELECT n + 1
FROM number_series
WHERE n < 100 -- termination condition: stop at 100
)
SELECT n
FROM number_series
ORDER BY n;
MySQL (uses RECURSIVE keyword):
WITH RECURSIVE number_series AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1
FROM number_series
WHERE n < 100
)
SELECT n FROM number_series;
Execution flow:
| Iteration | Value of n |
|---|---|
| Anchor | 1 |
| Recursion 1 | 2 |
| Recursion 2 | 3 |
| ... | ... |
| Recursion 99 | 100 |
| Termination | n < 100 is FALSE â stops |
Safety tip: Always include a
WHEREtermination condition in recursive CTEs. Without it, the query will run indefinitely. SQL Server also supportsMAXRECURSIONhint (OPTION (MAXRECURSION 1000)) to cap recursion depth.
Q8. Use a recursive CTE to traverse an employee-manager hierarchy and show each employee's full reporting chain. [3 Marks]
Answer:
Assumption: Table employees(emp_id, name, manager_id) where manager_id is NULL for the top-level executive.
WITH RECURSIVE org_hierarchy AS (
-- Anchor: start with top-level employees (no manager)
SELECT
emp_id,
name,
manager_id,
name AS reporting_chain,
0 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: find employees reporting to current level
SELECT
e.emp_id,
e.name,
e.manager_id,
CONCAT(h.reporting_chain, ' â ', e.name) AS reporting_chain,
h.level + 1 AS level
FROM employees e
INNER JOIN org_hierarchy h
ON e.manager_id = h.emp_id
)
SELECT
emp_id,
name,
level AS hierarchy_level,
reporting_chain
FROM org_hierarchy
ORDER BY reporting_chain;
Sample output:
| emp_id | name | hierarchy_level | reporting_chain |
|---|---|---|---|
| 1 | CEO | 0 | CEO |
| 2 | VP Sales | 1 | CEO â VP Sales |
| 4 | Sales Manager | 2 | CEO â VP Sales â Sales Manager |
| 7 | Sales Rep | 3 | CEO â VP Sales â Sales Manager â Sales Rep |
Note:
CONCATbuilds the chain path by appending each employee's name to their manager's chain. Thelevelcolumn tracks depth in the hierarchy. For SQL Server, useSTRING_AGGor+operator instead ofCONCATif needed.
Q9. What is a derived table? Write a query using a derived table to find customers whose average order value is in the top 10%. [5 Marks]
Answer: A derived table is an inline subquery placed in the FROM clause and given an alias. It acts as a virtual table for the duration of the query â similar to a CTE but defined inline rather than at the top with WITH.
-- Structure of a derived table
SELECT col1, col2
FROM (
SELECT ... -- this subquery IS the derived table
FROM some_table
) AS alias_name -- alias is MANDATORY for derived tables
WHERE ...;
Query â Find customers in the top 10% by average order value:
SELECT
customer_id,
avg_order_value,
percentile_rank
FROM (
-- Derived table: compute avg order value and percentile rank per customer
SELECT
customer_id,
ROUND(AVG(amount), 2) AS avg_order_value,
PERCENT_RANK() OVER (
ORDER BY AVG(amount)
) AS percentile_rank
FROM orders
GROUP BY customer_id
) AS customer_stats
WHERE percentile_rank >= 0.90 -- top 10% (90th percentile and above)
ORDER BY avg_order_value DESC;
With customer name (joining customers table):
SELECT
c.customer_id,
c.name,
cs.avg_order_value,
ROUND(cs.percentile_rank * 100, 1) AS percentile
FROM customers c
JOIN (
SELECT
customer_id,
ROUND(AVG(amount), 2) AS avg_order_value,
PERCENT_RANK() OVER (ORDER BY AVG(amount)) AS percentile_rank
FROM orders
GROUP BY customer_id
) AS cs
ON c.customer_id = cs.customer_id
WHERE cs.percentile_rank >= 0.90
ORDER BY cs.avg_order_value DESC;
Note:
PERCENT_RANK()returns a value between 0 and 1. A value of0.90means the customer's average order value is higher than 90% of all customers â placing them in the top 10%.
Q10. Rewrite the above query using a CTE. Which is more readable and why? [5 Marks]
Answer:
CTE version of the same query:
WITH customer_order_stats AS (
-- Step 1: Calculate average order value per customer
SELECT
customer_id,
ROUND(AVG(amount), 2) AS avg_order_value,
PERCENT_RANK() OVER (
ORDER BY AVG(amount)
) AS percentile_rank
FROM orders
GROUP BY customer_id
),
top_10_percent AS (
-- Step 2: Filter to top 10% only
SELECT
customer_id,
avg_order_value,
ROUND(percentile_rank * 100, 1) AS percentile
FROM customer_order_stats
WHERE percentile_rank >= 0.90
)
-- Step 3: Join with customers table to get names
SELECT
c.customer_id,
c.name,
t.avg_order_value,
t.percentile
FROM customers c
JOIN top_10_percent t
ON c.customer_id = t.customer_id
ORDER BY t.avg_order_value DESC;
Why the CTE version is more readable:
| Aspect | Derived Table | CTE |
|---|---|---|
| Logic separation | All nested inline â hard to isolate each step | Each step is named and separated clearly |
| Reusability | Cannot reuse the derived table elsewhere in the query | customer_order_stats can be referenced multiple times |
| Debugging | Must extract nested subquery to test it | Each CTE block can be run independently |
| Readability | Logic flows inside-out (read innermost first) | Logic flows top-to-bottom like narrative steps |
| Naming | Anonymous inline block | Descriptive names (customer_order_stats, top_10_percent) |
Verdict: The CTE version is significantly more readable because each logical step is named and reads sequentially from top to bottom. The derived table version requires mentally unwrapping nested layers, which becomes increasingly difficult as query complexity grows. For production code and team environments, CTEs are strongly preferred.
Q11. Write a query to find the third highest salary per department using a CTE. [5 Marks]
Answer:
Assumption: Table employees(emp_id, name, department, salary) exists.
WITH salary_rankings AS (
-- Step 1: Rank salaries within each department
-- DENSE_RANK handles ties: two equal salaries both get rank 2,
-- and the next distinct salary gets rank 3
SELECT
emp_id,
name,
department,
salary,
DENSE_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS salary_rank
FROM employees
),
third_highest AS (
-- Step 2: Filter to only rank 3
SELECT
emp_id,
name,
department,
salary,
salary_rank
FROM salary_rankings
WHERE salary_rank = 3
)
-- Step 3: Final output
SELECT
department,
name,
salary AS third_highest_salary
FROM third_highest
ORDER BY department, salary DESC;
Handling departments with fewer than 3 distinct salaries:
WITH salary_rankings AS (
SELECT
emp_id,
name,
department,
salary,
DENSE_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS salary_rank,
COUNT(DISTINCT salary) OVER (
PARTITION BY department
) AS distinct_salary_count
FROM employees
)
SELECT
department,
name,
salary AS third_highest_salary
FROM salary_rankings
WHERE salary_rank = 3
AND distinct_salary_count >= 3 -- only departments with 3+ distinct salaries
ORDER BY department;
Why DENSE_RANK() over ROW_NUMBER(): If two employees in the same department share the second-highest salary (e.g., both earn 80000),
ROW_NUMBER()would arbitrarily assign them ranks 2 and 3 â making one appear as the "third highest" incorrectly.DENSE_RANK()assigns both rank 2 and correctly identifies the next distinct salary as rank 3.
Q12. Using CTEs, write a query to find customers who placed orders in every month of 2024. [5 Marks]
Answer:
Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.
WITH all_months AS (
-- Step 1: Generate all 12 months of 2024 as reference set
SELECT 1 AS month_num UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12
),
customer_active_months AS (
-- Step 2: Find distinct months each customer ordered in 2024
SELECT
customer_id,
MONTH(order_date) AS order_month
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY
customer_id,
MONTH(order_date)
),
customer_month_count AS (
-- Step 3: Count how many distinct months each customer was active
SELECT
customer_id,
COUNT(DISTINCT order_month) AS active_months
FROM customer_active_months
GROUP BY customer_id
)
-- Step 4: Keep only customers active in all 12 months
SELECT
c.customer_id,
c.name,
cm.active_months
FROM customers c
JOIN customer_month_count cm
ON c.customer_id = cm.customer_id
WHERE cm.active_months = 12 -- ordered in all 12 months
ORDER BY c.name;
Alternative â Using recursive CTE for month generation (PostgreSQL):
WITH RECURSIVE all_months AS (
SELECT 1 AS month_num
UNION ALL
SELECT month_num + 1
FROM all_months
WHERE month_num < 12
),
customer_active_months AS (
SELECT
customer_id,
EXTRACT(MONTH FROM order_date) AS order_month
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024
GROUP BY customer_id, EXTRACT(MONTH FROM order_date)
),
customer_month_count AS (
SELECT
customer_id,
COUNT(DISTINCT order_month) AS active_months
FROM customer_active_months
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
cm.active_months
FROM customers c
JOIN customer_month_count cm
ON c.customer_id = cm.customer_id
WHERE cm.active_months = 12
ORDER BY c.name;
Step-by-step logic explained:
| CTE | Purpose |
|---|---|
all_months |
Reference list of all 12 months â useful for cross-join gap detection |
customer_active_months |
Deduplicates to one row per customer per month (handles multiple orders in same month) |
customer_month_count |
Counts distinct months each customer was active in 2024 |
Final WHERE active_months = 12 |
Keeps only truly every-month customers |
Why GROUP BY before COUNT: A customer might place 5 orders in January. Without
GROUP BYdeduplication incustomer_active_months, January would count 5 times instead of 1 â making a customer with 60 orders in one month incorrectly appear to have covered all months.
đ Day 5: Window Functions â RANK, LAG, LEAD & Running Totals
Here are all the answers formatted for direct copy-paste:
Q1. What is the difference between an aggregate function and a window function? Give examples. [3 Marks]
Answer:
| Feature | Aggregate Function | Window Function |
|---|---|---|
| Rows returned | Collapses rows â one row per group | Preserves all rows â no collapsing |
| Requires GROUP BY | â Yes (usually) | â No â uses OVER() clause instead |
| Can access individual row data alongside aggregation | â No | â Yes |
| Use case | Summarized totals, counts, averages | Rankings, running totals, row comparisons |
Aggregate function â collapses rows:
-- Returns ONE row per department â individual employee data is lost
SELECT
department,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
Window function â preserves all rows:
-- Returns EVERY employee row, with department average computed alongside
SELECT
emp_id,
name,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary
FROM employees;
Sample output comparison:
Aggregate result (3 rows for 3 departments):
| department | avg_salary |
|---|---|
| Sales | 65000 |
| HR | 52000 |
Window function result (every employee preserved):
| name | department | salary | dept_avg_salary |
|---|---|---|---|
| Alice | Sales | 80000 | 65000 |
| Bob | Sales | 50000 | 65000 |
| Carol | HR | 52000 | 52000 |
Key insight: Window functions let you answer questions like "how does this employee's salary compare to their department average?" in a single query â something impossible with aggregate functions alone without a JOIN or subquery.
Q2. Explain the OVER clause with PARTITION BY and ORDER BY. Write a query showing employee salary alongside their department's average salary. [4 Marks]
Answer: The OVER() clause is what defines a function as a window function. It specifies the "window" of rows the function operates on for each row in the result.
Components of OVER():
| Component | Purpose | Optional? |
|---|---|---|
PARTITION BY |
Divides rows into independent groups (like GROUP BY for windows) | â Optional |
ORDER BY |
Defines the order of rows within each partition | â Optional (required for ranking/LAG/LEAD) |
Frame clause (ROWS/RANGE BETWEEN) |
Defines the subset of rows within each partition | â Optional |
-- Anatomy of OVER()
SUM(salary) OVER (
PARTITION BY department -- reset the window for each department
ORDER BY hire_date -- order rows within each partition
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- window frame
)
Query â Employee salary alongside department average:
SELECT
e.emp_id,
e.name,
e.department,
e.salary,
ROUND(
AVG(e.salary) OVER (
PARTITION BY e.department -- compute avg separately per dept
), 2
) AS dept_avg_salary,
ROUND(
e.salary - AVG(e.salary) OVER (
PARTITION BY e.department
), 2
) AS diff_from_dept_avg,
ROUND(
AVG(e.salary) OVER () -- no PARTITION BY = entire table
, 2) AS company_avg_salary
FROM employees e
ORDER BY e.department, e.salary DESC;
Sample output:
| name | department | salary | dept_avg_salary | diff_from_dept_avg | company_avg_salary |
|---|---|---|---|---|---|
| Alice | Sales | 80000 | 65000 | +15000 | 61000 |
| Bob | Sales | 50000 | 65000 | -15000 | 61000 |
| Carol | HR | 60000 | 56000 | +4000 | 61000 |
PARTITION BY vs GROUP BY:
PARTITION BYdivides data into windows but keeps all rows visible.GROUP BYdivides data into groups and collapses them into summary rows â individual row data is no longer accessible.
Q3. What is the difference between ROWS BETWEEN and RANGE BETWEEN in window frames? [4 Marks]
Answer: Both ROWS BETWEEN and RANGE BETWEEN define the window frame â the subset of rows within a partition that a window function considers for each row. The critical difference is how they handle ties.
| Feature | ROWS BETWEEN | RANGE BETWEEN |
|---|---|---|
| Based on | Physical row position â exact row count | Logical value range â all rows with equal ORDER BY value |
| Tie handling | Each tied row is treated as a separate, distinct row | All tied rows are included together in the same frame |
| Precision | More precise and predictable | Can include unexpected extra rows when ties exist |
| Performance | Generally faster | Slightly slower due to value-based comparison |
| Requires ORDER BY | â Yes | â Yes |
Example â Monthly sales with two months having the same total:
-- Sample data: Jan=1000, Feb=1000, Mar=2000, Apr=3000
-- ROWS BETWEEN: strictly physical rows
SELECT
month,
total_sales,
SUM(total_sales) OVER (
ORDER BY month
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW -- exactly 1 prior physical row
) AS rows_running_sum
FROM monthly_sales;
-- Jan: 1000, Feb: 2000 (Jan+Feb), Mar: 3000 (Feb+Mar), Apr: 5000 (Mar+Apr)
-- RANGE BETWEEN: includes all rows with same ORDER BY value in frame
SELECT
month,
total_sales,
SUM(total_sales) OVER (
ORDER BY total_sales
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS range_running_sum
FROM monthly_sales;
-- When total_sales=1000 appears in Jan AND Feb, RANGE includes both
-- even when calculating for just "Jan" â because Feb has the same value
Most common window frame defaults:
-- Running total (explicit best practice)
SUM(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
-- Centered 3-row moving average
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)
-- Entire partition (no frame restriction needed)
AVG(amount) OVER (PARTITION BY department)
Best practice: Use
ROWS BETWEENfor running totals and moving averages â it is predictable and unaffected by duplicate values. UseRANGE BETWEENonly when you explicitly want all rows sharing the same ORDER BY value treated as a single logical unit.
Q4. Write a query to calculate a 7-day moving average of daily sales using window functions. [4 Marks]
Answer:
Assumption: Table daily_sales(sale_date, total_sales) exists with one row per day.
MySQL / PostgreSQL / SQL Server:
SELECT
sale_date,
total_sales AS daily_sales,
ROUND(
AVG(total_sales) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- current day + 6 prior days = 7 days
), 2
) AS moving_avg_7day,
COUNT(*) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS days_in_window -- useful to flag early rows
FROM daily_sales
ORDER BY sale_date;
With a flag for incomplete windows (first 6 days have fewer than 7 data points):
SELECT
sale_date,
total_sales,
CASE
WHEN COUNT(*) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) < 7
THEN NULL -- insufficient data, show NULL
ELSE ROUND(AVG(total_sales) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2)
END AS moving_avg_7day
FROM daily_sales
ORDER BY sale_date;
Sample output:
| sale_date | daily_sales | moving_avg_7day | days_in_window |
|---|---|---|---|
| 2024-01-01 | 5000 | NULL | 1 |
| 2024-01-02 | 6200 | NULL | 2 |
| ... | ... | NULL | ... |
| 2024-01-07 | 4800 | 5371.43 | 7 |
| 2024-01-08 | 7100 | 5628.57 | 7 |
Why
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: This frame includes the current day plus the 6 days before it â exactly 7 days total. UsingRANGE BETWEENinstead could include unintended extra rows if multiple dates share the sametotal_salesvalue, soROWS BETWEENis always preferred for date-based sliding windows.
Q5. Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() with an example where they produce different results. [4 Marks]
Answer:
| Function | Handles Ties | Gap after tie? | Unique per row? |
|---|---|---|---|
ROW_NUMBER() |
Assigns arbitrary unique numbers â ties broken randomly | N/A â always sequential | â Always unique |
RANK() |
Tied rows get same rank, next rank skips numbers | â Yes â creates gaps | â Not unique on ties |
DENSE_RANK() |
Tied rows get same rank, next rank is consecutive | â No gaps | â Not unique on ties |
Example â Salary ranking where 2 employees share the 2nd highest salary:
-- Sample data:
-- Alice: 90000, Bob: 75000, Carol: 75000, Dave: 60000
SELECT
name,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rnk
FROM employees;
Result:
| name | salary | row_num | rnk | dense_rnk |
|---|---|---|---|---|
| Alice | 90000 | 1 | 1 | 1 |
| Bob | 75000 | 2 | 2 | 2 |
| Carol | 75000 | 3 | 2 | 2 |
| Dave | 60000 | 4 | 4 | 3 |
Observations:
- ROW_NUMBER() â Bob and Carol both get different numbers (2 and 3) even though salaries are equal. The order between them is arbitrary.
- RANK() â Both Bob and Carol get rank 2. Dave jumps to rank 4 (skipping 3) â the gap reflects the two people above him.
- DENSE_RANK() â Both Bob and Carol get rank 2. Dave gets rank 3 â no gaps, consecutive ranking.
When to use each:
-- ROW_NUMBER: pagination, deduplication (need exactly 1 row per group)
-- RANK: sports leaderboards where gaps reflect true position ("tied for 2nd, no one in 3rd")
-- DENSE_RANK: finding Nth highest value without gaps (top 3 salary levels)
Q6. Write a query to find the top 3 products by revenue in each category using DENSE_RANK. [4 Marks]
Answer:
Assumption: Tables products(product_id, product_name, category) and order_items(order_item_id, product_id, quantity, unit_price) exist.
WITH product_revenue AS (
-- Step 1: Calculate total revenue per product
SELECT
p.product_id,
p.product_name,
p.category,
SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM products p
JOIN order_items oi
ON p.product_id = oi.product_id
GROUP BY
p.product_id,
p.product_name,
p.category
),
ranked_products AS (
-- Step 2: Rank products within each category by revenue
SELECT
product_id,
product_name,
category,
total_revenue,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_revenue DESC
) AS revenue_rank
FROM product_revenue
)
-- Step 3: Filter top 3 per category
SELECT
category,
revenue_rank,
product_name,
total_revenue
FROM ranked_products
WHERE revenue_rank <= 3
ORDER BY
category ASC,
revenue_rank ASC;
Sample output:
| category | revenue_rank | product_name | total_revenue |
|---|---|---|---|
| Electronics | 1 | Laptop Pro | 500000 |
| Electronics | 2 | Smartphone X | 420000 |
| Electronics | 3 | Tablet Z | 310000 |
| Clothing | 1 | Winter Jacket | 180000 |
| Clothing | 2 | Running Shoes | 155000 |
| Clothing | 2 | Denim Jeans | 155000 |
Why DENSE_RANK() here: If two products tie for rank 2 (like Clothing above),
DENSE_RANK()keeps both in the top 3 and correctly assigns rank 3 to the next product.ROW_NUMBER()would arbitrarily exclude one tied product.RANK()would skip rank 3 entirely and return rank 4 next.
Q7. What is NTILE? Write a query to divide customers into 4 quartiles based on their total spend. [4 Marks]
Answer: NTILE(n) is a window function that divides rows within a partition into n equal-sized buckets (as equal as possible) and assigns each row a bucket number from 1 to n. It is used for percentile grouping, quartile analysis, and tiered segmentation.
How NTILE handles uneven distribution:
- If rows don't divide evenly, earlier buckets receive one extra row
- Example: 10 rows with NTILE(3) â buckets of 4, 3, 3
WITH customer_spend AS (
-- Step 1: Calculate total spend per customer
SELECT
customer_id,
SUM(amount) AS total_spend
FROM orders
GROUP BY customer_id
),
customer_quartiles AS (
-- Step 2: Assign quartile using NTILE(4)
SELECT
customer_id,
total_spend,
NTILE(4) OVER (
ORDER BY total_spend ASC -- ascending: quartile 1 = lowest spend
) AS quartile
FROM customer_spend
)
-- Step 3: Label quartiles and join customer details
SELECT
c.customer_id,
c.name,
cq.total_spend,
cq.quartile,
CASE cq.quartile
WHEN 1 THEN 'Q1 â Low Spenders'
WHEN 2 THEN 'Q2 â Below Average'
WHEN 3 THEN 'Q3 â Above Average'
WHEN 4 THEN 'Q4 â Top Spenders'
END AS spend_segment
FROM customers c
JOIN customer_quartiles cq
ON c.customer_id = cq.customer_id
ORDER BY cq.total_spend DESC;
Summary statistics per quartile:
-- Bonus: aggregate stats per quartile
SELECT
quartile,
COUNT(*) AS customer_count,
ROUND(MIN(total_spend), 2) AS min_spend,
ROUND(MAX(total_spend), 2) AS max_spend,
ROUND(AVG(total_spend), 2) AS avg_spend
FROM customer_quartiles
GROUP BY quartile
ORDER BY quartile;
Q8. Write a query to find the most recent order for each customer using ROW_NUMBER. [3 Marks]
Answer:
Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.
WITH ranked_orders AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
o.amount,
ROW_NUMBER() OVER (
PARTITION BY o.customer_id -- restart numbering for each customer
ORDER BY o.order_date DESC -- most recent order gets rank 1
) AS rn
FROM orders o
)
SELECT
c.customer_id,
c.name,
ro.order_id,
ro.order_date AS most_recent_order_date,
ro.amount AS most_recent_order_amount
FROM customers c
JOIN ranked_orders ro
ON c.customer_id = ro.customer_id
WHERE ro.rn = 1 -- keep only the most recent order per customer
ORDER BY ro.order_date DESC;
Why ROW_NUMBER() over MAX(order_date):
MAX(order_date)can find the latest date, but you would need an additional JOIN to retrieve the full order row (order_id,amount, etc.).ROW_NUMBER()lets you select the complete most recent order row in one clean step. Also, if two orders share the same latest date,ROW_NUMBER()picks exactly one (useRANK()instead if you want both returned in tie cases).
Q9. Write a query using LAG to calculate month-over-month growth in revenue. [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL:
WITH monthly_revenue AS (
-- Step 1: Aggregate revenue per month
SELECT
YEAR(order_date) AS yr,
MONTH(order_date) AS mth,
SUM(amount) AS monthly_revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
),
revenue_with_lag AS (
-- Step 2: Use LAG to bring in previous month's revenue
SELECT
yr,
mth,
monthly_revenue,
LAG(monthly_revenue, 1, 0) OVER (
ORDER BY yr, mth -- order chronologically across years
) AS prev_month_revenue
FROM monthly_revenue
)
-- Step 3: Calculate growth metrics
SELECT
yr AS year,
mth AS month,
monthly_revenue,
prev_month_revenue,
ROUND(monthly_revenue - prev_month_revenue, 2) AS revenue_change,
CASE
WHEN prev_month_revenue = 0 THEN NULL -- avoid division by zero for first month
ELSE ROUND(
(monthly_revenue - prev_month_revenue)
/ prev_month_revenue * 100, 2
)
END AS mom_growth_pct
FROM revenue_with_lag
ORDER BY yr, mth;
PostgreSQL version (cleaner date formatting):
WITH monthly_revenue AS (
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS month_label,
SUM(amount) AS monthly_revenue
FROM orders
GROUP BY TO_CHAR(order_date, 'YYYY-MM')
),
revenue_with_lag AS (
SELECT
month_label,
monthly_revenue,
LAG(monthly_revenue) OVER (
ORDER BY month_label
) AS prev_month_revenue
FROM monthly_revenue
)
SELECT
month_label,
monthly_revenue,
prev_month_revenue,
ROUND(
CASE
WHEN prev_month_revenue IS NULL OR prev_month_revenue = 0 THEN NULL
ELSE (monthly_revenue - prev_month_revenue)
/ prev_month_revenue * 100
END, 2
) AS mom_growth_pct
FROM revenue_with_lag
ORDER BY month_label;
Sample output:
| month | monthly_revenue | prev_month_revenue | revenue_change | mom_growth_pct |
|---|---|---|---|---|
| 2024-01 | 120000 | NULL | NULL | NULL |
| 2024-02 | 95000 | 120000 | -25000 | -20.83% |
| 2024-03 | 140000 | 95000 | +45000 | +47.37% |
Q10. What is the difference between LAG and LEAD? Give an example of each. [3 Marks]
Answer:
| Function | Direction | Accesses | Use case |
|---|---|---|---|
LAG(col, n) |
Backward â looks behind | The row n rows before the current row | Compare with past values (MoM growth, previous price) |
LEAD(col, n) |
Forward â looks ahead | The row n rows after the current row | Preview future values (next appointment, upcoming deadline) |
Both accept three arguments: LAG/LEAD(column, offset, default) where default is returned when no row exists at that offset.
LAG example â Compare each month's sales to the previous month:
SELECT
sale_month,
monthly_sales,
LAG(monthly_sales, 1, 0) OVER (
ORDER BY sale_month
) AS previous_month_sales
FROM monthly_sales;
-- January row: previous_month_sales = 0 (no prior row, default used)
-- February row: previous_month_sales = January's value
LEAD example â Show next month's sales target alongside current:
SELECT
sale_month,
monthly_sales,
LEAD(monthly_sales, 1, NULL) OVER (
ORDER BY sale_month
) AS next_month_sales
FROM monthly_sales;
-- December row: next_month_sales = NULL (no following row)
-- November row: next_month_sales = December's value
Combined LAG + LEAD in one query:
SELECT
sale_month,
LAG(monthly_sales) OVER (ORDER BY sale_month) AS prev_month,
monthly_sales AS current_month,
LEAD(monthly_sales) OVER (ORDER BY sale_month) AS next_month
FROM monthly_sales;
Q11. Write a query using FIRST_VALUE to find the first product purchased by each customer. [3 Marks]
Answer:
Assumption: Tables customers(customer_id, name), orders(order_id, customer_id, order_date), and order_items(order_item_id, order_id, product_id) and products(product_id, product_name) exist.
WITH customer_purchases AS (
SELECT
o.customer_id,
p.product_name,
o.order_date,
FIRST_VALUE(p.product_name) OVER (
PARTITION BY o.customer_id -- per customer
ORDER BY o.order_date ASC -- earliest order first
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING -- consider entire partition
) AS first_product_purchased
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
)
SELECT DISTINCT
c.customer_id,
c.name AS customer_name,
cp.first_product_purchased
FROM customers c
JOIN customer_purchases cp
ON c.customer_id = cp.customer_id
ORDER BY c.customer_id;
Alternative â Using ROW_NUMBER (more portable across databases):
WITH ranked_purchases AS (
SELECT
o.customer_id,
p.product_name,
o.order_date,
ROW_NUMBER() OVER (
PARTITION BY o.customer_id
ORDER BY o.order_date ASC
) AS rn
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
)
SELECT
c.customer_id,
c.name,
rp.product_name AS first_product_purchased,
rp.order_date AS first_order_date
FROM customers c
JOIN ranked_purchases rp
ON c.customer_id = rp.customer_id
WHERE rp.rn = 1
ORDER BY c.customer_id;
Why specify
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGwith FIRST_VALUE: Without an explicit frame, many databases default toRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. ForFIRST_VALUEthis works fine since we always look backwards â but specifying the full frame is explicit best practice and avoids subtle bugs when the window ordering has ties.
Q12. Write a query to calculate the cumulative percentage of total sales for each product (Pareto analysis). [5 Marks]
Answer: A Pareto analysis (80/20 rule) identifies the products contributing the most to total revenue. Cumulative percentage helps identify the top 20% of products driving ~80% of sales.
Assumption: Tables products(product_id, product_name, category) and order_items(order_item_id, product_id, quantity, unit_price) exist.
WITH product_revenue AS (
-- Step 1: Calculate total revenue per product
SELECT
p.product_id,
p.product_name,
p.category,
SUM(oi.quantity * oi.unit_price) AS product_revenue
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name, p.category
),
revenue_ranked AS (
-- Step 2: Rank products and compute cumulative revenue
SELECT
product_id,
product_name,
category,
product_revenue,
SUM(product_revenue) OVER () AS grand_total_revenue,
ROUND(
product_revenue * 100.0
/ SUM(product_revenue) OVER (), 2
) AS pct_of_total,
ROUND(
SUM(product_revenue) OVER (
ORDER BY product_revenue DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) * 100.0
/ SUM(product_revenue) OVER (), 2
) AS cumulative_pct,
ROW_NUMBER() OVER (
ORDER BY product_revenue DESC
) AS revenue_rank
FROM product_revenue
)
-- Step 3: Label Pareto segment
SELECT
revenue_rank,
product_name,
category,
product_revenue,
pct_of_total,
cumulative_pct,
CASE
WHEN cumulative_pct <= 80 THEN 'Top 80% Revenue â Vital Few'
ELSE 'Bottom 20% Revenue â Trivial Many'
END AS pareto_segment
FROM revenue_ranked
ORDER BY revenue_rank;
Sample output:
| rank | product_name | product_revenue | pct_of_total | cumulative_pct | pareto_segment |
|---|---|---|---|---|---|
| 1 | Laptop Pro | 500000 | 28.50% | 28.50% | Top 80% Revenue â Vital Few |
| 2 | Smartphone X | 420000 | 23.94% | 52.44% | Top 80% Revenue â Vital Few |
| 3 | Tablet Z | 310000 | 17.67% | 70.11% | Top 80% Revenue â Vital Few |
| 4 | Headphones | 180000 | 10.26% | 80.37% | Bottom 20% Revenue â Trivial Many |
| 5 | Mouse | 50000 | 2.85% | 83.22% | Bottom 20% Revenue â Trivial Many |
Business insight: In the output above, just 3 products (out of many) generate over 70% of revenue â a classic Pareto pattern. Marketing and inventory decisions should prioritize these "Vital Few" products.
Q13. Write a query to identify customers whose spend decreased for 3 consecutive months. [5 Marks]
Answer:
Assumption: Tables orders(order_id, customer_id, order_date, amount) and customers(customer_id, name) exist.
WITH monthly_spend AS (
-- Step 1: Total spend per customer per month
SELECT
customer_id,
DATE_FORMAT(order_date, '%Y-%m') AS spend_month, -- MySQL
-- TO_CHAR(order_date, 'YYYY-MM') AS spend_month, -- PostgreSQL
SUM(amount) AS monthly_spend
FROM orders
GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
),
spend_with_lag AS (
-- Step 2: Bring in previous 2 months' spend using LAG
SELECT
customer_id,
spend_month,
monthly_spend,
LAG(monthly_spend, 1) OVER (
PARTITION BY customer_id
ORDER BY spend_month
) AS spend_1_month_ago,
LAG(monthly_spend, 2) OVER (
PARTITION BY customer_id
ORDER BY spend_month
) AS spend_2_months_ago
FROM monthly_spend
),
consecutive_declines AS (
-- Step 3: Flag months where spend has declined for 3 consecutive months
SELECT
customer_id,
spend_month,
spend_2_months_ago AS month_1_spend,
spend_1_month_ago AS month_2_spend,
monthly_spend AS month_3_spend
FROM spend_with_lag
WHERE
spend_2_months_ago IS NOT NULL -- need 3 months of data
AND spend_1_month_ago IS NOT NULL
AND monthly_spend < spend_1_month_ago -- month 3 < month 2
AND spend_1_month_ago < spend_2_months_ago -- month 2 < month 1
)
-- Step 4: Join with customers for final output
SELECT DISTINCT
c.customer_id,
c.name,
cd.spend_month AS third_declining_month,
cd.month_1_spend,
cd.month_2_spend,
cd.month_3_spend,
ROUND(
(cd.month_3_spend - cd.month_1_spend)
/ cd.month_1_spend * 100, 2
) AS total_decline_pct
FROM customers c
JOIN consecutive_declines cd
ON c.customer_id = cd.customer_id
ORDER BY total_decline_pct ASC; -- worst declines first
Step-by-step logic:
| CTE | Purpose |
|---|---|
monthly_spend |
Aggregates total spend per customer per month |
spend_with_lag |
Uses LAG(1) and LAG(2) to bring the previous 2 months' spend into the current row |
consecutive_declines |
Filters rows where all three months show strict decline: M1 > M2 > M3 |
| Final SELECT | Joins customer names and calculates total % decline across the 3-month window |
Sample output:
| name | third_declining_month | month_1_spend | month_2_spend | month_3_spend | total_decline_pct |
|---|---|---|---|---|---|
| Alice | 2024-04 | 8000 | 6500 | 4200 | -47.50% |
| Bob | 2024-07 | 5000 | 4100 | 3300 | -34.00% |
Business use case: This query powers churn risk detection â customers with 3 consecutive months of declining spend are high-priority candidates for re-engagement campaigns. The
total_decline_pctcolumn helps prioritize outreach by severity of the decline.
đ Day 6: Advanced Joins, Set Operations & CASE Logic
Here are all the answers formatted for direct copy-paste:
Q1. Explain the difference between UNION and UNION ALL. Which is faster and why? [3 Marks]
Answer:
| Feature | UNION | UNION ALL |
|---|---|---|
| Duplicate rows | Removes duplicates from final result | Keeps all rows including duplicates |
| Extra operation | Performs a DISTINCT sort/hash pass |
No deduplication step |
| Performance | Slower â extra work to find and remove duplicates | Faster â no deduplication overhead |
| Result order | Not guaranteed unless ORDER BY added |
Not guaranteed unless ORDER BY added |
| Use when | You need unique rows across both result sets | You know results are distinct OR duplicates are acceptable |
-- UNION: removes duplicates (customer 101 appears in both but shows once)
SELECT customer_id FROM orders_2023
UNION
SELECT customer_id FROM orders_2024;
-- UNION ALL: keeps all rows including duplicates (faster)
SELECT customer_id FROM orders_2023
UNION ALL
SELECT customer_id FROM orders_2024;
Why UNION ALL is faster:
UNION internally applies a deduplication step â similar to running SELECT DISTINCT on the combined result. This requires either sorting the entire result set or building a hash table to detect duplicates. Both are memory and CPU intensive on large datasets. UNION ALL simply concatenates the two result sets with no extra processing.
Best practice: Always use
UNION ALLunless you specifically need duplicate elimination. If you know the two queries return mutually exclusive rows (e.g., partitioned by year),UNION ALLis always the right choice.
Q2. Write a query to find customers who bought Product A but not Product B using EXCEPT. [4 Marks]
Answer:
Assumption: Tables customers(customer_id, name), orders(order_id, customer_id, order_date), order_items(order_item_id, order_id, product_id), and products(product_id, product_name) exist.
Method 1 â Using EXCEPT (PostgreSQL / SQL Server):
-- Customers who bought Product A
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_name = 'Product A'
EXCEPT
-- Customers who bought Product B
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_name = 'Product B';
With customer names (wrap in CTE):
WITH bought_a AS (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_name = 'Product A'
),
bought_b AS (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_name = 'Product B'
),
a_not_b AS (
SELECT customer_id FROM bought_a
EXCEPT
SELECT customer_id FROM bought_b
)
SELECT
c.customer_id,
c.name
FROM customers c
JOIN a_not_b ab ON c.customer_id = ab.customer_id
ORDER BY c.name;
Method 2 â Using NOT EXISTS (MySQL compatible â MySQL has no EXCEPT):
SELECT DISTINCT
c.customer_id,
c.name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_name = 'Product A'
AND NOT EXISTS (
SELECT 1
FROM orders o2
JOIN order_items oi2 ON o2.order_id = oi2.order_id
JOIN products p2 ON oi2.product_id = p2.product_id
WHERE o2.customer_id = c.customer_id
AND p2.product_name = 'Product B'
)
ORDER BY c.name;
Note: MySQL does not support
EXCEPTnatively. UseNOT EXISTS,NOT IN, or aLEFT JOIN + IS NULLanti-join pattern instead.
Q3. What is INTERSECT? Write a query to find customers who made purchases in both 2023 and 2024. [4 Marks]
Answer: INTERSECT returns only the rows that appear in both result sets â it is the SQL equivalent of a set intersection. Like UNION, it automatically removes duplicates from the final output.
Set A ⊠Set B = rows present in BOTH A and B
PostgreSQL / SQL Server:
-- Customers who purchased in 2023
SELECT DISTINCT customer_id
FROM orders
WHERE YEAR(order_date) = 2023 -- use EXTRACT(YEAR FROM order_date) in PostgreSQL
INTERSECT
-- Customers who purchased in 2024
SELECT DISTINCT customer_id
FROM orders
WHERE YEAR(order_date) = 2024;
With customer details (CTE wrapper):
WITH active_both_years AS (
SELECT customer_id FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2023
INTERSECT
SELECT customer_id FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024
)
SELECT
c.customer_id,
c.name,
c.email
FROM customers c
JOIN active_both_years ab
ON c.customer_id = ab.customer_id
ORDER BY c.name;
MySQL alternative (no INTERSECT support):
-- Using INNER JOIN as INTERSECT equivalent
SELECT DISTINCT
c.customer_id,
c.name
FROM customers c
WHERE c.customer_id IN (
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2023
)
AND c.customer_id IN (
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024
)
ORDER BY c.name;
Set operations comparison:
| Operation | Returns |
|---|---|
UNION |
All rows from A + B (no duplicates) |
UNION ALL |
All rows from A + B (with duplicates) |
INTERSECT |
Only rows in both A and B |
EXCEPT / MINUS |
Rows in A but not in B |
Q4. What are the rules for columns in a UNION query? [4 Marks]
Answer: For a UNION (or UNION ALL, INTERSECT, EXCEPT) query to be valid, all combined SELECT statements must follow these rules:
Rule 1 â Same number of columns:
-- â Invalid: different column counts
SELECT customer_id, name FROM customers
UNION
SELECT product_id FROM products; -- ERROR: different number of columns
-- â
Valid: same column count
SELECT customer_id, name FROM customers
UNION
SELECT product_id, product_name FROM products;
Rule 2 â Compatible data types in corresponding positions:
-- â Invalid: incompatible types (string vs integer)
SELECT customer_id, name FROM customers -- name is VARCHAR
UNION
SELECT product_id, price FROM products; -- price is DECIMAL â may fail
-- â
Valid: cast to compatible type
SELECT customer_id, CAST(name AS VARCHAR(100)) FROM customers
UNION
SELECT product_id, CAST(price AS VARCHAR(100)) FROM products;
Rule 3 â Column names come from the FIRST SELECT only:
-- The result column is named 'customer_id' and 'name' from the first SELECT
-- The second SELECT's column names (product_id, product_name) are ignored
SELECT customer_id, name FROM customers
UNION
SELECT product_id, product_name FROM products;
-- Result columns: customer_id, name â from first SELECT
Rule 4 â ORDER BY applies to the final result, not individual SELECTs:
-- â Invalid: ORDER BY inside individual SELECT
SELECT customer_id FROM customers ORDER BY customer_id -- ERROR
UNION
SELECT product_id FROM products;
-- â
Valid: ORDER BY at the very end applies to the full result
SELECT customer_id AS id FROM customers
UNION
SELECT product_id AS id FROM products
ORDER BY id; -- single ORDER BY at the end
Rule 5 â Use aliases from the first SELECT for ORDER BY and referencing:
SELECT customer_id AS entity_id, name AS entity_name FROM customers
UNION ALL
SELECT product_id, product_name FROM products
ORDER BY entity_name ASC; -- reference the alias from first SELECT
Q5. Explain the difference between simple CASE and searched CASE. Give one example of each. [4 Marks]
Answer:
| Feature | Simple CASE | Searched CASE |
|---|---|---|
| Syntax style | CASE column WHEN value THEN result |
CASE WHEN condition THEN result |
| Comparison type | Equality only (=) â compares one expression to fixed values |
Any condition â supports >, <, BETWEEN, LIKE, AND/OR, subqueries |
| Flexibility | Less flexible â one expression, multiple equal values | More flexible â each WHEN is an independent Boolean expression |
| Use case | Mapping fixed codes/values to labels | Range checks, complex conditions, multi-column logic |
Simple CASE â equality comparison against a single expression:
-- Map numeric status code to human-readable label
SELECT
order_id,
amount,
CASE status_code
WHEN 1 THEN 'Pending'
WHEN 2 THEN 'Processing'
WHEN 3 THEN 'Shipped'
WHEN 4 THEN 'Delivered'
WHEN 5 THEN 'Cancelled'
ELSE 'Unknown'
END AS order_status
FROM orders;
Searched CASE â independent Boolean condition per WHEN:
-- Categorize salary into bands using range conditions
SELECT
emp_id,
name,
salary,
CASE
WHEN salary >= 100000 THEN 'Executive'
WHEN salary >= 70000
AND salary < 100000 THEN 'Senior'
WHEN salary >= 40000
AND salary < 70000 THEN 'Mid-Level'
WHEN salary < 40000 THEN 'Junior'
ELSE 'Not Classified'
END AS salary_band
FROM employees;
Key rule: SQL evaluates CASE conditions top to bottom and returns the result of the first matching WHEN. Order your conditions from most specific to most general to avoid incorrect matches.
Q6. Write a query to categorize customers into 'Platinum', 'Gold', 'Silver', 'Bronze' based on their total spend. [4 Marks]
Answer:
Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.
WITH customer_spend AS (
-- Step 1: Calculate lifetime total spend per customer
SELECT
customer_id,
SUM(amount) AS total_spend,
COUNT(order_id) AS total_orders
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
c.email,
COALESCE(cs.total_spend, 0) AS total_spend,
COALESCE(cs.total_orders, 0) AS total_orders,
CASE
WHEN cs.total_spend >= 100000 THEN 'Platinum'
WHEN cs.total_spend >= 50000 THEN 'Gold'
WHEN cs.total_spend >= 10000 THEN 'Silver'
WHEN cs.total_spend > 0 THEN 'Bronze'
ELSE 'No Purchase'
END AS customer_tier,
CASE
WHEN cs.total_spend >= 100000 THEN 1
WHEN cs.total_spend >= 50000 THEN 2
WHEN cs.total_spend >= 10000 THEN 3
WHEN cs.total_spend > 0 THEN 4
ELSE 5
END AS tier_rank -- for sorting
FROM customers c
LEFT JOIN customer_spend cs
ON c.customer_id = cs.customer_id
ORDER BY
tier_rank ASC,
total_spend DESC;
Summary stats by tier:
-- Bonus: tier distribution report
WITH customer_spend AS (
SELECT customer_id, SUM(amount) AS total_spend
FROM orders GROUP BY customer_id
),
tiered AS (
SELECT
CASE
WHEN total_spend >= 100000 THEN 'Platinum'
WHEN total_spend >= 50000 THEN 'Gold'
WHEN total_spend >= 10000 THEN 'Silver'
ELSE 'Bronze'
END AS tier,
total_spend
FROM customer_spend
)
SELECT
tier,
COUNT(*) AS customer_count,
ROUND(SUM(total_spend), 2) AS tier_total_revenue,
ROUND(AVG(total_spend), 2) AS avg_spend_per_customer
FROM tiered
GROUP BY tier
ORDER BY MIN(total_spend) DESC;
Q7. Write a query using CASE inside an aggregate to count how many orders were 'High Value' (>10000) and 'Low Value' in each region. [4 Marks]
Answer:
Assumption: Tables orders(order_id, customer_id, region, order_date, amount) exist.
SELECT
region,
COUNT(order_id) AS total_orders,
-- Conditional aggregation: count High Value orders
SUM(
CASE WHEN amount > 10000 THEN 1 ELSE 0 END
) AS high_value_orders,
-- Conditional aggregation: count Low Value orders
SUM(
CASE WHEN amount <= 10000 THEN 1 ELSE 0 END
) AS low_value_orders,
-- Alternative using COUNT with filter
COUNT(
CASE WHEN amount > 10000 THEN 1 END
) AS high_value_count_v2,
-- Percentage breakdown
ROUND(
SUM(CASE WHEN amount > 10000 THEN 1 ELSE 0 END)
* 100.0 / COUNT(order_id), 2
) AS high_value_pct,
-- Revenue split
ROUND(
SUM(CASE WHEN amount > 10000 THEN amount ELSE 0 END), 2
) AS high_value_revenue,
ROUND(
SUM(CASE WHEN amount <= 10000 THEN amount ELSE 0 END), 2
) AS low_value_revenue
FROM orders
GROUP BY region
ORDER BY total_orders DESC;
Sample output:
| region | total_orders | high_value_orders | low_value_orders | high_value_pct | high_value_revenue |
|---|---|---|---|---|---|
| North | 850 | 210 | 640 | 24.71% | 3200000 |
| South | 620 | 95 | 525 | 15.32% | 1450000 |
| East | 440 | 180 | 260 | 40.91% | 2800000 |
Why
SUM(CASE...)overCOUNT(CASE...):COUNT(expr)counts non-NULL values â soCOUNT(CASE WHEN ... THEN 1 END)works because unmatched rows return NULL (not counted).SUM(CASE WHEN ... THEN 1 ELSE 0 END)is more explicit and readable. Both produce identical results â choose whichever is clearer to your team.
Q8. Can you use CASE in ORDER BY clause? Give an example of custom sorting. [3 Marks]
Answer: Yes, CASE can be used directly in the ORDER BY clause to implement custom sort logic â sorting rows in a business-defined order rather than alphabetically or numerically.
Example 1 â Sort order status by business priority (not alphabetically):
SELECT
order_id,
customer_id,
status,
amount
FROM orders
ORDER BY
CASE status
WHEN 'Critical' THEN 1 -- show most urgent first
WHEN 'Processing' THEN 2
WHEN 'Pending' THEN 3
WHEN 'Shipped' THEN 4
WHEN 'Delivered' THEN 5
WHEN 'Cancelled' THEN 6
ELSE 7 -- unknown statuses last
END ASC,
order_date ASC; -- secondary sort by date within each status
Example 2 â Sort employees with managers first, then by salary:
SELECT
emp_id,
name,
manager_id,
salary
FROM employees
ORDER BY
CASE
WHEN manager_id IS NULL THEN 0 -- top-level (no manager) first
ELSE 1
END ASC,
salary DESC; -- within each group, highest salary first
Example 3 â Dynamic sort direction based on a flag:
-- Sort ascending for 'Gold'/'Platinum' customers, descending for others
SELECT customer_id, name, tier, total_spend
FROM customer_summary
ORDER BY
CASE WHEN tier IN ('Gold', 'Platinum')
THEN total_spend END DESC, -- high spenders in premium tiers first
CASE WHEN tier NOT IN ('Gold', 'Platinum')
THEN total_spend END ASC; -- low spenders in lower tiers last
Rule:
CASEinORDER BYreturns a value (number or string) that SQL uses as the sort key. The actualCASEexpression never appears in the output â it only affects row ordering.
Q9. Write a query to find gaps in a sequence of order_ids (missing IDs). [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists where order_id should be a sequential integer with no gaps.
Method 1 â Using LAG to find gaps (PostgreSQL / SQL Server / MySQL 8+):
WITH ordered_ids AS (
SELECT
order_id,
LAG(order_id) OVER (ORDER BY order_id) AS prev_order_id
FROM orders
),
gaps AS (
SELECT
prev_order_id + 1 AS gap_start,
order_id - 1 AS gap_end,
order_id - prev_order_id - 1 AS missing_count
FROM ordered_ids
WHERE order_id - prev_order_id > 1 -- gap exists when difference > 1
)
SELECT
gap_start,
gap_end,
missing_count,
CONCAT(
'Missing IDs: ', gap_start,
CASE WHEN gap_start < gap_end
THEN CONCAT(' to ', gap_end)
ELSE ''
END
) AS gap_description
FROM gaps
ORDER BY gap_start;
Method 2 â Using LEAD to look forward:
SELECT
order_id AS current_id,
LEAD(order_id) OVER (ORDER BY order_id) AS next_id,
LEAD(order_id) OVER (ORDER BY order_id)
- order_id - 1 AS missing_between
FROM orders
HAVING missing_between > 0; -- MySQL syntax
Method 3 â Generate full sequence and LEFT JOIN (most explicit):
-- PostgreSQL: generate_series
WITH full_sequence AS (
SELECT generate_series(
(SELECT MIN(order_id) FROM orders),
(SELECT MAX(order_id) FROM orders)
) AS expected_id
)
SELECT
fs.expected_id AS missing_order_id
FROM full_sequence fs
LEFT JOIN orders o ON fs.expected_id = o.order_id
WHERE o.order_id IS NULL
ORDER BY fs.expected_id;
Sample output (Method 1):
| gap_start | gap_end | missing_count | gap_description |
|---|---|---|---|
| 4 | 4 | 1 | Missing IDs: 4 |
| 8 | 11 | 4 | Missing IDs: 8 to 11 |
| 15 | 15 | 1 | Missing IDs: 15 |
Q10. Explain how you would join two tables when the join condition involves a date range (e.g., matching a transaction date with a pricing table effective_from and effective_to). [6 Marks]
Answer: A date range join (also called a temporal join or non-equi join) matches rows based on whether a date falls within a valid date range rather than an exact key match. This is common in slowly changing dimension (SCD) scenarios â price histories, tax rate tables, salary bands, and exchange rates.
Schema setup:
-- transactions: each sale with its transaction date
-- transactions(txn_id, product_id, txn_date, quantity)
-- pricing: price valid between effective_from and effective_to
-- pricing(price_id, product_id, unit_price, effective_from, effective_to)
-- Sample pricing data:
-- product_id=1, price=100, valid 2024-01-01 to 2024-03-31
-- product_id=1, price=120, valid 2024-04-01 to 2024-12-31
Basic date range join:
SELECT
t.txn_id,
t.product_id,
t.txn_date,
t.quantity,
p.unit_price,
t.quantity * p.unit_price AS total_value
FROM transactions t
JOIN pricing p
ON t.product_id = p.product_id -- same product
AND t.txn_date >= p.effective_from -- transaction after price start
AND t.txn_date <= p.effective_to -- transaction before price end
ORDER BY t.txn_date;
Handling open-ended pricing (effective_to = NULL means "current price"):
SELECT
t.txn_id,
t.product_id,
t.txn_date,
t.quantity,
p.unit_price,
t.quantity * p.unit_price AS total_value
FROM transactions t
JOIN pricing p
ON t.product_id = p.product_id
AND t.txn_date >= p.effective_from
AND (
t.txn_date <= p.effective_to -- within closed range
OR p.effective_to IS NULL -- OR price is still active (open-ended)
)
ORDER BY t.txn_id;
Handling overlapping date ranges (defensive â pick most recent effective price):
-- If pricing table has overlapping ranges due to data quality issues,
-- use ROW_NUMBER to pick the most recently effective price
WITH ranked_prices AS (
SELECT
t.txn_id,
t.product_id,
t.txn_date,
t.quantity,
p.unit_price,
p.effective_from,
ROW_NUMBER() OVER (
PARTITION BY t.txn_id
ORDER BY p.effective_from DESC -- most recent price wins
) AS rn
FROM transactions t
JOIN pricing p
ON t.product_id = p.product_id
AND t.txn_date >= p.effective_from
AND (t.txn_date <= p.effective_to OR p.effective_to IS NULL)
)
SELECT
txn_id,
product_id,
txn_date,
quantity,
unit_price,
quantity * unit_price AS total_value
FROM ranked_prices
WHERE rn = 1 -- keep only best-match price row
ORDER BY txn_date;
Common pitfalls and solutions:
| Pitfall | Cause | Solution |
|---|---|---|
| Multiple price rows matched | Overlapping date ranges in pricing table | Use ROW_NUMBER() to pick one |
| No price row matched | Gap in pricing history | Use LEFT JOIN and flag NULL prices |
| NULL effective_to not handled | Current price has no end date | Add OR effective_to IS NULL |
| Performance on large tables | Full scan due to non-equi condition | Index on (product_id, effective_from, effective_to) |
Performance tip: Date range joins cannot use standard equality index lookups. Create a composite index on
(product_id, effective_from)â the equality onproduct_idreduces the scan, then the range oneffective_fromnarrows it further. For very large pricing tables, consider a partitioned or materialized approach.
Q11. Write a query to compare current month sales vs previous month sales vs same month last year for each product category. [5 Marks]
Answer:
Assumption: Tables orders(order_id, customer_id, order_date, amount) and order_items(order_item_id, order_id, product_id, quantity, unit_price) and products(product_id, product_name, category) exist.
WITH monthly_category_sales AS (
-- Step 1: Aggregate revenue per category per year-month
SELECT
p.category,
YEAR(o.order_date) AS yr,
MONTH(o.order_date) AS mth,
SUM(oi.quantity * oi.unit_price) AS monthly_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category, YEAR(o.order_date), MONTH(o.order_date)
),
sales_with_comparisons AS (
-- Step 2: Use LAG to bring in previous month and same month last year
SELECT
category,
yr,
mth,
monthly_revenue AS current_month_sales,
LAG(monthly_revenue, 1) OVER (
PARTITION BY category
ORDER BY yr, mth
) AS prev_month_sales,
LAG(monthly_revenue, 12) OVER (
PARTITION BY category
ORDER BY yr, mth
) AS same_month_last_year_sales
FROM monthly_category_sales
)
-- Step 3: Calculate growth metrics and filter to current month
SELECT
category,
yr AS year,
mth AS month,
ROUND(current_month_sales, 2) AS current_month_sales,
ROUND(COALESCE(prev_month_sales, 0), 2) AS prev_month_sales,
ROUND(COALESCE(same_month_last_year_sales, 0),2)AS same_month_ly_sales,
-- Month-over-month change
ROUND(
CASE
WHEN prev_month_sales IS NULL
OR prev_month_sales = 0 THEN NULL
ELSE (current_month_sales - prev_month_sales)
/ prev_month_sales * 100
END, 2
) AS mom_growth_pct,
-- Year-over-year change
ROUND(
CASE
WHEN same_month_last_year_sales IS NULL
OR same_month_last_year_sales = 0 THEN NULL
ELSE (current_month_sales - same_month_last_year_sales)
/ same_month_last_year_sales * 100
END, 2
) AS yoy_growth_pct
FROM sales_with_comparisons
ORDER BY
category ASC,
yr DESC,
mth DESC;
Sample output:
| category | year | month | current_month_sales | prev_month_sales | same_month_ly_sales | mom_growth_pct | yoy_growth_pct |
|---|---|---|---|---|---|---|---|
| Electronics | 2024 | 4 | 320000 | 295000 | 280000 | +8.47% | +14.29% |
| Clothing | 2024 | 4 | 145000 | 162000 | 130000 | -10.49% | +11.54% |
Why
LAG(monthly_revenue, 12): Partitioned by category and ordered by year+month chronologically, the row 12 positions back is exactly the same month of the prior year â a clean year-over-year comparison without any self-join.
Q12. Write a query to find customers who placed their first order in 2024. [5 Marks]
Answer:
Assumption: Tables customers(customer_id, name, email, created_at) and orders(order_id, customer_id, order_date, amount) exist.
Method 1 â Using MIN() with HAVING (simplest and most portable):
SELECT
c.customer_id,
c.name,
c.email,
MIN(o.order_date) AS first_order_date,
COUNT(o.order_id) AS total_orders_2024,
SUM(o.amount) AS total_spend_2024
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.name,
c.email
HAVING
YEAR(MIN(o.order_date)) = 2024 -- first ever order was in 2024
ORDER BY first_order_date ASC;
Method 2 â Using window function ROW_NUMBER (most explicit):
WITH ranked_orders AS (
-- Step 1: Rank each customer's orders chronologically
SELECT
customer_id,
order_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date ASC
) AS order_rank
FROM orders
),
first_orders AS (
-- Step 2: Keep only each customer's very first order
SELECT customer_id, order_date AS first_order_date
FROM ranked_orders
WHERE order_rank = 1
)
-- Step 3: Filter to those whose first order was in 2024
SELECT
c.customer_id,
c.name,
c.email,
fo.first_order_date
FROM customers c
JOIN first_orders fo
ON c.customer_id = fo.customer_id
WHERE YEAR(fo.first_order_date) = 2024
ORDER BY fo.first_order_date ASC;
Method 3 â Using NOT EXISTS (anti-join: no orders before 2024):
SELECT
c.customer_id,
c.name,
c.email,
MIN(o.order_date) AS first_order_date
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE YEAR(o.order_date) = 2024 -- has at least one 2024 order
AND NOT EXISTS (
SELECT 1
FROM orders o_prev
WHERE o_prev.customer_id = c.customer_id
AND YEAR(o_prev.order_date) < 2024 -- no orders before 2024
)
GROUP BY c.customer_id, c.name, c.email
ORDER BY first_order_date ASC;
Method comparison:
| Method | Readability | Performance | Handles edge cases |
|---|---|---|---|
MIN() + HAVING |
â Cleanest | â Single pass | â Yes |
ROW_NUMBER() |
â Explicit | â Good | â Yes |
NOT EXISTS |
â Clear intent | â ī¸ Correlated scan | â Best for correctness |
Best practice: Method 1 (
MIN + HAVING) is the most concise. Method 3 (NOT EXISTS) is semantically the clearest â it reads as "customers who have a 2024 order and have NO order before 2024" which precisely matches the business requirement. Use Method 3 in production when data correctness is critical.
đ Day 7: Date/Time Functions & String Manipulation
Here are all the answers formatted for direct copy-paste:
Q1. List the common date functions in SQL (DATEADD, DATEDIFF, DATEPART, EXTRACT). Give a use case for each. [4 Marks]
Answer:
DATEADD â Add or subtract a time interval from a date:
| Dialect | Function | Example |
|---|---|---|
| SQL Server | DATEADD(unit, n, date) |
DATEADD(DAY, 30, order_date) |
| MySQL | DATE_ADD(date, INTERVAL n unit) |
DATE_ADD(order_date, INTERVAL 30 DAY) |
| PostgreSQL | date + INTERVAL 'n unit' |
order_date + INTERVAL '30 days' |
-- Use case: Find the expected delivery date (7 days after order)
-- SQL Server
SELECT order_id, order_date,
DATEADD(DAY, 7, order_date) AS expected_delivery
FROM orders;
-- MySQL
SELECT order_id, order_date,
DATE_ADD(order_date, INTERVAL 7 DAY) AS expected_delivery
FROM orders;
-- PostgreSQL
SELECT order_id, order_date,
order_date + INTERVAL '7 days' AS expected_delivery
FROM orders;
DATEDIFF â Calculate the difference between two dates:
| Dialect | Function |
|---|---|
| SQL Server | DATEDIFF(unit, start_date, end_date) |
| MySQL | DATEDIFF(end_date, start_date) â days only |
| PostgreSQL | end_date - start_date â returns integer days |
-- Use case: Find how many days since each order was placed
-- SQL Server
SELECT order_id,
DATEDIFF(DAY, order_date, GETDATE()) AS days_since_order
FROM orders;
-- MySQL
SELECT order_id,
DATEDIFF(CURDATE(), order_date) AS days_since_order
FROM orders;
-- PostgreSQL
SELECT order_id,
CURRENT_DATE - order_date::DATE AS days_since_order
FROM orders;
DATEPART / EXTRACT â Extract a specific component (year, month, day, hour) from a date:
| Dialect | Function |
|---|---|
| SQL Server | DATEPART(unit, date) |
| MySQL | YEAR(date), MONTH(date), DAY(date) or EXTRACT(unit FROM date) |
| PostgreSQL | EXTRACT(unit FROM date) or DATE_PART('unit', date) |
-- Use case: Group sales by year and month for trend analysis
-- SQL Server
SELECT DATEPART(YEAR, order_date) AS yr,
DATEPART(MONTH, order_date) AS mth,
SUM(amount) AS monthly_sales
FROM orders
GROUP BY DATEPART(YEAR, order_date), DATEPART(MONTH, order_date);
-- MySQL
SELECT YEAR(order_date) AS yr,
MONTH(order_date) AS mth,
SUM(amount) AS monthly_sales
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date);
-- PostgreSQL
SELECT EXTRACT(YEAR FROM order_date) AS yr,
EXTRACT(MONTH FROM order_date) AS mth,
SUM(amount) AS monthly_sales
FROM orders
GROUP BY 1, 2;
EXTRACT â Standard SQL function (ANSI compliant) to pull date parts:
-- Use case: Find orders placed in Q4 (October, November, December)
-- PostgreSQL / MySQL
SELECT order_id, order_date, amount
FROM orders
WHERE EXTRACT(MONTH FROM order_date) IN (10, 11, 12);
Q2. Write a query to find all orders placed in the last 30 days. [3 Marks]
Answer:
MySQL:
SELECT
order_id,
customer_id,
order_date,
amount
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
AND order_date < CURDATE() + INTERVAL 1 DAY -- includes today fully
ORDER BY order_date DESC;
PostgreSQL:
SELECT
order_id,
customer_id,
order_date,
amount
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
AND order_date < CURRENT_DATE + INTERVAL '1 day'
ORDER BY order_date DESC;
SQL Server:
SELECT
order_id,
customer_id,
order_date,
amount
FROM orders
WHERE order_date >= DATEADD(DAY, -30, CAST(GETDATE() AS DATE))
AND order_date < DATEADD(DAY, 1, CAST(GETDATE() AS DATE))
ORDER BY order_date DESC;
Important nuance: If
order_dateis aDATETIMEorTIMESTAMPcolumn (not justDATE), using>= CURDATE() - 30might miss orders placed later in today's clock time. The safest approach is>= DATE_SUB(CURDATE(), INTERVAL 30 DAY)paired with< CURDATE() + 1to capture the full range from midnight to midnight.
Q3. How do you calculate the age of a customer in years from their date_of_birth? [3 Marks]
Answer:
Assumption: Table customers(customer_id, name, date_of_birth) exists where date_of_birth is a DATE column.
MySQL:
SELECT
customer_id,
name,
date_of_birth,
TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE()) AS age_in_years
FROM customers
ORDER BY age_in_years DESC;
PostgreSQL:
SELECT
customer_id,
name,
date_of_birth,
-- Method 1: Using AGE() function (returns interval)
AGE(CURRENT_DATE, date_of_birth) AS age_interval,
-- Method 2: Extract years from AGE() for a clean integer
EXTRACT(YEAR FROM AGE(CURRENT_DATE, date_of_birth))::INT
AS age_in_years
FROM customers
ORDER BY age_in_years DESC;
SQL Server:
SELECT
customer_id,
name,
date_of_birth,
-- Method 1: Simple DATEDIFF (may be off by 1 if birthday hasn't occurred yet this year)
DATEDIFF(YEAR, date_of_birth, GETDATE()) AS approx_age,
-- Method 2: Accurate age accounting for birthday this year
DATEDIFF(YEAR, date_of_birth, GETDATE())
- CASE
WHEN MONTH(GETDATE()) < MONTH(date_of_birth)
OR (MONTH(GETDATE()) = MONTH(date_of_birth)
AND DAY(GETDATE()) < DAY(date_of_birth))
THEN 1
ELSE 0
END AS accurate_age_in_years
FROM customers;
Why SQL Server needs the birthday correction:
DATEDIFF(YEAR, '2000-12-31', '2024-06-01')returns 24 even though the person hasn't turned 24 yet (their birthday is in December). TheCASEblock subtracts 1 if the birthday hasn't occurred yet in the current year. MySQL'sTIMESTAMPDIFFand PostgreSQL'sAGE()handle this correctly by default.
Q4. Write a query to find the first and last day of the current month. [3 Marks]
Answer:
MySQL:
SELECT
-- First day: set day component to 01
DATE_FORMAT(CURDATE(), '%Y-%m-01') AS first_day_of_month,
-- Last day: built-in function
LAST_DAY(CURDATE()) AS last_day_of_month;
-- Practical use: filter orders in current month
SELECT order_id, order_date, amount
FROM orders
WHERE order_date BETWEEN DATE_FORMAT(CURDATE(), '%Y-%m-01')
AND LAST_DAY(CURDATE());
PostgreSQL:
SELECT
-- First day: truncate to month start
DATE_TRUNC('month', CURRENT_DATE)::DATE AS first_day_of_month,
-- Last day: first day of next month minus 1 day
(DATE_TRUNC('month', CURRENT_DATE)
+ INTERVAL '1 month - 1 day')::DATE AS last_day_of_month;
-- Practical use
SELECT order_id, order_date, amount
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
AND order_date < DATE_TRUNC('month', CURRENT_DATE) + INTERVAL '1 month';
SQL Server:
SELECT
-- First day: construct from year and month parts
DATEFROMPARTS(YEAR(GETDATE()), MONTH(GETDATE()), 1)
AS first_day_of_month,
-- Last day: first day of next month minus 1 day
EOMONTH(GETDATE()) AS last_day_of_month;
-- Alternative using DATEADD
SELECT
DATEADD(DAY, 1 - DAY(GETDATE()), CAST(GETDATE() AS DATE))
AS first_day_of_month,
EOMONTH(GETDATE()) AS last_day_of_month;
Q5. What is the difference between CURRENT_DATE, GETDATE(), NOW(), and SYSDATE? Which database uses which? [2 Marks]
Answer:
| Function | Returns | Database |
|---|---|---|
CURRENT_DATE |
Date only â no time component (2024-04-13) |
Standard SQL â PostgreSQL, MySQL, SQL Server (2012+) |
GETDATE() |
Date + Time â current datetime (2024-04-13 14:32:05.123) |
SQL Server only |
NOW() |
Date + Time â current datetime at statement start | MySQL, PostgreSQL |
SYSDATE() |
Date + Time â current datetime at function evaluation (can differ within same query) | MySQL, Oracle |
-- PostgreSQL
SELECT CURRENT_DATE; -- 2024-04-13
SELECT NOW(); -- 2024-04-13 14:32:05.123456+05:30
-- MySQL
SELECT CURRENT_DATE; -- 2024-04-13
SELECT NOW(); -- 2024-04-13 14:32:05 (statement start time)
SELECT SYSDATE(); -- 2024-04-13 14:32:05 (exact evaluation time â can differ from NOW())
-- SQL Server
SELECT CAST(GETDATE() AS DATE); -- 2024-04-13
SELECT GETDATE(); -- 2024-04-13 14:32:05.123
SELECT SYSDATETIME(); -- higher precision: 2024-04-13 14:32:05.1234567
Key difference between NOW() and SYSDATE() in MySQL:
NOW()returns the time the statement began executing â it stays constant within a query.SYSDATE()returns the time it is actually called â it can return different values if called multiple times within the same query. For consistency and reproducibility,NOW()is preferred.
Q6. Write a query to calculate weekly sales for the last 12 weeks. [4 Marks]
Answer:
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL:
SELECT
YEARWEEK(order_date, 1) AS year_week, -- ISO week
MIN(order_date) AS week_start_date,
MAX(order_date) AS week_end_date,
COUNT(order_id) AS total_orders,
ROUND(SUM(amount), 2) AS weekly_sales,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 12 WEEK)
GROUP BY YEARWEEK(order_date, 1)
ORDER BY year_week ASC;
PostgreSQL:
SELECT
DATE_TRUNC('week', order_date)::DATE AS week_start_date,
DATE_TRUNC('week', order_date)::DATE
+ INTERVAL '6 days' AS week_end_date,
COUNT(order_id) AS total_orders,
ROUND(SUM(amount)::NUMERIC, 2) AS weekly_sales,
ROUND(AVG(amount)::NUMERIC, 2) AS avg_order_value
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 weeks'
GROUP BY DATE_TRUNC('week', order_date)
ORDER BY week_start_date ASC;
SQL Server:
SELECT
DATEADD(DAY, 1 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
AS week_start_date,
DATEADD(DAY, 7 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
AS week_end_date,
COUNT(order_id) AS total_orders,
ROUND(SUM(amount), 2) AS weekly_sales,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
WHERE order_date >= DATEADD(WEEK, -12, CAST(GETDATE() AS DATE))
GROUP BY
DATEADD(DAY, 1 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE)),
DATEADD(DAY, 7 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
ORDER BY week_start_date ASC;
Q7. Write a query to find the day of the week with the highest average sales. [3 Marks]
Answer:
MySQL:
SELECT
DAYNAME(order_date) AS day_of_week,
DAYOFWEEK(order_date) AS day_number, -- 1=Sunday, 7=Saturday
COUNT(order_id) AS total_orders,
ROUND(AVG(amount), 2) AS avg_sales,
ROUND(SUM(amount), 2) AS total_sales
FROM orders
GROUP BY
DAYOFWEEK(order_date),
DAYNAME(order_date)
ORDER BY avg_sales DESC
LIMIT 1;
PostgreSQL:
SELECT
TO_CHAR(order_date, 'Day') AS day_of_week,
EXTRACT(DOW FROM order_date) AS day_number, -- 0=Sunday, 6=Saturday
COUNT(order_id) AS total_orders,
ROUND(AVG(amount)::NUMERIC, 2) AS avg_sales,
ROUND(SUM(amount)::NUMERIC, 2) AS total_sales
FROM orders
GROUP BY
EXTRACT(DOW FROM order_date),
TO_CHAR(order_date, 'Day')
ORDER BY avg_sales DESC
LIMIT 1;
SQL Server (all days ranked, not just top 1):
SELECT TOP 1
DATENAME(WEEKDAY, order_date) AS day_of_week,
DATEPART(WEEKDAY, order_date) AS day_number,
COUNT(order_id) AS total_orders,
ROUND(AVG(amount), 2) AS avg_sales,
ROUND(SUM(amount), 2) AS total_sales
FROM orders
GROUP BY
DATEPART(WEEKDAY, order_date),
DATENAME(WEEKDAY, order_date)
ORDER BY avg_sales DESC;
Q8. How do you group sales by fiscal quarter (assuming fiscal year starts in April)? [3 Marks]
Answer: In a fiscal year starting April, the quarters are:
| Fiscal Quarter | Calendar Months |
|---|---|
| FQ1 | April, May, June |
| FQ2 | July, August, September |
| FQ3 | October, November, December |
| FQ4 | January, February, March |
MySQL:
SELECT
-- Fiscal year: shifts back 3 months so April becomes month 1 of new year
YEAR(order_date - INTERVAL 3 MONTH) AS fiscal_year,
QUARTER(order_date - INTERVAL 3 MONTH) AS fiscal_quarter,
CASE QUARTER(order_date - INTERVAL 3 MONTH)
WHEN 1 THEN 'FQ1 (Apr-Jun)'
WHEN 2 THEN 'FQ2 (Jul-Sep)'
WHEN 3 THEN 'FQ3 (Oct-Dec)'
WHEN 4 THEN 'FQ4 (Jan-Mar)'
END AS fiscal_quarter_label,
ROUND(SUM(amount), 2) AS total_sales,
COUNT(order_id) AS total_orders
FROM orders
GROUP BY
YEAR(order_date - INTERVAL 3 MONTH),
QUARTER(order_date - INTERVAL 3 MONTH)
ORDER BY fiscal_year, fiscal_quarter;
PostgreSQL:
SELECT
EXTRACT(YEAR FROM order_date - INTERVAL '3 months') AS fiscal_year,
EXTRACT(QUARTER FROM order_date - INTERVAL '3 months') AS fiscal_quarter,
CASE EXTRACT(QUARTER FROM order_date - INTERVAL '3 months')
WHEN 1 THEN 'FQ1 (Apr-Jun)'
WHEN 2 THEN 'FQ2 (Jul-Sep)'
WHEN 3 THEN 'FQ3 (Oct-Dec)'
WHEN 4 THEN 'FQ4 (Jan-Mar)'
END AS fiscal_quarter_label,
ROUND(SUM(amount)::NUMERIC, 2) AS total_sales
FROM orders
GROUP BY 1, 2
ORDER BY fiscal_year, fiscal_quarter;
SQL Server:
SELECT
YEAR(DATEADD(MONTH, -3, order_date)) AS fiscal_year,
DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
AS fiscal_quarter,
CASE DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
WHEN 1 THEN 'FQ1 (Apr-Jun)'
WHEN 2 THEN 'FQ2 (Jul-Sep)'
WHEN 3 THEN 'FQ3 (Oct-Dec)'
WHEN 4 THEN 'FQ4 (Jan-Mar)'
END AS fiscal_quarter_label,
ROUND(SUM(amount), 2) AS total_sales
FROM orders
GROUP BY
YEAR(DATEADD(MONTH, -3, order_date)),
DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
ORDER BY fiscal_year, fiscal_quarter;
Logic explained: Subtracting 3 months from each date shifts the calendar so that April (month 4) becomes January (month 1) of the shifted calendar. Standard
QUARTER()then correctly identifies FQ1=Apr-Jun as Q1 in the shifted timeline.
Q9. List the common string functions (CONCAT, SUBSTRING, LENGTH, TRIM, REPLACE, UPPER, LOWER). Give one example each. [5 Marks]
Answer:
CONCAT â Join two or more strings together:
-- Create full name from first and last name columns
SELECT
CONCAT(first_name, ' ', last_name) AS full_name
FROM customers;
-- MySQL also supports: CONCAT_WS(' ', first_name, last_name) â skips NULLs
-- PostgreSQL: first_name || ' ' || last_name
-- SQL Server: first_name + ' ' + last_name
SUBSTRING â Extract part of a string by position:
-- Extract area code from phone number '(022) 4567-8901'
SELECT
phone,
SUBSTRING(phone, 2, 3) AS area_code
-- MySQL: SUBSTRING(str, start_pos, length) â 1-indexed
-- PostgreSQL: SUBSTRING(str FROM 2 FOR 3)
-- SQL Server: SUBSTRING(str, 2, 3)
FROM customers;
LENGTH â Return the number of characters in a string:
-- Find customers with unusually long names (possible data quality issue)
SELECT
name,
LENGTH(name) AS name_length
-- SQL Server: LEN(name) â trims trailing spaces before counting
-- PostgreSQL: LENGTH(name) or CHAR_LENGTH(name)
FROM customers
WHERE LENGTH(name) > 50;
TRIM â Remove leading and trailing spaces (or specified characters):
-- Clean up messy email entries with extra whitespace
SELECT
email,
TRIM(email) AS clean_email,
LTRIM(email) AS left_trimmed, -- leading spaces only
RTRIM(email) AS right_trimmed -- trailing spaces only
FROM customers
WHERE email != TRIM(email); -- find rows that need cleaning
REPLACE â Substitute all occurrences of a substring:
-- Standardize phone format: remove hyphens from '022-4567-8901'
SELECT
phone,
REPLACE(phone, '-', '') AS clean_phone,
REPLACE(REPLACE(phone, '-', ''), ' ', '') AS fully_clean_phone
FROM customers;
UPPER â Convert string to all uppercase:
-- Normalize country codes to uppercase for consistent grouping
SELECT
UPPER(country) AS country_upper,
COUNT(*) AS customer_count
FROM customers
GROUP BY UPPER(country);
-- Prevents 'india', 'India', 'INDIA' from being counted separately
LOWER â Convert string to all lowercase:
-- Normalize email addresses for deduplication
SELECT
LOWER(TRIM(email)) AS normalized_email,
COUNT(*) AS occurrences
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1; -- find duplicate emails
Q10. Write a query to extract the domain name from an email column (e.g., 'user@gmail.com' â 'gmail.com'). [4 Marks]
Answer:
MySQL:
SELECT
email,
-- SUBSTRING_INDEX splits by delimiter and returns n-th part
SUBSTRING_INDEX(email, '@', -1) AS email_domain
-- -1 means: take everything AFTER the last '@'
FROM customers;
PostgreSQL:
SELECT
email,
-- Method 1: SPLIT_PART splits by delimiter and picks the n-th chunk
SPLIT_PART(email, '@', 2) AS email_domain,
-- Method 2: SUBSTRING with POSITION
SUBSTRING(email FROM POSITION('@' IN email) + 1) AS email_domain_v2
FROM customers;
SQL Server:
SELECT
email,
-- CHARINDEX finds the position of '@', then SUBSTRING extracts from that point
SUBSTRING(
email,
CHARINDEX('@', email) + 1, -- start after '@'
LEN(email) - CHARINDEX('@', email) -- length = remaining characters
) AS email_domain
FROM customers;
Universal approach (works across dialects):
-- Using SUBSTRING + position functions
SELECT
email,
SUBSTRING(email,
LOCATE('@', email) + 1, -- MySQL: LOCATE
LENGTH(email)
) AS email_domain
FROM customers;
With validation (skip malformed emails):
SELECT
email,
SUBSTRING_INDEX(email, '@', -1) AS email_domain
FROM customers
WHERE email LIKE '%@%.%' -- basic email format check
AND email NOT LIKE '%@%@%'; -- no double @ signs
Q11. Write a query to split a full_name column into first_name and last_name. [3 Marks]
Answer:
Assumption: Table customers(customer_id, full_name) exists. Names are in "FirstName LastName" format separated by a single space.
MySQL:
SELECT
customer_id,
full_name,
-- First name: everything before the first space
SUBSTRING_INDEX(full_name, ' ', 1) AS first_name,
-- Last name: everything after the first space
SUBSTRING_INDEX(full_name, ' ', -1) AS last_name,
-- Middle name safe version (if name has 3 parts)
CASE
WHEN LENGTH(full_name) - LENGTH(REPLACE(full_name, ' ', '')) >= 2
THEN SUBSTRING_INDEX(SUBSTRING_INDEX(full_name, ' ', 2), ' ', -1)
ELSE NULL
END AS middle_name
FROM customers;
PostgreSQL:
SELECT
customer_id,
full_name,
-- First name: text before first space
SPLIT_PART(full_name, ' ', 1) AS first_name,
-- Last name: text after first space (handles middle names too)
CASE
WHEN POSITION(' ' IN full_name) > 0
THEN SUBSTRING(
full_name FROM POSITION(' ' IN full_name) + 1
)
ELSE full_name -- single name fallback
END AS last_name
FROM customers;
SQL Server:
SELECT
customer_id,
full_name,
-- First name: up to first space
LEFT(full_name, CHARINDEX(' ', full_name + ' ') - 1)
AS first_name,
-- Last name: after first space
CASE
WHEN CHARINDEX(' ', full_name) > 0
THEN SUBSTRING(
full_name,
CHARINDEX(' ', full_name) + 1,
LEN(full_name)
)
ELSE NULL
END AS last_name
FROM customers;
Edge case handling: The
+ ' 'trick in SQL Server (CHARINDEX(' ', full_name + ' ')) ensures that even single-word names don't return 0 fromCHARINDEX, preventing errors inLEFT().
Q12. What is the difference between CHAR, VARCHAR, and TEXT data types? [3 Marks]
Answer:
| Feature | CHAR(n) |
VARCHAR(n) |
TEXT |
|---|---|---|---|
| Storage type | Fixed-length â always uses exactly n bytes | Variable-length â uses only as many bytes as needed + 1-2 byte length prefix | Variable-length â no declared max length |
| Max length | Up to 255 characters (MySQL) | Up to 65,535 bytes (MySQL) / 8000 chars (SQL Server) | Up to 65,535 bytes (MySQL) / unlimited (PostgreSQL) |
| Trailing spaces | Pads with spaces to fill n length | Stores as-is, no padding | Stores as-is |
| Performance | Slightly faster for fixed-size data (no length calculation) | Efficient for variable data | Slower for indexing; not ideal for WHERE clauses |
| Indexing | â Fully indexable | â Fully indexable | â ī¸ Limited or prefix-only indexing |
| Best use case | Fixed codes (country code, gender flag, status) | Names, emails, addresses (variable but bounded) | Long descriptions, comments, article body |
-- CHAR: always stores exactly 2 characters (pads 'US' as 'US')
CREATE TABLE countries (
country_code CHAR(2), -- 'US', 'IN', 'GB' â always 2 chars
country_name VARCHAR(100), -- 'United States', 'India' â variable
description TEXT -- long free-text field, unbounded
);
-- Demonstration of CHAR padding behavior
SELECT
CHAR_LENGTH('IN ') AS varchar_length, -- 3 (includes space)
CHAR_LENGTH(TRIM('IN ')) AS trimmed_length; -- 2
Storage comparison for the value 'Hello' (5 characters):
| Type | Declared as | Bytes used |
|---|---|---|
CHAR(10) |
Fixed | 10 bytes always (pads with 5 spaces) |
VARCHAR(100) |
Variable | 6 bytes (5 chars + 1 length byte) |
TEXT |
Variable | 6 bytes + possible off-row storage overhead |
Rule of thumb: Use
CHARfor truly fixed-length codes (ISO country codes, gender flags, UUID in some systems). UseVARCHARfor most text columns where length varies but has a reasonable maximum. UseTEXTonly for unbounded long-form content like descriptions or comments â avoid using it as a join key or inWHEREclauses with full indexes.
Q13. Write a query to find customers whose email domain is in the top 5 most common domains. [5 Marks]
Answer:
Assumption: Table customers(customer_id, name, email) exists.
MySQL:
WITH email_domains AS (
-- Step 1: Extract domain from each customer's email
SELECT
customer_id,
name,
email,
SUBSTRING_INDEX(email, '@', -1) AS domain
FROM customers
WHERE email LIKE '%@%.%' -- basic format validation
),
domain_counts AS (
-- Step 2: Count customers per domain
SELECT
domain,
COUNT(*) AS customer_count
FROM email_domains
GROUP BY domain
),
top_5_domains AS (
-- Step 3: Rank domains and keep top 5
SELECT domain
FROM domain_counts
ORDER BY customer_count DESC
LIMIT 5
)
-- Step 4: Return customers whose domain is in top 5
SELECT
ed.customer_id,
ed.name,
ed.email,
ed.domain,
dc.customer_count AS domain_user_count
FROM email_domains ed
JOIN domain_counts dc ON ed.domain = dc.domain
WHERE ed.domain IN (SELECT domain FROM top_5_domains)
ORDER BY dc.customer_count DESC, ed.name ASC;
PostgreSQL (using DENSE_RANK for ties):
WITH email_domains AS (
SELECT
customer_id,
name,
email,
SPLIT_PART(email, '@', 2) AS domain
FROM customers
WHERE email LIKE '%@%.%'
),
domain_ranked AS (
SELECT
domain,
COUNT(*) AS customer_count,
DENSE_RANK() OVER (
ORDER BY COUNT(*) DESC
) AS domain_rank
FROM email_domains
GROUP BY domain
)
SELECT
ed.customer_id,
ed.name,
ed.email,
ed.domain,
dr.customer_count,
dr.domain_rank
FROM email_domains ed
JOIN domain_ranked dr
ON ed.domain = dr.domain
WHERE dr.domain_rank <= 5 -- handles ties correctly
ORDER BY dr.domain_rank, ed.name;
Why DENSE_RANK over LIMIT 5: If two domains share the 5th highest count (e.g., both have 100 customers),
LIMIT 5arbitrarily drops one.DENSE_RANK() <= 5includes all tied domains fairly â a better approach for business reporting.
Q14. Write a query to find the average time (in days) between a customer's first and second order. [5 Marks]
Answer:
Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.
MySQL:
WITH ranked_orders AS (
-- Step 1: Rank each customer's orders chronologically
SELECT
customer_id,
order_id,
order_date,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date ASC,
order_id ASC -- tiebreak by order_id if same date
) AS order_rank
FROM orders
),
first_and_second AS (
-- Step 2: Pivot first and second order dates into same row
SELECT
customer_id,
MAX(CASE WHEN order_rank = 1 THEN order_date END) AS first_order_date,
MAX(CASE WHEN order_rank = 2 THEN order_date END) AS second_order_date
FROM ranked_orders
WHERE order_rank <= 2
GROUP BY customer_id
HAVING MAX(CASE WHEN order_rank = 2 THEN order_date END) IS NOT NULL
-- HAVING filters out customers with only 1 order
),
days_between AS (
-- Step 3: Calculate days between first and second order per customer
SELECT
customer_id,
first_order_date,
second_order_date,
DATEDIFF(second_order_date, first_order_date) AS days_to_second_order
FROM first_and_second
)
-- Step 4: Average across all qualifying customers
SELECT
ROUND(AVG(days_to_second_order), 2) AS avg_days_first_to_second_order,
MIN(days_to_second_order) AS min_days,
MAX(days_to_second_order) AS max_days,
COUNT(*) AS customers_with_2plus_orders
FROM days_between;
PostgreSQL:
WITH ranked_orders AS (
SELECT
customer_id,
order_date,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date ASC, order_id ASC
) AS order_rank
FROM orders
),
first_and_second AS (
SELECT
customer_id,
MAX(CASE WHEN order_rank = 1 THEN order_date END) AS first_order_date,
MAX(CASE WHEN order_rank = 2 THEN order_date END) AS second_order_date
FROM ranked_orders
WHERE order_rank <= 2
GROUP BY customer_id
HAVING MAX(CASE WHEN order_rank = 2 THEN order_date END) IS NOT NULL
)
SELECT
ROUND(AVG(
second_order_date::DATE - first_order_date::DATE
)::NUMERIC, 2) AS avg_days_first_to_second_order,
MIN(second_order_date - first_order_date) AS min_days,
MAX(second_order_date - first_order_date) AS max_days,
COUNT(*) AS customers_with_2plus_orders
FROM first_and_second;
With per-customer breakdown (join customers for detail view):
-- Add this to either query above to see per-customer breakdown:
SELECT
c.customer_id,
c.name,
db.first_order_date,
db.second_order_date,
db.days_to_second_order,
CASE
WHEN db.days_to_second_order <= 7 THEN 'Rapid Repeat (â¤7 days)'
WHEN db.days_to_second_order <= 30 THEN 'Quick Repeat (8-30 days)'
WHEN db.days_to_second_order <= 90 THEN 'Normal Repeat (31-90 days)'
ELSE 'Slow Repeat (>90 days)'
END AS repeat_segment
FROM customers c
JOIN days_between db ON c.customer_id = db.customer_id
ORDER BY db.days_to_second_order ASC;
Step-by-step logic:
| CTE | Purpose |
|---|---|
ranked_orders |
Numbers each customer's orders 1, 2, 3... chronologically |
first_and_second |
Pivots ranks 1 and 2 into columns; excludes one-time buyers via HAVING |
days_between |
Computes day difference per customer |
| Final SELECT | Averages across all qualifying customers with 2+ orders |
Why exclude one-time buyers with HAVING: Customers with only one order have no second order date â their
second_order_dateis NULL. Including them inAVG()would not affect the average (NULL is ignored by AVG) but theHAVINGclause makes the exclusion explicit and the row count (customers_with_2plus_orders) meaningful.
đ Day 8: Query Optimization, Indexes & Execution Plans
Here are all the answers formatted for direct copy-paste:
Q1. Explain the difference between a clustered and non-clustered index. Can a table have multiple clustered indexes? [4 Marks]
Answer:
| Feature | Clustered Index | Non-Clustered Index |
|---|---|---|
| Data storage | Physically reorders table rows to match index order | Stores a separate structure with pointers back to actual rows |
| Number per table | Only ONE per table | Up to 999 (SQL Server) / virtually unlimited (PostgreSQL, MySQL) |
| Lookup speed | Fastest for range scans â data is physically ordered | Requires an extra lookup step (row pointer) after finding index entry |
| Storage overhead | No extra storage â IS the table | Requires additional disk space for the index structure |
| Best for | Primary keys, range queries, ORDER BY columns | Columns used in WHERE, JOIN, or covering queries |
Clustered Index â the table IS the index:
-- SQL Server: Primary key automatically creates a clustered index
CREATE TABLE orders (
order_id INT PRIMARY KEY, -- clustered index by default
customer_id INT NOT NULL,
order_date DATE NOT NULL,
amount DECIMAL(10,2)
);
-- Explicitly create a clustered index (SQL Server)
CREATE CLUSTERED INDEX idx_orders_date
ON orders (order_date);
Non-Clustered Index â separate structure with row pointers:
-- Non-clustered index on customer_id for fast JOIN/WHERE lookups
CREATE NONCLUSTERED INDEX idx_orders_customer
ON orders (customer_id);
-- PostgreSQL (all indexes are non-clustered by default)
CREATE INDEX idx_orders_customer ON orders (customer_id);
CREATE INDEX idx_orders_date ON orders (order_date);
Can a table have multiple clustered indexes?
No. A table can have only one clustered index because the clustered index defines the physical storage order of the rows themselves â you cannot physically sort a table in two different orders simultaneously.
PostgreSQL note: PostgreSQL does not have a "clustered index" in the SQL Server sense. All indexes in PostgreSQL are non-clustered by default. The
CLUSTERcommand can physically reorder a table by an index once, but it does not maintain that order as data changes.
Q2. What is a composite index? When is the order of columns in a composite index important? [4 Marks]
Answer: A composite index (also called a multi-column or compound index) is an index built on two or more columns of a table. It allows the database to efficiently filter, sort, or look up rows based on combinations of those columns.
-- Creating a composite index
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);
Column order matters critically â the "leftmost prefix rule":
The database can only use a composite index if the query filters on columns starting from the leftmost column in the index definition. Skipping the leading column renders the index unusable.
-- Index: (customer_id, order_date, status)
-- â
Uses the full index â all three columns, left to right
SELECT * FROM orders
WHERE customer_id = 101
AND order_date = '2024-01-15'
AND status = 'Shipped';
-- â
Uses the index partially â first two columns (leftmost prefix)
SELECT * FROM orders
WHERE customer_id = 101
AND order_date BETWEEN '2024-01-01' AND '2024-03-31';
-- â
Uses the index â first column only
SELECT * FROM orders
WHERE customer_id = 101;
-- â Cannot use the index â skips customer_id (the leading column)
SELECT * FROM orders
WHERE order_date = '2024-01-15';
-- â Cannot use the index â skips leading two columns
SELECT * FROM orders
WHERE status = 'Shipped';
Column order design rules:
| Rule | Guidance |
|---|---|
| Equality columns first | Columns filtered with = should come before range (>, <, BETWEEN) columns |
| Highest cardinality first | Columns with more unique values (e.g., customer_id) before low-cardinality ones (e.g., status) |
| Most frequently filtered first | The column used in the most WHERE clauses goes first |
| Range column last | A range filter stops the index from being used for subsequent columns |
-- Good order: equality (customer_id) then range (order_date)
CREATE INDEX idx_good ON orders (customer_id, order_date);
-- Bad order: range (order_date) first â prevents customer_id from benefiting
CREATE INDEX idx_bad ON orders (order_date, customer_id);
Q3. What are the downsides of having too many indexes on a table? [3 Marks]
Answer: While indexes speed up SELECT queries, they introduce significant overhead for write operations and resource consumption. Every index on a table must be maintained whenever data changes.
1. Write performance degradation:
-- Every INSERT, UPDATE, DELETE must update ALL indexes on the table
-- A table with 10 indexes requires 10 index updates per row modification
INSERT INTO orders (order_id, customer_id, order_date, amount, status)
VALUES (9001, 42, CURDATE(), 5000, 'Pending');
-- If orders has 8 indexes, this single INSERT triggers 8 index maintenance operations
2. Increased disk storage:
-- Each non-clustered index stores a copy of indexed columns + row pointers
-- A table with 10M rows and 10 composite indexes could use
-- 3-5x more disk space than the base table alone
3. Increased memory pressure:
-- The database buffer pool caches frequently accessed index pages
-- Too many indexes compete for limited RAM â pushing out useful cached data
-- Results in more disk I/O (cache misses) for all queries
4. Query optimizer confusion:
-- With many indexes, the optimizer must evaluate more execution plan choices
-- EXPLAIN / EXPLAIN ANALYZE becomes complex
-- The optimizer may occasionally pick a suboptimal index (statistics-dependent)
-- EXPLAIN SELECT * FROM orders WHERE customer_id = 101 AND status = 'Shipped';
-- With 10 indexes, optimizer must evaluate many candidate index combinations
5. Index maintenance overhead:
-- Over time, indexes become fragmented (especially B-tree indexes on insert-heavy tables)
-- Regular REBUILD or REORGANIZE is needed
-- SQL Server:
ALTER INDEX idx_orders_customer ON orders REBUILD;
-- PostgreSQL:
REINDEX INDEX idx_orders_customer;
Summary â the cost-benefit balance:
| Operation | Effect of More Indexes |
|---|---|
SELECT |
â Faster (potentially) |
INSERT |
â Slower |
UPDATE |
â Slower (especially on indexed columns) |
DELETE |
â Slower |
| Disk usage | â Higher |
| Memory usage | â Higher |
Rule of thumb: Aim for the minimum number of indexes that satisfies your query performance requirements. Regularly audit unused indexes using
sys.dm_db_index_usage_stats(SQL Server) orpg_stat_user_indexes(PostgreSQL) and drop indexes that are never used.
Q4. What is a covering index? How does it improve query performance? [4 Marks]
Answer: A covering index is an index that contains all the columns a query needs â both for filtering (WHERE/JOIN) and for the output (SELECT) â so the database can satisfy the entire query from the index alone, without ever accessing the base table rows.
The problem without a covering index:
-- Query: find order amounts for a specific customer
SELECT order_date, amount
FROM orders
WHERE customer_id = 101;
-- If the index is only on (customer_id):
-- Step 1: Index seek on customer_id = 101 â finds row pointers
-- Step 2: For EACH pointer, go back to the base table to fetch order_date and amount
-- This "row lookup" (called a Key Lookup or RID Lookup) is expensive at scale
Covering index â includes all needed columns:
-- Include the SELECT columns in the index so no table lookup is needed
-- SQL Server: use INCLUDE for non-key columns
CREATE NONCLUSTERED INDEX idx_covering_customer_orders
ON orders (customer_id) -- filter/seek column
INCLUDE (order_date, amount, status); -- additional columns needed by SELECT
-- MySQL / PostgreSQL: list all columns in the index directly
CREATE INDEX idx_covering_customer_orders
ON orders (customer_id, order_date, amount, status);
With the covering index:
SELECT order_date, amount
FROM orders
WHERE customer_id = 101;
-- Step 1: Index seek on customer_id = 101
-- Step 2: Read order_date and amount directly from index leaf pages
-- NO trip back to the base table â "Index Only Scan" or "Covering Index" in EXPLAIN
How to verify a covering index is being used:
-- PostgreSQL: look for "Index Only Scan"
EXPLAIN ANALYZE
SELECT order_date, amount
FROM orders
WHERE customer_id = 101;
-- Output: Index Only Scan using idx_covering_customer_orders on orders
-- SQL Server: look for absence of "Key Lookup" in execution plan
-- If Key Lookup appears â the index is NOT covering â add missing columns to INCLUDE
When to use INCLUDE vs adding to the key:
-- SQL Server best practice:
-- Key columns (in index): columns used in WHERE, JOIN, ORDER BY
-- INCLUDE columns: columns only needed in SELECT â not used for filtering
CREATE NONCLUSTERED INDEX idx_orders_opt
ON orders (customer_id, order_date) -- key: used in WHERE and ORDER BY
INCLUDE (amount, status); -- included: only needed in SELECT output
Performance impact: Covering indexes can reduce query execution time by 10x-100x on large tables by eliminating the most expensive operation in index lookups â the random I/O row fetch from the base table. They are one of the most powerful single-step optimizations available for read-heavy workloads.
Q5. Explain why 'SELECT *' is considered bad practice in production queries. [3 Marks]
Answer: While SELECT * is convenient for exploration and debugging, it causes several serious problems in production environments:
1. Fetches unnecessary data â increased I/O and network transfer:
-- Table has 50 columns but query only needs 3
-- SELECT * transfers all 50 columns across the network for every row
SELECT * FROM customers; -- â Bad: transfers all 50 columns
SELECT customer_id, name, email -- â
Good: only the 3 needed columns
FROM customers;
2. Breaks application code when schema changes:
-- Application code: result[4] assumes the 5th column is 'email'
-- Someone adds a column between name and email â result[4] is now wrong
SELECT * FROM customers; -- â Column positions shift on schema change
-- Explicit column list is immune to column additions/reorderings
SELECT customer_id, name, email FROM customers; -- â
Always gets the right columns
3. Prevents covering index usage:
-- Perfect covering index exists: (customer_id, name, email)
CREATE INDEX idx_covering ON customers (customer_id, name, email);
-- SELECT * cannot be served by the covering index
-- because it needs ALL columns â forces a full table scan or key lookup
SELECT * FROM customers WHERE customer_id = 101; -- â Covering index unused
SELECT customer_id, name, email -- â
Covered by index
FROM customers WHERE customer_id = 101; -- Index Only Scan
4. Exposes sensitive columns accidentally:
-- Table contains: name, email, password_hash, ssn, credit_card_last4
SELECT * FROM customers; -- â Exposes sensitive data to any consumer of this query
-- API response might leak password_hash or SSN
5. Increased memory consumption:
-- Query processing buffers all fetched columns in memory
-- SELECT * on a 100-column, 10M row table can exhaust buffer pool
-- Pushes other useful cached pages out of memory
Rule: Always list only the columns you actually need in production queries. Use
SELECT *only in quick exploratory queries orSELECT COUNT(*)which is a special optimized form.
Q6. What is the impact of using functions on indexed columns in the WHERE clause (e.g., WHERE YEAR(date_col) = 2024)? How do you rewrite it? [4 Marks]
Answer: When you wrap an indexed column inside a function in the WHERE clause, the database cannot use the index on that column. The function must be evaluated for every single row in the table, forcing a full table scan â even if a perfect index exists.
Why it breaks index usage:
-- Index exists on order_date
CREATE INDEX idx_order_date ON orders (order_date);
-- â SLOW: Function applied to indexed column â index is NOT used
-- Database must compute YEAR() for every row to evaluate the condition
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;
-- Execution plan: Table Scan (reads all rows)
-- â Also breaks index usage:
WHERE MONTH(order_date) = 3
WHERE LOWER(email) = 'test@example.com'
WHERE DATE_FORMAT(order_date, '%Y-%m') = '2024-01'
WHERE UPPER(status) = 'ACTIVE'
WHERE LEFT(phone, 3) = '022'
Fix â rewrite to compare the raw column against a range or literal:
-- â
FAST: Rewrite YEAR(order_date) = 2024 as a date range
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
-- Execution plan: Index Range Scan on idx_order_date â
-- â
FAST: Rewrite MONTH(order_date) = 3 (for year-agnostic monthly query)
WHERE order_date >= '2024-03-01'
AND order_date < '2024-04-01';
-- â
FAST: Rewrite LOWER(email) = '...' â store emails pre-normalized
-- Option 1: Ensure emails are always stored lowercase (data hygiene)
WHERE email = 'test@example.com';
-- Option 2: Create a function-based index (PostgreSQL / Oracle)
CREATE INDEX idx_email_lower ON customers (LOWER(email));
WHERE LOWER(email) = 'test@example.com'; -- now uses the functional index â
-- â
FAST: Rewrite DATE_FORMAT grouping using TRUNC
-- PostgreSQL
WHERE order_date >= DATE_TRUNC('month', '2024-01-01'::DATE)
AND order_date < DATE_TRUNC('month', '2024-02-01'::DATE);
Verification with EXPLAIN:
-- Before fix: shows Table Scan or Seq Scan
EXPLAIN SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- After fix: shows Index Range Scan or Index Seek
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
Golden rule: Never apply a transformation to an indexed column in a WHERE clause. Instead, move the transformation to the constant side of the comparison, or use a function-based index when the transformation is unavoidable.
Q7. What is a query execution plan? How do you read it? [4 Marks]
Answer: A query execution plan (also called an explain plan) is the database engine's step-by-step strategy for executing a SQL query. It shows which operations are performed, in what order, which indexes are used, how many rows are estimated vs actually processed, and where the cost is concentrated.
How to generate an execution plan:
-- MySQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 101;
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 101; -- also runs the query
-- PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 101;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE customer_id = 101;
-- SQL Server
-- In SSMS: press Ctrl+M before running (graphical plan)
-- Or use:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 101;
Key fields to read in MySQL EXPLAIN output:
| Column | What it means | Red flags |
|---|---|---|
type |
Join/access type â how rows are found | ALL = full table scan â |
key |
Index actually used | NULL = no index used â |
rows |
Estimated rows examined | Very high number = expensive |
Extra |
Additional info | Using filesort, Using temporary â |
possible_keys |
Indexes the optimizer considered | Helps diagnose index misses |
Access type hierarchy (best to worst in MySQL):
const > eq_ref > ref > range > index > ALL
â â
fastest slowest (full scan)
Reading a PostgreSQL EXPLAIN ANALYZE output:
EXPLAIN ANALYZE
SELECT c.name, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name;
-- Sample output interpretation:
-- HashAggregate (cost=1842.50..1867.50 rows=500 width=40)
-- (actual time=145.2..148.3 rows=487 loops=1)
-- -> Hash Join (cost=450.00..1792.50 rows=10000)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Index Scan using idx_order_date on orders
-- Index Cond: (order_date >= '2024-01-01')
-- -> Seq Scan on customers (cost=0..35.00 rows=500)
What to look for when optimizing:
-- đ´ Warning signs in execution plans:
-- 1. Table Scan / Seq Scan on large tables â missing index
-- 2. Key Lookup / RID Lookup â non-covering index, add INCLUDE columns
-- 3. Hash Match / Sort â missing ORDER BY index or GROUP BY index
-- 4. Estimated rows << Actual rows â stale statistics, run ANALYZE/UPDATE STATISTICS
-- 5. Nested Loops on large tables â consider Hash Join, check join column indexes
Fix stale statistics (affects plan accuracy):
-- MySQL
ANALYZE TABLE orders;
-- PostgreSQL
ANALYZE orders;
-- SQL Server
UPDATE STATISTICS orders;
Q8. Explain the difference between a table scan, index scan, and index seek. [4 Marks]
Answer:
| Operation | What it does | Reads | Speed | When it occurs |
|---|---|---|---|---|
| Table Scan (Full Scan) | Reads every row in the table from start to end | All rows | đ´ Slowest | No usable index, SELECT *, very low cardinality filter |
| Index Scan | Reads all or most entries in an index sequentially | All index pages | đĄ Medium | Index exists but filter is broad; or SELECT needs ordered results |
| Index Seek | Navigates the B-tree index directly to the exact matching rows | Only matching index pages | đĸ Fastest | High-cardinality equality or narrow range filter on indexed column |
Table Scan â reads every row:
-- No index on 'notes' column â must read every row
SELECT * FROM orders WHERE notes LIKE '%urgent%';
-- EXPLAIN output: type = ALL (MySQL), Seq Scan (PostgreSQL)
-- Also occurs when:
-- 1. The filter returns >30% of rows (optimizer prefers scan over index)
-- 2. No index exists on the filtered column
-- 3. Function applied to indexed column defeats index usage
Index Scan â traverses index but reads many entries:
-- Index on status column, but 'Active' represents 80% of rows
SELECT customer_id FROM customers WHERE status = 'Active';
-- Optimizer may choose Index Scan (reads most of the index)
-- or even revert to Table Scan if selectivity is too low
-- EXPLAIN output: type = index (MySQL), Index Scan (PostgreSQL)
Index Seek â pinpoints exact rows via B-tree navigation:
-- High cardinality column with equality filter
-- Index on customer_id
SELECT order_id, amount FROM orders WHERE customer_id = 101;
-- B-tree navigation: root â branch â leaf â exact rows
-- EXPLAIN output: type = ref or const (MySQL), Index Scan with specific rows (PostgreSQL)
-- Range seek on indexed date column
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- EXPLAIN output: type = range (MySQL), Index Scan with bounds (PostgreSQL)
Visual â B-tree index seek navigation:
[Root Node]
/ \
[Branch 1] [Branch 2]
id: 1â500 id: 501â1000
| |
[Leaf Pages] [Leaf Pages]
id=101 â row id=501 â row
Seek for customer_id = 101:
Root â Branch 1 â Leaf Page â Row pointer â Fetch row
Only 3 pages read instead of thousands
Optimization goal: Always aim for Index Seek on large tables. If EXPLAIN shows Table Scan or Index Scan on a frequently run query against a large table, investigate whether an appropriate index is missing or whether the query is defeating an existing index through functions or type mismatches.
Q9. Explain how EXISTS vs IN can differ in performance for large datasets. [4 Marks]
Answer:
| Behavior | IN | EXISTS |
|---|---|---|
| Execution strategy | Collects all matching values from subquery first, then checks membership | Runs subquery row by row and stops at first match (short-circuits) |
| NULL handling | Dangerous â if subquery returns any NULL, NOT IN returns no rows |
Safe â EXISTS only checks row existence, NULLs don't affect it |
| Best for | Small subquery result sets | Large subquery result sets; correlated conditions |
| Correlated subquery | â Not correlated â evaluates once | â Naturally correlated â can reference outer query |
| Duplicate handling | IN deduplicates internally | EXISTS stops after first match â duplicates irrelevant |
Performance scenarios:
Scenario 1 â Small subquery (IN is fine):
-- Subquery returns only a few rows â IN is efficient
SELECT name FROM customers
WHERE customer_id IN (101, 102, 103); -- or small static list
-- Also fine: small reference table
SELECT name FROM customers
WHERE country_code IN (
SELECT code FROM active_countries -- 50 rows â negligible
);
Scenario 2 â Large subquery (EXISTS wins):
-- â IN: collects ALL order customer_ids (millions) into memory, then checks each customer
SELECT name FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders -- 10 million rows loaded into memory first
WHERE amount > 1000
);
-- â
EXISTS: for each customer, stops scanning orders as soon as first match is found
SELECT name FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 1000 -- stops at first matching order per customer
);
Scenario 3 â NULL trap with NOT IN:
-- â DANGEROUS: if subquery returns even ONE NULL, NOT IN returns zero rows
SELECT name FROM customers
WHERE customer_id NOT IN (
SELECT manager_id FROM employees -- if ANY manager_id is NULL â returns nothing
);
-- â
SAFE: NOT EXISTS handles NULLs correctly
SELECT name FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM employees e
WHERE e.manager_id = c.customer_id
);
-- â
SAFE fix for NOT IN: explicitly exclude NULLs
WHERE customer_id NOT IN (
SELECT manager_id FROM employees
WHERE manager_id IS NOT NULL -- critical safety addition
);
Modern optimizers: Many modern databases (PostgreSQL 9.x+, SQL Server 2014+, MySQL 8.x) can internally rewrite IN as a semi-join and EXISTS with similar efficiency. However, the NULL behavior difference is always a real risk and cannot be optimized away â NOT IN vs NOT EXISTS is always a correctness concern regardless of optimizer version.
Q10. What is query hoisting? Give an example of rewriting a subquery for better performance. [3 Marks]
Answer: Query hoisting (also called subquery unnesting or subquery flattening) is the optimization technique of moving a subquery out of its nested position into the main query â typically converting it into a JOIN, CTE, or derived table. This allows the optimizer to create a more efficient execution plan, avoid repeated subquery execution, and leverage indexes more effectively.
Problem â Correlated subquery executes once per outer row:
-- â SLOW: correlated subquery runs for EVERY employee row (N executions)
SELECT
emp_id,
name,
salary,
(SELECT AVG(salary)
FROM employees e2
WHERE e2.department = e1.department) AS dept_avg -- re-executes per row
FROM employees e1;
-- On a table with 100,000 employees: subquery runs 100,000 times
Hoisted version â compute once, JOIN:
-- â
FAST: hoist the subquery into a JOIN â department averages computed once
SELECT
e.emp_id,
e.name,
e.salary,
dept.avg_salary AS dept_avg
FROM employees e
JOIN (
SELECT
department,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department -- runs ONCE, produces ~10-50 rows
) AS dept
ON e.department = dept.department;
-- Subquery runs once â result joined â massive improvement
Another example â IN subquery hoisted to JOIN:
-- â SLOW: IN subquery may be re-evaluated or loads large set into memory
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE amount > 10000
);
-- â
FAST: hoisted to JOIN (or EXISTS)
SELECT DISTINCT
c.customer_id,
c.name
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.amount > 10000;
CTE hoisting for repeated subquery use:
-- â BAD: same subquery written three times in different parts of a query
SELECT * FROM orders
WHERE customer_id IN (SELECT customer_id FROM vip_customers)
AND amount > (SELECT AVG(amount) FROM orders WHERE customer_id IN (SELECT customer_id FROM vip_customers));
-- â
GOOD: hoist into a CTE â computed once, referenced multiple times
WITH vip_ids AS (
SELECT customer_id FROM vip_customers
),
vip_avg AS (
SELECT AVG(amount) AS avg_amount
FROM orders
WHERE customer_id IN (SELECT customer_id FROM vip_ids)
)
SELECT o.*
FROM orders o
JOIN vip_ids v ON o.customer_id = v.customer_id
CROSS JOIN vip_avg va
WHERE o.amount > va.avg_amount;
Q11. How would you identify slow queries in a production database? [3 Marks]
Answer: Identifying slow queries requires a combination of database-native monitoring tools, logging configuration, and query analysis techniques.
Method 1 â Enable and read the Slow Query Log (MySQL):
-- Enable slow query log (in my.cnf or dynamically)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- log queries taking > 1 second
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
-- Analyze the log using mysqldumpslow tool (command line):
-- mysqldumpslow -s t -t 10 /var/log/mysql/slow.log
-- Shows top 10 slowest queries sorted by total time
-- Or query the Performance Schema
SELECT
DIGEST_TEXT AS query,
COUNT_STAR AS executions,
ROUND(AVG_TIMER_WAIT / 1e12, 4) AS avg_seconds,
ROUND(SUM_TIMER_WAIT / 1e12, 4) AS total_seconds,
SUM_ROWS_EXAMINED AS total_rows_examined
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;
Method 2 â pg_stat_statements (PostgreSQL):
-- Enable extension (once, in postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slowest queries by total execution time
SELECT
query,
calls,
ROUND(total_exec_time::NUMERIC, 2) AS total_ms,
ROUND(mean_exec_time::NUMERIC, 2) AS avg_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Find queries with worst rows-examined to rows-returned ratio (inefficiency indicator)
SELECT
query,
calls,
rows,
ROUND(mean_exec_time::NUMERIC, 2) AS avg_ms
FROM pg_stat_statements
WHERE calls > 100 -- only frequently run queries
ORDER BY mean_exec_time DESC
LIMIT 20;
Method 3 â SQL Server Dynamic Management Views:
-- Top 20 most expensive queries by total CPU time
SELECT TOP 20
qs.total_worker_time / qs.execution_count AS avg_cpu_time,
qs.total_elapsed_time / qs.execution_count AS avg_elapsed_time,
qs.execution_count,
qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
SUBSTRING(
st.text,
(qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1
) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_cpu_time DESC;
Method 4 â EXPLAIN ANALYZE on suspected queries:
-- Once a slow query is identified, analyze its plan
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name, SUM(o.amount)
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name;
-- Look for: Seq Scan on large tables, high actual vs estimated row mismatch,
-- Hash Batches > 1 (spilling to disk), Sort operations on large sets
Production monitoring stack: In practice, slow query identification is also handled by APM tools like Datadog, New Relic, pgBadger (PostgreSQL), Percona Monitoring and Management, or cloud-native tools like AWS RDS Performance Insights and Azure Query Performance Insight â which provide visual dashboards, query fingerprinting, and trend analysis.
Q12. You have a query that joins 5 tables and runs in 45 seconds. Walk through your approach to optimizing it. [5 Marks]
Answer: A systematic, step-by-step optimization approach:
Step 1 â Capture and understand the query baseline:
-- Record current performance metrics before any changes
-- MySQL
SHOW STATUS LIKE 'Handler_read%'; -- rows examined metrics
EXPLAIN ANALYZE <your slow query>; -- execution plan + actual timings
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<your slow query>;
-- Note: Seq Scan tables, actual vs estimated row counts, total cost
Step 2 â Analyze the execution plan for problem areas:
-- Red flags to look for in EXPLAIN output:
-- â Table Scan / Seq Scan on large tables
-- â Key Lookup / RID Lookup (non-covering index)
-- â Hash Match or Sort on large row sets
-- â Estimated rows << Actual rows (stale statistics)
-- â Nested Loop join on large tables (should be Hash Join)
-- â Early joins producing huge intermediate result sets
-- Example 5-table join to examine:
SELECT
c.name, o.order_date, oi.quantity,
p.product_name, cat.category_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
WHERE o.order_date >= '2024-01-01'
AND cat.category_name = 'Electronics';
Step 3 â Fix missing or inefficient indexes:
-- Index every JOIN key and WHERE filter column
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
CREATE INDEX idx_order_items_order ON order_items (order_id);
CREATE INDEX idx_order_items_product ON order_items (product_id);
CREATE INDEX idx_products_category ON products (category_id);
CREATE INDEX idx_categories_name ON categories (category_name);
-- Add covering indexes for SELECT columns to eliminate Key Lookups
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (order_id); -- SQL Server syntax
-- Re-run EXPLAIN after adding indexes to verify index seeks
Step 4 â Filter early using CTEs or derived tables:
-- â
Push filters as early as possible to reduce rows before expensive joins
WITH filtered_orders AS (
-- Filter BEFORE joining â drastically reduces rows in subsequent joins
SELECT order_id, customer_id, order_date
FROM orders
WHERE order_date >= '2024-01-01' -- eliminate old orders immediately
),
electronics_products AS (
-- Pre-filter products to Electronics category
SELECT p.product_id, p.product_name
FROM products p
JOIN categories cat ON p.category_id = cat.category_id
WHERE cat.category_name = 'Electronics'
)
SELECT
c.name,
fo.order_date,
oi.quantity,
ep.product_name
FROM filtered_orders fo
JOIN customers c ON fo.customer_id = c.customer_id
JOIN order_items oi ON fo.order_id = oi.order_id
JOIN electronics_products ep ON oi.product_id = ep.product_id;
Step 5 â Update statistics and check join order:
-- Stale statistics cause the optimizer to make wrong join order decisions
-- MySQL
ANALYZE TABLE orders, order_items, products, customers, categories;
-- PostgreSQL
ANALYZE orders; ANALYZE order_items; ANALYZE products;
-- SQL Server
UPDATE STATISTICS orders;
UPDATE STATISTICS order_items;
-- Or rebuild all indexes (also updates statistics):
ALTER INDEX ALL ON orders REBUILD;
Step 6 â Consider materialization for repeated sub-results:
-- If the 5-table join is run frequently with same base data,
-- materialize intermediate results
-- PostgreSQL: Materialized CTE (forces evaluation once)
WITH MATERIALIZED expensive_join AS (
SELECT ...
FROM orders JOIN order_items ... JOIN products ...
WHERE order_date >= '2024-01-01'
)
SELECT c.name, ej.order_date ...
FROM customers c JOIN expensive_join ej ON ...;
-- Or create a summary/aggregation table and refresh periodically
CREATE TABLE daily_order_summary AS
SELECT DATE(order_date) AS day, category_id, SUM(amount) AS daily_total
FROM orders JOIN order_items ... JOIN products ...
GROUP BY DATE(order_date), category_id;
Step 7 â Measure improvement and document:
| Step taken | Time before | Time after |
|---|---|---|
| Added indexes on join keys | 45s | 18s |
| Added covering index | 18s | 8s |
| Pushed filters into CTEs | 8s | 3s |
| Updated statistics | 3s | 2.1s |
Systematic principle: Never guess â always measure with
EXPLAIN ANALYZEbefore and after each change. Change one thing at a time so you can attribute the improvement (or regression) to a specific action.
Q13. A query uses 'WHERE LOWER(email) = 'test@example.com'' on a table with 10 million rows. Explain why it is slow and how to fix it. [5 Marks]
Answer:
Why it is slow â the function defeats the index:
-- This query is slow even with an index on email:
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';
-- Root cause:
-- 1. An index on 'email' stores values in their original case ('Test@Example.COM', 'TEST@EXAMPLE.COM')
-- 2. LOWER(email) is a function applied to the column at query time
-- 3. The database CANNOT use the index on 'email' because the indexed values
-- and the computed values LOWER(email) are different
-- 4. The database must read ALL 10 million rows, apply LOWER() to each,
-- then compare â this is a full table scan = millions of function calls
-- 5. EXPLAIN shows: Seq Scan / Table Scan, rows examined = 10,000,000
Fix 1 â Enforce consistent data storage (best long-term fix):
-- Normalize emails to lowercase at INSERT/UPDATE time
-- Add a CHECK constraint or trigger to enforce lowercase storage
-- MySQL: Add constraint to enforce lowercase
ALTER TABLE customers
ADD CONSTRAINT chk_email_lowercase
CHECK (email = LOWER(email));
-- Application-level enforcement (pseudo-code):
-- Before INSERT: email = email.toLowerCase().trim()
-- Once all emails are lowercase, query without LOWER():
SELECT customer_id, name
FROM customers
WHERE email = 'test@example.com'; -- uses index directly â
-- EXPLAIN: Index Seek, rows examined = 1
Fix 2 â Create a functional/expression index:
-- PostgreSQL: Create an index on LOWER(email) expression
CREATE INDEX idx_customers_email_lower
ON customers (LOWER(email));
-- Now this query uses the expression index:
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';
-- EXPLAIN: Index Scan using idx_customers_email_lower â
-- MySQL 8.0+: Functional index
CREATE INDEX idx_customers_email_lower
ON customers ((LOWER(email))); -- note double parentheses
-- SQL Server: Computed column + index
ALTER TABLE customers
ADD email_lower AS LOWER(email) PERSISTED; -- stored computed column
CREATE INDEX idx_customers_email_lower
ON customers (email_lower);
-- Query using computed column:
SELECT customer_id, name
FROM customers
WHERE email_lower = 'test@example.com'; -- uses index â
Fix 3 â Use case-insensitive collation:
-- MySQL: Set column collation to case-insensitive (ci suffix)
ALTER TABLE customers
MODIFY email VARCHAR(255)
COLLATE utf8mb4_unicode_ci; -- ci = case insensitive
-- Now standard equality is case-insensitive AND index-friendly:
SELECT customer_id, name
FROM customers
WHERE email = 'Test@Example.COM'; -- matches 'test@example.com' â
-- EXPLAIN: uses index on email column â
-- PostgreSQL: Use CITEXT extension
CREATE EXTENSION IF NOT EXISTS citext;
ALTER TABLE customers
ALTER COLUMN email TYPE citext; -- case-insensitive text type
-- Standard index works with case-insensitive comparison:
CREATE INDEX idx_email ON customers (email);
WHERE email = 'Test@Example.COM'; -- uses index, case-insensitive â
Performance comparison â 10 million row table:
| Approach | Index used | Rows examined | Approx query time |
|---|---|---|---|
WHERE LOWER(email) = '...' (no fix) |
â None | 10,000,000 | ~45 seconds |
Functional index on LOWER(email) |
â Expression index | 1 | ~0.001 seconds |
| Case-insensitive collation + index | â Regular index | 1 | ~0.001 seconds |
| Normalized lowercase storage + index | â Regular index | 1 | ~0.001 seconds |
Verification after fix:
-- PostgreSQL: confirm index is used
EXPLAIN ANALYZE
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';
-- Should show: Index Scan using idx_customers_email_lower
-- Rows Removed by Filter: 0
-- Actual rows: 1, Actual time: ~0.05ms
-- MySQL: confirm no full scan
EXPLAIN
SELECT customer_id, name FROM customers
WHERE email = 'test@example.com'; -- after normalization fix
-- type: ref key: idx_email rows: 1 Extra: Using index â
Production recommendation: Use all three fixes together for maximum robustness â normalize email storage to lowercase in the application layer (prevents dirty data), add a functional index as a safety net (handles any existing mixed-case data), and add a database constraint to prevent future case-inconsistent inserts. This is a defense-in-depth approach that eliminates the problem at every layer.
đ Day 9: Advanced Topics â Pivoting, JSON, Stored Procedures & Transactions
Here are all the answers formatted for direct copy-paste:
Q1. Explain the PIVOT operator. Write a query to pivot monthly sales into columns (Jan, Feb, Mar...). [5 Marks]
Answer: The PIVOT operator rotates rows into columns â it takes distinct values from one column and spreads them across multiple columns, aggregating related data in the process. It transforms a "tall" (normalized) table format into a "wide" (crosstab/spreadsheet) format.
Before PIVOT â tall format (rows per month):
| year | month | total_sales |
|---|---|---|
| 2024 | Jan | 120000 |
| 2024 | Feb | 95000 |
| 2024 | Mar | 140000 |
After PIVOT â wide format (months as columns):
| year | Jan | Feb | Mar | Apr | ... |
|---|---|---|---|---|---|
| 2024 | 120000 | 95000 | 140000 | 110000 | ... |
SQL Server â Native PIVOT operator:
-- Assumption: orders(order_id, customer_id, order_date, amount)
WITH monthly_sales AS (
SELECT
YEAR(order_date) AS sale_year,
FORMAT(order_date, 'MMM') AS sale_month, -- 'Jan', 'Feb', etc.
amount
FROM orders
WHERE YEAR(order_date) = 2024
)
SELECT
sale_year,
ISNULL([Jan], 0) AS Jan,
ISNULL([Feb], 0) AS Feb,
ISNULL([Mar], 0) AS Mar,
ISNULL([Apr], 0) AS Apr,
ISNULL([May], 0) AS May,
ISNULL([Jun], 0) AS Jun,
ISNULL([Jul], 0) AS Jul,
ISNULL([Aug], 0) AS Aug,
ISNULL([Sep], 0) AS Sep,
ISNULL([Oct], 0) AS Oct,
ISNULL([Nov], 0) AS Nov,
ISNULL([Dec], 0) AS Dec
FROM monthly_sales
PIVOT (
SUM(amount) -- aggregate function
FOR sale_month IN ( -- column to pivot
[Jan],[Feb],[Mar],[Apr],[May],[Jun],
[Jul],[Aug],[Sep],[Oct],[Nov],[Dec]
)
) AS pivoted_sales
ORDER BY sale_year;
PostgreSQL â Using crosstab() from tablefunc extension:
-- Enable extension (once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT *
FROM crosstab(
-- Query 1: source data (must return row_id, category, value)
$$
SELECT
EXTRACT(YEAR FROM order_date)::INT AS sale_year,
TO_CHAR(order_date, 'Mon') AS sale_month,
SUM(amount) AS total_sales
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024
GROUP BY 1, 2
ORDER BY 1, 2
$$,
-- Query 2: distinct category values (column headers)
$$SELECT unnest(ARRAY['Jan','Feb','Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec'])$$
) AS pivoted (
sale_year INT,
"Jan" NUMERIC, "Feb" NUMERIC, "Mar" NUMERIC,
"Apr" NUMERIC, "May" NUMERIC, "Jun" NUMERIC,
"Jul" NUMERIC, "Aug" NUMERIC, "Sep" NUMERIC,
"Oct" NUMERIC, "Nov" NUMERIC, "Dec" NUMERIC
);
Q2. How would you pivot data in a database that does not support PIVOT (e.g., MySQL)? Use CASE WHEN. [5 Marks]
Answer: MySQL does not have a native PIVOT operator. The standard workaround is conditional aggregation â using CASE WHEN inside aggregate functions to selectively sum/count values for each target column.
Assumption: Table orders(order_id, customer_id, order_date, amount) exists.
MySQL â CASE WHEN pivot (manual column definition):
SELECT
YEAR(order_date) AS sale_year,
-- Each CASE extracts only the rows matching that month, then SUM aggregates them
ROUND(SUM(CASE WHEN MONTH(order_date) = 1
THEN amount ELSE 0 END), 2) AS Jan,
ROUND(SUM(CASE WHEN MONTH(order_date) = 2
THEN amount ELSE 0 END), 2) AS Feb,
ROUND(SUM(CASE WHEN MONTH(order_date) = 3
THEN amount ELSE 0 END), 2) AS Mar,
ROUND(SUM(CASE WHEN MONTH(order_date) = 4
THEN amount ELSE 0 END), 2) AS Apr,
ROUND(SUM(CASE WHEN MONTH(order_date) = 5
THEN amount ELSE 0 END), 2) AS May,
ROUND(SUM(CASE WHEN MONTH(order_date) = 6
THEN amount ELSE 0 END), 2) AS Jun,
ROUND(SUM(CASE WHEN MONTH(order_date) = 7
THEN amount ELSE 0 END), 2) AS Jul,
ROUND(SUM(CASE WHEN MONTH(order_date) = 8
THEN amount ELSE 0 END), 2) AS Aug,
ROUND(SUM(CASE WHEN MONTH(order_date) = 9
THEN amount ELSE 0 END), 2) AS Sep,
ROUND(SUM(CASE WHEN MONTH(order_date) = 10
THEN amount ELSE 0 END), 2) AS Oct,
ROUND(SUM(CASE WHEN MONTH(order_date) = 11
THEN amount ELSE 0 END), 2) AS Nov,
ROUND(SUM(CASE WHEN MONTH(order_date) = 12
THEN amount ELSE 0 END), 2) AS Dec,
-- Grand total row
ROUND(SUM(amount), 2) AS annual_total
FROM orders
GROUP BY YEAR(order_date)
ORDER BY sale_year;
Extended â pivot by category AND month:
-- Pivot showing monthly sales broken down per product category
SELECT
p.category,
ROUND(SUM(CASE WHEN MONTH(o.order_date) = 1
THEN oi.quantity * oi.unit_price ELSE 0 END), 2) AS Jan,
ROUND(SUM(CASE WHEN MONTH(o.order_date) = 2
THEN oi.quantity * oi.unit_price ELSE 0 END), 2) AS Feb,
ROUND(SUM(CASE WHEN MONTH(o.order_date) = 3
THEN oi.quantity * oi.unit_price ELSE 0 END), 2) AS Mar,
-- ... continue for all 12 months
ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE YEAR(o.order_date) = 2024
GROUP BY p.category
ORDER BY total DESC;
Dynamic pivot using prepared statements (MySQL â when column values are unknown at query time):
-- Step 1: Generate the CASE expressions dynamically
SET @sql = NULL;
SELECT GROUP_CONCAT(
DISTINCT CONCAT(
'ROUND(SUM(CASE WHEN MONTH(order_date) = ',
MONTH(order_date),
' THEN amount ELSE 0 END), 2) AS `',
DATE_FORMAT(order_date, '%b'),
'`'
)
ORDER BY MONTH(order_date)
) INTO @sql
FROM orders
WHERE YEAR(order_date) = 2024;
-- Step 2: Build and execute the full query
SET @sql = CONCAT(
'SELECT YEAR(order_date) AS sale_year, ', @sql,
' FROM orders WHERE YEAR(order_date) = 2024 GROUP BY YEAR(order_date)'
);
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
Why
ELSE 0matters: WithoutELSE 0, theCASEreturns NULL for non-matching months.SUM(NULL)returns NULL â not 0. UsingELSE 0ensures months with no sales show 0 instead of NULL in the pivot output.
Q3. What is UNPIVOT? Give a use case. [5 Marks]
Answer: UNPIVOT is the reverse of PIVOT â it transforms columns back into rows, converting a "wide" table with many columns into a "tall" normalized format. Each column value becomes a row, with an additional column identifying which original column the value came from.
Before UNPIVOT â wide format:
| year | Jan | Feb | Mar |
|---|---|---|---|
| 2024 | 120000 | 95000 | 140000 |
After UNPIVOT â tall normalized format:
| year | month | sales |
|---|---|---|
| 2024 | Jan | 120000 |
| 2024 | Feb | 95000 |
| 2024 | Mar | 140000 |
SQL Server â Native UNPIVOT operator:
-- Assumption: wide_sales(sale_year, Jan, Feb, Mar, Apr, May, Jun,
-- Jul, Aug, Sep, Oct, Nov, Dec)
SELECT
sale_year,
sale_month,
monthly_sales
FROM wide_sales
UNPIVOT (
monthly_sales -- new value column name
FOR sale_month IN ( -- new category column name, listing source columns
Jan, Feb, Mar, Apr, May, Jun,
Jul, Aug, Sep, Oct, Nov, Dec
)
) AS unpivoted
ORDER BY sale_year, sale_month;
MySQL / PostgreSQL â UNION ALL approach (manual unpivot):
-- Unpivot using UNION ALL (universal approach â works in all databases)
SELECT sale_year, 'Jan' AS sale_month, Jan AS monthly_sales FROM wide_sales
UNION ALL
SELECT sale_year, 'Feb', Feb FROM wide_sales
UNION ALL
SELECT sale_year, 'Mar', Mar FROM wide_sales
UNION ALL
SELECT sale_year, 'Apr', Apr FROM wide_sales
UNION ALL
SELECT sale_year, 'May', May FROM wide_sales
UNION ALL
SELECT sale_year, 'Jun', Jun FROM wide_sales
UNION ALL
SELECT sale_year, 'Jul', Jul FROM wide_sales
UNION ALL
SELECT sale_year, 'Aug', Aug FROM wide_sales
UNION ALL
SELECT sale_year, 'Sep', Sep FROM wide_sales
UNION ALL
SELECT sale_year, 'Oct', Oct FROM wide_sales
UNION ALL
SELECT sale_year, 'Nov', Nov FROM wide_sales
UNION ALL
SELECT sale_year, 'Dec', Dec FROM wide_sales
ORDER BY sale_year, sale_month;
PostgreSQL â Using unnest() for cleaner unpivot:
SELECT
sale_year,
unnest(ARRAY['Jan','Feb','Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec']) AS sale_month,
unnest(ARRAY[jan,feb,mar,apr,may,jun,
jul,aug,sep,oct,nov,dec]) AS monthly_sales
FROM wide_sales
ORDER BY sale_year;
Real-world use cases for UNPIVOT:
| Use Case | Description |
|---|---|
| Data normalization | Convert denormalized spreadsheet imports into relational rows |
| Reporting system ingestion | Transform pivot-style Excel exports back to row format for loading into a fact table |
| Survey data | Columns like q1_answer, q2_answer, q3_answer â rows per question |
| Financial data | Quarterly budget columns â normalized rows for time-series analysis |
Q4. How do you extract a value from a JSON column in SQL? Give examples in PostgreSQL and MySQL. [4 Marks]
Answer:
PostgreSQL â JSON operators:
-- Assumption: customers table has a JSON/JSONB column 'profile'
-- Sample JSON: {"name": "Alice", "age": 30, "address": {"city": "Mumbai", "zip": "400001"}}
-- -> operator: returns JSON object (keeps JSON type)
SELECT profile -> 'name' AS name_json -- returns "Alice" (quoted)
FROM customers;
-- ->> operator: returns TEXT value (most commonly used for comparisons)
SELECT profile ->> 'name' AS name_text -- returns Alice (plain text)
FROM customers;
-- Nested access: chain operators
SELECT profile -> 'address' ->> 'city' AS city -- returns Mumbai
FROM customers;
-- #> operator: access by path (array of keys)
SELECT profile #> '{address, city}' AS city_json
FROM customers;
-- #>> operator: access by path, return as text
SELECT profile #>> '{address, city}' AS city_text
FROM customers;
-- Filtering on JSON value
SELECT customer_id, profile ->> 'name' AS name
FROM customers
WHERE profile ->> 'age' = '30'; -- JSON values compared as text
-- JSONB-specific: @> containment operator (very fast with GIN index)
SELECT customer_id
FROM customers
WHERE profile @> '{"address": {"city": "Mumbai"}}';
MySQL â JSON functions:
-- JSON_EXTRACT: standard extraction using path syntax
SELECT
JSON_EXTRACT(profile, '$.name') AS name_json, -- returns "Alice" (quoted)
JSON_UNQUOTE(JSON_EXTRACT(profile, '$.name'))
AS name_text -- returns Alice (unquoted)
FROM customers;
-- Shorthand -> operator (MySQL 5.7.13+)
SELECT profile -> '$.name' AS name_json -- same as JSON_EXTRACT
FROM customers;
-- Shorthand ->> operator (MySQL 5.7.13+)
SELECT profile ->> '$.name' AS name_text -- same as JSON_UNQUOTE(JSON_EXTRACT)
FROM customers;
-- Nested access
SELECT profile ->> '$.address.city' AS city
FROM customers;
-- Array element access
SELECT profile ->> '$.tags[0]' AS first_tag -- first element of array
FROM customers;
-- Filtering
SELECT customer_id
FROM customers
WHERE JSON_EXTRACT(profile, '$.address.city') = '"Mumbai"'; -- JSON string includes quotes
-- Cleaner filtering with ->>
WHERE profile ->> '$.address.city' = 'Mumbai';
Q5. Write a query to extract the 'city' field from a JSON column 'address'. [3 Marks]
Answer:
Assumption: Table customers(customer_id, name, email, address) exists where address is a JSON/JSONB column.
Sample JSON value: {"street": "42 MG Road", "city": "Mumbai", "state": "Maharashtra", "zip": "400001", "country": "India"}
PostgreSQL (JSONB column â recommended type):
SELECT
customer_id,
name,
address ->> 'city' AS city,
address ->> 'state' AS state,
address ->> 'zip' AS zip_code,
address ->> 'country' AS country
FROM customers
WHERE address ->> 'city' IS NOT NULL
ORDER BY city;
-- Filter customers in a specific city
SELECT customer_id, name
FROM customers
WHERE address ->> 'city' = 'Mumbai';
-- With GIN index for fast city filtering
CREATE INDEX idx_customer_address_gin
ON customers USING GIN (address);
-- Or a more targeted expression index for city specifically
CREATE INDEX idx_customer_city
ON customers ((address ->> 'city'));
MySQL:
SELECT
customer_id,
name,
address ->> '$.city' AS city,
address ->> '$.state' AS state,
address ->> '$.zip' AS zip_code,
address ->> '$.country' AS country
FROM customers
WHERE address ->> '$.city' IS NOT NULL
ORDER BY city;
-- Filter customers in a specific city
SELECT customer_id, name
FROM customers
WHERE address ->> '$.city' = 'Mumbai';
-- MySQL: create functional index on JSON path (MySQL 8.0+)
CREATE INDEX idx_customer_city
ON customers ((CAST(address ->> '$.city' AS CHAR(100))));
SQL Server (JSON stored as NVARCHAR):
SELECT
customer_id,
name,
JSON_VALUE(address, '$.city') AS city,
JSON_VALUE(address, '$.state') AS state,
JSON_VALUE(address, '$.zip') AS zip_code,
JSON_VALUE(address, '$.country') AS country
FROM customers
WHERE JSON_VALUE(address, '$.city') IS NOT NULL
ORDER BY JSON_VALUE(address, '$.city');
-- Filter
WHERE JSON_VALUE(address, '$.city') = 'Mumbai';
Q6. What are the advantages and disadvantages of storing data as JSON in a relational database? [3 Marks]
Answer:
Advantages:
-- 1. Schema flexibility â store variable or evolving attributes without ALTER TABLE
-- Different customers can have different fields in their profile JSON
INSERT INTO customers (customer_id, name, profile) VALUES
(1, 'Alice', '{"tier": "Gold", "preferences": ["email", "sms"]}'),
(2, 'Bob', '{"tier": "Silver", "company": "TechCorp", "gst_no": "27AAPFU0939F1ZV"}');
-- No need for 50 nullable columns to handle every possible attribute
-- 2. Nested/hierarchical data â represents complex structures naturally
-- Address with multiple lines, tags as arrays, nested metadata
'{"address": {"street": "42 MG Road", "city": "Mumbai"}, "tags": ["vip","early-adopter"]}'
-- 3. Reduces table count â avoids creating separate tables for sparse attributes
-- Instead of a customer_attributes EAV table (entity-attribute-value anti-pattern)
-- store flexible attributes directly in the JSON column
-- 4. API-friendly â JSON from external APIs can be stored and queried directly
-- without upfront schema design for every API field
Disadvantages:
-- 1. No referential integrity â JSON values cannot have foreign keys
-- No constraint enforcement inside JSON fields
-- 'city': 'Mumbay' â typo impossible to catch at DB level
-- 2. Poor query performance without careful indexing
-- Full table scan required unless expression indexes or GIN indexes are created
-- Complex JSON queries are harder to optimize than relational queries
SELECT * FROM customers WHERE address ->> 'city' = 'Mumbai'; -- slow without index
-- 3. No JOIN capability â cannot JOIN on a JSON field value efficiently
-- Cannot enforce relationships between JSON values and other tables
-- 4. Data duplication and inconsistency â same city name stored as
-- 'Mumbai', 'mumbai', 'MUMBAI', 'Bombay' with no normalization
-- 5. Harder to aggregate â SUM, COUNT, GROUP BY on JSON values require
-- extraction functions at every step â verbose and error-prone
SELECT address ->> 'city', COUNT(*)
FROM customers
GROUP BY address ->> 'city'; -- works but awkward at scale
-- 6. Schema documentation is implicit â no explicit column list
-- New developers cannot see what fields exist by looking at table structure
Balanced guideline:
| Use JSON for | Avoid JSON for |
|---|---|
| Truly variable/optional attributes | Core business data (names, dates, amounts) |
| External API payload storage | Data that needs JOINs or foreign keys |
| Audit logs and event payloads | Columns used in GROUP BY or aggregations frequently |
| Sparse attributes (most rows NULL) | Data requiring referential integrity |
Q7. What is the difference between a stored procedure and a user-defined function? [3 Marks]
Answer:
| Feature | Stored Procedure | User-Defined Function (UDF) |
|---|---|---|
| Return value | Can return zero, one, or multiple result sets; uses OUT parameters |
Must return exactly one value (scalar) or a table (table-valued function) |
| Use in SELECT | â Cannot be called inside a SELECT statement | â Can be used in SELECT, WHERE, JOIN |
| DML operations | â Can execute INSERT, UPDATE, DELETE, DDL | â ī¸ Scalar UDFs cannot modify data (read-only in most databases) |
| Transaction control | â Can use COMMIT, ROLLBACK, SAVEPOINT | â Cannot manage transactions |
| Error handling | â Full TRY/CATCH or EXCEPTION handling | Limited error handling |
| Called with | CALL proc_name() or EXEC proc_name |
SELECT func_name() or inline in expressions |
| Purpose | Complex business workflows, batch operations | Reusable calculations, transformations, lookups |
-- STORED PROCEDURE: executes a workflow, cannot be used inline in SELECT
CALL calculate_monthly_bonuses(2024, 'Q1');
EXEC sp_process_refund @order_id = 1001;
-- USER-DEFINED FUNCTION: returns a value, usable anywhere an expression is valid
SELECT
customer_id,
name,
get_customer_tier(customer_id) AS tier, -- scalar UDF in SELECT â
calculate_tax(amount, 'GST') AS tax
FROM orders
WHERE get_customer_tier(customer_id) = 'Gold'; -- UDF in WHERE â
Q8. Write a stored procedure that takes a customer_id and returns their total order value. [5 Marks]
Answer:
MySQL:
DELIMITER $$
CREATE PROCEDURE sp_get_customer_total_order_value(
IN p_customer_id INT,
OUT p_total_value DECIMAL(15, 2),
OUT p_order_count INT,
OUT p_customer_name VARCHAR(255)
)
BEGIN
-- Error handling
DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
SET p_total_value = NULL;
SET p_order_count = 0;
SET p_customer_name = NULL;
RESIGNAL;
END;
-- Validate customer exists
IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = p_customer_id) THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Customer not found';
END IF;
-- Get customer name
SELECT name INTO p_customer_name
FROM customers
WHERE customer_id = p_customer_id;
-- Calculate total order value and count
SELECT
COALESCE(SUM(amount), 0),
COUNT(order_id)
INTO
p_total_value,
p_order_count
FROM orders
WHERE customer_id = p_customer_id;
END$$
DELIMITER ;
-- Calling the procedure:
CALL sp_get_customer_total_order_value(101, @total, @count, @name);
SELECT @name AS customer_name, @total AS total_order_value, @count AS order_count;
PostgreSQL (using PROCEDURE and INOUT, or a FUNCTION):
-- PostgreSQL prefers functions for returning data
CREATE OR REPLACE FUNCTION fn_get_customer_total_order_value(
p_customer_id INT
)
RETURNS TABLE (
customer_name VARCHAR,
total_order_value NUMERIC,
order_count BIGINT,
first_order_date DATE,
last_order_date DATE,
avg_order_value NUMERIC
)
LANGUAGE plpgsql
AS $$
BEGIN
-- Validate customer exists
IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = p_customer_id) THEN
RAISE EXCEPTION 'Customer with ID % not found', p_customer_id;
END IF;
-- Return result set
RETURN QUERY
SELECT
c.name AS customer_name,
COALESCE(SUM(o.amount), 0) AS total_order_value,
COUNT(o.order_id) AS order_count,
MIN(o.order_date)::DATE AS first_order_date,
MAX(o.order_date)::DATE AS last_order_date,
ROUND(AVG(o.amount)::NUMERIC, 2) AS avg_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = p_customer_id
GROUP BY c.name;
END;
$$;
-- Calling the function:
SELECT * FROM fn_get_customer_total_order_value(101);
SQL Server:
CREATE OR ALTER PROCEDURE sp_get_customer_total_order_value
@customer_id INT,
@total_value DECIMAL(15, 2) OUTPUT,
@order_count INT OUTPUT
AS
BEGIN
SET NOCOUNT ON;
BEGIN TRY
-- Validate
IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = @customer_id)
BEGIN
RAISERROR ('Customer ID %d not found.', 16, 1, @customer_id);
RETURN;
END
-- Calculate totals
SELECT
@total_value = COALESCE(SUM(amount), 0),
@order_count = COUNT(order_id)
FROM orders
WHERE customer_id = @customer_id;
-- Return full detail as a result set
SELECT
c.name AS customer_name,
@total_value AS total_order_value,
@order_count AS order_count,
MIN(o.order_date) AS first_order,
MAX(o.order_date) AS last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = @customer_id
GROUP BY c.name;
END TRY
BEGIN CATCH
PRINT ERROR_MESSAGE();
THROW;
END CATCH
END;
-- Calling:
DECLARE @total DECIMAL(15,2), @cnt INT;
EXEC sp_get_customer_total_order_value
@customer_id = 101,
@total_value = @total OUTPUT,
@order_count = @cnt OUTPUT;
SELECT @total AS total_value, @cnt AS order_count;
Q9. What are the pros and cons of using stored procedures for business logic? [4 Marks]
Answer:
Pros:
-- 1. PERFORMANCE: Pre-compiled and cached execution plan
-- The database compiles and optimizes the procedure once, reuses plan on every call
-- Reduces parsing and planning overhead for frequently executed logic
-- 2. NETWORK EFFICIENCY: Only procedure name + parameters sent over the wire
-- Instead of sending 200 lines of SQL, client sends:
CALL sp_process_month_end_closing(2024, 3); -- single network round trip
-- vs. sending the entire 200-line query from application code each time
-- 3. SECURITY: Granular permission control without exposing base tables
-- Grant EXECUTE on procedure, not SELECT/INSERT/UPDATE on base tables
GRANT EXECUTE ON sp_process_refund TO app_user; -- app can execute
REVOKE SELECT ON orders FROM app_user; -- but cannot query directly
-- 4. REUSABILITY: Business logic defined once, used by multiple applications
-- Java app, Python script, and reporting tool all call the same procedure
-- Changing the logic means updating ONE place â not every application
-- 5. CONSISTENCY: Enforces business rules at the database layer
-- Even if an app developer bypasses the application layer, the DB enforces rules
Cons:
-- 1. VERSION CONTROL DIFFICULTY: Stored procedures live in the database
-- Harder to track changes in Git â require special tooling (Flyway, Liquibase)
-- "What changed in this procedure last month?" is hard to answer without tooling
-- 2. PORTABILITY: Procedural SQL is NOT standard across databases
-- PL/pgSQL (PostgreSQL), T-SQL (SQL Server), PL/SQL (Oracle), MySQL differ significantly
-- Migrating database engines requires rewriting ALL procedures
-- 3. TESTING DIFFICULTY: Unit testing stored procedures is cumbersome
-- No native unit test frameworks like Jest or pytest
-- Requires database access for every test â slow, stateful, fragile
-- 4. DEBUGGING: Limited debugging tools compared to application code
-- Debugging PL/pgSQL or T-SQL is much harder than stepping through Python/Java
-- Error messages are often less descriptive
-- 5. HIDDEN BUSINESS LOGIC: Logic split between application and database
-- New developers must look in two places to understand the full flow
-- Can lead to inconsistency if some logic is in app and some in procedures
-- 6. SCALABILITY: Procedures execute on the database server
-- Heavy business logic in DB consumes CPU/memory on the most critical shared resource
-- Application servers are easier (and cheaper) to scale horizontally
Verdict â when to use stored procedures:
| Use stored procedures for | Keep in application code |
|---|---|
| Complex multi-step DML operations | Simple CRUD queries |
| Batch jobs with large data volume | Business rules needing frequent changes |
| Security-sensitive data access | Logic requiring unit testing |
| DB-level audit trails and triggers | Cross-service or multi-DB workflows |
Q10. What is a trigger in SQL? Give a practical example of when you would use one. [3 Marks]
Answer: A trigger is a stored database object that automatically executes in response to a specific event (INSERT, UPDATE, or DELETE) on a table. Triggers fire automatically â they cannot be called manually and do not require application code to invoke them.
Trigger timing:
- BEFORE â fires before the DML operation (can modify values or cancel the operation)
- AFTER â fires after the DML operation completes successfully
Practical example 1 â Audit trail (log all changes to orders):
-- MySQL: Log every order status change to an audit table
CREATE TABLE order_audit_log (
log_id INT AUTO_INCREMENT PRIMARY KEY,
order_id INT,
old_status VARCHAR(50),
new_status VARCHAR(50),
changed_by VARCHAR(100) DEFAULT USER(),
changed_at DATETIME DEFAULT NOW()
);
DELIMITER $$
CREATE TRIGGER trg_orders_status_audit
AFTER UPDATE ON orders
FOR EACH ROW
BEGIN
-- Only log when status actually changes
IF OLD.status != NEW.status THEN
INSERT INTO order_audit_log (
order_id, old_status, new_status, changed_by, changed_at
) VALUES (
NEW.order_id, OLD.status, NEW.status, USER(), NOW()
);
END IF;
END$$
DELIMITER ;
Practical example 2 â Maintain denormalized summary (auto-update stock):
-- PostgreSQL: Automatically decrement inventory when an order item is inserted
CREATE OR REPLACE FUNCTION trg_fn_decrement_stock()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
UPDATE products
SET stock_quantity = stock_quantity - NEW.quantity
WHERE product_id = NEW.product_id;
-- Raise error if stock goes negative
IF (SELECT stock_quantity FROM products WHERE product_id = NEW.product_id) < 0 THEN
RAISE EXCEPTION 'Insufficient stock for product_id %', NEW.product_id;
END IF;
RETURN NEW;
END;
$$;
CREATE TRIGGER trg_order_item_stock_decrement
AFTER INSERT ON order_items
FOR EACH ROW
EXECUTE FUNCTION trg_fn_decrement_stock();
Common trigger use cases:
| Use Case | Trigger Type |
|---|---|
| Audit logging of data changes | AFTER UPDATE / DELETE |
Auto-updating updated_at timestamp |
BEFORE UPDATE |
| Enforcing complex business rules | BEFORE INSERT / UPDATE |
| Maintaining denormalized summary tables | AFTER INSERT / UPDATE / DELETE |
| Cascading soft deletes | AFTER DELETE |
Caution: Triggers execute invisibly â developers running ad-hoc SQL may not realize a trigger is firing. Overuse leads to "magic behavior" that is hard to debug. Always document triggers clearly and use them sparingly for genuinely cross-cutting concerns.
Q11. Explain ACID properties with a real-world example. [4 Marks]
Answer: ACID is a set of four properties that guarantee database transactions are processed reliably, even in the event of errors, power failures, or concurrent access.
Real-world scenario: Bank transfer of âš10,000 from Alice's account to Bob's account.
A â Atomicity ("all or nothing"):
-- Either BOTH operations succeed, or NEITHER happens
-- There is no partial state where Alice loses money but Bob doesn't receive it
BEGIN TRANSACTION;
UPDATE accounts
SET balance = balance - 10000
WHERE account_id = 'ALICE_001'; -- Step 1: Debit Alice
-- If the system crashes HERE, Atomicity ensures Step 1 is ROLLED BACK
-- Alice does not lose âš10,000 without Bob receiving it
UPDATE accounts
SET balance = balance + 10000
WHERE account_id = 'BOB_001'; -- Step 2: Credit Bob
COMMIT; -- Both succeed â commit
-- OR
ROLLBACK; -- If any error â both reversed, as if transaction never happened
C â Consistency ("valid state to valid state"):
-- The database must move from one valid state to another valid state
-- Business rules (constraints) must hold before AND after the transaction
-- Example constraint: no account can have negative balance
ALTER TABLE accounts ADD CONSTRAINT chk_balance CHECK (balance >= 0);
-- If Alice has only âš5,000 and tries to send âš10,000:
-- The CHECK constraint fires â transaction REJECTED â database stays consistent
-- Total money in system: âšX before = âšX after (money is neither created nor destroyed)
I â Isolation ("concurrent transactions don't interfere):
-- Two simultaneous transactions see a consistent snapshot of data
-- Transaction T1 (Alice â Bob transfer) and T2 (Alice â Carol transfer)
-- should not interfere with each other
-- Without isolation: both T1 and T2 read Alice's balance as âš10,000
-- Both debit âš10,000 â Alice ends up with -âš10,000 (race condition)
-- With isolation (SERIALIZABLE level):
-- T2 waits until T1 commits before reading Alice's balance
-- T2 reads âš0 â rejects transfer â database remains consistent
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
-- T2 now sees committed state of T1 before proceeding
D â Durability ("committed data survives failures"):
-- Once COMMIT is executed, the transaction is permanently saved
-- Even if the server crashes 1 millisecond after COMMIT, the data is not lost
COMMIT;
-- At this point:
-- â
Transaction written to the Write-Ahead Log (WAL) / Redo Log on disk
-- â
Even if RAM is wiped by a crash, the committed data can be recovered from the log
-- â
On server restart, the DB replays the log to restore the committed state
ACID summary table:
| Property | Guarantee | Failure it prevents |
|---|---|---|
| Atomicity | All steps succeed or none do | Partial updates leaving inconsistent state |
| Consistency | DB moves between valid states only | Constraint violations from concurrent changes |
| Isolation | Concurrent transactions don't interfere | Race conditions, dirty reads |
| Durability | Committed data survives crashes | Data loss after power failure or crash |
Q12. What are the different isolation levels? Explain each briefly. [3 Marks]
Answer: Transaction isolation levels control how much one transaction is shielded from changes made by other concurrent transactions. Higher isolation = more data consistency, but less concurrency and more locking overhead.
The four standard isolation levels (least to most strict):
Level 1 â READ UNCOMMITTED (lowest isolation):
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
-- Can read data that another transaction has modified but NOT yet committed
-- Problem: DIRTY READ â reading data that may be rolled back
-- Example: T1 updates price to âš500 (not committed) â T2 reads âš500 â T1 rolls back â T2 used wrong price
-- Use case: Real-time dashboards where approximate values are acceptable
Level 2 â READ COMMITTED (default in PostgreSQL, SQL Server, Oracle):
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Only reads data that has been COMMITTED by other transactions
-- Prevents: Dirty reads â
-- Problem: NON-REPEATABLE READ â same query within a transaction can return different results
-- Example: T1 reads salary=50000 â T2 commits salary=60000 â T1 reads salary again â gets 60000
-- Use case: Most OLTP applications â good balance of consistency and performance
Level 3 â REPEATABLE READ (default in MySQL InnoDB):
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Guarantees that if you read a row, re-reading it in the same transaction gives the same result
-- Prevents: Dirty reads â
, Non-repeatable reads â
-- Problem: PHANTOM READ â a new row inserted by another transaction may appear in a repeated range query
-- Example: T1 counts orders WHERE amount > 1000 â gets 50
-- T2 inserts a new order with amount=5000 (commits)
-- T1 re-counts â gets 51 (phantom row appeared)
-- Use case: Financial reports needing consistent row-level reads within a transaction
Level 4 â SERIALIZABLE (highest isolation):
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Transactions execute as if they were run one after another (serially), not concurrently
-- Prevents: Dirty reads â
, Non-repeatable reads â
, Phantom reads â
-- Problem: LOWEST concurrency â heaviest locking, most likely to cause deadlocks and timeouts
-- Use case: Bank transfers, inventory allocation, any scenario requiring complete data integrity
Summary â what each level prevents:
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| READ UNCOMMITTED | â Possible | â Possible | â Possible |
| READ COMMITTED | â Prevented | â Possible | â Possible |
| REPEATABLE READ | â Prevented | â Prevented | â Possible |
| SERIALIZABLE | â Prevented | â Prevented | â Prevented |
Q13. What is a deadlock? How can it be prevented? [3 Marks]
Answer: A deadlock occurs when two or more transactions are permanently blocked, each waiting for the other to release a lock that the other transaction holds. Neither can proceed â they are stuck waiting for each other in a circular dependency.
Classic deadlock scenario:
-- Transaction T1 and T2 run concurrently
-- T1 execution sequence:
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE account_id = 'A'; -- T1 locks row A
-- T1 now waits for row B (held by T2)...
-- T2 execution sequence (concurrent):
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE account_id = 'B'; -- T2 locks row B
UPDATE accounts SET balance = balance + 500 WHERE account_id = 'A'; -- T2 WAITS for row A (held by T1)
-- Back to T1:
UPDATE accounts SET balance = balance + 1000 WHERE account_id = 'B'; -- T1 WAITS for row B (held by T2)
-- DEADLOCK: T1 waits for T2 to release B, T2 waits for T1 to release A
-- Neither can proceed â database detects cycle â kills one transaction
Prevention strategies:
Strategy 1 â Always access resources in the same consistent order:
-- â
Both T1 and T2 always lock accounts in ascending account_id order
-- T1: Lock A first, then B
-- T2: Lock A first, then B (waits for T1 to release A â no deadlock possible)
-- Application-level enforcement:
-- Sort the accounts being updated before executing:
-- accounts_to_update = sorted([account_A, account_B]) -- always same order
Strategy 2 â Keep transactions short and fast:
-- â
Minimize the time locks are held â do all computation BEFORE the transaction
-- â Bad: long computation inside transaction holds locks for seconds
BEGIN;
-- complex calculation taking 30 seconds (locks held entire time)
SELECT ... expensive computation ...
UPDATE accounts SET balance = ...;
COMMIT;
-- â
Good: compute first, then lock briefly
DECLARE @new_balance DECIMAL = (compute new balance in application);
BEGIN;
UPDATE accounts SET balance = @new_balance WHERE account_id = 'A'; -- lock held milliseconds
COMMIT;
Strategy 3 â Use SELECT FOR UPDATE with SKIP LOCKED (queue processing):
-- PostgreSQL / MySQL: lock only available rows, skip already-locked ones
-- Prevents multiple workers from fighting over the same rows
SELECT order_id FROM pending_orders
ORDER BY order_id
FOR UPDATE SKIP LOCKED
LIMIT 1;
Strategy 4 â Set lock timeouts to fail fast rather than wait forever:
-- MySQL: set a timeout after which the transaction gives up
SET innodb_lock_wait_timeout = 5; -- give up after 5 seconds
-- PostgreSQL:
SET lock_timeout = '5s'; -- error if lock not acquired in 5 seconds
-- SQL Server: NOWAIT or timeout
SELECT * FROM orders WITH (UPDLOCK, NOWAIT)
WHERE order_id = 1001; -- fails immediately if row is locked
Strategy 5 â Use optimistic locking for read-heavy scenarios:
-- Add a version column to detect conflicts without holding locks
ALTER TABLE accounts ADD COLUMN version INT DEFAULT 1;
-- Read with version
SELECT balance, version FROM accounts WHERE account_id = 'A';
-- Returns: balance=10000, version=5
-- Update only if version hasn't changed (no lock held during think time)
UPDATE accounts
SET balance = 9000, version = version + 1
WHERE account_id = 'A'
AND version = 5; -- fails if another transaction updated first
-- If rows_affected = 0 â conflict detected â retry in application
Database deadlock detection: All major databases automatically detect deadlocks. When a circular wait is detected, the database selects a victim transaction (usually the one that has done the least work) and rolls it back, allowing the other transaction to proceed. The rolled-back transaction must be retried by the application â so always implement retry logic for deadlock errors (MySQL error 1213, PostgreSQL error 40P01).
đ Day 10: Mock Interview Paper â All Topics Combined
Here are all the answers formatted for direct copy-paste:
Q1. What is the difference between DELETE, TRUNCATE, and DROP? [2 Marks]
Answer:
| Feature | DELETE | TRUNCATE | DROP |
|---|---|---|---|
| What it removes | Specific rows (with WHERE) or all rows | All rows in the table | Entire table â structure + data + indexes |
| WHERE clause | â Supported | â Not supported | â Not supported |
| Transaction rollback | â Can be rolled back | â ī¸ Cannot be rolled back (DDL in most DBs) | â Cannot be rolled back |
| Triggers fired | â Yes â row-level triggers fire | â No triggers fire | â No triggers fire |
| Speed | đ´ Slowest â logs each row deletion | đĸ Very fast â deallocates data pages | đĸ Instant â removes entire object |
| Auto-increment reset | â Does not reset | â Resets identity/auto-increment counter | N/A â table gone |
| Statement type | DML (Data Manipulation) | DDL (Data Definition) | DDL (Data Definition) |
-- DELETE: remove specific rows, fully logged, rollback-safe
BEGIN TRANSACTION;
DELETE FROM orders WHERE order_date < '2020-01-01';
ROLLBACK; -- rows are restored â
-- TRUNCATE: wipe all rows instantly, cannot be rolled back in most DBs
TRUNCATE TABLE temp_staging; -- all rows gone, structure remains â
-- DROP: destroy the entire table permanently
DROP TABLE temp_staging; -- table, indexes, constraints â all gone â
-- PostgreSQL exception: TRUNCATE CAN be rolled back inside an explicit transaction
BEGIN;
TRUNCATE TABLE orders;
ROLLBACK; -- rows restored in PostgreSQL only â
Interview tip: Use
DELETEwhen you need selective removal with rollback safety. UseTRUNCATEto quickly clear staging/temp tables. UseDROPonly when the table itself is no longer needed.
Q2. What is a foreign key constraint? What happens on CASCADE DELETE? [2 Marks]
Answer: A foreign key is a column (or set of columns) in a child table that references the primary key of a parent table. It enforces referential integrity â ensuring that every value in the foreign key column has a matching row in the parent table. No orphan records can exist.
-- Parent table
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
-- Child table with foreign key referencing parent
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
order_date DATE,
amount DECIMAL(10,2),
CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id)
REFERENCES customers (customer_id)
ON DELETE CASCADE -- behavior on parent row deletion
ON UPDATE CASCADE -- behavior on parent key update
);
ON DELETE behavior options:
| Option | What happens when parent row is deleted |
|---|---|
CASCADE |
Child rows are automatically deleted with the parent |
SET NULL |
Child foreign key column is set to NULL |
SET DEFAULT |
Child foreign key is set to its default value |
RESTRICT |
Prevents deletion of parent if child rows exist (error thrown) |
NO ACTION |
Same as RESTRICT in most databases (checked at end of statement) |
-- CASCADE DELETE example:
DELETE FROM customers WHERE customer_id = 101;
-- If ON DELETE CASCADE: all orders WHERE customer_id = 101 are automatically deleted â
-- If ON DELETE RESTRICT: ERROR â cannot delete customer while orders exist â
-- CASCADE UPDATE example:
UPDATE customers SET customer_id = 999 WHERE customer_id = 101;
-- If ON UPDATE CASCADE: all child orders.customer_id updated from 101 â 999 automatically â
Caution:
CASCADE DELETEcan cause unintended mass deletions across multiple related tables if the schema has deep chains of cascading relationships. Always verify cascade chains before enabling.
Q3. What is database normalization? Explain 1NF, 2NF, and 3NF briefly. [2 Marks]
Answer: Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity by decomposing tables into smaller, well-structured ones connected by relationships. Each "Normal Form" builds on the previous.
1NF â First Normal Form (eliminate repeating groups and ensure atomicity):
-- â Violates 1NF: multiple values in a single column (non-atomic)
-- customer_id | name | phone_numbers
-- 1 | Alice | '9876543210, 9123456789' -- two numbers in one column
-- â
1NF compliant: each column holds a single atomic value
CREATE TABLE customer_phones (
customer_id INT,
phone_number VARCHAR(15), -- one phone per row
PRIMARY KEY (customer_id, phone_number)
);
-- Rules: no repeating columns, no multi-valued columns, each row uniquely identifiable
2NF â Second Normal Form (eliminate partial dependencies):
-- â Violates 2NF: non-key columns depend on PART of a composite key
-- order_id | product_id | quantity | product_name | product_price
-- product_name and product_price depend only on product_id, NOT on (order_id, product_id)
-- This is a PARTIAL DEPENDENCY
-- â
2NF compliant: split into two tables
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (order_id, product_id) -- no non-key column partially depends on part of PK
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100), -- fully depends on product_id only
product_price DECIMAL(10,2)
);
-- Rule: every non-key column must depend on the WHOLE primary key
3NF â Third Normal Form (eliminate transitive dependencies):
-- â Violates 3NF: non-key column depends on another non-key column (transitive dependency)
-- emp_id | name | dept_id | dept_name | dept_location
-- dept_name and dept_location depend on dept_id (non-key), not directly on emp_id
-- â
3NF compliant: extract transitive dependency into its own table
CREATE TABLE employees (
emp_id INT PRIMARY KEY,
name VARCHAR(100),
dept_id INT, -- foreign key to departments
FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);
CREATE TABLE departments (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(100), -- depends only on dept_id â
dept_location VARCHAR(100)
);
-- Rule: no non-key column should depend on another non-key column
Q4. What is the difference between a primary key and a unique key? [2 Marks]
Answer:
| Feature | Primary Key | Unique Key |
|---|---|---|
| Purpose | Uniquely identifies each row â the main row identifier | Enforces uniqueness on a column but is not the main identifier |
| NULL values | â Cannot be NULL â ever | â Can be NULL (most databases allow one or more NULLs) |
| Count per table | Only ONE primary key per table | Multiple unique keys allowed per table |
| Clustered index | â Automatically creates a clustered index (SQL Server) | Creates a non-clustered index by default |
| Foreign key target | â Commonly referenced by foreign keys | â Can also be referenced by foreign keys |
| Implicit index | â Always creates an index automatically | â Always creates an index automatically |
CREATE TABLE employees (
emp_id INT PRIMARY KEY, -- one PK, NOT NULL, clustered index
national_id VARCHAR(20) UNIQUE, -- unique, but can be NULL if unknown
email VARCHAR(255) UNIQUE NOT NULL, -- unique + explicitly not null
pan_number VARCHAR(10) UNIQUE, -- another unique constraint
name VARCHAR(100) NOT NULL
);
-- emp_id: main row identifier â cannot be NULL, only one per table
-- national_id, email, pan_number: business uniqueness rules â multiple allowed
-- NULL behavior with UNIQUE:
INSERT INTO employees (emp_id, name, national_id) VALUES (1, 'Alice', NULL);
INSERT INTO employees (emp_id, name, national_id) VALUES (2, 'Bob', NULL);
-- â
Both NULLs allowed in UNIQUE column (NULL != NULL in SQL)
-- â Would fail if emp_id was NULL
Q5. What is denormalization? When is it used? [2 Marks]
Answer: Denormalization is the deliberate process of introducing controlled redundancy into a normalized database schema by merging tables, adding duplicate columns, or pre-computing aggregates. It trades data consistency for query performance.
-- Normalized (3NF) schema â requires JOIN to get order total with customer name:
SELECT c.name, o.order_id, SUM(oi.quantity * oi.unit_price) AS total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY c.name, o.order_id; -- 3-table join on every query
-- Denormalized approach â pre-store the total and customer name on orders table:
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100); -- duplicate from customers
ALTER TABLE orders ADD COLUMN order_total DECIMAL(10,2); -- pre-computed aggregate
-- Now the query needs no JOIN:
SELECT customer_name, order_id, order_total
FROM orders; -- single table read, massively faster â
When denormalization is appropriate:
-- 1. OLAP / Data Warehousing (read-heavy, rarely updated)
-- Star schema: fact tables surrounded by dimension tables
-- product_name stored directly in fact_sales (not joined from products)
CREATE TABLE fact_sales (
sale_id INT,
sale_date DATE,
product_id INT,
product_name VARCHAR(100), -- denormalized from products table
category_name VARCHAR(100), -- denormalized from categories table
amount DECIMAL(10,2)
);
-- 2. High-traffic reporting dashboards needing sub-second response
-- 3. Pre-aggregated summary tables refreshed on a schedule
CREATE TABLE daily_sales_summary (
summary_date DATE PRIMARY KEY,
total_orders INT,
total_revenue DECIMAL(15,2),
new_customers INT
-- Updated nightly â avoids aggregating millions of rows in real time
);
-- 4. Caching computed values that are expensive to recalculate
ALTER TABLE customers ADD COLUMN lifetime_value DECIMAL(15,2);
-- Recalculated nightly by a batch job
Trade-offs:
| Benefit | Cost |
|---|---|
| Faster reads â fewer JOINs | Data redundancy â same data stored multiple times |
| Simpler queries | Update anomalies â changing a value requires updating multiple places |
| Better for reporting | Synchronization complexity â keeping duplicated data consistent |
Q6. Write a query to find the nth highest salary in a table using any method. [5 Marks]
Answer:
Assumption: Table employees(emp_id, name, department, salary) exists. Finding the nth highest salary where n is a variable (e.g., n=3 for 3rd highest).
Method 1 â Using DENSE_RANK() window function (recommended â handles ties):
-- Find the 3rd highest salary (change n=3 to any value)
WITH salary_ranked AS (
SELECT
emp_id,
name,
department,
salary,
DENSE_RANK() OVER (
ORDER BY salary DESC
) AS salary_rank
FROM employees
)
SELECT
emp_id,
name,
department,
salary
FROM salary_ranked
WHERE salary_rank = 3; -- change 3 to any n
Method 2 â Using LIMIT/OFFSET (MySQL / PostgreSQL â simple but tie-unsafe):
-- nth highest: skip (n-1) rows, take 1
-- For n=3: OFFSET 2
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2; -- OFFSET = n-1
-- With full employee details using subquery:
SELECT emp_id, name, salary
FROM employees
WHERE salary = (
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2 -- 3rd highest distinct salary
);
Method 3 â Using a correlated subquery (works in all databases â no window functions needed):
-- Find salary where exactly (n-1) distinct salaries are higher than it
-- For n=3: exactly 2 distinct salaries are higher
SELECT DISTINCT salary AS third_highest_salary
FROM employees e1
WHERE 2 = ( -- change 2 to (n-1)
SELECT COUNT(DISTINCT salary)
FROM employees e2
WHERE e2.salary > e1.salary
);
-- With employee details:
SELECT emp_id, name, department, salary
FROM employees
WHERE salary = (
SELECT DISTINCT e1.salary
FROM employees e1
WHERE 2 = (
SELECT COUNT(DISTINCT e2.salary)
FROM employees e2
WHERE e2.salary > e1.salary
)
);
Method 4 â Using FETCH NEXT (SQL Server / PostgreSQL standard SQL):
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
OFFSET 2 ROWS -- skip (n-1) rows
FETCH NEXT 1 ROWS ONLY; -- fetch exactly 1 row
Method 5 â Stored procedure for dynamic n (MySQL):
DELIMITER $$
CREATE PROCEDURE sp_nth_highest_salary(IN n INT)
BEGIN
SET @query = CONCAT(
'SELECT DISTINCT salary FROM employees ',
'ORDER BY salary DESC LIMIT 1 OFFSET ', n - 1
);
PREPARE stmt FROM @query;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END$$
DELIMITER ;
CALL sp_nth_highest_salary(3); -- find 3rd highest salary
Method comparison:
| Method | Handles ties | Works across all DBs | Dynamic n |
|---|---|---|---|
| DENSE_RANK() | â Best | â MySQL 8+, PG, SS | â Easy |
| LIMIT/OFFSET | â ī¸ Risky | â MySQL/PG only | â Easy |
| Correlated subquery | â Yes | â All databases | â Change constant |
| FETCH NEXT | â ī¸ Risky | â SQL Server/PG | â Easy |
Q7. Given a 'logins' table (user_id, login_date), write a query to find users who logged in on 3 or more consecutive days. [5 Marks]
Answer:
Assumption: Table logins(login_id, user_id, login_date DATE) exists. Multiple logins per day per user are possible and should be deduplicated.
Method 1 â Using LAG() to detect consecutive day gaps:
WITH deduped_logins AS (
-- Step 1: One row per user per day (remove same-day duplicates)
SELECT DISTINCT
user_id,
login_date
FROM logins
),
login_with_gaps AS (
-- Step 2: Calculate days since previous login for each user
SELECT
user_id,
login_date,
LAG(login_date) OVER (
PARTITION BY user_id
ORDER BY login_date
) AS prev_login_date,
-- Is this login exactly 1 day after the previous? (consecutive)
CASE
WHEN login_date - LAG(login_date) OVER (
PARTITION BY user_id ORDER BY login_date
) = 1
THEN 0 -- consecutive day â same group
ELSE 1 -- gap â new group starts
END AS is_new_group
FROM deduped_logins
),
group_assigned AS (
-- Step 3: Assign a group number to each consecutive run
SELECT
user_id,
login_date,
SUM(is_new_group) OVER (
PARTITION BY user_id
ORDER BY login_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS consecutive_group
FROM login_with_gaps
),
consecutive_counts AS (
-- Step 4: Count days in each consecutive group
SELECT
user_id,
consecutive_group,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS consecutive_days
FROM group_assigned
GROUP BY user_id, consecutive_group
)
-- Step 5: Return users with at least one streak of 3+ consecutive days
SELECT DISTINCT
l.user_id,
cc.streak_start,
cc.streak_end,
cc.consecutive_days
FROM consecutive_counts cc
JOIN logins l ON cc.user_id = l.user_id
WHERE cc.consecutive_days >= 3
ORDER BY cc.consecutive_days DESC, l.user_id;
Method 2 â Using date subtraction and ROW_NUMBER (elegant island-and-gaps technique):
WITH deduped_logins AS (
SELECT DISTINCT user_id, login_date FROM logins
),
date_groups AS (
-- Subtracting ROW_NUMBER from login_date groups consecutive dates together
-- Consecutive dates always produce the same (login_date - ROW_NUMBER) value
SELECT
user_id,
login_date,
login_date - INTERVAL ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY login_date
) DAY AS date_group
-- PostgreSQL: login_date - ROW_NUMBER() OVER (...) * INTERVAL '1 day'
-- SQL Server: DATEADD(DAY, -ROW_NUMBER() OVER (...), login_date)
FROM deduped_logins
),
streaks AS (
SELECT
user_id,
date_group,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM date_groups
GROUP BY user_id, date_group
)
SELECT
user_id,
streak_start,
streak_end,
streak_length AS consecutive_days
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC, user_id;
Why the island-and-gaps trick works:
User 1 logins: Jan 1, Jan 2, Jan 3, Jan 5, Jan 6
ROW_NUMBER: 1, 2, 3, 4, 5
login - RN: Dec31, Dec31, Dec31, Jan 1, Jan 1 â same date = same group
Groups formed: [Jan1, Jan2, Jan3] = 3 consecutive â
[Jan5, Jan6] = 2 consecutive â
Q8. Write a query to find duplicate records in a 'customers' table based on email, keeping only the earliest record. [5 Marks]
Answer:
Assumption: Table customers(customer_id, name, email, created_at, phone) exists. "Earliest" = smallest created_at timestamp. When created_at is equal, keep the smallest customer_id.
Step 1 â Identify duplicates:
-- First, see which emails have duplicates and how many
SELECT
email,
COUNT(*) AS duplicate_count,
MIN(created_at) AS earliest_record,
MAX(created_at) AS latest_record
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;
Step 2 â View all duplicate rows with their rank:
WITH ranked_customers AS (
SELECT
customer_id,
name,
email,
created_at,
ROW_NUMBER() OVER (
PARTITION BY email -- group duplicates by email
ORDER BY
created_at ASC, -- keep earliest first
customer_id ASC -- tiebreak by smallest ID
) AS rn
FROM customers
)
SELECT
customer_id,
name,
email,
created_at,
rn,
CASE WHEN rn = 1 THEN 'KEEP' ELSE 'DUPLICATE â DELETE' END AS action
FROM ranked_customers
WHERE email IN ( -- only show emails with duplicates
SELECT email FROM customers
GROUP BY email HAVING COUNT(*) > 1
)
ORDER BY email, rn;
Step 3 â Delete duplicates, keep earliest:
-- Method 1: DELETE using CTE (PostgreSQL / SQL Server)
WITH ranked_customers AS (
SELECT
customer_id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC, customer_id ASC
) AS rn
FROM customers
)
DELETE FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM ranked_customers
WHERE rn > 1 -- delete all but the first (rn=1)
);
-- Method 2: DELETE using self-join (MySQL compatible)
DELETE c2
FROM customers c1
JOIN customers c2
ON c1.email = c2.email -- same email
AND (
c1.created_at < c2.created_at -- c1 is older â delete c2
OR (
c1.created_at = c2.created_at -- same date â keep lower ID
AND c1.customer_id < c2.customer_id
)
);
-- Method 3: Keep only records in the "keep" set (safe for all databases)
DELETE FROM customers
WHERE customer_id NOT IN (
SELECT MIN(customer_id) -- keep only earliest customer_id per email
FROM customers
GROUP BY email
);
-- â ī¸ Note: This keeps the smallest customer_id, not necessarily earliest created_at
-- For created_at based deduplication, use Method 1 or 2
Step 4 â Prevent future duplicates:
-- Add unique constraint after deduplication
ALTER TABLE customers
ADD CONSTRAINT uq_customers_email UNIQUE (email);
Q9. Given an 'orders' table (order_id, customer_id, order_date, amount), write a query to calculate monthly retention: the percentage of customers from each signup month who made a purchase in subsequent months. [10 Marks]
Answer:
Assumption: The "signup month" is derived as each customer's first ever order month. Retention = for each cohort (month of first order), what % of those customers placed at least one order in each subsequent month.
WITH customer_first_order AS (
-- Step 1: Find each customer's cohort month (month of first order)
SELECT
customer_id,
DATE_FORMAT(MIN(order_date), '%Y-%m') AS cohort_month,
MIN(order_date) AS first_order_date
FROM orders
GROUP BY customer_id
),
cohort_sizes AS (
-- Step 2: Count total customers in each cohort
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size
FROM customer_first_order
GROUP BY cohort_month
),
customer_activity AS (
-- Step 3: For each customer, find every month they placed an order
SELECT DISTINCT
o.customer_id,
cfo.cohort_month,
DATE_FORMAT(o.order_date, '%Y-%m') AS activity_month
FROM orders o
JOIN customer_first_order cfo
ON o.customer_id = cfo.customer_id
),
cohort_activity AS (
-- Step 4: Calculate how many months after cohort month each activity occurred
SELECT
ca.cohort_month,
ca.activity_month,
-- Period number: 0 = cohort month itself, 1 = one month later, etc.
PERIOD_DIFF(
REPLACE(ca.activity_month, '-', ''),
REPLACE(ca.cohort_month, '-', '')
) AS months_since_cohort,
COUNT(DISTINCT ca.customer_id) AS retained_customers
FROM customer_activity ca
GROUP BY
ca.cohort_month,
ca.activity_month
),
retention_pct AS (
-- Step 5: Join with cohort sizes to compute retention percentage
SELECT
ca.cohort_month,
cs.cohort_size,
ca.activity_month,
ca.months_since_cohort,
ca.retained_customers,
ROUND(
ca.retained_customers * 100.0 / cs.cohort_size, 1
) AS retention_rate_pct
FROM cohort_activity ca
JOIN cohort_sizes cs
ON ca.cohort_month = cs.cohort_month
WHERE ca.months_since_cohort >= 0 -- exclude activity before cohort month
)
-- Step 6: Final retention table output
SELECT
cohort_month,
cohort_size,
months_since_cohort AS month_number,
activity_month,
retained_customers,
retention_rate_pct,
-- Visual bar for quick reading
CONCAT(
REPEAT('â', FLOOR(retention_rate_pct / 10)),
' ',
retention_rate_pct, '%'
) AS retention_bar
FROM retention_pct
ORDER BY cohort_month ASC, months_since_cohort ASC;
PostgreSQL version (cleaner date arithmetic):
WITH customer_first_order AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders
GROUP BY customer_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) AS cohort_size
FROM customer_first_order
GROUP BY cohort_month
),
customer_activity AS (
SELECT DISTINCT
o.customer_id,
cfo.cohort_month,
DATE_TRUNC('month', o.order_date)::DATE AS activity_month
FROM orders o
JOIN customer_first_order cfo ON o.customer_id = cfo.customer_id
),
retention AS (
SELECT
ca.cohort_month,
cs.cohort_size,
ca.activity_month,
-- Number of months since cohort start
EXTRACT(YEAR FROM AGE(ca.activity_month, ca.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(ca.activity_month, ca.cohort_month))
AS months_since_cohort,
COUNT(DISTINCT ca.customer_id) AS retained_customers
FROM customer_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
GROUP BY ca.cohort_month, cs.cohort_size, ca.activity_month
)
SELECT
TO_CHAR(cohort_month, 'YYYY-MM') AS cohort,
cohort_size,
months_since_cohort::INT AS month_num,
TO_CHAR(activity_month, 'YYYY-MM') AS period,
retained_customers,
ROUND(retained_customers * 100.0 / cohort_size, 1) AS retention_pct
FROM retention
WHERE months_since_cohort >= 0
ORDER BY cohort_month, months_since_cohort;
Sample output (cohort retention table):
| cohort | cohort_size | month_num | retained_customers | retention_pct |
|---|---|---|---|---|
| 2024-01 | 500 | 0 | 500 | 100.0% |
| 2024-01 | 500 | 1 | 325 | 65.0% |
| 2024-01 | 500 | 2 | 240 | 48.0% |
| 2024-01 | 500 | 3 | 195 | 39.0% |
| 2024-02 | 620 | 0 | 620 | 100.0% |
| 2024-02 | 620 | 1 | 410 | 66.1% |
Pivoted view (cohort à month matrix):
-- MySQL: pivot retention into a matrix using conditional aggregation
SELECT
cohort_month,
cohort_size,
MAX(CASE WHEN months_since_cohort = 0 THEN CONCAT(retention_rate_pct,'%') END) AS M0,
MAX(CASE WHEN months_since_cohort = 1 THEN CONCAT(retention_rate_pct,'%') END) AS M1,
MAX(CASE WHEN months_since_cohort = 2 THEN CONCAT(retention_rate_pct,'%') END) AS M2,
MAX(CASE WHEN months_since_cohort = 3 THEN CONCAT(retention_rate_pct,'%') END) AS M3,
MAX(CASE WHEN months_since_cohort = 4 THEN CONCAT(retention_rate_pct,'%') END) AS M4,
MAX(CASE WHEN months_since_cohort = 5 THEN CONCAT(retention_rate_pct,'%') END) AS M5,
MAX(CASE WHEN months_since_cohort = 6 THEN CONCAT(retention_rate_pct,'%') END) AS M6
FROM retention_pct
GROUP BY cohort_month, cohort_size
ORDER BY cohort_month;
Q10. Design a database schema for an e-commerce platform. List at least 5 tables, their columns, and the relationships between them. [5 Marks]
Answer:
Schema overview â 8 core tables:
-- ============================================================
-- TABLE 1: CUSTOMERS
-- Stores registered user/buyer information
-- ============================================================
CREATE TABLE customers (
customer_id INT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
phone VARCHAR(20),
date_of_birth DATE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE
);
-- ============================================================
-- TABLE 2: ADDRESSES
-- Customer shipping/billing addresses (one customer â many addresses)
-- ============================================================
CREATE TABLE addresses (
address_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
address_type ENUM('shipping','billing','both') DEFAULT 'shipping',
street_line1 VARCHAR(255) NOT NULL,
street_line2 VARCHAR(255),
city VARCHAR(100) NOT NULL,
state VARCHAR(100) NOT NULL,
postal_code VARCHAR(20) NOT NULL,
country_code CHAR(2) NOT NULL, -- ISO 3166-1 alpha-2
is_default BOOLEAN DEFAULT FALSE,
CONSTRAINT fk_addresses_customer
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
ON DELETE CASCADE
);
-- ============================================================
-- TABLE 3: CATEGORIES
-- Hierarchical product categories (self-referencing for subcategories)
-- ============================================================
CREATE TABLE categories (
category_id INT PRIMARY KEY AUTO_INCREMENT,
parent_category_id INT NULL, -- NULL = top-level category
category_name VARCHAR(100) NOT NULL,
slug VARCHAR(100) NOT NULL UNIQUE, -- URL-friendly name
description TEXT,
is_active BOOLEAN DEFAULT TRUE,
CONSTRAINT fk_categories_parent
FOREIGN KEY (parent_category_id) REFERENCES categories(category_id)
ON DELETE SET NULL
);
-- ============================================================
-- TABLE 4: PRODUCTS
-- Product catalog
-- ============================================================
CREATE TABLE products (
product_id INT PRIMARY KEY AUTO_INCREMENT,
category_id INT NOT NULL,
product_name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL UNIQUE,
description TEXT,
base_price DECIMAL(12,2) NOT NULL,
sku VARCHAR(100) NOT NULL UNIQUE, -- Stock Keeping Unit
stock_quantity INT NOT NULL DEFAULT 0,
weight_kg DECIMAL(6,3),
is_active BOOLEAN DEFAULT TRUE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_products_category
FOREIGN KEY (category_id) REFERENCES categories(category_id),
CONSTRAINT chk_price CHECK (base_price >= 0),
CONSTRAINT chk_stock CHECK (stock_quantity >= 0)
);
-- ============================================================
-- TABLE 5: ORDERS
-- Customer order header â one row per order
-- ============================================================
CREATE TABLE orders (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
shipping_address_id INT NOT NULL,
order_date DATETIME DEFAULT CURRENT_TIMESTAMP,
status ENUM('pending','confirmed','processing',
'shipped','delivered','cancelled','refunded')
DEFAULT 'pending',
subtotal_amount DECIMAL(12,2) NOT NULL,
discount_amount DECIMAL(12,2) DEFAULT 0.00,
tax_amount DECIMAL(12,2) DEFAULT 0.00,
shipping_amount DECIMAL(12,2) DEFAULT 0.00,
total_amount DECIMAL(12,2) NOT NULL,
payment_status ENUM('unpaid','paid','refunded') DEFAULT 'unpaid',
notes TEXT,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
CONSTRAINT fk_orders_address
FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id)
);
-- ============================================================
-- TABLE 6: ORDER_ITEMS
-- Line items within each order â one row per product per order
-- ============================================================
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY AUTO_INCREMENT,
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(12,2) NOT NULL, -- price AT TIME of purchase
discount_pct DECIMAL(5,2) DEFAULT 0.00,
line_total DECIMAL(12,2) NOT NULL, -- quantity * unit_price * (1 - discount)
CONSTRAINT fk_order_items_order
FOREIGN KEY (order_id) REFERENCES orders(order_id)
ON DELETE CASCADE,
CONSTRAINT fk_order_items_product
FOREIGN KEY (product_id) REFERENCES products(product_id),
CONSTRAINT chk_quantity CHECK (quantity > 0)
);
-- ============================================================
-- TABLE 7: PAYMENTS
-- Payment records per order (supports partial payments / refunds)
-- ============================================================
CREATE TABLE payments (
payment_id INT PRIMARY KEY AUTO_INCREMENT,
order_id INT NOT NULL,
payment_method ENUM('credit_card','debit_card','upi',
'net_banking','wallet','cod') NOT NULL,
payment_gateway VARCHAR(50), -- 'Razorpay', 'Stripe', etc.
gateway_txn_id VARCHAR(255) UNIQUE, -- external transaction reference
amount DECIMAL(12,2) NOT NULL,
currency CHAR(3) DEFAULT 'INR',
status ENUM('initiated','success','failed','refunded') DEFAULT 'initiated',
paid_at DATETIME,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_payments_order
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
-- ============================================================
-- TABLE 8: REVIEWS
-- Product reviews by customers (only for purchased products)
-- ============================================================
CREATE TABLE reviews (
review_id INT PRIMARY KEY AUTO_INCREMENT,
product_id INT NOT NULL,
customer_id INT NOT NULL,
order_id INT NOT NULL, -- must have purchased to review
rating TINYINT NOT NULL,
title VARCHAR(255),
body TEXT,
is_verified BOOLEAN DEFAULT FALSE,
is_approved BOOLEAN DEFAULT TRUE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_reviews_product
FOREIGN KEY (product_id) REFERENCES products(product_id),
CONSTRAINT fk_reviews_customer
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
CONSTRAINT fk_reviews_order
FOREIGN KEY (order_id) REFERENCES orders(order_id),
CONSTRAINT chk_rating CHECK (rating BETWEEN 1 AND 5),
CONSTRAINT uq_one_review_per_purchase
UNIQUE (customer_id, product_id, order_id) -- one review per purchase
);
Entity Relationship summary:
customers (1) âââââââââââ (many) addresses
customers (1) âââââââââââ (many) orders
customers (1) âââââââââââ (many) reviews
orders (1) âââââââââââ (many) order_items
orders (1) âââââââââââ (many) payments
orders (many) ââââââââ (1) addresses [shipping address]
order_items (many) âââââââ (1) products
products (many) ââââââââ (1) categories
categories (many) ââââââââ (1) categories [self-referencing parent]
reviews (many) ââââââââ (1) products
Key indexes for query performance:
CREATE INDEX idx_orders_customer ON orders (customer_id, order_date);
CREATE INDEX idx_orders_status ON orders (status);
CREATE INDEX idx_order_items_order ON order_items (order_id);
CREATE INDEX idx_order_items_product ON order_items (product_id);
CREATE INDEX idx_products_category ON products (category_id);
CREATE INDEX idx_reviews_product ON reviews (product_id);
CREATE INDEX idx_payments_order ON payments (order_id);
Q11. Case Study â Insurance Company Churn Risk Analysis [10 Marks]
(a) What data would you pull, and from which tables? [3 Marks]
Answer:
Key tables and data required:
-- TABLE 1: customers / policyholders
-- Columns: customer_id, name, age, city, join_date, segment, agent_id
-- TABLE 2: policies
-- Columns: policy_id, customer_id, policy_type, start_date, end_date,
-- premium_amount, payment_frequency, status (active/lapsed/cancelled)
-- TABLE 3: premium_payments
-- Columns: payment_id, policy_id, customer_id, due_date, paid_date,
-- amount_due, amount_paid, payment_status (paid/late/missed)
-- TABLE 4: claims
-- Columns: claim_id, policy_id, customer_id, claim_date, claim_amount,
-- claim_status (approved/rejected/pending), settlement_date
-- TABLE 5: customer_interactions
-- Columns: interaction_id, customer_id, interaction_date, channel,
-- interaction_type (complaint/query/renewal_discussion), resolved
-- TABLE 6: policy_renewals
-- Columns: renewal_id, policy_id, customer_id, renewal_due_date,
-- renewal_date, renewal_status (renewed/not_renewed/pending)
-- Churn indicators to extract:
-- â
Missed/late premium payments â premium_payments
-- â
Lapsed or cancelled policies â policies
-- â
No renewal after policy expiry â policy_renewals
-- â
High complaint frequency â customer_interactions
-- â
Claim rejection history â claims
-- â
Days since last interaction â customer_interactions
-- â
Policy age and years as customer â policies, customers
(b) Write the key SQL queries to calculate churn risk indicators. [4 Marks]
Answer:
-- ============================================================
-- INDICATOR 1: Missed payment rate (% of premiums missed in last 12 months)
-- High missed rate = high churn risk
-- ============================================================
WITH payment_health AS (
SELECT
customer_id,
COUNT(*) AS total_payments_due,
SUM(CASE WHEN payment_status = 'missed' THEN 1 ELSE 0 END)
AS missed_payments,
SUM(CASE WHEN payment_status = 'late' THEN 1 ELSE 0 END)
AS late_payments,
ROUND(
SUM(CASE WHEN payment_status = 'missed' THEN 1 ELSE 0 END)
* 100.0 / COUNT(*), 1
) AS missed_rate_pct
FROM premium_payments
WHERE due_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY customer_id
)
SELECT * FROM payment_health WHERE missed_rate_pct > 20
ORDER BY missed_rate_pct DESC;
-- ============================================================
-- INDICATOR 2: Days since last interaction (disengaged customers)
-- No contact in 6+ months = silent churn risk
-- ============================================================
WITH last_interaction AS (
SELECT
customer_id,
MAX(interaction_date) AS last_contact_date,
DATEDIFF(CURDATE(), MAX(interaction_date)) AS days_since_contact,
COUNT(CASE WHEN interaction_type = 'complaint'
THEN 1 END) AS complaint_count
FROM customer_interactions
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
li.last_contact_date,
li.days_since_contact,
li.complaint_count,
CASE
WHEN li.days_since_contact > 365 THEN 'Critical â No contact 1 year+'
WHEN li.days_since_contact > 180 THEN 'High â No contact 6 months+'
WHEN li.complaint_count > 3 THEN 'Medium â Multiple complaints'
ELSE 'Low Risk'
END AS engagement_risk
FROM customers c
LEFT JOIN last_interaction li ON c.customer_id = li.customer_id
WHERE li.days_since_contact > 180 OR li.complaint_count > 3
ORDER BY li.days_since_contact DESC;
-- ============================================================
-- INDICATOR 3: Renewal failure rate (policies not renewed after expiry)
-- ============================================================
WITH renewal_behavior AS (
SELECT
p.customer_id,
COUNT(pr.renewal_id) AS policies_up_for_renewal,
SUM(CASE WHEN pr.renewal_status = 'renewed'
THEN 1 ELSE 0 END) AS renewals_completed,
SUM(CASE WHEN pr.renewal_status = 'not_renewed'
THEN 1 ELSE 0 END) AS renewals_missed,
ROUND(
SUM(CASE WHEN pr.renewal_status = 'not_renewed' THEN 1 ELSE 0 END)
* 100.0 / NULLIF(COUNT(pr.renewal_id), 0), 1
) AS non_renewal_rate_pct
FROM policies p
JOIN policy_renewals pr ON p.policy_id = pr.policy_id
WHERE pr.renewal_due_date <= CURDATE() -- past renewal due dates
GROUP BY p.customer_id
)
SELECT * FROM renewal_behavior
WHERE non_renewal_rate_pct >= 50 -- missed half or more renewals
ORDER BY non_renewal_rate_pct DESC;
-- ============================================================
-- INDICATOR 4: COMPOSITE CHURN RISK SCORE (0-100)
-- Combines all signals into one actionable score per customer
-- ============================================================
WITH payment_risk AS (
SELECT customer_id,
ROUND(SUM(CASE WHEN payment_status='missed' THEN 1 ELSE 0 END)
* 100.0 / COUNT(*), 1) AS missed_pct
FROM premium_payments
WHERE due_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY customer_id
),
engagement_risk AS (
SELECT customer_id,
DATEDIFF(CURDATE(), MAX(interaction_date)) AS days_silent,
COUNT(CASE WHEN interaction_type='complaint' THEN 1 END) AS complaints
FROM customer_interactions
GROUP BY customer_id
),
renewal_risk AS (
SELECT p.customer_id,
ROUND(SUM(CASE WHEN pr.renewal_status='not_renewed' THEN 1 ELSE 0 END)
* 100.0 / NULLIF(COUNT(*),0), 1) AS non_renewal_pct
FROM policies p
JOIN policy_renewals pr ON p.policy_id = pr.policy_id
WHERE pr.renewal_due_date <= CURDATE()
GROUP BY p.customer_id
)
SELECT
c.customer_id,
c.name,
COALESCE(pr.missed_pct, 0) AS payment_risk_score,
COALESCE(er.days_silent, 0) AS days_silent,
COALESCE(er.complaints, 0) AS complaint_count,
COALESCE(rr.non_renewal_pct, 0) AS renewal_risk_score,
-- Composite score: weighted combination of all risk factors (max 100)
LEAST(100, ROUND(
COALESCE(pr.missed_pct, 0) * 0.35 + -- 35% weight: payment behaviour
LEAST(COALESCE(er.days_silent, 0) / 365 * 100, 100) * 0.25 + -- 25%: engagement
COALESCE(rr.non_renewal_pct, 0) * 0.25 + -- 25%: renewal behaviour
LEAST(COALESCE(er.complaints, 0) * 10, 100) * 0.15 -- 15%: complaints
, 1)) AS churn_risk_score,
CASE
WHEN LEAST(100, ROUND(
COALESCE(pr.missed_pct,0)*0.35 +
LEAST(COALESCE(er.days_silent,0)/365*100,100)*0.25 +
COALESCE(rr.non_renewal_pct,0)*0.25 +
LEAST(COALESCE(er.complaints,0)*10,100)*0.15, 1))
>= 70 THEN 'đ´ High Risk â Immediate Action'
WHEN LEAST(100, ROUND(
COALESCE(pr.missed_pct,0)*0.35 +
LEAST(COALESCE(er.days_silent,0)/365*100,100)*0.25 +
COALESCE(rr.non_renewal_pct,0)*0.25 +
LEAST(COALESCE(er.complaints,0)*10,100)*0.15, 1))
>= 40 THEN 'đĄ Medium Risk â Monitor Closely'
ELSE 'đĸ Low Risk â Retain'
END AS risk_segment
FROM customers c
LEFT JOIN payment_risk pr ON c.customer_id = pr.customer_id
LEFT JOIN engagement_risk er ON c.customer_id = er.customer_id
LEFT JOIN renewal_risk rr ON c.customer_id = rr.customer_id
ORDER BY churn_risk_score DESC;
(c) How would you present the findings to the business team? [3 Marks]
Answer:
1. Executive Summary Dashboard â KPI snapshot:
-- A single summary query for the leadership slide
SELECT
COUNT(*) AS total_active_customers,
SUM(CASE WHEN churn_risk_score >= 70 THEN 1 ELSE 0 END) AS high_risk_count,
SUM(CASE WHEN churn_risk_score BETWEEN 40 AND 69
THEN 1 ELSE 0 END) AS medium_risk_count,
ROUND(SUM(CASE WHEN churn_risk_score >= 70 THEN 1 ELSE 0 END)
* 100.0 / COUNT(*), 1) AS high_risk_pct,
-- Estimated revenue at risk (annual premium of high-risk customers)
SUM(CASE WHEN churn_risk_score >= 70
THEN annual_premium ELSE 0 END) AS revenue_at_risk
FROM churn_risk_final;
2. Segment the findings into actionable groups:
| Risk Tier | Count | Revenue at Risk | Recommended Action |
|---|---|---|---|
| đ´ High Risk (70-100) | 1,240 | âš4.2 Cr | Immediate outreach â retention call within 48 hrs |
| đĄ Medium Risk (40-69) | 3,850 | âš8.7 Cr | Email campaign, renewal reminder, discount offer |
| đĸ Low Risk (0-39) | 18,200 | â | Routine engagement, upsell opportunities |
3. Root cause breakdown â for the operations team:
-- Show what's driving churn risk (which factor contributes most)
SELECT
CASE
WHEN missed_pct >= 40 THEN 'Payment Issues'
WHEN days_silent >= 180 THEN 'Disengagement'
WHEN non_renewal_pct >= 50 THEN 'Renewal Avoidance'
WHEN complaint_count >= 3 THEN 'Service Dissatisfaction'
ELSE 'Multiple Factors'
END AS primary_churn_driver,
COUNT(*) AS customers_affected,
ROUND(AVG(churn_risk_score), 1) AS avg_risk_score
FROM churn_risk_final
WHERE churn_risk_score >= 40
GROUP BY 1
ORDER BY customers_affected DESC;
4. Presentation structure for business stakeholders:
Slide 1 â The Problem
â X% of customers are at high churn risk
â âšY crore in annual premiums at stake
Slide 2 â What We Measured (the 4 risk indicators)
â Payment missed rate, Days silent, Renewal failures, Complaint count
Slide 3 â Who Is at Risk (segmented customer list)
â Exportable list with customer_id, agent_id, risk score, primary driver
Slide 4 â Recommended Actions by Tier
â High Risk: personal call within 48 hours + loyalty discount offer
â Medium Risk: automated WhatsApp/email campaign + early renewal incentive
â Monitor: quarterly re-score and track improvement
Slide 5 â Success Metrics to Track
â Target: reduce high-risk count by 30% in 90 days
â Measure: retention rate, renewal completion rate, complaint resolution time
Key principle for business presentation: Never present raw SQL results to business stakeholders. Translate query outputs into: a number of customers affected, revenue at risk in rupees/crores, one clear recommended action per segment, and one metric to measure whether the action worked. The SQL is the engine â the story is what the business team acts on.