โ† Back to Interview Preparation
๐Ÿ—ƒ๏ธ

SQL Interview Guide

Comprehensive interview questions and answers to help you prepare for technical interviews.

๐Ÿ“‚ Database ๐Ÿ’ผ 10 Related Roles

๐Ÿ“… Day 1: Python Basics & NumPy Foundations

Beginner 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the execution order of a SQL query (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT). Why is this important? [3 Marks]

Answer: SQL queries are not executed in the order they are written. The actual logical execution order is:

  1. FROM โ€” Identifies the source table(s) and performs JOINs
  2. WHERE โ€” Filters rows before any grouping
  3. GROUP BY โ€” Groups the filtered rows
  4. HAVING โ€” Filters groups after aggregation
  5. SELECT โ€” Selects and computes the output columns
  6. ORDER BY โ€” Sorts the final result
  7. LIMIT / TOP โ€” Restricts the number of rows returned

Why it matters: This order explains why you cannot use a SELECT alias in a WHERE clause (WHERE runs before SELECT), but you can use it in ORDER BY (ORDER BY runs after SELECT). It also clarifies why aggregate functions like COUNT() or SUM() cannot appear in WHERE โ€” they must go in HAVING instead.


Q2. Write a query to select all employees from 'employees' table where department is 'Sales' and salary > 50000. [2 Marks]

Answer:

SELECT *
FROM employees
WHERE department = 'Sales'
  AND salary > 50000;

Q3. What is the difference between DISTINCT and GROUP BY for retrieving unique values? [3 Marks]

Answer:

Feature DISTINCT GROUP BY
Purpose Removes duplicate rows from the result Groups rows for aggregation
Use with aggregates Cannot directly aggregate Designed for aggregate functions (SUM, COUNT, etc.)
Performance Generally simpler and faster for deduplication More overhead due to grouping logic
Flexibility Works on one or more columns Can filter groups using HAVING

Example:

-- DISTINCT: just unique departments
SELECT DISTINCT department FROM employees;

-- GROUP BY: unique departments with employee count
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department;

Rule of thumb: Use DISTINCT when you only need unique values. Use GROUP BY when you also need to perform calculations on each group.


Q4. Write a query to find the top 10 highest-paid employees. Use both LIMIT and TOP syntax. [3 Marks]

Answer:

Using LIMIT (MySQL, PostgreSQL, SQLite):

SELECT employee_id, name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;

Using TOP (SQL Server / MS Access):

SELECT TOP 10 employee_id, name, salary
FROM employees
ORDER BY salary DESC;

Using FETCH NEXT (Standard SQL / PostgreSQL / Oracle 12c+):

SELECT employee_id, name, salary
FROM employees
ORDER BY salary DESC
FETCH NEXT 10 ROWS ONLY;

Q5. Explain the difference between WHERE and HAVING clauses with an example. [2 Marks]

Answer:

  • WHERE filters rows before grouping โ€” it works on individual row values and cannot use aggregate functions.
  • HAVING filters after grouping โ€” it works on aggregated results like COUNT(), SUM(), etc.

Example:

-- WHERE filters individual rows first
-- HAVING filters the grouped results
SELECT department, COUNT(*) AS total_employees
FROM employees
WHERE salary > 30000          -- filters rows before grouping
GROUP BY department
HAVING COUNT(*) > 5;          -- filters groups after aggregation

Here, only employees with salary > 30000 are considered, and only departments with more than 5 such employees are shown.


Q6. What does the BETWEEN operator do? Write a query to find orders placed between Jan 1 and Mar 31, 2024. [2 Marks]

Answer: The BETWEEN operator filters values within an inclusive range โ€” meaning both the start and end values are included in the result. It works with numbers, dates, and text.

SELECT order_id, customer_id, order_date, amount
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';

Note: BETWEEN '2024-01-01' AND '2024-03-31' is equivalent to order_date >= '2024-01-01' AND order_date <= '2024-03-31'. For datetime columns, use '2024-03-31 23:59:59' or < '2024-04-01' to capture the full last day.


Q7. Explain the difference between IN, EXISTS, and ANY operators with examples. [4 Marks]

Answer:

IN โ€” checks if a value matches any value in a fixed list or subquery result.

-- Finds employees in Sales or HR
SELECT name FROM employees
WHERE department IN ('Sales', 'HR');

EXISTS โ€” checks whether a subquery returns at least one row. It is more efficient for large datasets because it stops as soon as a match is found.

-- Finds customers who have placed at least one order
SELECT name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

ANY โ€” compares a value against any value returned by a subquery using a comparison operator (=, >, <, etc.).

-- Finds employees earning more than at least one person in the Intern grade
SELECT name FROM employees
WHERE salary > ANY (
    SELECT salary FROM employees WHERE grade = 'Intern'
);
Operator Best Use Case Stops Early?
IN Fixed list or small subquery No
EXISTS Checking row existence (correlated) Yes
ANY Comparison against a set of values No

Q8. What is the difference between LIKE and ILIKE? Write a query to find all customers whose names start with 'A' (case-insensitive). [3 Marks]

Answer:

  • LIKE โ€” performs case-sensitive pattern matching (standard SQL).
  • ILIKE โ€” performs case-insensitive pattern matching (PostgreSQL-specific extension).
-- PostgreSQL: using ILIKE (case-insensitive)
SELECT customer_id, name
FROM customers
WHERE name ILIKE 'A%';

-- MySQL: LIKE is case-insensitive by default for most collations
SELECT customer_id, name
FROM customers
WHERE name LIKE 'A%';

-- SQL Server: case-insensitive workaround using UPPER()
SELECT customer_id, name
FROM customers
WHERE UPPER(name) LIKE 'A%';

Note: In MySQL, case sensitivity depends on the column's collation (e.g., utf8_general_ci is case-insensitive). In PostgreSQL, always use ILIKE for case-insensitive matching.


Q9. Write a query to find all products whose name contains 'pro' but not 'professional'. [3 Marks]

Answer:

-- Works in MySQL, PostgreSQL, SQL Server
SELECT product_id, product_name
FROM products
WHERE product_name ILIKE '%pro%'           -- contains 'pro' (case-insensitive)
  AND product_name NOT ILIKE '%professional%';  -- excludes 'professional'

MySQL version (LIKE is case-insensitive by default):

SELECT product_id, product_name
FROM products
WHERE product_name LIKE '%pro%'
  AND product_name NOT LIKE '%professional%';

SQL Server version:

SELECT product_id, product_name
FROM products
WHERE LOWER(product_name) LIKE '%pro%'
  AND LOWER(product_name) NOT LIKE '%professional%';

Q10. Explain NULL handling in SQL. Why does 'WHERE column = NULL' not work? What should be used instead? [3 Marks]

Answer: In SQL, NULL represents the absence of a value โ€” it is not zero, not an empty string, and not equal to anything, including itself. This is based on three-valued logic (TRUE, FALSE, UNKNOWN).

Why WHERE column = NULL fails:

-- This will NEVER return rows โ€” NULL = NULL evaluates to UNKNOWN, not TRUE
SELECT * FROM employees WHERE manager_id = NULL;  -- โŒ Wrong

Correct approach โ€” use IS NULL / IS NOT NULL:

-- Correct: find rows where manager_id is NULL
SELECT * FROM employees WHERE manager_id IS NULL;     -- โœ…

-- Correct: find rows where manager_id is NOT NULL
SELECT * FROM employees WHERE manager_id IS NOT NULL; -- โœ…

Other NULL-safe handling functions:

-- COALESCE returns the first non-NULL value
SELECT COALESCE(phone, 'N/A') FROM customers;

-- NULL-safe equality in MySQL
SELECT * FROM employees WHERE manager_id <=> NULL;

Q11. What is the difference between COALESCE, ISNULL, and NULLIF? [2 Marks]

Answer:

Function Description Dialect
COALESCE(a, b, c, ...) Returns the first non-NULL value from a list of arguments Standard SQL (all databases)
ISNULL(expr, replacement) Returns replacement if expr is NULL; similar to COALESCE but takes only 2 arguments SQL Server / MySQL
NULLIF(a, b) Returns NULL if a = b, otherwise returns a โ€” used to avoid division by zero Standard SQL (all databases)

Examples:

-- COALESCE: returns first non-null among multiple options
SELECT COALESCE(phone, mobile, 'No Contact') FROM customers;

-- ISNULL: SQL Server specific
SELECT ISNULL(salary, 0) FROM employees;

-- NULLIF: avoid division by zero
SELECT total_sales / NULLIF(total_orders, 0) AS avg_order_value
FROM sales_summary;

Q12. Write a query to sort customers by country ascending and total_spend descending. [3 Marks]

Answer:

SELECT
    customer_id,
    name,
    country,
    total_spend
FROM customers
ORDER BY
    country     ASC,
    total_spend DESC;

Explanation: When multiple columns are specified in ORDER BY, SQL sorts by the first column first. Rows with the same country are then sorted by total_spend in descending order (highest spender first within each country).


Q13. How do you implement pagination in SQL? Write a query to fetch page 3 with 20 records per page. [4 Marks]

Answer: Pagination is implemented by combining LIMIT (number of records per page) with OFFSET (number of records to skip).

Formula: OFFSET = (page_number - 1) ร— records_per_page
For Page 3 with 20 records: OFFSET = (3 - 1) ร— 20 = 40

MySQL / PostgreSQL (LIMIT + OFFSET):

SELECT customer_id, name, email
FROM customers
ORDER BY customer_id
LIMIT 20 OFFSET 40;

SQL Server (OFFSET FETCH):

SELECT customer_id, name, email
FROM customers
ORDER BY customer_id
OFFSET 40 ROWS
FETCH NEXT 20 ROWS ONLY;

Oracle (ROWNUM โ€” older versions):

SELECT * FROM (
    SELECT customer_id, name, ROWNUM AS rn
    FROM customers
    WHERE ROWNUM <= 60
)
WHERE rn > 40;

Important: Always include ORDER BY with pagination โ€” without it, the order of rows is not guaranteed across pages.


Q14. What is the difference between LIMIT/OFFSET and FETCH NEXT in standard SQL? [3 Marks]

Answer:

Feature LIMIT / OFFSET FETCH NEXT
Standard Not in official SQL standard Part of ISO/ANSI SQL:2008 standard
Supported by MySQL, PostgreSQL, SQLite PostgreSQL, SQL Server, Oracle 12c+, DB2
Syntax style Simpler and shorter More verbose but portable
Requires ORDER BY Recommended but not enforced (varies) Mandatory

LIMIT / OFFSET syntax:

SELECT name FROM customers
ORDER BY customer_id
LIMIT 20 OFFSET 40;

FETCH NEXT syntax (Standard SQL):

SELECT name FROM customers
ORDER BY customer_id
OFFSET 40 ROWS
FETCH NEXT 20 ROWS ONLY;

Recommendation: Use FETCH NEXT when writing database-agnostic or enterprise SQL. Use LIMIT/OFFSET for MySQL or quick scripts.


Q15. Given a 'customers' table, write a query to find the number of customers per country, excluding countries with less than 10 customers. [5 Marks]

Answer:

SELECT
    country,
    COUNT(*) AS total_customers
FROM customers
GROUP BY country
HAVING COUNT(*) >= 10
ORDER BY total_customers DESC;

Step-by-step explanation:

Step Clause What it does
1 FROM customers Reads all rows from the customers table
2 GROUP BY country Groups rows by each unique country
3 COUNT(*) Counts customers in each group
4 HAVING COUNT(*) >= 10 Excludes countries with fewer than 10 customers
5 ORDER BY total_customers DESC Shows highest-count countries first

Note: WHERE cannot be used here because it runs before grouping. HAVING is the correct clause to filter aggregated results.


Q16. Write a query to find the second highest salary in the employees table without using LIMIT. [5 Marks]

Answer:

Method 1 โ€” Using a Subquery (works in all databases):

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (
    SELECT MAX(salary) FROM employees
);

Method 2 โ€” Using NOT IN:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary NOT IN (
    SELECT MAX(salary) FROM employees
);

Method 3 โ€” Using a Correlated Subquery:

SELECT DISTINCT salary AS second_highest_salary
FROM employees e1
WHERE 1 = (
    SELECT COUNT(DISTINCT salary)
    FROM employees e2
    WHERE e2.salary > e1.salary
);

Method 4 โ€” Using DENSE_RANK() Window Function (recommended for production):

SELECT salary AS second_highest_salary
FROM (
    SELECT salary,
           DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM employees
) ranked
WHERE rnk = 2;

Why DENSE_RANK() is preferred: It handles duplicate salaries gracefully. If two employees share the highest salary, DENSE_RANK() still correctly identifies the true second-highest distinct value.


๐Ÿ“… Day 2: Aggregations, GROUP BY & HAVING

Beginner 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the difference between COUNT(*), COUNT(column_name), and COUNT(DISTINCT column_name). [4 Marks]

Answer:

Function What it counts Includes NULLs?
COUNT(*) Every row in the result set, regardless of values โœ… Yes
COUNT(column_name) Only rows where the specified column is NOT NULL โŒ No
COUNT(DISTINCT column_name) Only unique, non-NULL values in the column โŒ No

Example โ€” Assume this employees table:

emp_id name manager_id
1 Alice NULL
2 Bob 101
3 Carol 101
4 Dave NULL
SELECT
    COUNT(*)                    AS total_rows,        -- 4
    COUNT(manager_id)           AS non_null_managers, -- 2
    COUNT(DISTINCT manager_id)  AS unique_managers     -- 1
FROM employees;

Key insight: Always use COUNT(*) to count total rows. Use COUNT(column) when NULLs should be excluded. Use COUNT(DISTINCT column) when you need unique value frequency.


Q2. Write a query to calculate the total revenue, average order value, and number of orders per month. [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL:

SELECT
    YEAR(order_date)                    AS order_year,
    MONTH(order_date)                   AS order_month,
    COUNT(order_id)                     AS total_orders,
    SUM(amount)                         AS total_revenue,
    ROUND(AVG(amount), 2)               AS avg_order_value
FROM orders
GROUP BY
    YEAR(order_date),
    MONTH(order_date)
ORDER BY
    order_year ASC,
    order_month ASC;

PostgreSQL:

SELECT
    TO_CHAR(order_date, 'YYYY-MM')      AS order_month,
    COUNT(order_id)                     AS total_orders,
    SUM(amount)                         AS total_revenue,
    ROUND(AVG(amount)::NUMERIC, 2)      AS avg_order_value
FROM orders
GROUP BY
    TO_CHAR(order_date, 'YYYY-MM')
ORDER BY
    order_month ASC;

SQL Server:

SELECT
    FORMAT(order_date, 'yyyy-MM')       AS order_month,
    COUNT(order_id)                     AS total_orders,
    SUM(amount)                         AS total_revenue,
    ROUND(AVG(amount), 2)               AS avg_order_value
FROM orders
GROUP BY
    FORMAT(order_date, 'yyyy-MM')
ORDER BY
    order_month ASC;

Q3. What happens when you use an aggregate function with NULL values? Illustrate with SUM and AVG. [3 Marks]

Answer: SQL aggregate functions ignore NULL values automatically โ€” except COUNT(*) which counts all rows including those with NULLs.

Example โ€” Assume this salaries table:

emp_id salary
1 50000
2 60000
3 NULL
4 40000
SELECT
    SUM(salary)   AS total_salary,   -- 150000  (NULL ignored, not treated as 0)
    AVG(salary)   AS avg_salary,     -- 50000   (150000 / 3, NOT divided by 4)
    COUNT(*)      AS total_rows,     -- 4
    COUNT(salary) AS non_null_count  -- 3
FROM salaries;

Critical point for AVG:
- AVG(salary) = SUM(salary) / COUNT(salary) = 150000 / 3 = 50000
- It does NOT divide by 4 (total rows), which would give 37500
- This can lead to misleading averages if NULLs represent missing data (e.g., unpaid employees)

Safe approach โ€” treat NULL as 0 using COALESCE:

SELECT AVG(COALESCE(salary, 0)) AS avg_including_nulls  -- 37500
FROM salaries;

Q4. Write a query to find the department with the highest average salary. [4 Marks]

Answer:

Assumption: Table employees(emp_id, name, department, salary) exists.

Method 1 โ€” Using ORDER BY + LIMIT (MySQL / PostgreSQL):

SELECT
    department,
    ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
LIMIT 1;

Method 2 โ€” Using a Subquery (works in all databases):

SELECT
    department,
    ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) = (
    SELECT MAX(avg_sal)
    FROM (
        SELECT AVG(salary) AS avg_sal
        FROM employees
        GROUP BY department
    ) AS dept_avgs
);

Method 3 โ€” Using RANK() Window Function (recommended โ€” handles ties):

SELECT department, avg_salary
FROM (
    SELECT
        department,
        ROUND(AVG(salary), 2)                       AS avg_salary,
        RANK() OVER (ORDER BY AVG(salary) DESC)     AS rnk
    FROM employees
    GROUP BY department
) ranked
WHERE rnk = 1;

Why Method 3 is best: If two departments share the highest average salary, LIMIT 1 would arbitrarily drop one. RANK() returns all tied departments correctly.


Q5. Can you use columns in SELECT that are not in GROUP BY? Explain the rule and give an example. [3 Marks]

Answer: Generally, no โ€” in standard SQL, every column in the SELECT clause must either appear in the GROUP BY clause or be wrapped inside an aggregate function (SUM, COUNT, MAX, etc.).

Why this rule exists: After grouping, each group collapses into a single row. If a non-grouped column has multiple different values within a group, SQL has no way to decide which value to display.

โŒ Invalid query (SQL Server / PostgreSQL will throw an error):

SELECT department, name, AVG(salary)  -- 'name' is neither grouped nor aggregated
FROM employees
GROUP BY department;
-- ERROR: column "name" must appear in GROUP BY or aggregate function

โœ… Correct approach:

-- Option 1: Add name to GROUP BY
SELECT department, name, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, name;

-- Option 2: Use an aggregate on name
SELECT department, MAX(name) AS sample_employee, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

MySQL Exception: MySQL with sql_mode not including ONLY_FULL_GROUP_BY allows non-grouped columns โ€” but it returns a random unpredictable value from the group, which is considered bad practice.


Q6. What is ROLLUP? Write a query to show sales by region, by country, and grand total using ROLLUP. [4 Marks]

Answer: ROLLUP is a GROUP BY extension that generates subtotals and a grand total automatically. It creates a hierarchy of groupings moving from the most specific to the most general level.

Assumption: Table sales(sale_id, region, country, amount) exists.

MySQL / PostgreSQL / SQL Server:

SELECT
    COALESCE(region, 'ALL REGIONS')    AS region,
    COALESCE(country, 'ALL COUNTRIES') AS country,
    SUM(amount)                         AS total_sales
FROM sales
GROUP BY ROLLUP(region, country)
ORDER BY region, country;

Sample Output:

region country total_sales
Asia India 500000
Asia Japan 300000
Asia ALL COUNTRIES 800000
Europe France 400000
Europe Germany 600000
Europe ALL COUNTRIES 1000000
ALL REGIONS ALL COUNTRIES 1800000

Note: ROLLUP produces subtotals for each region and a grand total at the end. COALESCE is used to label the NULL placeholders that ROLLUP generates for subtotal/grand total rows.


Q7. Explain the difference between CUBE and ROLLUP with examples. [4 Marks]

Answer:

Feature ROLLUP CUBE
Subtotals generated Hierarchical โ€” follows left-to-right column order All possible combinations of the specified columns
Number of groupings n + 1 (for n columns) 2โฟ (for n columns)
Use case Hierarchical data (year โ†’ month โ†’ day, region โ†’ country) Multi-dimensional analysis (pivot-style reporting)

Example โ€” Table: sales(region, product, amount)

ROLLUP โ€” Hierarchical subtotals:

SELECT
    region,
    product,
    SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(region, product);

Produces groupings:
- (region, product) โ€” detail level
- (region) โ€” subtotal per region
- () โ€” grand total

CUBE โ€” All combinations:

SELECT
    region,
    product,
    SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE(region, product);

Produces groupings:
- (region, product) โ€” detail level
- (region) โ€” subtotal per region
- (product) โ€” subtotal per product โ† extra grouping CUBE adds
- () โ€” grand total

Rule of thumb: Use ROLLUP for drill-down reports (top to bottom hierarchy). Use CUBE when you need totals sliced every possible way โ€” like a pivot table.


Q8. Write a query to find customers who placed more than 5 orders in the last 6 months. [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL:

SELECT
    customer_id,
    COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;

PostgreSQL:

SELECT
    customer_id,
    COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;

SQL Server:

SELECT
    customer_id,
    COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATEADD(MONTH, -6, GETDATE())
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_orders DESC;

Note: WHERE filters rows to the last 6 months before grouping (efficient โ€” reduces data early). HAVING then filters the grouped results to only customers with more than 5 orders.


Q9. Write a query to find products whose average rating is above 4.5 but with at least 100 reviews. [4 Marks]

Answer:

Assumption: Table reviews(review_id, product_id, rating) exists. Table products(product_id, product_name) exists.

SELECT
    p.product_id,
    p.product_name,
    COUNT(r.review_id)          AS total_reviews,
    ROUND(AVG(r.rating), 2)     AS avg_rating
FROM products p
JOIN reviews r
    ON p.product_id = r.product_id
GROUP BY
    p.product_id,
    p.product_name
HAVING
    COUNT(r.review_id) >= 100       -- at least 100 reviews
    AND AVG(r.rating) > 4.5         -- average rating above 4.5
ORDER BY avg_rating DESC;

Why HAVING and not WHERE: Both conditions (AVG and COUNT) are aggregate results computed after grouping. WHERE runs before grouping and cannot reference aggregates โ€” so HAVING is the only correct option here.


Q10. Can HAVING be used without GROUP BY? Give an example. [3 Marks]

Answer: Yes, HAVING can be used without GROUP BY. In this case, the entire table is treated as one single group, and HAVING filters whether that group satisfies the condition โ€” returning either all rows or no rows.

Example โ€” Check if the overall average salary in the company exceeds 50000:

SELECT AVG(salary) AS company_avg_salary
FROM employees
HAVING AVG(salary) > 50000;
  • If the company-wide average salary is above 50000, the result row is returned.
  • If not, no rows are returned at all.

Practical use case โ€” Return all orders only if there are more than 1000 total orders:

SELECT *
FROM orders
HAVING COUNT(*) > 1000;

Note: While valid, HAVING without GROUP BY is uncommon in real-world queries. A WHERE clause or a simple SELECT with a conditional check is usually more readable for this purpose.


Q11. Write a query to find departments where the total salary exceeds 1 million AND the employee count is more than 20. [3 Marks]

Answer:

Assumption: Table employees(emp_id, name, department, salary) exists.

SELECT
    department,
    COUNT(emp_id)       AS employee_count,
    SUM(salary)         AS total_salary
FROM employees
GROUP BY department
HAVING
    SUM(salary)  > 1000000    -- total salary exceeds 1 million
    AND COUNT(emp_id) > 20    -- employee count more than 20
ORDER BY total_salary DESC;

Explanation: Both conditions operate on aggregated values (SUM and COUNT), so they must go in the HAVING clause โ€” not in WHERE. WHERE only filters individual rows before grouping and cannot use aggregate functions.


Q12. Write a query to find the top 3 products by revenue in each category. [5 Marks]

Answer:

Assumption: Table products(product_id, product_name, category) and table order_items(order_item_id, product_id, quantity, unit_price) exist.

Method โ€” Using DENSE_RANK() Window Function (recommended):

SELECT
    category,
    product_name,
    total_revenue,
    revenue_rank
FROM (
    SELECT
        p.category,
        p.product_name,
        SUM(oi.quantity * oi.unit_price)            AS total_revenue,
        DENSE_RANK() OVER (
            PARTITION BY p.category
            ORDER BY SUM(oi.quantity * oi.unit_price) DESC
        )                                           AS revenue_rank
    FROM products p
    JOIN order_items oi
        ON p.product_id = oi.product_id
    GROUP BY
        p.category,
        p.product_name
) ranked
WHERE revenue_rank <= 3
ORDER BY
    category ASC,
    revenue_rank ASC;

How it works step by step:

Step What happens
JOIN Combines products with their order line items
GROUP BY category, product_name Calculates total revenue per product per category
DENSE_RANK() OVER (PARTITION BY category ...) Ranks products within each category by revenue
WHERE revenue_rank <= 3 Keeps only top 3 per category

Why DENSE_RANK() over ROW_NUMBER(): DENSE_RANK() handles ties correctly โ€” if two products share rank 2, both appear, and rank 3 is still assigned to the next product. ROW_NUMBER() would arbitrarily break the tie and potentially exclude a valid top-3 product.


Q13. Given an orders table with columns (order_id, customer_id, order_date, amount), write a query to find the month with the highest total sales. [5 Marks]

Answer:

MySQL:

SELECT
    YEAR(order_date)                AS order_year,
    MONTH(order_date)               AS order_month,
    SUM(amount)                     AS total_sales
FROM orders
GROUP BY
    YEAR(order_date),
    MONTH(order_date)
ORDER BY total_sales DESC
LIMIT 1;

PostgreSQL:

SELECT
    TO_CHAR(order_date, 'YYYY-MM')  AS order_month,
    SUM(amount)                     AS total_sales
FROM orders
GROUP BY
    TO_CHAR(order_date, 'YYYY-MM')
ORDER BY total_sales DESC
LIMIT 1;

SQL Server (handles ties correctly using RANK):

SELECT order_month, total_sales
FROM (
    SELECT
        FORMAT(order_date, 'yyyy-MM')           AS order_month,
        SUM(amount)                             AS total_sales,
        RANK() OVER (
            ORDER BY SUM(amount) DESC
        )                                       AS rnk
    FROM orders
    GROUP BY FORMAT(order_date, 'yyyy-MM')
) ranked
WHERE rnk = 1;

Why use RANK() for SQL Server: If two months share the exact same highest total sales, TOP 1 would arbitrarily return one. RANK() ensures all tied months are returned correctly. For MySQL/PostgreSQL, use LIMIT 1 for simplicity or wrap in a subquery with MAX() for tie-safety.

Tie-safe version for MySQL / PostgreSQL:

SELECT
    TO_CHAR(order_date, 'YYYY-MM')  AS order_month,
    SUM(amount)                     AS total_sales
FROM orders
GROUP BY TO_CHAR(order_date, 'YYYY-MM')
HAVING SUM(amount) = (
    SELECT MAX(monthly_total)
    FROM (
        SELECT SUM(amount) AS monthly_total
        FROM orders
        GROUP BY TO_CHAR(order_date, 'YYYY-MM')
    ) AS monthly_sales
);

๐Ÿ“… Day 3: Joins โ€” INNER, LEFT, RIGHT, FULL & Self Joins

Intermediate 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN with Venn diagrams. [4 Marks]

Answer:

INNER JOIN โ€” Returns only rows where there is a match in both tables.

    A        B
  [   |โ–“โ–“โ–“|   ]
  Only the intersection
SELECT c.name, o.order_id
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
-- Only customers who HAVE orders

LEFT JOIN (LEFT OUTER JOIN) โ€” Returns all rows from the left table and matched rows from the right table. Unmatched right-side columns return NULL.

    A        B
  [โ–“โ–“โ–“|โ–“โ–“โ–“|   ]
  All of A + matching B
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
-- All customers, even those with no orders (order_id = NULL)

RIGHT JOIN (RIGHT OUTER JOIN) โ€” Returns all rows from the right table and matched rows from the left table. Unmatched left-side columns return NULL.

    A        B
  [   |โ–“โ–“โ–“|โ–“โ–“โ–“]
  Matching A + all of B
SELECT c.name, o.order_id
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;
-- All orders, even if the customer record no longer exists

FULL OUTER JOIN โ€” Returns all rows from both tables. NULLs fill in where there is no match on either side.

    A        B
  [โ–“โ–“โ–“|โ–“โ–“โ–“|โ–“โ–“โ–“]
  Everything from both
SELECT c.name, o.order_id
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;
-- All customers + all orders, matched where possible
-- MySQL does NOT support FULL OUTER JOIN natively (use UNION of LEFT + RIGHT)
Join Type Left Unmatched Right Unmatched Matched Rows
INNER JOIN โŒ โŒ โœ…
LEFT JOIN โœ… โŒ โœ…
RIGHT JOIN โŒ โœ… โœ…
FULL OUTER JOIN โœ… โœ… โœ…

Q2. Write a query to find all customers who have never placed an order using LEFT JOIN. [3 Marks]

Answer:

Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.

SELECT
    c.customer_id,
    c.name,
    c.email
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL
ORDER BY c.customer_id;

How it works:

Step What happens
LEFT JOIN Returns all customers, with NULL in order columns if no matching order exists
WHERE o.order_id IS NULL Filters to keep only customers where no order was found (the NULL rows)

Common mistake: Using WHERE o.order_id = NULL instead of IS NULL โ€” this will always return zero rows because NULL comparisons with = evaluate to UNKNOWN, not TRUE. Always use IS NULL to check for NULLs.


Q3. What is a CROSS JOIN? When is it useful? [3 Marks]

Answer: A CROSS JOIN produces the Cartesian product of two tables โ€” every row from the left table is combined with every row from the right table. If Table A has m rows and Table B has n rows, the result has m ร— n rows.

Syntax:

SELECT
    a.color,
    b.size
FROM colors a
CROSS JOIN sizes b;

Example output (3 colors ร— 3 sizes = 9 rows):

color size
Red Small
Red Medium
Red Large
Blue Small
Blue Medium
... ...

When is CROSS JOIN useful?

-- 1. Generating all combinations (e.g., product variants)
SELECT p.product_name, c.color, s.size
FROM products p
CROSS JOIN colors c
CROSS JOIN sizes s;

-- 2. Generating a date/calendar table
SELECT d.date_value
FROM (SELECT CURDATE() + INTERVAL seq DAY AS date_value
      FROM generate_series(0, 364)) d;

-- 3. Seeding test data with all possible combinations
SELECT e.employee_id, t.task_type
FROM employees e
CROSS JOIN task_types t;

Warning: CROSS JOINs on large tables can be extremely expensive. A table of 1,000 rows CROSS JOINed with another 1,000-row table produces 1,000,000 rows. Always ensure this is intentional.


Q4. Explain the difference between ON and USING in join syntax. [2 Marks]

Answer:

Feature ON USING
Column names Works even if column names differ between tables Requires the join column to have the same name in both tables
Flexibility Supports complex conditions (>, <, multiple columns, expressions) Only supports simple equality on identically named columns
Result columns Join column appears twice (once per table) in SELECT * Join column appears only once in SELECT *
Standard support All SQL databases Standard SQL, supported by MySQL and PostgreSQL
-- Using ON (flexible โ€” column names can differ)
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.cust_id;  -- different column names

-- Using USING (concise โ€” column name must be identical in both tables)
SELECT name, order_id
FROM customers
JOIN orders USING (customer_id);  -- customer_id exists in both tables

-- ON with multiple conditions
SELECT *
FROM orders o
JOIN promotions p
    ON o.customer_id = p.customer_id
    AND o.order_date BETWEEN p.start_date AND p.end_date;

Rule of thumb: Use USING for clean, simple joins when column names match. Use ON for everything else โ€” it is more explicit and universally supported.


Q5. Write a query to join three tables: customers, orders, and products, showing customer name, order date, and product name. [3 Marks]

Answer:

Assumption: Tables are:
- customers(customer_id, name, email)
- orders(order_id, customer_id, order_date)
- order_items(order_item_id, order_id, product_id, quantity)
- products(product_id, product_name, price)

SELECT
    c.name           AS customer_name,
    o.order_date,
    p.product_name
FROM customers c
INNER JOIN orders o
    ON c.customer_id = o.customer_id
INNER JOIN order_items oi
    ON o.order_id = oi.order_id
INNER JOIN products p
    ON oi.product_id = p.product_id
ORDER BY
    o.order_date DESC,
    c.name ASC;

Note: In a normalized schema, orders and products are linked through an intermediate order_items table. Attempting to join orders directly to products without order_items would be incorrect as no direct foreign key relationship exists between them.


Q6. What is a self join? Write a query to find each employee and their manager's name from an 'employees' table with a 'manager_id' column. [4 Marks]

Answer: A self join is when a table is joined to itself. This is useful when a table contains a hierarchical or relational reference to its own rows โ€” such as an employee's manager also being an employee in the same table.

Assumption: Table employees(emp_id, name, department, salary, manager_id) where manager_id references emp_id in the same table.

-- INNER JOIN version (excludes employees with no manager, e.g., the CEO)
SELECT
    e.emp_id                AS employee_id,
    e.name                  AS employee_name,
    m.name                  AS manager_name
FROM employees e
INNER JOIN employees m
    ON e.manager_id = m.emp_id
ORDER BY e.emp_id;
-- LEFT JOIN version (includes employees with no manager โ€” shows NULL for top-level)
SELECT
    e.emp_id                AS employee_id,
    e.name                  AS employee_name,
    COALESCE(m.name, 'No Manager โ€” Top Level')  AS manager_name
FROM employees e
LEFT JOIN employees m
    ON e.manager_id = m.emp_id
ORDER BY e.emp_id;

Key concept: The same table is aliased twice โ€” e for the employee row and m for the manager row โ€” so SQL treats them as two separate logical tables during the join.


Q7. Write a query to find all pairs of employees in the same department (without pairing employees with themselves or duplicating pairs). [3 Marks]

Answer:

Assumption: Table employees(emp_id, name, department) exists.

SELECT
    e1.name         AS employee_1,
    e2.name         AS employee_2,
    e1.department
FROM employees e1
JOIN employees e2
    ON  e1.department = e2.department   -- same department
    AND e1.emp_id < e2.emp_id           -- prevents self-pairing AND duplicate pairs
ORDER BY
    e1.department,
    e1.name;

Why e1.emp_id < e2.emp_id works:

Condition What it prevents
e1.emp_id != e2.emp_id Prevents self-pairing (Alice, Alice) but still gives (Alice, Bob) AND (Bob, Alice)
e1.emp_id < e2.emp_id Prevents self-pairing AND ensures each pair appears only once in one direction

Example: If Alice has emp_id=1 and Bob has emp_id=2, only the pair (1, 2) is included. The reverse pair (2, 1) is excluded because 2 < 1 is FALSE.


Q8. What is a non-equi join? Give an example with price ranges. [3 Marks]

Answer: A non-equi join is a join that uses a comparison operator other than = in the ON clause โ€” such as >, <, >=, <=, BETWEEN, or !=. It is used when rows should be matched based on a range or inequality condition rather than exact key matching.

Example โ€” Categorize products into price tiers using a price_ranges table:

Assumption:
- products(product_id, product_name, price)
- price_ranges(tier_name, min_price, max_price)

-- price_ranges table:
-- | tier_name  | min_price | max_price |
-- | Budget     | 0         | 999       |
-- | Mid-Range  | 1000      | 4999      |
-- | Premium    | 5000      | 9999      |
-- | Luxury     | 10000     | 999999    |

SELECT
    p.product_id,
    p.product_name,
    p.price,
    pr.tier_name        AS price_tier
FROM products p
JOIN price_ranges pr
    ON p.price BETWEEN pr.min_price AND pr.max_price
ORDER BY p.price;

Other common non-equi join use cases:

-- Find employees whose salary falls within a salary band
SELECT e.name, sb.band_name
FROM employees e
JOIN salary_bands sb
    ON e.salary BETWEEN sb.lower_bound AND sb.upper_bound;

-- Find all orders placed during an active promotion period
SELECT o.order_id, promo.promo_name
FROM orders o
JOIN promotions promo
    ON o.order_date >= promo.start_date
    AND o.order_date <= promo.end_date;

Q9. Write a query using LEFT JOIN to find customers who have placed orders but never returned any product. (tables: customers, orders, returns) [5 Marks]

Answer:

Assumption: Tables are:
- customers(customer_id, name, email)
- orders(order_id, customer_id, order_date, amount)
- returns(return_id, order_id, customer_id, return_date, reason)

SELECT DISTINCT
    c.customer_id,
    c.name,
    c.email,
    COUNT(o.order_id)   AS total_orders
FROM customers c
INNER JOIN orders o
    ON c.customer_id = o.customer_id    -- only customers who have ordered
LEFT JOIN returns r
    ON o.order_id = r.order_id          -- attempt to match any returns
WHERE r.return_id IS NULL               -- keep only those with NO returns
GROUP BY
    c.customer_id,
    c.name,
    c.email
ORDER BY total_orders DESC;

Step-by-step logic:

Step Clause Purpose
1 INNER JOIN orders Restrict to customers who have placed at least one order
2 LEFT JOIN returns Attempt to find a matching return for each order
3 WHERE r.return_id IS NULL Keep only orders that have no corresponding return
4 GROUP BY Aggregate per customer to count their orders

Why INNER JOIN for orders but LEFT JOIN for returns: We want customers who definitely ordered (INNER JOIN) but never returned anything (LEFT JOIN + IS NULL anti-join pattern).


Q10. Explain the difference between a LEFT JOIN with WHERE clause vs a LEFT JOIN with a condition in the ON clause. Give examples showing the different results. [5 Marks]

Answer: This is one of the most commonly misunderstood concepts in SQL joins. The placement of a filter condition โ€” in ON vs WHERE โ€” produces fundamentally different results with LEFT JOIN.

Setup โ€” Two tables:

-- customers: Alice (1), Bob (2), Carol (3)
-- orders: Alice has 1 order, Bob has 1 order, Carol has none

Case 1 โ€” Filter in ON clause (soft filter):

SELECT
    c.name,
    o.order_id
FROM customers c
LEFT JOIN orders o
    ON  c.customer_id = o.customer_id
    AND o.amount > 500;       -- condition applied DURING the join

Result:
| name | order_id |
|---|---|
| Alice | 101 |
| Bob | NULL |
| Carol | NULL |

All customers are returned. Bob's order exists but its amount was โ‰ค 500, so it doesn't match the ON condition โ€” Bob gets NULL for order_id but still appears in results.

Case 2 โ€” Filter in WHERE clause (hard filter):

SELECT
    c.name,
    o.order_id
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.amount > 500;         -- condition applied AFTER the join

Result:
| name | order_id |
|---|---|
| Alice | 101 |

Carol and Bob are completely removed. The LEFT JOIN first produces NULLs for unmatched rows, but then WHERE o.amount > 500 filters them out โ€” NULL > 500 evaluates to UNKNOWN (not TRUE), so those rows are eliminated. This effectively converts LEFT JOIN into INNER JOIN.

Summary:

ON Condition WHERE Condition
Filter applied During join construction After join is complete
Effect on unmatched rows Still included (with NULLs) Eliminated if condition fails
Result type True LEFT JOIN preserved Behaves like INNER JOIN
Use when You want all left rows regardless You want to strictly filter final results

Golden rule: To keep the LEFT JOIN behavior intact while filtering the right-side table, always put right-table conditions in the ON clause. Putting them in WHERE silently turns your LEFT JOIN into an INNER JOIN.


Q11. Write a query to find customers whose total order value in 2024 is greater than their total order value in 2023. [5 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) and customers(customer_id, name) exist.

WITH orders_2023 AS (
    SELECT
        customer_id,
        SUM(amount) AS total_2023
    FROM orders
    WHERE YEAR(order_date) = 2023
    GROUP BY customer_id
),
orders_2024 AS (
    SELECT
        customer_id,
        SUM(amount) AS total_2024
    FROM orders
    WHERE YEAR(order_date) = 2024
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.name,
    COALESCE(o23.total_2023, 0)     AS total_2023,
    COALESCE(o24.total_2024, 0)     AS total_2024,
    (COALESCE(o24.total_2024, 0)
     - COALESCE(o23.total_2023, 0)) AS growth
FROM customers c
INNER JOIN orders_2024 o24
    ON c.customer_id = o24.customer_id          -- must have 2024 orders
LEFT JOIN orders_2023 o23
    ON c.customer_id = o23.customer_id          -- may or may not have 2023 orders
WHERE COALESCE(o24.total_2024, 0)
    > COALESCE(o23.total_2023, 0)               -- 2024 must exceed 2023
ORDER BY growth DESC;

Why COALESCE: A customer who had zero orders in 2023 but ordered in 2024 would have NULL for total_2023. COALESCE(o23.total_2023, 0) converts that NULL to 0, allowing the comparison to work correctly and also including new customers who grew from 0 to something.


Q12. Write a query to find the top 5 customers by total spend, including customers who haven't ordered anything (show them with 0). [5 Marks]

Answer:

Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.

SELECT
    c.customer_id,
    c.name,
    c.email,
    COALESCE(SUM(o.amount), 0)      AS total_spend,
    COUNT(o.order_id)               AS total_orders
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
GROUP BY
    c.customer_id,
    c.name,
    c.email
ORDER BY total_spend DESC
LIMIT 5;                            -- use TOP 5 for SQL Server

SQL Server version:

SELECT TOP 5
    c.customer_id,
    c.name,
    c.email,
    COALESCE(SUM(o.amount), 0)      AS total_spend,
    COUNT(o.order_id)               AS total_orders
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
GROUP BY
    c.customer_id,
    c.name,
    c.email
ORDER BY total_spend DESC;

Design decisions explained:

Choice Reason
LEFT JOIN instead of INNER JOIN Ensures customers with zero orders are included
COALESCE(SUM(o.amount), 0) Converts NULL (no orders) to 0 for clean display
COUNT(o.order_id) not COUNT(*) COUNT(*) would count 1 even for NULL rows; COUNT(order_id) correctly returns 0

Q13. Given 'departments' and 'employees' tables, find departments that have no employees assigned. [5 Marks]

Answer:

Assumption: Tables departments(dept_id, dept_name, location) and employees(emp_id, name, dept_id, salary) exist.

Method 1 โ€” LEFT JOIN + IS NULL (Anti-Join pattern โ€” most common):

SELECT
    d.dept_id,
    d.dept_name,
    d.location
FROM departments d
LEFT JOIN employees e
    ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL
ORDER BY d.dept_name;

Method 2 โ€” NOT EXISTS (most readable and efficient for large datasets):

SELECT
    d.dept_id,
    d.dept_name,
    d.location
FROM departments d
WHERE NOT EXISTS (
    SELECT 1
    FROM employees e
    WHERE e.dept_id = d.dept_id
)
ORDER BY d.dept_name;

Method 3 โ€” NOT IN (simple but unsafe if dept_id can be NULL in employees):

SELECT
    dept_id,
    dept_name,
    location
FROM departments
WHERE dept_id NOT IN (
    SELECT dept_id
    FROM employees
    WHERE dept_id IS NOT NULL   -- critical: NOT IN fails silently if subquery has NULLs
)
ORDER BY dept_name;

Method comparison:

Method Performance NULL safe Readability
LEFT JOIN + IS NULL โœ… Fast โœ… Yes โœ… Common pattern
NOT EXISTS โœ… Fast (stops early) โœ… Yes โœ… Very readable
NOT IN โš ๏ธ Slower on large sets โŒ Fails with NULLs โœ… Simple

Best practice: Prefer NOT EXISTS or LEFT JOIN + IS NULL over NOT IN in production queries. If the subquery in NOT IN returns even one NULL value, the entire query returns zero rows โ€” a silent and dangerous bug.


๐Ÿ“… Day 4: Subqueries, CTEs & Derived Tables

Intermediate 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the difference between a correlated and non-correlated subquery with examples. [4 Marks]

Answer:

Feature Non-Correlated Subquery Correlated Subquery
Dependency Independent of the outer query References a column from the outer query
Execution Runs once, result reused Runs once per row of the outer query
Performance Generally faster Generally slower on large datasets
Can run alone? โœ… Yes โŒ No โ€” depends on outer query

Non-Correlated Subquery โ€” runs once, no reference to outer query:

-- Find all employees earning more than the company-wide average salary
SELECT emp_id, name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary)      -- runs once, returns a single value
    FROM employees
);

Correlated Subquery โ€” references outer query, runs per row:

-- Find all employees earning more than their OWN department's average salary
SELECT emp_id, name, department, salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department = e1.department  -- references e1 from outer query
    -- re-executes for every row in e1
);

Performance tip: Correlated subqueries can be slow on large tables because they execute once per outer row. They are often rewritable as JOINs or window functions for better performance.


Q2. Write a query using a subquery to find employees who earn more than the average salary of their department. [4 Marks]

Answer:

Assumption: Table employees(emp_id, name, department, salary) exists.

Method 1 โ€” Correlated Subquery:

SELECT
    emp_id,
    name,
    department,
    salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department = e1.department
)
ORDER BY department, salary DESC;

Method 2 โ€” Derived Table (often more efficient):

SELECT
    e.emp_id,
    e.name,
    e.department,
    e.salary,
    ROUND(dept_avg.avg_salary, 2)   AS dept_avg_salary
FROM employees e
JOIN (
    SELECT
        department,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) AS dept_avg
    ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_salary
ORDER BY e.department, e.salary DESC;

Method 3 โ€” Using Window Function (most efficient):

SELECT emp_id, name, department, salary
FROM (
    SELECT
        emp_id,
        name,
        department,
        salary,
        AVG(salary) OVER (PARTITION BY department) AS dept_avg
    FROM employees
) ranked
WHERE salary > dept_avg
ORDER BY department, salary DESC;

Why Method 3 is best: The window function computes department averages in a single pass over the data. No correlated re-execution, no extra JOIN โ€” it is the most performant approach for large datasets.


Q3. What is the difference between a scalar, row, and table subquery? [3 Marks]

Answer:

Type Returns Used in Must return
Scalar A single value (1 row ร— 1 column) SELECT, WHERE, HAVING Exactly 1 value (error if multiple)
Row A single row (1 row ร— multiple columns) WHERE with row constructors Exactly 1 row
Table Multiple rows and columns FROM clause (as derived table) Any number of rows

Scalar Subquery โ€” returns one value:

-- Returns a single number used in comparison
SELECT name, salary,
       (SELECT AVG(salary) FROM employees) AS company_avg
FROM employees;

Row Subquery โ€” returns one row, compared as a unit:

-- Find the employee with the exact same department and salary as emp_id 101
SELECT name, department, salary
FROM employees
WHERE (department, salary) = (
    SELECT department, salary
    FROM employees
    WHERE emp_id = 101
);

Table Subquery โ€” returns a result set used as a virtual table:

-- Use multiple rows and columns as an inline table in FROM
SELECT dept_summary.department, dept_summary.avg_salary
FROM (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) AS dept_summary
WHERE dept_summary.avg_salary > 60000;

Q4. Write a query using EXISTS to find customers who have placed at least one order above 10000. [4 Marks]

Answer:

Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.

SELECT
    c.customer_id,
    c.name,
    c.email
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.amount > 10000
)
ORDER BY c.name;

Why SELECT 1 inside EXISTS:

-- EXISTS only checks whether the subquery returns ANY row โ€” it doesn't care about values
-- SELECT 1, SELECT *, SELECT NULL โ€” all work identically inside EXISTS
-- SELECT 1 is a common convention signaling "we only care about existence, not data"

EXISTS vs IN for this query:

-- Using IN (alternative)
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE amount > 10000
);

-- EXISTS is preferred when:
-- 1. The subquery could return a large result set (EXISTS stops at first match)
-- 2. The subquery involves NULLs (IN behaves unexpectedly with NULLs)
-- 3. You need a correlated condition (EXISTS naturally supports correlation)

Performance note: EXISTS short-circuits โ€” it stops scanning as soon as it finds the first matching row. IN collects all matching values first, then checks membership. For large datasets, EXISTS is typically faster.


Q5. What is a CTE? How does it differ from a subquery and a temporary table? When would you prefer a CTE? [4 Marks]

Answer: A Common Table Expression (CTE) is a named, temporary result set defined using the WITH clause at the beginning of a query. It exists only for the duration of that single query execution.

WITH cte_name AS (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
)
SELECT e.name, e.salary, c.avg_salary
FROM employees e
JOIN cte_name c ON e.department = c.department;

Comparison table:

Feature Subquery CTE Temporary Table
Named & reusable โŒ No โœ… Yes (within query) โœ… Yes (within session)
Scope Inline only Current query only Entire session
Readability โŒ Can nest and become complex โœ… Clean, modular โœ… Clear
Supports recursion โŒ No โœ… Yes โŒ No
Indexes possible โŒ No โŒ No โœ… Yes
Materialized to disk โŒ No โŒ Usually not โœ… Yes
DDL required โŒ No โŒ No โœ… Yes (CREATE TABLE)

When to prefer a CTE:
- When the same subquery logic is needed more than once in the same query
- When building recursive queries (hierarchy traversal, number generation)
- When breaking a complex query into readable named steps
- When you want no side effects โ€” CTEs vanish after the query, unlike temp tables


Q6. Write a query using a CTE to calculate monthly running total of sales. [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL:

WITH monthly_sales AS (
    SELECT
        YEAR(order_date)                        AS sale_year,
        MONTH(order_date)                       AS sale_month,
        SUM(amount)                             AS monthly_total
    FROM orders
    GROUP BY
        YEAR(order_date),
        MONTH(order_date)
)
SELECT
    sale_year,
    sale_month,
    monthly_total,
    SUM(monthly_total) OVER (
        PARTITION BY sale_year
        ORDER BY sale_month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )                                           AS running_total
FROM monthly_sales
ORDER BY sale_year, sale_month;

PostgreSQL:

WITH monthly_sales AS (
    SELECT
        TO_CHAR(order_date, 'YYYY-MM')          AS sale_month,
        SUM(amount)                             AS monthly_total
    FROM orders
    GROUP BY TO_CHAR(order_date, 'YYYY-MM')
)
SELECT
    sale_month,
    monthly_total,
    SUM(monthly_total) OVER (
        ORDER BY sale_month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )                                           AS running_total
FROM monthly_sales
ORDER BY sale_month;

Sample output:

sale_month monthly_total running_total
2024-01 120000 120000
2024-02 95000 215000
2024-03 140000 355000

Why ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: This window frame explicitly defines the running total from the very first row up to the current row. Without specifying the frame, the default behavior can vary across databases and lead to incorrect results.


Q7. What is a recursive CTE? Write one to generate numbers from 1 to 100. [4 Marks]

Answer: A recursive CTE is a CTE that references itself. It consists of two parts joined by UNION ALL:
- Anchor member โ€” the starting/base case (runs once)
- Recursive member โ€” references the CTE itself and runs repeatedly until the termination condition is met

Structure:

WITH RECURSIVE cte_name AS (
    -- Anchor: starting point
    SELECT base_value ...

    UNION ALL

    -- Recursive member: references cte_name itself
    SELECT next_value ...
    FROM cte_name
    WHERE termination_condition
)
SELECT * FROM cte_name;

Generate numbers 1 to 100:

PostgreSQL / SQL Server:

WITH RECURSIVE number_series AS (
    -- Anchor member: start at 1
    SELECT 1 AS n

    UNION ALL

    -- Recursive member: add 1 each iteration
    SELECT n + 1
    FROM number_series
    WHERE n < 100           -- termination condition: stop at 100
)
SELECT n
FROM number_series
ORDER BY n;

MySQL (uses RECURSIVE keyword):

WITH RECURSIVE number_series AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1
    FROM number_series
    WHERE n < 100
)
SELECT n FROM number_series;

Execution flow:

Iteration Value of n
Anchor 1
Recursion 1 2
Recursion 2 3
... ...
Recursion 99 100
Termination n < 100 is FALSE โ€” stops

Safety tip: Always include a WHERE termination condition in recursive CTEs. Without it, the query will run indefinitely. SQL Server also supports MAXRECURSION hint (OPTION (MAXRECURSION 1000)) to cap recursion depth.


Q8. Use a recursive CTE to traverse an employee-manager hierarchy and show each employee's full reporting chain. [3 Marks]

Answer:

Assumption: Table employees(emp_id, name, manager_id) where manager_id is NULL for the top-level executive.

WITH RECURSIVE org_hierarchy AS (
    -- Anchor: start with top-level employees (no manager)
    SELECT
        emp_id,
        name,
        manager_id,
        name                        AS reporting_chain,
        0                           AS level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive: find employees reporting to current level
    SELECT
        e.emp_id,
        e.name,
        e.manager_id,
        CONCAT(h.reporting_chain, ' โ†’ ', e.name)   AS reporting_chain,
        h.level + 1                                 AS level
    FROM employees e
    INNER JOIN org_hierarchy h
        ON e.manager_id = h.emp_id
)
SELECT
    emp_id,
    name,
    level                           AS hierarchy_level,
    reporting_chain
FROM org_hierarchy
ORDER BY reporting_chain;

Sample output:

emp_id name hierarchy_level reporting_chain
1 CEO 0 CEO
2 VP Sales 1 CEO โ†’ VP Sales
4 Sales Manager 2 CEO โ†’ VP Sales โ†’ Sales Manager
7 Sales Rep 3 CEO โ†’ VP Sales โ†’ Sales Manager โ†’ Sales Rep

Note: CONCAT builds the chain path by appending each employee's name to their manager's chain. The level column tracks depth in the hierarchy. For SQL Server, use STRING_AGG or + operator instead of CONCAT if needed.


Q9. What is a derived table? Write a query using a derived table to find customers whose average order value is in the top 10%. [5 Marks]

Answer: A derived table is an inline subquery placed in the FROM clause and given an alias. It acts as a virtual table for the duration of the query โ€” similar to a CTE but defined inline rather than at the top with WITH.

-- Structure of a derived table
SELECT col1, col2
FROM (
    SELECT ...      -- this subquery IS the derived table
    FROM some_table
) AS alias_name     -- alias is MANDATORY for derived tables
WHERE ...;

Query โ€” Find customers in the top 10% by average order value:

SELECT
    customer_id,
    avg_order_value,
    percentile_rank
FROM (
    -- Derived table: compute avg order value and percentile rank per customer
    SELECT
        customer_id,
        ROUND(AVG(amount), 2)                           AS avg_order_value,
        PERCENT_RANK() OVER (
            ORDER BY AVG(amount)
        )                                               AS percentile_rank
    FROM orders
    GROUP BY customer_id
) AS customer_stats
WHERE percentile_rank >= 0.90           -- top 10% (90th percentile and above)
ORDER BY avg_order_value DESC;

With customer name (joining customers table):

SELECT
    c.customer_id,
    c.name,
    cs.avg_order_value,
    ROUND(cs.percentile_rank * 100, 1)  AS percentile
FROM customers c
JOIN (
    SELECT
        customer_id,
        ROUND(AVG(amount), 2)                           AS avg_order_value,
        PERCENT_RANK() OVER (ORDER BY AVG(amount))      AS percentile_rank
    FROM orders
    GROUP BY customer_id
) AS cs
    ON c.customer_id = cs.customer_id
WHERE cs.percentile_rank >= 0.90
ORDER BY cs.avg_order_value DESC;

Note: PERCENT_RANK() returns a value between 0 and 1. A value of 0.90 means the customer's average order value is higher than 90% of all customers โ€” placing them in the top 10%.


Q10. Rewrite the above query using a CTE. Which is more readable and why? [5 Marks]

Answer:

CTE version of the same query:

WITH customer_order_stats AS (
    -- Step 1: Calculate average order value per customer
    SELECT
        customer_id,
        ROUND(AVG(amount), 2)                           AS avg_order_value,
        PERCENT_RANK() OVER (
            ORDER BY AVG(amount)
        )                                               AS percentile_rank
    FROM orders
    GROUP BY customer_id
),
top_10_percent AS (
    -- Step 2: Filter to top 10% only
    SELECT
        customer_id,
        avg_order_value,
        ROUND(percentile_rank * 100, 1)                 AS percentile
    FROM customer_order_stats
    WHERE percentile_rank >= 0.90
)
-- Step 3: Join with customers table to get names
SELECT
    c.customer_id,
    c.name,
    t.avg_order_value,
    t.percentile
FROM customers c
JOIN top_10_percent t
    ON c.customer_id = t.customer_id
ORDER BY t.avg_order_value DESC;

Why the CTE version is more readable:

Aspect Derived Table CTE
Logic separation All nested inline โ€” hard to isolate each step Each step is named and separated clearly
Reusability Cannot reuse the derived table elsewhere in the query customer_order_stats can be referenced multiple times
Debugging Must extract nested subquery to test it Each CTE block can be run independently
Readability Logic flows inside-out (read innermost first) Logic flows top-to-bottom like narrative steps
Naming Anonymous inline block Descriptive names (customer_order_stats, top_10_percent)

Verdict: The CTE version is significantly more readable because each logical step is named and reads sequentially from top to bottom. The derived table version requires mentally unwrapping nested layers, which becomes increasingly difficult as query complexity grows. For production code and team environments, CTEs are strongly preferred.


Q11. Write a query to find the third highest salary per department using a CTE. [5 Marks]

Answer:

Assumption: Table employees(emp_id, name, department, salary) exists.

WITH salary_rankings AS (
    -- Step 1: Rank salaries within each department
    -- DENSE_RANK handles ties: two equal salaries both get rank 2,
    -- and the next distinct salary gets rank 3
    SELECT
        emp_id,
        name,
        department,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        )                               AS salary_rank
    FROM employees
),
third_highest AS (
    -- Step 2: Filter to only rank 3
    SELECT
        emp_id,
        name,
        department,
        salary,
        salary_rank
    FROM salary_rankings
    WHERE salary_rank = 3
)
-- Step 3: Final output
SELECT
    department,
    name,
    salary                              AS third_highest_salary
FROM third_highest
ORDER BY department, salary DESC;

Handling departments with fewer than 3 distinct salaries:

WITH salary_rankings AS (
    SELECT
        emp_id,
        name,
        department,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        )                               AS salary_rank,
        COUNT(DISTINCT salary) OVER (
            PARTITION BY department
        )                               AS distinct_salary_count
    FROM employees
)
SELECT
    department,
    name,
    salary                              AS third_highest_salary
FROM salary_rankings
WHERE salary_rank = 3
  AND distinct_salary_count >= 3        -- only departments with 3+ distinct salaries
ORDER BY department;

Why DENSE_RANK() over ROW_NUMBER(): If two employees in the same department share the second-highest salary (e.g., both earn 80000), ROW_NUMBER() would arbitrarily assign them ranks 2 and 3 โ€” making one appear as the "third highest" incorrectly. DENSE_RANK() assigns both rank 2 and correctly identifies the next distinct salary as rank 3.


Q12. Using CTEs, write a query to find customers who placed orders in every month of 2024. [5 Marks]

Answer:

Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.

WITH all_months AS (
    -- Step 1: Generate all 12 months of 2024 as reference set
    SELECT 1 AS month_num UNION ALL SELECT 2 UNION ALL SELECT 3
    UNION ALL SELECT 4  UNION ALL SELECT 5  UNION ALL SELECT 6
    UNION ALL SELECT 7  UNION ALL SELECT 8  UNION ALL SELECT 9
    UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12
),
customer_active_months AS (
    -- Step 2: Find distinct months each customer ordered in 2024
    SELECT
        customer_id,
        MONTH(order_date)               AS order_month
    FROM orders
    WHERE YEAR(order_date) = 2024
    GROUP BY
        customer_id,
        MONTH(order_date)
),
customer_month_count AS (
    -- Step 3: Count how many distinct months each customer was active
    SELECT
        customer_id,
        COUNT(DISTINCT order_month)     AS active_months
    FROM customer_active_months
    GROUP BY customer_id
)
-- Step 4: Keep only customers active in all 12 months
SELECT
    c.customer_id,
    c.name,
    cm.active_months
FROM customers c
JOIN customer_month_count cm
    ON c.customer_id = cm.customer_id
WHERE cm.active_months = 12             -- ordered in all 12 months
ORDER BY c.name;

Alternative โ€” Using recursive CTE for month generation (PostgreSQL):

WITH RECURSIVE all_months AS (
    SELECT 1 AS month_num
    UNION ALL
    SELECT month_num + 1
    FROM all_months
    WHERE month_num < 12
),
customer_active_months AS (
    SELECT
        customer_id,
        EXTRACT(MONTH FROM order_date)  AS order_month
    FROM orders
    WHERE EXTRACT(YEAR FROM order_date) = 2024
    GROUP BY customer_id, EXTRACT(MONTH FROM order_date)
),
customer_month_count AS (
    SELECT
        customer_id,
        COUNT(DISTINCT order_month)     AS active_months
    FROM customer_active_months
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.name,
    cm.active_months
FROM customers c
JOIN customer_month_count cm
    ON c.customer_id = cm.customer_id
WHERE cm.active_months = 12
ORDER BY c.name;

Step-by-step logic explained:

CTE Purpose
all_months Reference list of all 12 months โ€” useful for cross-join gap detection
customer_active_months Deduplicates to one row per customer per month (handles multiple orders in same month)
customer_month_count Counts distinct months each customer was active in 2024
Final WHERE active_months = 12 Keeps only truly every-month customers

Why GROUP BY before COUNT: A customer might place 5 orders in January. Without GROUP BY deduplication in customer_active_months, January would count 5 times instead of 1 โ€” making a customer with 60 orders in one month incorrectly appear to have covered all months.


๐Ÿ“… Day 5: Window Functions โ€” RANK, LAG, LEAD & Running Totals

Intermediate 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. What is the difference between an aggregate function and a window function? Give examples. [3 Marks]

Answer:

Feature Aggregate Function Window Function
Rows returned Collapses rows โ€” one row per group Preserves all rows โ€” no collapsing
Requires GROUP BY โœ… Yes (usually) โŒ No โ€” uses OVER() clause instead
Can access individual row data alongside aggregation โŒ No โœ… Yes
Use case Summarized totals, counts, averages Rankings, running totals, row comparisons

Aggregate function โ€” collapses rows:

-- Returns ONE row per department โ€” individual employee data is lost
SELECT
    department,
    AVG(salary)     AS avg_salary
FROM employees
GROUP BY department;

Window function โ€” preserves all rows:

-- Returns EVERY employee row, with department average computed alongside
SELECT
    emp_id,
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department)  AS dept_avg_salary
FROM employees;

Sample output comparison:

Aggregate result (3 rows for 3 departments):
| department | avg_salary |
|---|---|
| Sales | 65000 |
| HR | 52000 |

Window function result (every employee preserved):
| name | department | salary | dept_avg_salary |
|---|---|---|---|
| Alice | Sales | 80000 | 65000 |
| Bob | Sales | 50000 | 65000 |
| Carol | HR | 52000 | 52000 |

Key insight: Window functions let you answer questions like "how does this employee's salary compare to their department average?" in a single query โ€” something impossible with aggregate functions alone without a JOIN or subquery.


Q2. Explain the OVER clause with PARTITION BY and ORDER BY. Write a query showing employee salary alongside their department's average salary. [4 Marks]

Answer: The OVER() clause is what defines a function as a window function. It specifies the "window" of rows the function operates on for each row in the result.

Components of OVER():

Component Purpose Optional?
PARTITION BY Divides rows into independent groups (like GROUP BY for windows) โœ… Optional
ORDER BY Defines the order of rows within each partition โœ… Optional (required for ranking/LAG/LEAD)
Frame clause (ROWS/RANGE BETWEEN) Defines the subset of rows within each partition โœ… Optional
-- Anatomy of OVER()
SUM(salary) OVER (
    PARTITION BY department    -- reset the window for each department
    ORDER BY hire_date         -- order rows within each partition
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- window frame
)

Query โ€” Employee salary alongside department average:

SELECT
    e.emp_id,
    e.name,
    e.department,
    e.salary,
    ROUND(
        AVG(e.salary) OVER (
            PARTITION BY e.department   -- compute avg separately per dept
        ), 2
    )                                   AS dept_avg_salary,
    ROUND(
        e.salary - AVG(e.salary) OVER (
            PARTITION BY e.department
        ), 2
    )                                   AS diff_from_dept_avg,
    ROUND(
        AVG(e.salary) OVER ()           -- no PARTITION BY = entire table
    , 2)                                AS company_avg_salary
FROM employees e
ORDER BY e.department, e.salary DESC;

Sample output:

name department salary dept_avg_salary diff_from_dept_avg company_avg_salary
Alice Sales 80000 65000 +15000 61000
Bob Sales 50000 65000 -15000 61000
Carol HR 60000 56000 +4000 61000

PARTITION BY vs GROUP BY: PARTITION BY divides data into windows but keeps all rows visible. GROUP BY divides data into groups and collapses them into summary rows โ€” individual row data is no longer accessible.


Q3. What is the difference between ROWS BETWEEN and RANGE BETWEEN in window frames? [4 Marks]

Answer: Both ROWS BETWEEN and RANGE BETWEEN define the window frame โ€” the subset of rows within a partition that a window function considers for each row. The critical difference is how they handle ties.

Feature ROWS BETWEEN RANGE BETWEEN
Based on Physical row position โ€” exact row count Logical value range โ€” all rows with equal ORDER BY value
Tie handling Each tied row is treated as a separate, distinct row All tied rows are included together in the same frame
Precision More precise and predictable Can include unexpected extra rows when ties exist
Performance Generally faster Slightly slower due to value-based comparison
Requires ORDER BY โœ… Yes โœ… Yes

Example โ€” Monthly sales with two months having the same total:

-- Sample data: Jan=1000, Feb=1000, Mar=2000, Apr=3000

-- ROWS BETWEEN: strictly physical rows
SELECT
    month,
    total_sales,
    SUM(total_sales) OVER (
        ORDER BY month
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW  -- exactly 1 prior physical row
    ) AS rows_running_sum
FROM monthly_sales;
-- Jan: 1000, Feb: 2000 (Jan+Feb), Mar: 3000 (Feb+Mar), Apr: 5000 (Mar+Apr)

-- RANGE BETWEEN: includes all rows with same ORDER BY value in frame
SELECT
    month,
    total_sales,
    SUM(total_sales) OVER (
        ORDER BY total_sales
        RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS range_running_sum
FROM monthly_sales;
-- When total_sales=1000 appears in Jan AND Feb, RANGE includes both
-- even when calculating for just "Jan" โ€” because Feb has the same value

Most common window frame defaults:

-- Running total (explicit best practice)
SUM(amount) OVER (
    ORDER BY sale_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

-- Centered 3-row moving average
AVG(amount) OVER (
    ORDER BY sale_date
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)

-- Entire partition (no frame restriction needed)
AVG(amount) OVER (PARTITION BY department)

Best practice: Use ROWS BETWEEN for running totals and moving averages โ€” it is predictable and unaffected by duplicate values. Use RANGE BETWEEN only when you explicitly want all rows sharing the same ORDER BY value treated as a single logical unit.


Q4. Write a query to calculate a 7-day moving average of daily sales using window functions. [4 Marks]

Answer:

Assumption: Table daily_sales(sale_date, total_sales) exists with one row per day.

MySQL / PostgreSQL / SQL Server:

SELECT
    sale_date,
    total_sales                                         AS daily_sales,
    ROUND(
        AVG(total_sales) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW   -- current day + 6 prior days = 7 days
        ), 2
    )                                                   AS moving_avg_7day,
    COUNT(*) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    )                                                   AS days_in_window    -- useful to flag early rows
FROM daily_sales
ORDER BY sale_date;

With a flag for incomplete windows (first 6 days have fewer than 7 data points):

SELECT
    sale_date,
    total_sales,
    CASE
        WHEN COUNT(*) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) < 7
        THEN NULL                                       -- insufficient data, show NULL
        ELSE ROUND(AVG(total_sales) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ), 2)
    END                                                 AS moving_avg_7day
FROM daily_sales
ORDER BY sale_date;

Sample output:

sale_date daily_sales moving_avg_7day days_in_window
2024-01-01 5000 NULL 1
2024-01-02 6200 NULL 2
... ... NULL ...
2024-01-07 4800 5371.43 7
2024-01-08 7100 5628.57 7

Why ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: This frame includes the current day plus the 6 days before it โ€” exactly 7 days total. Using RANGE BETWEEN instead could include unintended extra rows if multiple dates share the same total_sales value, so ROWS BETWEEN is always preferred for date-based sliding windows.


Q5. Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() with an example where they produce different results. [4 Marks]

Answer:

Function Handles Ties Gap after tie? Unique per row?
ROW_NUMBER() Assigns arbitrary unique numbers โ€” ties broken randomly N/A โ€” always sequential โœ… Always unique
RANK() Tied rows get same rank, next rank skips numbers โœ… Yes โ€” creates gaps โŒ Not unique on ties
DENSE_RANK() Tied rows get same rank, next rank is consecutive โŒ No gaps โŒ Not unique on ties

Example โ€” Salary ranking where 2 employees share the 2nd highest salary:

-- Sample data:
-- Alice: 90000, Bob: 75000, Carol: 75000, Dave: 60000

SELECT
    name,
    salary,
    ROW_NUMBER()  OVER (ORDER BY salary DESC)   AS row_num,
    RANK()        OVER (ORDER BY salary DESC)   AS rnk,
    DENSE_RANK()  OVER (ORDER BY salary DESC)   AS dense_rnk
FROM employees;

Result:

name salary row_num rnk dense_rnk
Alice 90000 1 1 1
Bob 75000 2 2 2
Carol 75000 3 2 2
Dave 60000 4 4 3

Observations:
- ROW_NUMBER() โ€” Bob and Carol both get different numbers (2 and 3) even though salaries are equal. The order between them is arbitrary.
- RANK() โ€” Both Bob and Carol get rank 2. Dave jumps to rank 4 (skipping 3) โ€” the gap reflects the two people above him.
- DENSE_RANK() โ€” Both Bob and Carol get rank 2. Dave gets rank 3 โ€” no gaps, consecutive ranking.

When to use each:

-- ROW_NUMBER: pagination, deduplication (need exactly 1 row per group)
-- RANK: sports leaderboards where gaps reflect true position ("tied for 2nd, no one in 3rd")
-- DENSE_RANK: finding Nth highest value without gaps (top 3 salary levels)

Q6. Write a query to find the top 3 products by revenue in each category using DENSE_RANK. [4 Marks]

Answer:

Assumption: Tables products(product_id, product_name, category) and order_items(order_item_id, product_id, quantity, unit_price) exist.

WITH product_revenue AS (
    -- Step 1: Calculate total revenue per product
    SELECT
        p.product_id,
        p.product_name,
        p.category,
        SUM(oi.quantity * oi.unit_price)                AS total_revenue
    FROM products p
    JOIN order_items oi
        ON p.product_id = oi.product_id
    GROUP BY
        p.product_id,
        p.product_name,
        p.category
),
ranked_products AS (
    -- Step 2: Rank products within each category by revenue
    SELECT
        product_id,
        product_name,
        category,
        total_revenue,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY total_revenue DESC
        )                                               AS revenue_rank
    FROM product_revenue
)
-- Step 3: Filter top 3 per category
SELECT
    category,
    revenue_rank,
    product_name,
    total_revenue
FROM ranked_products
WHERE revenue_rank <= 3
ORDER BY
    category        ASC,
    revenue_rank    ASC;

Sample output:

category revenue_rank product_name total_revenue
Electronics 1 Laptop Pro 500000
Electronics 2 Smartphone X 420000
Electronics 3 Tablet Z 310000
Clothing 1 Winter Jacket 180000
Clothing 2 Running Shoes 155000
Clothing 2 Denim Jeans 155000

Why DENSE_RANK() here: If two products tie for rank 2 (like Clothing above), DENSE_RANK() keeps both in the top 3 and correctly assigns rank 3 to the next product. ROW_NUMBER() would arbitrarily exclude one tied product. RANK() would skip rank 3 entirely and return rank 4 next.


Q7. What is NTILE? Write a query to divide customers into 4 quartiles based on their total spend. [4 Marks]

Answer: NTILE(n) is a window function that divides rows within a partition into n equal-sized buckets (as equal as possible) and assigns each row a bucket number from 1 to n. It is used for percentile grouping, quartile analysis, and tiered segmentation.

How NTILE handles uneven distribution:
- If rows don't divide evenly, earlier buckets receive one extra row
- Example: 10 rows with NTILE(3) โ†’ buckets of 4, 3, 3

WITH customer_spend AS (
    -- Step 1: Calculate total spend per customer
    SELECT
        customer_id,
        SUM(amount)                                     AS total_spend
    FROM orders
    GROUP BY customer_id
),
customer_quartiles AS (
    -- Step 2: Assign quartile using NTILE(4)
    SELECT
        customer_id,
        total_spend,
        NTILE(4) OVER (
            ORDER BY total_spend ASC                    -- ascending: quartile 1 = lowest spend
        )                                               AS quartile
    FROM customer_spend
)
-- Step 3: Label quartiles and join customer details
SELECT
    c.customer_id,
    c.name,
    cq.total_spend,
    cq.quartile,
    CASE cq.quartile
        WHEN 1 THEN 'Q1 โ€” Low Spenders'
        WHEN 2 THEN 'Q2 โ€” Below Average'
        WHEN 3 THEN 'Q3 โ€” Above Average'
        WHEN 4 THEN 'Q4 โ€” Top Spenders'
    END                                                 AS spend_segment
FROM customers c
JOIN customer_quartiles cq
    ON c.customer_id = cq.customer_id
ORDER BY cq.total_spend DESC;

Summary statistics per quartile:

-- Bonus: aggregate stats per quartile
SELECT
    quartile,
    COUNT(*)                        AS customer_count,
    ROUND(MIN(total_spend), 2)      AS min_spend,
    ROUND(MAX(total_spend), 2)      AS max_spend,
    ROUND(AVG(total_spend), 2)      AS avg_spend
FROM customer_quartiles
GROUP BY quartile
ORDER BY quartile;

Q8. Write a query to find the most recent order for each customer using ROW_NUMBER. [3 Marks]

Answer:

Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.

WITH ranked_orders AS (
    SELECT
        o.order_id,
        o.customer_id,
        o.order_date,
        o.amount,
        ROW_NUMBER() OVER (
            PARTITION BY o.customer_id          -- restart numbering for each customer
            ORDER BY o.order_date DESC          -- most recent order gets rank 1
        )                                       AS rn
    FROM orders o
)
SELECT
    c.customer_id,
    c.name,
    ro.order_id,
    ro.order_date                               AS most_recent_order_date,
    ro.amount                                   AS most_recent_order_amount
FROM customers c
JOIN ranked_orders ro
    ON c.customer_id = ro.customer_id
WHERE ro.rn = 1                                 -- keep only the most recent order per customer
ORDER BY ro.order_date DESC;

Why ROW_NUMBER() over MAX(order_date): MAX(order_date) can find the latest date, but you would need an additional JOIN to retrieve the full order row (order_id, amount, etc.). ROW_NUMBER() lets you select the complete most recent order row in one clean step. Also, if two orders share the same latest date, ROW_NUMBER() picks exactly one (use RANK() instead if you want both returned in tie cases).


Q9. Write a query using LAG to calculate month-over-month growth in revenue. [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL:

WITH monthly_revenue AS (
    -- Step 1: Aggregate revenue per month
    SELECT
        YEAR(order_date)                            AS yr,
        MONTH(order_date)                           AS mth,
        SUM(amount)                                 AS monthly_revenue
    FROM orders
    GROUP BY YEAR(order_date), MONTH(order_date)
),
revenue_with_lag AS (
    -- Step 2: Use LAG to bring in previous month's revenue
    SELECT
        yr,
        mth,
        monthly_revenue,
        LAG(monthly_revenue, 1, 0) OVER (
            ORDER BY yr, mth                        -- order chronologically across years
        )                                           AS prev_month_revenue
    FROM monthly_revenue
)
-- Step 3: Calculate growth metrics
SELECT
    yr                                              AS year,
    mth                                             AS month,
    monthly_revenue,
    prev_month_revenue,
    ROUND(monthly_revenue - prev_month_revenue, 2)  AS revenue_change,
    CASE
        WHEN prev_month_revenue = 0 THEN NULL       -- avoid division by zero for first month
        ELSE ROUND(
            (monthly_revenue - prev_month_revenue)
            / prev_month_revenue * 100, 2
        )
    END                                             AS mom_growth_pct
FROM revenue_with_lag
ORDER BY yr, mth;

PostgreSQL version (cleaner date formatting):

WITH monthly_revenue AS (
    SELECT
        TO_CHAR(order_date, 'YYYY-MM')              AS month_label,
        SUM(amount)                                 AS monthly_revenue
    FROM orders
    GROUP BY TO_CHAR(order_date, 'YYYY-MM')
),
revenue_with_lag AS (
    SELECT
        month_label,
        monthly_revenue,
        LAG(monthly_revenue) OVER (
            ORDER BY month_label
        )                                           AS prev_month_revenue
    FROM monthly_revenue
)
SELECT
    month_label,
    monthly_revenue,
    prev_month_revenue,
    ROUND(
        CASE
            WHEN prev_month_revenue IS NULL OR prev_month_revenue = 0 THEN NULL
            ELSE (monthly_revenue - prev_month_revenue)
                 / prev_month_revenue * 100
        END, 2
    )                                               AS mom_growth_pct
FROM revenue_with_lag
ORDER BY month_label;

Sample output:

month monthly_revenue prev_month_revenue revenue_change mom_growth_pct
2024-01 120000 NULL NULL NULL
2024-02 95000 120000 -25000 -20.83%
2024-03 140000 95000 +45000 +47.37%

Q10. What is the difference between LAG and LEAD? Give an example of each. [3 Marks]

Answer:

Function Direction Accesses Use case
LAG(col, n) Backward โ€” looks behind The row n rows before the current row Compare with past values (MoM growth, previous price)
LEAD(col, n) Forward โ€” looks ahead The row n rows after the current row Preview future values (next appointment, upcoming deadline)

Both accept three arguments: LAG/LEAD(column, offset, default) where default is returned when no row exists at that offset.

LAG example โ€” Compare each month's sales to the previous month:

SELECT
    sale_month,
    monthly_sales,
    LAG(monthly_sales, 1, 0) OVER (
        ORDER BY sale_month
    )                               AS previous_month_sales
FROM monthly_sales;
-- January row: previous_month_sales = 0 (no prior row, default used)
-- February row: previous_month_sales = January's value

LEAD example โ€” Show next month's sales target alongside current:

SELECT
    sale_month,
    monthly_sales,
    LEAD(monthly_sales, 1, NULL) OVER (
        ORDER BY sale_month
    )                               AS next_month_sales
FROM monthly_sales;
-- December row: next_month_sales = NULL (no following row)
-- November row: next_month_sales = December's value

Combined LAG + LEAD in one query:

SELECT
    sale_month,
    LAG(monthly_sales)  OVER (ORDER BY sale_month)  AS prev_month,
    monthly_sales                                    AS current_month,
    LEAD(monthly_sales) OVER (ORDER BY sale_month)  AS next_month
FROM monthly_sales;

Q11. Write a query using FIRST_VALUE to find the first product purchased by each customer. [3 Marks]

Answer:

Assumption: Tables customers(customer_id, name), orders(order_id, customer_id, order_date), and order_items(order_item_id, order_id, product_id) and products(product_id, product_name) exist.

WITH customer_purchases AS (
    SELECT
        o.customer_id,
        p.product_name,
        o.order_date,
        FIRST_VALUE(p.product_name) OVER (
            PARTITION BY o.customer_id              -- per customer
            ORDER BY o.order_date ASC               -- earliest order first
            ROWS BETWEEN UNBOUNDED PRECEDING
                     AND UNBOUNDED FOLLOWING        -- consider entire partition
        )                                           AS first_product_purchased
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p     ON oi.product_id = p.product_id
)
SELECT DISTINCT
    c.customer_id,
    c.name                                          AS customer_name,
    cp.first_product_purchased
FROM customers c
JOIN customer_purchases cp
    ON c.customer_id = cp.customer_id
ORDER BY c.customer_id;

Alternative โ€” Using ROW_NUMBER (more portable across databases):

WITH ranked_purchases AS (
    SELECT
        o.customer_id,
        p.product_name,
        o.order_date,
        ROW_NUMBER() OVER (
            PARTITION BY o.customer_id
            ORDER BY o.order_date ASC
        )                                           AS rn
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p     ON oi.product_id = p.product_id
)
SELECT
    c.customer_id,
    c.name,
    rp.product_name                                 AS first_product_purchased,
    rp.order_date                                   AS first_order_date
FROM customers c
JOIN ranked_purchases rp
    ON c.customer_id = rp.customer_id
WHERE rp.rn = 1
ORDER BY c.customer_id;

Why specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING with FIRST_VALUE: Without an explicit frame, many databases default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For FIRST_VALUE this works fine since we always look backwards โ€” but specifying the full frame is explicit best practice and avoids subtle bugs when the window ordering has ties.


Q12. Write a query to calculate the cumulative percentage of total sales for each product (Pareto analysis). [5 Marks]

Answer: A Pareto analysis (80/20 rule) identifies the products contributing the most to total revenue. Cumulative percentage helps identify the top 20% of products driving ~80% of sales.

Assumption: Tables products(product_id, product_name, category) and order_items(order_item_id, product_id, quantity, unit_price) exist.

WITH product_revenue AS (
    -- Step 1: Calculate total revenue per product
    SELECT
        p.product_id,
        p.product_name,
        p.category,
        SUM(oi.quantity * oi.unit_price)                AS product_revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.product_name, p.category
),
revenue_ranked AS (
    -- Step 2: Rank products and compute cumulative revenue
    SELECT
        product_id,
        product_name,
        category,
        product_revenue,
        SUM(product_revenue) OVER ()                    AS grand_total_revenue,
        ROUND(
            product_revenue * 100.0
            / SUM(product_revenue) OVER (), 2
        )                                               AS pct_of_total,
        ROUND(
            SUM(product_revenue) OVER (
                ORDER BY product_revenue DESC
                ROWS BETWEEN UNBOUNDED PRECEDING
                         AND CURRENT ROW
            ) * 100.0
            / SUM(product_revenue) OVER (), 2
        )                                               AS cumulative_pct,
        ROW_NUMBER() OVER (
            ORDER BY product_revenue DESC
        )                                               AS revenue_rank
    FROM product_revenue
)
-- Step 3: Label Pareto segment
SELECT
    revenue_rank,
    product_name,
    category,
    product_revenue,
    pct_of_total,
    cumulative_pct,
    CASE
        WHEN cumulative_pct <= 80 THEN 'Top 80% Revenue โ€” Vital Few'
        ELSE 'Bottom 20% Revenue โ€” Trivial Many'
    END                                                 AS pareto_segment
FROM revenue_ranked
ORDER BY revenue_rank;

Sample output:

rank product_name product_revenue pct_of_total cumulative_pct pareto_segment
1 Laptop Pro 500000 28.50% 28.50% Top 80% Revenue โ€” Vital Few
2 Smartphone X 420000 23.94% 52.44% Top 80% Revenue โ€” Vital Few
3 Tablet Z 310000 17.67% 70.11% Top 80% Revenue โ€” Vital Few
4 Headphones 180000 10.26% 80.37% Bottom 20% Revenue โ€” Trivial Many
5 Mouse 50000 2.85% 83.22% Bottom 20% Revenue โ€” Trivial Many

Business insight: In the output above, just 3 products (out of many) generate over 70% of revenue โ€” a classic Pareto pattern. Marketing and inventory decisions should prioritize these "Vital Few" products.


Q13. Write a query to identify customers whose spend decreased for 3 consecutive months. [5 Marks]

Answer:

Assumption: Tables orders(order_id, customer_id, order_date, amount) and customers(customer_id, name) exist.

WITH monthly_spend AS (
    -- Step 1: Total spend per customer per month
    SELECT
        customer_id,
        DATE_FORMAT(order_date, '%Y-%m')            AS spend_month,  -- MySQL
        -- TO_CHAR(order_date, 'YYYY-MM')           AS spend_month,  -- PostgreSQL
        SUM(amount)                                 AS monthly_spend
    FROM orders
    GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
),
spend_with_lag AS (
    -- Step 2: Bring in previous 2 months' spend using LAG
    SELECT
        customer_id,
        spend_month,
        monthly_spend,
        LAG(monthly_spend, 1) OVER (
            PARTITION BY customer_id
            ORDER BY spend_month
        )                                           AS spend_1_month_ago,
        LAG(monthly_spend, 2) OVER (
            PARTITION BY customer_id
            ORDER BY spend_month
        )                                           AS spend_2_months_ago
    FROM monthly_spend
),
consecutive_declines AS (
    -- Step 3: Flag months where spend has declined for 3 consecutive months
    SELECT
        customer_id,
        spend_month,
        spend_2_months_ago                          AS month_1_spend,
        spend_1_month_ago                           AS month_2_spend,
        monthly_spend                               AS month_3_spend
    FROM spend_with_lag
    WHERE
        spend_2_months_ago IS NOT NULL              -- need 3 months of data
        AND spend_1_month_ago IS NOT NULL
        AND monthly_spend   < spend_1_month_ago     -- month 3 < month 2
        AND spend_1_month_ago < spend_2_months_ago  -- month 2 < month 1
)
-- Step 4: Join with customers for final output
SELECT DISTINCT
    c.customer_id,
    c.name,
    cd.spend_month                                  AS third_declining_month,
    cd.month_1_spend,
    cd.month_2_spend,
    cd.month_3_spend,
    ROUND(
        (cd.month_3_spend - cd.month_1_spend)
        / cd.month_1_spend * 100, 2
    )                                               AS total_decline_pct
FROM customers c
JOIN consecutive_declines cd
    ON c.customer_id = cd.customer_id
ORDER BY total_decline_pct ASC;                     -- worst declines first

Step-by-step logic:

CTE Purpose
monthly_spend Aggregates total spend per customer per month
spend_with_lag Uses LAG(1) and LAG(2) to bring the previous 2 months' spend into the current row
consecutive_declines Filters rows where all three months show strict decline: M1 > M2 > M3
Final SELECT Joins customer names and calculates total % decline across the 3-month window

Sample output:

name third_declining_month month_1_spend month_2_spend month_3_spend total_decline_pct
Alice 2024-04 8000 6500 4200 -47.50%
Bob 2024-07 5000 4100 3300 -34.00%

Business use case: This query powers churn risk detection โ€” customers with 3 consecutive months of declining spend are high-priority candidates for re-engagement campaigns. The total_decline_pct column helps prioritize outreach by severity of the decline.

๐Ÿ“… Day 6: Advanced Joins, Set Operations & CASE Logic

Intermediate 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the difference between UNION and UNION ALL. Which is faster and why? [3 Marks]

Answer:

Feature UNION UNION ALL
Duplicate rows Removes duplicates from final result Keeps all rows including duplicates
Extra operation Performs a DISTINCT sort/hash pass No deduplication step
Performance Slower โ€” extra work to find and remove duplicates Faster โ€” no deduplication overhead
Result order Not guaranteed unless ORDER BY added Not guaranteed unless ORDER BY added
Use when You need unique rows across both result sets You know results are distinct OR duplicates are acceptable
-- UNION: removes duplicates (customer 101 appears in both but shows once)
SELECT customer_id FROM orders_2023
UNION
SELECT customer_id FROM orders_2024;

-- UNION ALL: keeps all rows including duplicates (faster)
SELECT customer_id FROM orders_2023
UNION ALL
SELECT customer_id FROM orders_2024;

Why UNION ALL is faster:
UNION internally applies a deduplication step โ€” similar to running SELECT DISTINCT on the combined result. This requires either sorting the entire result set or building a hash table to detect duplicates. Both are memory and CPU intensive on large datasets. UNION ALL simply concatenates the two result sets with no extra processing.

Best practice: Always use UNION ALL unless you specifically need duplicate elimination. If you know the two queries return mutually exclusive rows (e.g., partitioned by year), UNION ALL is always the right choice.


Q2. Write a query to find customers who bought Product A but not Product B using EXCEPT. [4 Marks]

Answer:

Assumption: Tables customers(customer_id, name), orders(order_id, customer_id, order_date), order_items(order_item_id, order_id, product_id), and products(product_id, product_name) exist.

Method 1 โ€” Using EXCEPT (PostgreSQL / SQL Server):

-- Customers who bought Product A
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p     ON oi.product_id = p.product_id
WHERE p.product_name = 'Product A'

EXCEPT

-- Customers who bought Product B
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p     ON oi.product_id = p.product_id
WHERE p.product_name = 'Product B';

With customer names (wrap in CTE):

WITH bought_a AS (
    SELECT DISTINCT o.customer_id
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p     ON oi.product_id = p.product_id
    WHERE p.product_name = 'Product A'
),
bought_b AS (
    SELECT DISTINCT o.customer_id
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p     ON oi.product_id = p.product_id
    WHERE p.product_name = 'Product B'
),
a_not_b AS (
    SELECT customer_id FROM bought_a
    EXCEPT
    SELECT customer_id FROM bought_b
)
SELECT
    c.customer_id,
    c.name
FROM customers c
JOIN a_not_b ab ON c.customer_id = ab.customer_id
ORDER BY c.name;

Method 2 โ€” Using NOT EXISTS (MySQL compatible โ€” MySQL has no EXCEPT):

SELECT DISTINCT
    c.customer_id,
    c.name
FROM customers c
JOIN orders o       ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p     ON oi.product_id = p.product_id
WHERE p.product_name = 'Product A'
  AND NOT EXISTS (
      SELECT 1
      FROM orders o2
      JOIN order_items oi2 ON o2.order_id = oi2.order_id
      JOIN products p2     ON oi2.product_id = p2.product_id
      WHERE o2.customer_id = c.customer_id
        AND p2.product_name = 'Product B'
  )
ORDER BY c.name;

Note: MySQL does not support EXCEPT natively. Use NOT EXISTS, NOT IN, or a LEFT JOIN + IS NULL anti-join pattern instead.


Q3. What is INTERSECT? Write a query to find customers who made purchases in both 2023 and 2024. [4 Marks]

Answer: INTERSECT returns only the rows that appear in both result sets โ€” it is the SQL equivalent of a set intersection. Like UNION, it automatically removes duplicates from the final output.

Set A โˆฉ Set B = rows present in BOTH A and B

PostgreSQL / SQL Server:

-- Customers who purchased in 2023
SELECT DISTINCT customer_id
FROM orders
WHERE YEAR(order_date) = 2023   -- use EXTRACT(YEAR FROM order_date) in PostgreSQL

INTERSECT

-- Customers who purchased in 2024
SELECT DISTINCT customer_id
FROM orders
WHERE YEAR(order_date) = 2024;

With customer details (CTE wrapper):

WITH active_both_years AS (
    SELECT customer_id FROM orders
    WHERE EXTRACT(YEAR FROM order_date) = 2023

    INTERSECT

    SELECT customer_id FROM orders
    WHERE EXTRACT(YEAR FROM order_date) = 2024
)
SELECT
    c.customer_id,
    c.name,
    c.email
FROM customers c
JOIN active_both_years ab
    ON c.customer_id = ab.customer_id
ORDER BY c.name;

MySQL alternative (no INTERSECT support):

-- Using INNER JOIN as INTERSECT equivalent
SELECT DISTINCT
    c.customer_id,
    c.name
FROM customers c
WHERE c.customer_id IN (
    SELECT customer_id FROM orders WHERE YEAR(order_date) = 2023
)
AND c.customer_id IN (
    SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024
)
ORDER BY c.name;

Set operations comparison:

Operation Returns
UNION All rows from A + B (no duplicates)
UNION ALL All rows from A + B (with duplicates)
INTERSECT Only rows in both A and B
EXCEPT / MINUS Rows in A but not in B

Q4. What are the rules for columns in a UNION query? [4 Marks]

Answer: For a UNION (or UNION ALL, INTERSECT, EXCEPT) query to be valid, all combined SELECT statements must follow these rules:

Rule 1 โ€” Same number of columns:

-- โŒ Invalid: different column counts
SELECT customer_id, name FROM customers
UNION
SELECT product_id FROM products;           -- ERROR: different number of columns

-- โœ… Valid: same column count
SELECT customer_id, name FROM customers
UNION
SELECT product_id, product_name FROM products;

Rule 2 โ€” Compatible data types in corresponding positions:

-- โŒ Invalid: incompatible types (string vs integer)
SELECT customer_id, name FROM customers    -- name is VARCHAR
UNION
SELECT product_id, price FROM products;    -- price is DECIMAL โ€” may fail

-- โœ… Valid: cast to compatible type
SELECT customer_id, CAST(name AS VARCHAR(100)) FROM customers
UNION
SELECT product_id, CAST(price AS VARCHAR(100)) FROM products;

Rule 3 โ€” Column names come from the FIRST SELECT only:

-- The result column is named 'customer_id' and 'name' from the first SELECT
-- The second SELECT's column names (product_id, product_name) are ignored
SELECT customer_id, name       FROM customers
UNION
SELECT product_id, product_name FROM products;
-- Result columns: customer_id, name   โ† from first SELECT

Rule 4 โ€” ORDER BY applies to the final result, not individual SELECTs:

-- โŒ Invalid: ORDER BY inside individual SELECT
SELECT customer_id FROM customers ORDER BY customer_id   -- ERROR
UNION
SELECT product_id FROM products;

-- โœ… Valid: ORDER BY at the very end applies to the full result
SELECT customer_id AS id FROM customers
UNION
SELECT product_id  AS id FROM products
ORDER BY id;                               -- single ORDER BY at the end

Rule 5 โ€” Use aliases from the first SELECT for ORDER BY and referencing:

SELECT customer_id AS entity_id, name AS entity_name FROM customers
UNION ALL
SELECT product_id,               product_name         FROM products
ORDER BY entity_name ASC;        -- reference the alias from first SELECT

Q5. Explain the difference between simple CASE and searched CASE. Give one example of each. [4 Marks]

Answer:

Feature Simple CASE Searched CASE
Syntax style CASE column WHEN value THEN result CASE WHEN condition THEN result
Comparison type Equality only (=) โ€” compares one expression to fixed values Any condition โ€” supports >, <, BETWEEN, LIKE, AND/OR, subqueries
Flexibility Less flexible โ€” one expression, multiple equal values More flexible โ€” each WHEN is an independent Boolean expression
Use case Mapping fixed codes/values to labels Range checks, complex conditions, multi-column logic

Simple CASE โ€” equality comparison against a single expression:

-- Map numeric status code to human-readable label
SELECT
    order_id,
    amount,
    CASE status_code
        WHEN 1  THEN 'Pending'
        WHEN 2  THEN 'Processing'
        WHEN 3  THEN 'Shipped'
        WHEN 4  THEN 'Delivered'
        WHEN 5  THEN 'Cancelled'
        ELSE        'Unknown'
    END                         AS order_status
FROM orders;

Searched CASE โ€” independent Boolean condition per WHEN:

-- Categorize salary into bands using range conditions
SELECT
    emp_id,
    name,
    salary,
    CASE
        WHEN salary >= 100000               THEN 'Executive'
        WHEN salary >= 70000
             AND salary < 100000           THEN 'Senior'
        WHEN salary >= 40000
             AND salary < 70000            THEN 'Mid-Level'
        WHEN salary < 40000                THEN 'Junior'
        ELSE                                    'Not Classified'
    END                                     AS salary_band
FROM employees;

Key rule: SQL evaluates CASE conditions top to bottom and returns the result of the first matching WHEN. Order your conditions from most specific to most general to avoid incorrect matches.


Q6. Write a query to categorize customers into 'Platinum', 'Gold', 'Silver', 'Bronze' based on their total spend. [4 Marks]

Answer:

Assumption: Tables customers(customer_id, name, email) and orders(order_id, customer_id, order_date, amount) exist.

WITH customer_spend AS (
    -- Step 1: Calculate lifetime total spend per customer
    SELECT
        customer_id,
        SUM(amount)                         AS total_spend,
        COUNT(order_id)                     AS total_orders
    FROM orders
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.name,
    c.email,
    COALESCE(cs.total_spend, 0)             AS total_spend,
    COALESCE(cs.total_orders, 0)            AS total_orders,
    CASE
        WHEN cs.total_spend >= 100000       THEN 'Platinum'
        WHEN cs.total_spend >= 50000        THEN 'Gold'
        WHEN cs.total_spend >= 10000        THEN 'Silver'
        WHEN cs.total_spend >  0            THEN 'Bronze'
        ELSE                                     'No Purchase'
    END                                     AS customer_tier,
    CASE
        WHEN cs.total_spend >= 100000       THEN 1
        WHEN cs.total_spend >= 50000        THEN 2
        WHEN cs.total_spend >= 10000        THEN 3
        WHEN cs.total_spend >  0            THEN 4
        ELSE                                     5
    END                                     AS tier_rank          -- for sorting
FROM customers c
LEFT JOIN customer_spend cs
    ON c.customer_id = cs.customer_id
ORDER BY
    tier_rank ASC,
    total_spend DESC;

Summary stats by tier:

-- Bonus: tier distribution report
WITH customer_spend AS (
    SELECT customer_id, SUM(amount) AS total_spend
    FROM orders GROUP BY customer_id
),
tiered AS (
    SELECT
        CASE
            WHEN total_spend >= 100000  THEN 'Platinum'
            WHEN total_spend >= 50000   THEN 'Gold'
            WHEN total_spend >= 10000   THEN 'Silver'
            ELSE                             'Bronze'
        END                             AS tier,
        total_spend
    FROM customer_spend
)
SELECT
    tier,
    COUNT(*)                            AS customer_count,
    ROUND(SUM(total_spend), 2)          AS tier_total_revenue,
    ROUND(AVG(total_spend), 2)          AS avg_spend_per_customer
FROM tiered
GROUP BY tier
ORDER BY MIN(total_spend) DESC;

Q7. Write a query using CASE inside an aggregate to count how many orders were 'High Value' (>10000) and 'Low Value' in each region. [4 Marks]

Answer:

Assumption: Tables orders(order_id, customer_id, region, order_date, amount) exist.

SELECT
    region,
    COUNT(order_id)                                     AS total_orders,
    -- Conditional aggregation: count High Value orders
    SUM(
        CASE WHEN amount > 10000 THEN 1 ELSE 0 END
    )                                                   AS high_value_orders,
    -- Conditional aggregation: count Low Value orders
    SUM(
        CASE WHEN amount <= 10000 THEN 1 ELSE 0 END
    )                                                   AS low_value_orders,
    -- Alternative using COUNT with filter
    COUNT(
        CASE WHEN amount > 10000 THEN 1 END
    )                                                   AS high_value_count_v2,
    -- Percentage breakdown
    ROUND(
        SUM(CASE WHEN amount > 10000 THEN 1 ELSE 0 END)
        * 100.0 / COUNT(order_id), 2
    )                                                   AS high_value_pct,
    -- Revenue split
    ROUND(
        SUM(CASE WHEN amount > 10000 THEN amount ELSE 0 END), 2
    )                                                   AS high_value_revenue,
    ROUND(
        SUM(CASE WHEN amount <= 10000 THEN amount ELSE 0 END), 2
    )                                                   AS low_value_revenue
FROM orders
GROUP BY region
ORDER BY total_orders DESC;

Sample output:

region total_orders high_value_orders low_value_orders high_value_pct high_value_revenue
North 850 210 640 24.71% 3200000
South 620 95 525 15.32% 1450000
East 440 180 260 40.91% 2800000

Why SUM(CASE...) over COUNT(CASE...): COUNT(expr) counts non-NULL values โ€” so COUNT(CASE WHEN ... THEN 1 END) works because unmatched rows return NULL (not counted). SUM(CASE WHEN ... THEN 1 ELSE 0 END) is more explicit and readable. Both produce identical results โ€” choose whichever is clearer to your team.


Q8. Can you use CASE in ORDER BY clause? Give an example of custom sorting. [3 Marks]

Answer: Yes, CASE can be used directly in the ORDER BY clause to implement custom sort logic โ€” sorting rows in a business-defined order rather than alphabetically or numerically.

Example 1 โ€” Sort order status by business priority (not alphabetically):

SELECT
    order_id,
    customer_id,
    status,
    amount
FROM orders
ORDER BY
    CASE status
        WHEN 'Critical'    THEN 1    -- show most urgent first
        WHEN 'Processing'  THEN 2
        WHEN 'Pending'     THEN 3
        WHEN 'Shipped'     THEN 4
        WHEN 'Delivered'   THEN 5
        WHEN 'Cancelled'   THEN 6
        ELSE                    7    -- unknown statuses last
    END ASC,
    order_date ASC;                  -- secondary sort by date within each status

Example 2 โ€” Sort employees with managers first, then by salary:

SELECT
    emp_id,
    name,
    manager_id,
    salary
FROM employees
ORDER BY
    CASE
        WHEN manager_id IS NULL THEN 0   -- top-level (no manager) first
        ELSE 1
    END ASC,
    salary DESC;                         -- within each group, highest salary first

Example 3 โ€” Dynamic sort direction based on a flag:

-- Sort ascending for 'Gold'/'Platinum' customers, descending for others
SELECT customer_id, name, tier, total_spend
FROM customer_summary
ORDER BY
    CASE WHEN tier IN ('Gold', 'Platinum')
         THEN total_spend END DESC,      -- high spenders in premium tiers first
    CASE WHEN tier NOT IN ('Gold', 'Platinum')
         THEN total_spend END ASC;       -- low spenders in lower tiers last

Rule: CASE in ORDER BY returns a value (number or string) that SQL uses as the sort key. The actual CASE expression never appears in the output โ€” it only affects row ordering.


Q9. Write a query to find gaps in a sequence of order_ids (missing IDs). [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists where order_id should be a sequential integer with no gaps.

Method 1 โ€” Using LAG to find gaps (PostgreSQL / SQL Server / MySQL 8+):

WITH ordered_ids AS (
    SELECT
        order_id,
        LAG(order_id) OVER (ORDER BY order_id)     AS prev_order_id
    FROM orders
),
gaps AS (
    SELECT
        prev_order_id + 1                           AS gap_start,
        order_id - 1                                AS gap_end,
        order_id - prev_order_id - 1                AS missing_count
    FROM ordered_ids
    WHERE order_id - prev_order_id > 1              -- gap exists when difference > 1
)
SELECT
    gap_start,
    gap_end,
    missing_count,
    CONCAT(
        'Missing IDs: ', gap_start,
        CASE WHEN gap_start < gap_end
             THEN CONCAT(' to ', gap_end)
             ELSE ''
        END
    )                                               AS gap_description
FROM gaps
ORDER BY gap_start;

Method 2 โ€” Using LEAD to look forward:

SELECT
    order_id                                        AS current_id,
    LEAD(order_id) OVER (ORDER BY order_id)         AS next_id,
    LEAD(order_id) OVER (ORDER BY order_id)
        - order_id - 1                              AS missing_between
FROM orders
HAVING missing_between > 0;                         -- MySQL syntax

Method 3 โ€” Generate full sequence and LEFT JOIN (most explicit):

-- PostgreSQL: generate_series
WITH full_sequence AS (
    SELECT generate_series(
        (SELECT MIN(order_id) FROM orders),
        (SELECT MAX(order_id) FROM orders)
    ) AS expected_id
)
SELECT
    fs.expected_id                                  AS missing_order_id
FROM full_sequence fs
LEFT JOIN orders o ON fs.expected_id = o.order_id
WHERE o.order_id IS NULL
ORDER BY fs.expected_id;

Sample output (Method 1):

gap_start gap_end missing_count gap_description
4 4 1 Missing IDs: 4
8 11 4 Missing IDs: 8 to 11
15 15 1 Missing IDs: 15

Q10. Explain how you would join two tables when the join condition involves a date range (e.g., matching a transaction date with a pricing table effective_from and effective_to). [6 Marks]

Answer: A date range join (also called a temporal join or non-equi join) matches rows based on whether a date falls within a valid date range rather than an exact key match. This is common in slowly changing dimension (SCD) scenarios โ€” price histories, tax rate tables, salary bands, and exchange rates.

Schema setup:

-- transactions: each sale with its transaction date
-- transactions(txn_id, product_id, txn_date, quantity)

-- pricing: price valid between effective_from and effective_to
-- pricing(price_id, product_id, unit_price, effective_from, effective_to)

-- Sample pricing data:
-- product_id=1, price=100, valid 2024-01-01 to 2024-03-31
-- product_id=1, price=120, valid 2024-04-01 to 2024-12-31

Basic date range join:

SELECT
    t.txn_id,
    t.product_id,
    t.txn_date,
    t.quantity,
    p.unit_price,
    t.quantity * p.unit_price                       AS total_value
FROM transactions t
JOIN pricing p
    ON  t.product_id   = p.product_id              -- same product
    AND t.txn_date    >= p.effective_from           -- transaction after price start
    AND t.txn_date    <= p.effective_to             -- transaction before price end
ORDER BY t.txn_date;

Handling open-ended pricing (effective_to = NULL means "current price"):

SELECT
    t.txn_id,
    t.product_id,
    t.txn_date,
    t.quantity,
    p.unit_price,
    t.quantity * p.unit_price                       AS total_value
FROM transactions t
JOIN pricing p
    ON  t.product_id = p.product_id
    AND t.txn_date  >= p.effective_from
    AND (
        t.txn_date  <= p.effective_to              -- within closed range
        OR p.effective_to IS NULL                  -- OR price is still active (open-ended)
    )
ORDER BY t.txn_id;

Handling overlapping date ranges (defensive โ€” pick most recent effective price):

-- If pricing table has overlapping ranges due to data quality issues,
-- use ROW_NUMBER to pick the most recently effective price
WITH ranked_prices AS (
    SELECT
        t.txn_id,
        t.product_id,
        t.txn_date,
        t.quantity,
        p.unit_price,
        p.effective_from,
        ROW_NUMBER() OVER (
            PARTITION BY t.txn_id
            ORDER BY p.effective_from DESC          -- most recent price wins
        )                                           AS rn
    FROM transactions t
    JOIN pricing p
        ON  t.product_id = p.product_id
        AND t.txn_date  >= p.effective_from
        AND (t.txn_date <= p.effective_to OR p.effective_to IS NULL)
)
SELECT
    txn_id,
    product_id,
    txn_date,
    quantity,
    unit_price,
    quantity * unit_price                           AS total_value
FROM ranked_prices
WHERE rn = 1                                        -- keep only best-match price row
ORDER BY txn_date;

Common pitfalls and solutions:

Pitfall Cause Solution
Multiple price rows matched Overlapping date ranges in pricing table Use ROW_NUMBER() to pick one
No price row matched Gap in pricing history Use LEFT JOIN and flag NULL prices
NULL effective_to not handled Current price has no end date Add OR effective_to IS NULL
Performance on large tables Full scan due to non-equi condition Index on (product_id, effective_from, effective_to)

Performance tip: Date range joins cannot use standard equality index lookups. Create a composite index on (product_id, effective_from) โ€” the equality on product_id reduces the scan, then the range on effective_from narrows it further. For very large pricing tables, consider a partitioned or materialized approach.


Q11. Write a query to compare current month sales vs previous month sales vs same month last year for each product category. [5 Marks]

Answer:

Assumption: Tables orders(order_id, customer_id, order_date, amount) and order_items(order_item_id, order_id, product_id, quantity, unit_price) and products(product_id, product_name, category) exist.

WITH monthly_category_sales AS (
    -- Step 1: Aggregate revenue per category per year-month
    SELECT
        p.category,
        YEAR(o.order_date)                          AS yr,
        MONTH(o.order_date)                         AS mth,
        SUM(oi.quantity * oi.unit_price)            AS monthly_revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p     ON oi.product_id = p.product_id
    GROUP BY p.category, YEAR(o.order_date), MONTH(o.order_date)
),
sales_with_comparisons AS (
    -- Step 2: Use LAG to bring in previous month and same month last year
    SELECT
        category,
        yr,
        mth,
        monthly_revenue                             AS current_month_sales,
        LAG(monthly_revenue, 1) OVER (
            PARTITION BY category
            ORDER BY yr, mth
        )                                           AS prev_month_sales,
        LAG(monthly_revenue, 12) OVER (
            PARTITION BY category
            ORDER BY yr, mth
        )                                           AS same_month_last_year_sales
    FROM monthly_category_sales
)
-- Step 3: Calculate growth metrics and filter to current month
SELECT
    category,
    yr                                              AS year,
    mth                                             AS month,
    ROUND(current_month_sales, 2)                   AS current_month_sales,
    ROUND(COALESCE(prev_month_sales, 0), 2)         AS prev_month_sales,
    ROUND(COALESCE(same_month_last_year_sales, 0),2)AS same_month_ly_sales,
    -- Month-over-month change
    ROUND(
        CASE
            WHEN prev_month_sales IS NULL
              OR prev_month_sales = 0               THEN NULL
            ELSE (current_month_sales - prev_month_sales)
                 / prev_month_sales * 100
        END, 2
    )                                               AS mom_growth_pct,
    -- Year-over-year change
    ROUND(
        CASE
            WHEN same_month_last_year_sales IS NULL
              OR same_month_last_year_sales = 0     THEN NULL
            ELSE (current_month_sales - same_month_last_year_sales)
                 / same_month_last_year_sales * 100
        END, 2
    )                                               AS yoy_growth_pct
FROM sales_with_comparisons
ORDER BY
    category        ASC,
    yr              DESC,
    mth             DESC;

Sample output:

category year month current_month_sales prev_month_sales same_month_ly_sales mom_growth_pct yoy_growth_pct
Electronics 2024 4 320000 295000 280000 +8.47% +14.29%
Clothing 2024 4 145000 162000 130000 -10.49% +11.54%

Why LAG(monthly_revenue, 12): Partitioned by category and ordered by year+month chronologically, the row 12 positions back is exactly the same month of the prior year โ€” a clean year-over-year comparison without any self-join.


Q12. Write a query to find customers who placed their first order in 2024. [5 Marks]

Answer:

Assumption: Tables customers(customer_id, name, email, created_at) and orders(order_id, customer_id, order_date, amount) exist.

Method 1 โ€” Using MIN() with HAVING (simplest and most portable):

SELECT
    c.customer_id,
    c.name,
    c.email,
    MIN(o.order_date)                           AS first_order_date,
    COUNT(o.order_id)                           AS total_orders_2024,
    SUM(o.amount)                               AS total_spend_2024
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY
    c.customer_id,
    c.name,
    c.email
HAVING
    YEAR(MIN(o.order_date)) = 2024              -- first ever order was in 2024
ORDER BY first_order_date ASC;

Method 2 โ€” Using window function ROW_NUMBER (most explicit):

WITH ranked_orders AS (
    -- Step 1: Rank each customer's orders chronologically
    SELECT
        customer_id,
        order_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date ASC
        )                                       AS order_rank
    FROM orders
),
first_orders AS (
    -- Step 2: Keep only each customer's very first order
    SELECT customer_id, order_date              AS first_order_date
    FROM ranked_orders
    WHERE order_rank = 1
)
-- Step 3: Filter to those whose first order was in 2024
SELECT
    c.customer_id,
    c.name,
    c.email,
    fo.first_order_date
FROM customers c
JOIN first_orders fo
    ON c.customer_id = fo.customer_id
WHERE YEAR(fo.first_order_date) = 2024
ORDER BY fo.first_order_date ASC;

Method 3 โ€” Using NOT EXISTS (anti-join: no orders before 2024):

SELECT
    c.customer_id,
    c.name,
    c.email,
    MIN(o.order_date)                           AS first_order_date
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE YEAR(o.order_date) = 2024                 -- has at least one 2024 order
  AND NOT EXISTS (
      SELECT 1
      FROM orders o_prev
      WHERE o_prev.customer_id = c.customer_id
        AND YEAR(o_prev.order_date) < 2024      -- no orders before 2024
  )
GROUP BY c.customer_id, c.name, c.email
ORDER BY first_order_date ASC;

Method comparison:

Method Readability Performance Handles edge cases
MIN() + HAVING โœ… Cleanest โœ… Single pass โœ… Yes
ROW_NUMBER() โœ… Explicit โœ… Good โœ… Yes
NOT EXISTS โœ… Clear intent โš ๏ธ Correlated scan โœ… Best for correctness

Best practice: Method 1 (MIN + HAVING) is the most concise. Method 3 (NOT EXISTS) is semantically the clearest โ€” it reads as "customers who have a 2024 order and have NO order before 2024" which precisely matches the business requirement. Use Method 3 in production when data correctness is critical.


๐Ÿ“… Day 7: Date/Time Functions & String Manipulation

Intermediate 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. List the common date functions in SQL (DATEADD, DATEDIFF, DATEPART, EXTRACT). Give a use case for each. [4 Marks]

Answer:

DATEADD โ€” Add or subtract a time interval from a date:

Dialect Function Example
SQL Server DATEADD(unit, n, date) DATEADD(DAY, 30, order_date)
MySQL DATE_ADD(date, INTERVAL n unit) DATE_ADD(order_date, INTERVAL 30 DAY)
PostgreSQL date + INTERVAL 'n unit' order_date + INTERVAL '30 days'
-- Use case: Find the expected delivery date (7 days after order)
-- SQL Server
SELECT order_id, order_date,
       DATEADD(DAY, 7, order_date)                  AS expected_delivery
FROM orders;

-- MySQL
SELECT order_id, order_date,
       DATE_ADD(order_date, INTERVAL 7 DAY)         AS expected_delivery
FROM orders;

-- PostgreSQL
SELECT order_id, order_date,
       order_date + INTERVAL '7 days'               AS expected_delivery
FROM orders;

DATEDIFF โ€” Calculate the difference between two dates:

Dialect Function
SQL Server DATEDIFF(unit, start_date, end_date)
MySQL DATEDIFF(end_date, start_date) โ€” days only
PostgreSQL end_date - start_date โ€” returns integer days
-- Use case: Find how many days since each order was placed
-- SQL Server
SELECT order_id,
       DATEDIFF(DAY, order_date, GETDATE())         AS days_since_order
FROM orders;

-- MySQL
SELECT order_id,
       DATEDIFF(CURDATE(), order_date)              AS days_since_order
FROM orders;

-- PostgreSQL
SELECT order_id,
       CURRENT_DATE - order_date::DATE              AS days_since_order
FROM orders;

DATEPART / EXTRACT โ€” Extract a specific component (year, month, day, hour) from a date:

Dialect Function
SQL Server DATEPART(unit, date)
MySQL YEAR(date), MONTH(date), DAY(date) or EXTRACT(unit FROM date)
PostgreSQL EXTRACT(unit FROM date) or DATE_PART('unit', date)
-- Use case: Group sales by year and month for trend analysis
-- SQL Server
SELECT DATEPART(YEAR, order_date)   AS yr,
       DATEPART(MONTH, order_date)  AS mth,
       SUM(amount)                  AS monthly_sales
FROM orders
GROUP BY DATEPART(YEAR, order_date), DATEPART(MONTH, order_date);

-- MySQL
SELECT YEAR(order_date)    AS yr,
       MONTH(order_date)   AS mth,
       SUM(amount)         AS monthly_sales
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date);

-- PostgreSQL
SELECT EXTRACT(YEAR  FROM order_date) AS yr,
       EXTRACT(MONTH FROM order_date) AS mth,
       SUM(amount)                    AS monthly_sales
FROM orders
GROUP BY 1, 2;

EXTRACT โ€” Standard SQL function (ANSI compliant) to pull date parts:

-- Use case: Find orders placed in Q4 (October, November, December)
-- PostgreSQL / MySQL
SELECT order_id, order_date, amount
FROM orders
WHERE EXTRACT(MONTH FROM order_date) IN (10, 11, 12);

Q2. Write a query to find all orders placed in the last 30 days. [3 Marks]

Answer:

MySQL:

SELECT
    order_id,
    customer_id,
    order_date,
    amount
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
  AND order_date <  CURDATE() + INTERVAL 1 DAY    -- includes today fully
ORDER BY order_date DESC;

PostgreSQL:

SELECT
    order_id,
    customer_id,
    order_date,
    amount
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  AND order_date <  CURRENT_DATE + INTERVAL '1 day'
ORDER BY order_date DESC;

SQL Server:

SELECT
    order_id,
    customer_id,
    order_date,
    amount
FROM orders
WHERE order_date >= DATEADD(DAY, -30, CAST(GETDATE() AS DATE))
  AND order_date <  DATEADD(DAY, 1,   CAST(GETDATE() AS DATE))
ORDER BY order_date DESC;

Important nuance: If order_date is a DATETIME or TIMESTAMP column (not just DATE), using >= CURDATE() - 30 might miss orders placed later in today's clock time. The safest approach is >= DATE_SUB(CURDATE(), INTERVAL 30 DAY) paired with < CURDATE() + 1 to capture the full range from midnight to midnight.


Q3. How do you calculate the age of a customer in years from their date_of_birth? [3 Marks]

Answer:

Assumption: Table customers(customer_id, name, date_of_birth) exists where date_of_birth is a DATE column.

MySQL:

SELECT
    customer_id,
    name,
    date_of_birth,
    TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())   AS age_in_years
FROM customers
ORDER BY age_in_years DESC;

PostgreSQL:

SELECT
    customer_id,
    name,
    date_of_birth,
    -- Method 1: Using AGE() function (returns interval)
    AGE(CURRENT_DATE, date_of_birth)                AS age_interval,
    -- Method 2: Extract years from AGE() for a clean integer
    EXTRACT(YEAR FROM AGE(CURRENT_DATE, date_of_birth))::INT
                                                    AS age_in_years
FROM customers
ORDER BY age_in_years DESC;

SQL Server:

SELECT
    customer_id,
    name,
    date_of_birth,
    -- Method 1: Simple DATEDIFF (may be off by 1 if birthday hasn't occurred yet this year)
    DATEDIFF(YEAR, date_of_birth, GETDATE())        AS approx_age,
    -- Method 2: Accurate age accounting for birthday this year
    DATEDIFF(YEAR, date_of_birth, GETDATE())
    - CASE
        WHEN MONTH(GETDATE()) < MONTH(date_of_birth)
          OR (MONTH(GETDATE()) = MONTH(date_of_birth)
              AND DAY(GETDATE()) < DAY(date_of_birth))
        THEN 1
        ELSE 0
      END                                           AS accurate_age_in_years
FROM customers;

Why SQL Server needs the birthday correction: DATEDIFF(YEAR, '2000-12-31', '2024-06-01') returns 24 even though the person hasn't turned 24 yet (their birthday is in December). The CASE block subtracts 1 if the birthday hasn't occurred yet in the current year. MySQL's TIMESTAMPDIFF and PostgreSQL's AGE() handle this correctly by default.


Q4. Write a query to find the first and last day of the current month. [3 Marks]

Answer:

MySQL:

SELECT
    -- First day: set day component to 01
    DATE_FORMAT(CURDATE(), '%Y-%m-01')              AS first_day_of_month,
    -- Last day: built-in function
    LAST_DAY(CURDATE())                             AS last_day_of_month;

-- Practical use: filter orders in current month
SELECT order_id, order_date, amount
FROM orders
WHERE order_date BETWEEN DATE_FORMAT(CURDATE(), '%Y-%m-01')
                     AND LAST_DAY(CURDATE());

PostgreSQL:

SELECT
    -- First day: truncate to month start
    DATE_TRUNC('month', CURRENT_DATE)::DATE         AS first_day_of_month,
    -- Last day: first day of next month minus 1 day
    (DATE_TRUNC('month', CURRENT_DATE)
     + INTERVAL '1 month - 1 day')::DATE            AS last_day_of_month;

-- Practical use
SELECT order_id, order_date, amount
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
  AND order_date <  DATE_TRUNC('month', CURRENT_DATE) + INTERVAL '1 month';

SQL Server:

SELECT
    -- First day: construct from year and month parts
    DATEFROMPARTS(YEAR(GETDATE()), MONTH(GETDATE()), 1)
                                                    AS first_day_of_month,
    -- Last day: first day of next month minus 1 day
    EOMONTH(GETDATE())                              AS last_day_of_month;

-- Alternative using DATEADD
SELECT
    DATEADD(DAY, 1 - DAY(GETDATE()), CAST(GETDATE() AS DATE))
                                                    AS first_day_of_month,
    EOMONTH(GETDATE())                              AS last_day_of_month;

Q5. What is the difference between CURRENT_DATE, GETDATE(), NOW(), and SYSDATE? Which database uses which? [2 Marks]

Answer:

Function Returns Database
CURRENT_DATE Date only โ€” no time component (2024-04-13) Standard SQL โ€” PostgreSQL, MySQL, SQL Server (2012+)
GETDATE() Date + Time โ€” current datetime (2024-04-13 14:32:05.123) SQL Server only
NOW() Date + Time โ€” current datetime at statement start MySQL, PostgreSQL
SYSDATE() Date + Time โ€” current datetime at function evaluation (can differ within same query) MySQL, Oracle
-- PostgreSQL
SELECT CURRENT_DATE;             -- 2024-04-13
SELECT NOW();                    -- 2024-04-13 14:32:05.123456+05:30

-- MySQL
SELECT CURRENT_DATE;             -- 2024-04-13
SELECT NOW();                    -- 2024-04-13 14:32:05  (statement start time)
SELECT SYSDATE();                -- 2024-04-13 14:32:05  (exact evaluation time โ€” can differ from NOW())

-- SQL Server
SELECT CAST(GETDATE() AS DATE);  -- 2024-04-13
SELECT GETDATE();                -- 2024-04-13 14:32:05.123
SELECT SYSDATETIME();            -- higher precision: 2024-04-13 14:32:05.1234567

Key difference between NOW() and SYSDATE() in MySQL: NOW() returns the time the statement began executing โ€” it stays constant within a query. SYSDATE() returns the time it is actually called โ€” it can return different values if called multiple times within the same query. For consistency and reproducibility, NOW() is preferred.


Q6. Write a query to calculate weekly sales for the last 12 weeks. [4 Marks]

Answer:

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL:

SELECT
    YEARWEEK(order_date, 1)                         AS year_week,       -- ISO week
    MIN(order_date)                                 AS week_start_date,
    MAX(order_date)                                 AS week_end_date,
    COUNT(order_id)                                 AS total_orders,
    ROUND(SUM(amount), 2)                           AS weekly_sales,
    ROUND(AVG(amount), 2)                           AS avg_order_value
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 12 WEEK)
GROUP BY YEARWEEK(order_date, 1)
ORDER BY year_week ASC;

PostgreSQL:

SELECT
    DATE_TRUNC('week', order_date)::DATE            AS week_start_date,
    DATE_TRUNC('week', order_date)::DATE
        + INTERVAL '6 days'                         AS week_end_date,
    COUNT(order_id)                                 AS total_orders,
    ROUND(SUM(amount)::NUMERIC, 2)                  AS weekly_sales,
    ROUND(AVG(amount)::NUMERIC, 2)                  AS avg_order_value
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 weeks'
GROUP BY DATE_TRUNC('week', order_date)
ORDER BY week_start_date ASC;

SQL Server:

SELECT
    DATEADD(DAY, 1 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
                                                    AS week_start_date,
    DATEADD(DAY, 7 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
                                                    AS week_end_date,
    COUNT(order_id)                                 AS total_orders,
    ROUND(SUM(amount), 2)                           AS weekly_sales,
    ROUND(AVG(amount), 2)                           AS avg_order_value
FROM orders
WHERE order_date >= DATEADD(WEEK, -12, CAST(GETDATE() AS DATE))
GROUP BY
    DATEADD(DAY, 1 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE)),
    DATEADD(DAY, 7 - DATEPART(WEEKDAY, order_date), CAST(order_date AS DATE))
ORDER BY week_start_date ASC;

Q7. Write a query to find the day of the week with the highest average sales. [3 Marks]

Answer:

MySQL:

SELECT
    DAYNAME(order_date)                             AS day_of_week,
    DAYOFWEEK(order_date)                           AS day_number,      -- 1=Sunday, 7=Saturday
    COUNT(order_id)                                 AS total_orders,
    ROUND(AVG(amount), 2)                           AS avg_sales,
    ROUND(SUM(amount), 2)                           AS total_sales
FROM orders
GROUP BY
    DAYOFWEEK(order_date),
    DAYNAME(order_date)
ORDER BY avg_sales DESC
LIMIT 1;

PostgreSQL:

SELECT
    TO_CHAR(order_date, 'Day')                      AS day_of_week,
    EXTRACT(DOW FROM order_date)                    AS day_number,      -- 0=Sunday, 6=Saturday
    COUNT(order_id)                                 AS total_orders,
    ROUND(AVG(amount)::NUMERIC, 2)                  AS avg_sales,
    ROUND(SUM(amount)::NUMERIC, 2)                  AS total_sales
FROM orders
GROUP BY
    EXTRACT(DOW FROM order_date),
    TO_CHAR(order_date, 'Day')
ORDER BY avg_sales DESC
LIMIT 1;

SQL Server (all days ranked, not just top 1):

SELECT TOP 1
    DATENAME(WEEKDAY, order_date)                   AS day_of_week,
    DATEPART(WEEKDAY, order_date)                   AS day_number,
    COUNT(order_id)                                 AS total_orders,
    ROUND(AVG(amount), 2)                           AS avg_sales,
    ROUND(SUM(amount), 2)                           AS total_sales
FROM orders
GROUP BY
    DATEPART(WEEKDAY, order_date),
    DATENAME(WEEKDAY, order_date)
ORDER BY avg_sales DESC;

Q8. How do you group sales by fiscal quarter (assuming fiscal year starts in April)? [3 Marks]

Answer: In a fiscal year starting April, the quarters are:

Fiscal Quarter Calendar Months
FQ1 April, May, June
FQ2 July, August, September
FQ3 October, November, December
FQ4 January, February, March

MySQL:

SELECT
    -- Fiscal year: shifts back 3 months so April becomes month 1 of new year
    YEAR(order_date - INTERVAL 3 MONTH)             AS fiscal_year,
    QUARTER(order_date - INTERVAL 3 MONTH)          AS fiscal_quarter,
    CASE QUARTER(order_date - INTERVAL 3 MONTH)
        WHEN 1 THEN 'FQ1 (Apr-Jun)'
        WHEN 2 THEN 'FQ2 (Jul-Sep)'
        WHEN 3 THEN 'FQ3 (Oct-Dec)'
        WHEN 4 THEN 'FQ4 (Jan-Mar)'
    END                                             AS fiscal_quarter_label,
    ROUND(SUM(amount), 2)                           AS total_sales,
    COUNT(order_id)                                 AS total_orders
FROM orders
GROUP BY
    YEAR(order_date - INTERVAL 3 MONTH),
    QUARTER(order_date - INTERVAL 3 MONTH)
ORDER BY fiscal_year, fiscal_quarter;

PostgreSQL:

SELECT
    EXTRACT(YEAR FROM order_date - INTERVAL '3 months')     AS fiscal_year,
    EXTRACT(QUARTER FROM order_date - INTERVAL '3 months')  AS fiscal_quarter,
    CASE EXTRACT(QUARTER FROM order_date - INTERVAL '3 months')
        WHEN 1 THEN 'FQ1 (Apr-Jun)'
        WHEN 2 THEN 'FQ2 (Jul-Sep)'
        WHEN 3 THEN 'FQ3 (Oct-Dec)'
        WHEN 4 THEN 'FQ4 (Jan-Mar)'
    END                                                     AS fiscal_quarter_label,
    ROUND(SUM(amount)::NUMERIC, 2)                          AS total_sales
FROM orders
GROUP BY 1, 2
ORDER BY fiscal_year, fiscal_quarter;

SQL Server:

SELECT
    YEAR(DATEADD(MONTH, -3, order_date))            AS fiscal_year,
    DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
                                                    AS fiscal_quarter,
    CASE DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
        WHEN 1 THEN 'FQ1 (Apr-Jun)'
        WHEN 2 THEN 'FQ2 (Jul-Sep)'
        WHEN 3 THEN 'FQ3 (Oct-Dec)'
        WHEN 4 THEN 'FQ4 (Jan-Mar)'
    END                                             AS fiscal_quarter_label,
    ROUND(SUM(amount), 2)                           AS total_sales
FROM orders
GROUP BY
    YEAR(DATEADD(MONTH, -3, order_date)),
    DATEPART(QUARTER, DATEADD(MONTH, -3, order_date))
ORDER BY fiscal_year, fiscal_quarter;

Logic explained: Subtracting 3 months from each date shifts the calendar so that April (month 4) becomes January (month 1) of the shifted calendar. Standard QUARTER() then correctly identifies FQ1=Apr-Jun as Q1 in the shifted timeline.


Q9. List the common string functions (CONCAT, SUBSTRING, LENGTH, TRIM, REPLACE, UPPER, LOWER). Give one example each. [5 Marks]

Answer:

CONCAT โ€” Join two or more strings together:

-- Create full name from first and last name columns
SELECT
    CONCAT(first_name, ' ', last_name)              AS full_name
FROM customers;
-- MySQL also supports: CONCAT_WS(' ', first_name, last_name) โ€” skips NULLs
-- PostgreSQL: first_name || ' ' || last_name
-- SQL Server: first_name + ' ' + last_name

SUBSTRING โ€” Extract part of a string by position:

-- Extract area code from phone number '(022) 4567-8901'
SELECT
    phone,
    SUBSTRING(phone, 2, 3)                          AS area_code
    -- MySQL: SUBSTRING(str, start_pos, length) โ€” 1-indexed
    -- PostgreSQL: SUBSTRING(str FROM 2 FOR 3)
    -- SQL Server: SUBSTRING(str, 2, 3)
FROM customers;

LENGTH โ€” Return the number of characters in a string:

-- Find customers with unusually long names (possible data quality issue)
SELECT
    name,
    LENGTH(name)                                    AS name_length
    -- SQL Server: LEN(name) โ€” trims trailing spaces before counting
    -- PostgreSQL: LENGTH(name) or CHAR_LENGTH(name)
FROM customers
WHERE LENGTH(name) > 50;

TRIM โ€” Remove leading and trailing spaces (or specified characters):

-- Clean up messy email entries with extra whitespace
SELECT
    email,
    TRIM(email)                                     AS clean_email,
    LTRIM(email)                                    AS left_trimmed,   -- leading spaces only
    RTRIM(email)                                    AS right_trimmed   -- trailing spaces only
FROM customers
WHERE email != TRIM(email);                         -- find rows that need cleaning

REPLACE โ€” Substitute all occurrences of a substring:

-- Standardize phone format: remove hyphens from '022-4567-8901'
SELECT
    phone,
    REPLACE(phone, '-', '')                         AS clean_phone,
    REPLACE(REPLACE(phone, '-', ''), ' ', '')       AS fully_clean_phone
FROM customers;

UPPER โ€” Convert string to all uppercase:

-- Normalize country codes to uppercase for consistent grouping
SELECT
    UPPER(country)                                  AS country_upper,
    COUNT(*)                                        AS customer_count
FROM customers
GROUP BY UPPER(country);
-- Prevents 'india', 'India', 'INDIA' from being counted separately

LOWER โ€” Convert string to all lowercase:

-- Normalize email addresses for deduplication
SELECT
    LOWER(TRIM(email))                              AS normalized_email,
    COUNT(*)                                        AS occurrences
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1;                               -- find duplicate emails

Q10. Write a query to extract the domain name from an email column (e.g., 'user@gmail.com' โ†’ 'gmail.com'). [4 Marks]

Answer:

MySQL:

SELECT
    email,
    -- SUBSTRING_INDEX splits by delimiter and returns n-th part
    SUBSTRING_INDEX(email, '@', -1)                 AS email_domain
    -- -1 means: take everything AFTER the last '@'
FROM customers;

PostgreSQL:

SELECT
    email,
    -- Method 1: SPLIT_PART splits by delimiter and picks the n-th chunk
    SPLIT_PART(email, '@', 2)                       AS email_domain,
    -- Method 2: SUBSTRING with POSITION
    SUBSTRING(email FROM POSITION('@' IN email) + 1) AS email_domain_v2
FROM customers;

SQL Server:

SELECT
    email,
    -- CHARINDEX finds the position of '@', then SUBSTRING extracts from that point
    SUBSTRING(
        email,
        CHARINDEX('@', email) + 1,                  -- start after '@'
        LEN(email) - CHARINDEX('@', email)          -- length = remaining characters
    )                                               AS email_domain
FROM customers;

Universal approach (works across dialects):

-- Using SUBSTRING + position functions
SELECT
    email,
    SUBSTRING(email,
        LOCATE('@', email) + 1,                     -- MySQL: LOCATE
        LENGTH(email)
    )                                               AS email_domain
FROM customers;

With validation (skip malformed emails):

SELECT
    email,
    SUBSTRING_INDEX(email, '@', -1)                 AS email_domain
FROM customers
WHERE email LIKE '%@%.%'                            -- basic email format check
  AND email NOT LIKE '%@%@%';                       -- no double @ signs

Q11. Write a query to split a full_name column into first_name and last_name. [3 Marks]

Answer:

Assumption: Table customers(customer_id, full_name) exists. Names are in "FirstName LastName" format separated by a single space.

MySQL:

SELECT
    customer_id,
    full_name,
    -- First name: everything before the first space
    SUBSTRING_INDEX(full_name, ' ', 1)              AS first_name,
    -- Last name: everything after the first space
    SUBSTRING_INDEX(full_name, ' ', -1)             AS last_name,
    -- Middle name safe version (if name has 3 parts)
    CASE
        WHEN LENGTH(full_name) - LENGTH(REPLACE(full_name, ' ', '')) >= 2
        THEN SUBSTRING_INDEX(SUBSTRING_INDEX(full_name, ' ', 2), ' ', -1)
        ELSE NULL
    END                                             AS middle_name
FROM customers;

PostgreSQL:

SELECT
    customer_id,
    full_name,
    -- First name: text before first space
    SPLIT_PART(full_name, ' ', 1)                   AS first_name,
    -- Last name: text after first space (handles middle names too)
    CASE
        WHEN POSITION(' ' IN full_name) > 0
        THEN SUBSTRING(
            full_name FROM POSITION(' ' IN full_name) + 1
        )
        ELSE full_name                              -- single name fallback
    END                                             AS last_name
FROM customers;

SQL Server:

SELECT
    customer_id,
    full_name,
    -- First name: up to first space
    LEFT(full_name, CHARINDEX(' ', full_name + ' ') - 1)
                                                    AS first_name,
    -- Last name: after first space
    CASE
        WHEN CHARINDEX(' ', full_name) > 0
        THEN SUBSTRING(
            full_name,
            CHARINDEX(' ', full_name) + 1,
            LEN(full_name)
        )
        ELSE NULL
    END                                             AS last_name
FROM customers;

Edge case handling: The + ' ' trick in SQL Server (CHARINDEX(' ', full_name + ' ')) ensures that even single-word names don't return 0 from CHARINDEX, preventing errors in LEFT().


Q12. What is the difference between CHAR, VARCHAR, and TEXT data types? [3 Marks]

Answer:

Feature CHAR(n) VARCHAR(n) TEXT
Storage type Fixed-length โ€” always uses exactly n bytes Variable-length โ€” uses only as many bytes as needed + 1-2 byte length prefix Variable-length โ€” no declared max length
Max length Up to 255 characters (MySQL) Up to 65,535 bytes (MySQL) / 8000 chars (SQL Server) Up to 65,535 bytes (MySQL) / unlimited (PostgreSQL)
Trailing spaces Pads with spaces to fill n length Stores as-is, no padding Stores as-is
Performance Slightly faster for fixed-size data (no length calculation) Efficient for variable data Slower for indexing; not ideal for WHERE clauses
Indexing โœ… Fully indexable โœ… Fully indexable โš ๏ธ Limited or prefix-only indexing
Best use case Fixed codes (country code, gender flag, status) Names, emails, addresses (variable but bounded) Long descriptions, comments, article body
-- CHAR: always stores exactly 2 characters (pads 'US' as 'US')
CREATE TABLE countries (
    country_code    CHAR(2),       -- 'US', 'IN', 'GB' โ€” always 2 chars
    country_name    VARCHAR(100),  -- 'United States', 'India' โ€” variable
    description     TEXT           -- long free-text field, unbounded
);

-- Demonstration of CHAR padding behavior
SELECT
    CHAR_LENGTH('IN ')            AS varchar_length,  -- 3 (includes space)
    CHAR_LENGTH(TRIM('IN '))      AS trimmed_length;   -- 2

Storage comparison for the value 'Hello' (5 characters):

Type Declared as Bytes used
CHAR(10) Fixed 10 bytes always (pads with 5 spaces)
VARCHAR(100) Variable 6 bytes (5 chars + 1 length byte)
TEXT Variable 6 bytes + possible off-row storage overhead

Rule of thumb: Use CHAR for truly fixed-length codes (ISO country codes, gender flags, UUID in some systems). Use VARCHAR for most text columns where length varies but has a reasonable maximum. Use TEXT only for unbounded long-form content like descriptions or comments โ€” avoid using it as a join key or in WHERE clauses with full indexes.


Q13. Write a query to find customers whose email domain is in the top 5 most common domains. [5 Marks]

Answer:

Assumption: Table customers(customer_id, name, email) exists.

MySQL:

WITH email_domains AS (
    -- Step 1: Extract domain from each customer's email
    SELECT
        customer_id,
        name,
        email,
        SUBSTRING_INDEX(email, '@', -1)             AS domain
    FROM customers
    WHERE email LIKE '%@%.%'                        -- basic format validation
),
domain_counts AS (
    -- Step 2: Count customers per domain
    SELECT
        domain,
        COUNT(*)                                    AS customer_count
    FROM email_domains
    GROUP BY domain
),
top_5_domains AS (
    -- Step 3: Rank domains and keep top 5
    SELECT domain
    FROM domain_counts
    ORDER BY customer_count DESC
    LIMIT 5
)
-- Step 4: Return customers whose domain is in top 5
SELECT
    ed.customer_id,
    ed.name,
    ed.email,
    ed.domain,
    dc.customer_count                               AS domain_user_count
FROM email_domains ed
JOIN domain_counts dc ON ed.domain = dc.domain
WHERE ed.domain IN (SELECT domain FROM top_5_domains)
ORDER BY dc.customer_count DESC, ed.name ASC;

PostgreSQL (using DENSE_RANK for ties):

WITH email_domains AS (
    SELECT
        customer_id,
        name,
        email,
        SPLIT_PART(email, '@', 2)                   AS domain
    FROM customers
    WHERE email LIKE '%@%.%'
),
domain_ranked AS (
    SELECT
        domain,
        COUNT(*)                                    AS customer_count,
        DENSE_RANK() OVER (
            ORDER BY COUNT(*) DESC
        )                                           AS domain_rank
    FROM email_domains
    GROUP BY domain
)
SELECT
    ed.customer_id,
    ed.name,
    ed.email,
    ed.domain,
    dr.customer_count,
    dr.domain_rank
FROM email_domains ed
JOIN domain_ranked dr
    ON ed.domain = dr.domain
WHERE dr.domain_rank <= 5                           -- handles ties correctly
ORDER BY dr.domain_rank, ed.name;

Why DENSE_RANK over LIMIT 5: If two domains share the 5th highest count (e.g., both have 100 customers), LIMIT 5 arbitrarily drops one. DENSE_RANK() <= 5 includes all tied domains fairly โ€” a better approach for business reporting.


Q14. Write a query to find the average time (in days) between a customer's first and second order. [5 Marks]

Answer:

Assumption: Tables customers(customer_id, name) and orders(order_id, customer_id, order_date, amount) exist.

MySQL:

WITH ranked_orders AS (
    -- Step 1: Rank each customer's orders chronologically
    SELECT
        customer_id,
        order_id,
        order_date,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date ASC,
                     order_id   ASC                 -- tiebreak by order_id if same date
        )                                           AS order_rank
    FROM orders
),
first_and_second AS (
    -- Step 2: Pivot first and second order dates into same row
    SELECT
        customer_id,
        MAX(CASE WHEN order_rank = 1 THEN order_date END)   AS first_order_date,
        MAX(CASE WHEN order_rank = 2 THEN order_date END)   AS second_order_date
    FROM ranked_orders
    WHERE order_rank <= 2
    GROUP BY customer_id
    HAVING MAX(CASE WHEN order_rank = 2 THEN order_date END) IS NOT NULL
    -- HAVING filters out customers with only 1 order
),
days_between AS (
    -- Step 3: Calculate days between first and second order per customer
    SELECT
        customer_id,
        first_order_date,
        second_order_date,
        DATEDIFF(second_order_date, first_order_date)       AS days_to_second_order
    FROM first_and_second
)
-- Step 4: Average across all qualifying customers
SELECT
    ROUND(AVG(days_to_second_order), 2)             AS avg_days_first_to_second_order,
    MIN(days_to_second_order)                       AS min_days,
    MAX(days_to_second_order)                       AS max_days,
    COUNT(*)                                        AS customers_with_2plus_orders
FROM days_between;

PostgreSQL:

WITH ranked_orders AS (
    SELECT
        customer_id,
        order_date,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date ASC, order_id ASC
        )                                           AS order_rank
    FROM orders
),
first_and_second AS (
    SELECT
        customer_id,
        MAX(CASE WHEN order_rank = 1 THEN order_date END)   AS first_order_date,
        MAX(CASE WHEN order_rank = 2 THEN order_date END)   AS second_order_date
    FROM ranked_orders
    WHERE order_rank <= 2
    GROUP BY customer_id
    HAVING MAX(CASE WHEN order_rank = 2 THEN order_date END) IS NOT NULL
)
SELECT
    ROUND(AVG(
        second_order_date::DATE - first_order_date::DATE
    )::NUMERIC, 2)                                  AS avg_days_first_to_second_order,
    MIN(second_order_date - first_order_date)       AS min_days,
    MAX(second_order_date - first_order_date)       AS max_days,
    COUNT(*)                                        AS customers_with_2plus_orders
FROM first_and_second;

With per-customer breakdown (join customers for detail view):

-- Add this to either query above to see per-customer breakdown:
SELECT
    c.customer_id,
    c.name,
    db.first_order_date,
    db.second_order_date,
    db.days_to_second_order,
    CASE
        WHEN db.days_to_second_order <= 7   THEN 'Rapid Repeat (โ‰ค7 days)'
        WHEN db.days_to_second_order <= 30  THEN 'Quick Repeat (8-30 days)'
        WHEN db.days_to_second_order <= 90  THEN 'Normal Repeat (31-90 days)'
        ELSE                                     'Slow Repeat (>90 days)'
    END                                         AS repeat_segment
FROM customers c
JOIN days_between db ON c.customer_id = db.customer_id
ORDER BY db.days_to_second_order ASC;

Step-by-step logic:

CTE Purpose
ranked_orders Numbers each customer's orders 1, 2, 3... chronologically
first_and_second Pivots ranks 1 and 2 into columns; excludes one-time buyers via HAVING
days_between Computes day difference per customer
Final SELECT Averages across all qualifying customers with 2+ orders

Why exclude one-time buyers with HAVING: Customers with only one order have no second order date โ€” their second_order_date is NULL. Including them in AVG() would not affect the average (NULL is ignored by AVG) but the HAVING clause makes the exclusion explicit and the row count (customers_with_2plus_orders) meaningful.


๐Ÿ“… Day 8: Query Optimization, Indexes & Execution Plans

Advanced 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the difference between a clustered and non-clustered index. Can a table have multiple clustered indexes? [4 Marks]

Answer:

Feature Clustered Index Non-Clustered Index
Data storage Physically reorders table rows to match index order Stores a separate structure with pointers back to actual rows
Number per table Only ONE per table Up to 999 (SQL Server) / virtually unlimited (PostgreSQL, MySQL)
Lookup speed Fastest for range scans โ€” data is physically ordered Requires an extra lookup step (row pointer) after finding index entry
Storage overhead No extra storage โ€” IS the table Requires additional disk space for the index structure
Best for Primary keys, range queries, ORDER BY columns Columns used in WHERE, JOIN, or covering queries

Clustered Index โ€” the table IS the index:

-- SQL Server: Primary key automatically creates a clustered index
CREATE TABLE orders (
    order_id    INT         PRIMARY KEY,    -- clustered index by default
    customer_id INT         NOT NULL,
    order_date  DATE        NOT NULL,
    amount      DECIMAL(10,2)
);

-- Explicitly create a clustered index (SQL Server)
CREATE CLUSTERED INDEX idx_orders_date
ON orders (order_date);

Non-Clustered Index โ€” separate structure with row pointers:

-- Non-clustered index on customer_id for fast JOIN/WHERE lookups
CREATE NONCLUSTERED INDEX idx_orders_customer
ON orders (customer_id);

-- PostgreSQL (all indexes are non-clustered by default)
CREATE INDEX idx_orders_customer ON orders (customer_id);
CREATE INDEX idx_orders_date     ON orders (order_date);

Can a table have multiple clustered indexes?
No. A table can have only one clustered index because the clustered index defines the physical storage order of the rows themselves โ€” you cannot physically sort a table in two different orders simultaneously.

PostgreSQL note: PostgreSQL does not have a "clustered index" in the SQL Server sense. All indexes in PostgreSQL are non-clustered by default. The CLUSTER command can physically reorder a table by an index once, but it does not maintain that order as data changes.


Q2. What is a composite index? When is the order of columns in a composite index important? [4 Marks]

Answer: A composite index (also called a multi-column or compound index) is an index built on two or more columns of a table. It allows the database to efficiently filter, sort, or look up rows based on combinations of those columns.

-- Creating a composite index
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);

Column order matters critically โ€” the "leftmost prefix rule":

The database can only use a composite index if the query filters on columns starting from the leftmost column in the index definition. Skipping the leading column renders the index unusable.

-- Index: (customer_id, order_date, status)

-- โœ… Uses the full index โ€” all three columns, left to right
SELECT * FROM orders
WHERE customer_id = 101
  AND order_date = '2024-01-15'
  AND status = 'Shipped';

-- โœ… Uses the index partially โ€” first two columns (leftmost prefix)
SELECT * FROM orders
WHERE customer_id = 101
  AND order_date BETWEEN '2024-01-01' AND '2024-03-31';

-- โœ… Uses the index โ€” first column only
SELECT * FROM orders
WHERE customer_id = 101;

-- โŒ Cannot use the index โ€” skips customer_id (the leading column)
SELECT * FROM orders
WHERE order_date = '2024-01-15';

-- โŒ Cannot use the index โ€” skips leading two columns
SELECT * FROM orders
WHERE status = 'Shipped';

Column order design rules:

Rule Guidance
Equality columns first Columns filtered with = should come before range (>, <, BETWEEN) columns
Highest cardinality first Columns with more unique values (e.g., customer_id) before low-cardinality ones (e.g., status)
Most frequently filtered first The column used in the most WHERE clauses goes first
Range column last A range filter stops the index from being used for subsequent columns
-- Good order: equality (customer_id) then range (order_date)
CREATE INDEX idx_good ON orders (customer_id, order_date);

-- Bad order: range (order_date) first โ€” prevents customer_id from benefiting
CREATE INDEX idx_bad  ON orders (order_date, customer_id);

Q3. What are the downsides of having too many indexes on a table? [3 Marks]

Answer: While indexes speed up SELECT queries, they introduce significant overhead for write operations and resource consumption. Every index on a table must be maintained whenever data changes.

1. Write performance degradation:

-- Every INSERT, UPDATE, DELETE must update ALL indexes on the table
-- A table with 10 indexes requires 10 index updates per row modification

INSERT INTO orders (order_id, customer_id, order_date, amount, status)
VALUES (9001, 42, CURDATE(), 5000, 'Pending');
-- If orders has 8 indexes, this single INSERT triggers 8 index maintenance operations

2. Increased disk storage:

-- Each non-clustered index stores a copy of indexed columns + row pointers
-- A table with 10M rows and 10 composite indexes could use
-- 3-5x more disk space than the base table alone

3. Increased memory pressure:

-- The database buffer pool caches frequently accessed index pages
-- Too many indexes compete for limited RAM โ€” pushing out useful cached data
-- Results in more disk I/O (cache misses) for all queries

4. Query optimizer confusion:

-- With many indexes, the optimizer must evaluate more execution plan choices
-- EXPLAIN / EXPLAIN ANALYZE becomes complex
-- The optimizer may occasionally pick a suboptimal index (statistics-dependent)
-- EXPLAIN SELECT * FROM orders WHERE customer_id = 101 AND status = 'Shipped';
-- With 10 indexes, optimizer must evaluate many candidate index combinations

5. Index maintenance overhead:

-- Over time, indexes become fragmented (especially B-tree indexes on insert-heavy tables)
-- Regular REBUILD or REORGANIZE is needed
-- SQL Server:
ALTER INDEX idx_orders_customer ON orders REBUILD;
-- PostgreSQL:
REINDEX INDEX idx_orders_customer;

Summary โ€” the cost-benefit balance:

Operation Effect of More Indexes
SELECT โœ… Faster (potentially)
INSERT โŒ Slower
UPDATE โŒ Slower (especially on indexed columns)
DELETE โŒ Slower
Disk usage โŒ Higher
Memory usage โŒ Higher

Rule of thumb: Aim for the minimum number of indexes that satisfies your query performance requirements. Regularly audit unused indexes using sys.dm_db_index_usage_stats (SQL Server) or pg_stat_user_indexes (PostgreSQL) and drop indexes that are never used.


Q4. What is a covering index? How does it improve query performance? [4 Marks]

Answer: A covering index is an index that contains all the columns a query needs โ€” both for filtering (WHERE/JOIN) and for the output (SELECT) โ€” so the database can satisfy the entire query from the index alone, without ever accessing the base table rows.

The problem without a covering index:

-- Query: find order amounts for a specific customer
SELECT order_date, amount
FROM orders
WHERE customer_id = 101;

-- If the index is only on (customer_id):
-- Step 1: Index seek on customer_id = 101 โ†’ finds row pointers
-- Step 2: For EACH pointer, go back to the base table to fetch order_date and amount
-- This "row lookup" (called a Key Lookup or RID Lookup) is expensive at scale

Covering index โ€” includes all needed columns:

-- Include the SELECT columns in the index so no table lookup is needed
-- SQL Server: use INCLUDE for non-key columns
CREATE NONCLUSTERED INDEX idx_covering_customer_orders
ON orders (customer_id)                     -- filter/seek column
INCLUDE (order_date, amount, status);       -- additional columns needed by SELECT

-- MySQL / PostgreSQL: list all columns in the index directly
CREATE INDEX idx_covering_customer_orders
ON orders (customer_id, order_date, amount, status);

With the covering index:

SELECT order_date, amount
FROM orders
WHERE customer_id = 101;
-- Step 1: Index seek on customer_id = 101
-- Step 2: Read order_date and amount directly from index leaf pages
-- NO trip back to the base table โ€” "Index Only Scan" or "Covering Index" in EXPLAIN

How to verify a covering index is being used:

-- PostgreSQL: look for "Index Only Scan"
EXPLAIN ANALYZE
SELECT order_date, amount
FROM orders
WHERE customer_id = 101;
-- Output: Index Only Scan using idx_covering_customer_orders on orders

-- SQL Server: look for absence of "Key Lookup" in execution plan
-- If Key Lookup appears โ†’ the index is NOT covering โ†’ add missing columns to INCLUDE

When to use INCLUDE vs adding to the key:

-- SQL Server best practice:
-- Key columns (in index): columns used in WHERE, JOIN, ORDER BY
-- INCLUDE columns: columns only needed in SELECT โ€” not used for filtering

CREATE NONCLUSTERED INDEX idx_orders_opt
ON orders (customer_id, order_date)         -- key: used in WHERE and ORDER BY
INCLUDE (amount, status);                   -- included: only needed in SELECT output

Performance impact: Covering indexes can reduce query execution time by 10x-100x on large tables by eliminating the most expensive operation in index lookups โ€” the random I/O row fetch from the base table. They are one of the most powerful single-step optimizations available for read-heavy workloads.


Q5. Explain why 'SELECT *' is considered bad practice in production queries. [3 Marks]

Answer: While SELECT * is convenient for exploration and debugging, it causes several serious problems in production environments:

1. Fetches unnecessary data โ€” increased I/O and network transfer:

-- Table has 50 columns but query only needs 3
-- SELECT * transfers all 50 columns across the network for every row
SELECT * FROM customers;                -- โŒ Bad: transfers all 50 columns

SELECT customer_id, name, email        -- โœ… Good: only the 3 needed columns
FROM customers;

2. Breaks application code when schema changes:

-- Application code: result[4] assumes the 5th column is 'email'
-- Someone adds a column between name and email โ†’ result[4] is now wrong
SELECT * FROM customers;               -- โŒ Column positions shift on schema change

-- Explicit column list is immune to column additions/reorderings
SELECT customer_id, name, email FROM customers;  -- โœ… Always gets the right columns

3. Prevents covering index usage:

-- Perfect covering index exists: (customer_id, name, email)
CREATE INDEX idx_covering ON customers (customer_id, name, email);

-- SELECT * cannot be served by the covering index
-- because it needs ALL columns โ€” forces a full table scan or key lookup
SELECT * FROM customers WHERE customer_id = 101;       -- โŒ Covering index unused

SELECT customer_id, name, email                        -- โœ… Covered by index
FROM customers WHERE customer_id = 101;                -- Index Only Scan

4. Exposes sensitive columns accidentally:

-- Table contains: name, email, password_hash, ssn, credit_card_last4
SELECT * FROM customers;   -- โŒ Exposes sensitive data to any consumer of this query
                           -- API response might leak password_hash or SSN

5. Increased memory consumption:

-- Query processing buffers all fetched columns in memory
-- SELECT * on a 100-column, 10M row table can exhaust buffer pool
-- Pushes other useful cached pages out of memory

Rule: Always list only the columns you actually need in production queries. Use SELECT * only in quick exploratory queries or SELECT COUNT(*) which is a special optimized form.


Q6. What is the impact of using functions on indexed columns in the WHERE clause (e.g., WHERE YEAR(date_col) = 2024)? How do you rewrite it? [4 Marks]

Answer: When you wrap an indexed column inside a function in the WHERE clause, the database cannot use the index on that column. The function must be evaluated for every single row in the table, forcing a full table scan โ€” even if a perfect index exists.

Why it breaks index usage:

-- Index exists on order_date
CREATE INDEX idx_order_date ON orders (order_date);

-- โŒ SLOW: Function applied to indexed column โ€” index is NOT used
-- Database must compute YEAR() for every row to evaluate the condition
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;
-- Execution plan: Table Scan (reads all rows)

-- โŒ Also breaks index usage:
WHERE MONTH(order_date) = 3
WHERE LOWER(email) = 'test@example.com'
WHERE DATE_FORMAT(order_date, '%Y-%m') = '2024-01'
WHERE UPPER(status) = 'ACTIVE'
WHERE LEFT(phone, 3) = '022'

Fix โ€” rewrite to compare the raw column against a range or literal:

-- โœ… FAST: Rewrite YEAR(order_date) = 2024 as a date range
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
  AND order_date <  '2025-01-01';
-- Execution plan: Index Range Scan on idx_order_date โœ…

-- โœ… FAST: Rewrite MONTH(order_date) = 3 (for year-agnostic monthly query)
WHERE order_date >= '2024-03-01'
  AND order_date <  '2024-04-01';

-- โœ… FAST: Rewrite LOWER(email) = '...' โ€” store emails pre-normalized
-- Option 1: Ensure emails are always stored lowercase (data hygiene)
WHERE email = 'test@example.com';

-- Option 2: Create a function-based index (PostgreSQL / Oracle)
CREATE INDEX idx_email_lower ON customers (LOWER(email));
WHERE LOWER(email) = 'test@example.com';  -- now uses the functional index โœ…

-- โœ… FAST: Rewrite DATE_FORMAT grouping using TRUNC
-- PostgreSQL
WHERE order_date >= DATE_TRUNC('month', '2024-01-01'::DATE)
  AND order_date <  DATE_TRUNC('month', '2024-02-01'::DATE);

Verification with EXPLAIN:

-- Before fix: shows Table Scan or Seq Scan
EXPLAIN SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- After fix: shows Index Range Scan or Index Seek
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

Golden rule: Never apply a transformation to an indexed column in a WHERE clause. Instead, move the transformation to the constant side of the comparison, or use a function-based index when the transformation is unavoidable.


Q7. What is a query execution plan? How do you read it? [4 Marks]

Answer: A query execution plan (also called an explain plan) is the database engine's step-by-step strategy for executing a SQL query. It shows which operations are performed, in what order, which indexes are used, how many rows are estimated vs actually processed, and where the cost is concentrated.

How to generate an execution plan:

-- MySQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 101;
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 101;  -- also runs the query

-- PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 101;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE customer_id = 101;

-- SQL Server
-- In SSMS: press Ctrl+M before running (graphical plan)
-- Or use:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 101;

Key fields to read in MySQL EXPLAIN output:

Column What it means Red flags
type Join/access type โ€” how rows are found ALL = full table scan โŒ
key Index actually used NULL = no index used โŒ
rows Estimated rows examined Very high number = expensive
Extra Additional info Using filesort, Using temporary โŒ
possible_keys Indexes the optimizer considered Helps diagnose index misses

Access type hierarchy (best to worst in MySQL):

const > eq_ref > ref > range > index > ALL
  โ†‘                                    โ†‘
fastest                             slowest (full scan)

Reading a PostgreSQL EXPLAIN ANALYZE output:

EXPLAIN ANALYZE
SELECT c.name, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name;

-- Sample output interpretation:
-- HashAggregate (cost=1842.50..1867.50 rows=500 width=40)
--   (actual time=145.2..148.3 rows=487 loops=1)
--   ->  Hash Join  (cost=450.00..1792.50 rows=10000)
--       Hash Cond: (o.customer_id = c.customer_id)
--       ->  Index Scan using idx_order_date on orders
--           Index Cond: (order_date >= '2024-01-01')
--       ->  Seq Scan on customers (cost=0..35.00 rows=500)

What to look for when optimizing:

-- ๐Ÿ”ด Warning signs in execution plans:
-- 1. Table Scan / Seq Scan on large tables โ†’ missing index
-- 2. Key Lookup / RID Lookup          โ†’ non-covering index, add INCLUDE columns
-- 3. Hash Match / Sort               โ†’ missing ORDER BY index or GROUP BY index
-- 4. Estimated rows << Actual rows   โ†’ stale statistics, run ANALYZE/UPDATE STATISTICS
-- 5. Nested Loops on large tables    โ†’ consider Hash Join, check join column indexes

Fix stale statistics (affects plan accuracy):

-- MySQL
ANALYZE TABLE orders;

-- PostgreSQL
ANALYZE orders;

-- SQL Server
UPDATE STATISTICS orders;

Q8. Explain the difference between a table scan, index scan, and index seek. [4 Marks]

Answer:

Operation What it does Reads Speed When it occurs
Table Scan (Full Scan) Reads every row in the table from start to end All rows ๐Ÿ”ด Slowest No usable index, SELECT *, very low cardinality filter
Index Scan Reads all or most entries in an index sequentially All index pages ๐ŸŸก Medium Index exists but filter is broad; or SELECT needs ordered results
Index Seek Navigates the B-tree index directly to the exact matching rows Only matching index pages ๐ŸŸข Fastest High-cardinality equality or narrow range filter on indexed column

Table Scan โ€” reads every row:

-- No index on 'notes' column โ€” must read every row
SELECT * FROM orders WHERE notes LIKE '%urgent%';
-- EXPLAIN output: type = ALL (MySQL), Seq Scan (PostgreSQL)

-- Also occurs when:
-- 1. The filter returns >30% of rows (optimizer prefers scan over index)
-- 2. No index exists on the filtered column
-- 3. Function applied to indexed column defeats index usage

Index Scan โ€” traverses index but reads many entries:

-- Index on status column, but 'Active' represents 80% of rows
SELECT customer_id FROM customers WHERE status = 'Active';
-- Optimizer may choose Index Scan (reads most of the index)
-- or even revert to Table Scan if selectivity is too low
-- EXPLAIN output: type = index (MySQL), Index Scan (PostgreSQL)

Index Seek โ€” pinpoints exact rows via B-tree navigation:

-- High cardinality column with equality filter
-- Index on customer_id
SELECT order_id, amount FROM orders WHERE customer_id = 101;
-- B-tree navigation: root โ†’ branch โ†’ leaf โ†’ exact rows
-- EXPLAIN output: type = ref or const (MySQL), Index Scan with specific rows (PostgreSQL)

-- Range seek on indexed date column
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- EXPLAIN output: type = range (MySQL), Index Scan with bounds (PostgreSQL)

Visual โ€” B-tree index seek navigation:

          [Root Node]
         /            \
   [Branch 1]      [Branch 2]
   id: 1โ€“500       id: 501โ€“1000
      |                  |
  [Leaf Pages]      [Leaf Pages]
  id=101 โ†’ row     id=501 โ†’ row

Seek for customer_id = 101:
Root โ†’ Branch 1 โ†’ Leaf Page โ†’ Row pointer โ†’ Fetch row
Only 3 pages read instead of thousands

Optimization goal: Always aim for Index Seek on large tables. If EXPLAIN shows Table Scan or Index Scan on a frequently run query against a large table, investigate whether an appropriate index is missing or whether the query is defeating an existing index through functions or type mismatches.


Q9. Explain how EXISTS vs IN can differ in performance for large datasets. [4 Marks]

Answer:

Behavior IN EXISTS
Execution strategy Collects all matching values from subquery first, then checks membership Runs subquery row by row and stops at first match (short-circuits)
NULL handling Dangerous โ€” if subquery returns any NULL, NOT IN returns no rows Safe โ€” EXISTS only checks row existence, NULLs don't affect it
Best for Small subquery result sets Large subquery result sets; correlated conditions
Correlated subquery โŒ Not correlated โ€” evaluates once โœ… Naturally correlated โ€” can reference outer query
Duplicate handling IN deduplicates internally EXISTS stops after first match โ€” duplicates irrelevant

Performance scenarios:

Scenario 1 โ€” Small subquery (IN is fine):

-- Subquery returns only a few rows โ€” IN is efficient
SELECT name FROM customers
WHERE customer_id IN (101, 102, 103);    -- or small static list

-- Also fine: small reference table
SELECT name FROM customers
WHERE country_code IN (
    SELECT code FROM active_countries    -- 50 rows โ€” negligible
);

Scenario 2 โ€” Large subquery (EXISTS wins):

-- โŒ IN: collects ALL order customer_ids (millions) into memory, then checks each customer
SELECT name FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders       -- 10 million rows loaded into memory first
    WHERE amount > 1000
);

-- โœ… EXISTS: for each customer, stops scanning orders as soon as first match is found
SELECT name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.amount > 1000               -- stops at first matching order per customer
);

Scenario 3 โ€” NULL trap with NOT IN:

-- โŒ DANGEROUS: if subquery returns even ONE NULL, NOT IN returns zero rows
SELECT name FROM customers
WHERE customer_id NOT IN (
    SELECT manager_id FROM employees     -- if ANY manager_id is NULL โ†’ returns nothing
);

-- โœ… SAFE: NOT EXISTS handles NULLs correctly
SELECT name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM employees e
    WHERE e.manager_id = c.customer_id
);

-- โœ… SAFE fix for NOT IN: explicitly exclude NULLs
WHERE customer_id NOT IN (
    SELECT manager_id FROM employees
    WHERE manager_id IS NOT NULL        -- critical safety addition
);

Modern optimizers: Many modern databases (PostgreSQL 9.x+, SQL Server 2014+, MySQL 8.x) can internally rewrite IN as a semi-join and EXISTS with similar efficiency. However, the NULL behavior difference is always a real risk and cannot be optimized away โ€” NOT IN vs NOT EXISTS is always a correctness concern regardless of optimizer version.


Q10. What is query hoisting? Give an example of rewriting a subquery for better performance. [3 Marks]

Answer: Query hoisting (also called subquery unnesting or subquery flattening) is the optimization technique of moving a subquery out of its nested position into the main query โ€” typically converting it into a JOIN, CTE, or derived table. This allows the optimizer to create a more efficient execution plan, avoid repeated subquery execution, and leverage indexes more effectively.

Problem โ€” Correlated subquery executes once per outer row:

-- โŒ SLOW: correlated subquery runs for EVERY employee row (N executions)
SELECT
    emp_id,
    name,
    salary,
    (SELECT AVG(salary)
     FROM employees e2
     WHERE e2.department = e1.department)   AS dept_avg   -- re-executes per row
FROM employees e1;
-- On a table with 100,000 employees: subquery runs 100,000 times

Hoisted version โ€” compute once, JOIN:

-- โœ… FAST: hoist the subquery into a JOIN โ€” department averages computed once
SELECT
    e.emp_id,
    e.name,
    e.salary,
    dept.avg_salary                         AS dept_avg
FROM employees e
JOIN (
    SELECT
        department,
        AVG(salary)     AS avg_salary
    FROM employees
    GROUP BY department                     -- runs ONCE, produces ~10-50 rows
) AS dept
    ON e.department = dept.department;
-- Subquery runs once โ†’ result joined โ†’ massive improvement

Another example โ€” IN subquery hoisted to JOIN:

-- โŒ SLOW: IN subquery may be re-evaluated or loads large set into memory
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE amount > 10000
);

-- โœ… FAST: hoisted to JOIN (or EXISTS)
SELECT DISTINCT
    c.customer_id,
    c.name
FROM customers c
JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.amount > 10000;

CTE hoisting for repeated subquery use:

-- โŒ BAD: same subquery written three times in different parts of a query
SELECT * FROM orders
WHERE customer_id IN (SELECT customer_id FROM vip_customers)
  AND amount > (SELECT AVG(amount) FROM orders WHERE customer_id IN (SELECT customer_id FROM vip_customers));

-- โœ… GOOD: hoist into a CTE โ€” computed once, referenced multiple times
WITH vip_ids AS (
    SELECT customer_id FROM vip_customers
),
vip_avg AS (
    SELECT AVG(amount) AS avg_amount
    FROM orders
    WHERE customer_id IN (SELECT customer_id FROM vip_ids)
)
SELECT o.*
FROM orders o
JOIN vip_ids v   ON o.customer_id = v.customer_id
CROSS JOIN vip_avg va
WHERE o.amount > va.avg_amount;

Q11. How would you identify slow queries in a production database? [3 Marks]

Answer: Identifying slow queries requires a combination of database-native monitoring tools, logging configuration, and query analysis techniques.

Method 1 โ€” Enable and read the Slow Query Log (MySQL):

-- Enable slow query log (in my.cnf or dynamically)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;            -- log queries taking > 1 second
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';

-- Analyze the log using mysqldumpslow tool (command line):
-- mysqldumpslow -s t -t 10 /var/log/mysql/slow.log
-- Shows top 10 slowest queries sorted by total time

-- Or query the Performance Schema
SELECT
    DIGEST_TEXT                             AS query,
    COUNT_STAR                              AS executions,
    ROUND(AVG_TIMER_WAIT / 1e12, 4)        AS avg_seconds,
    ROUND(SUM_TIMER_WAIT / 1e12, 4)        AS total_seconds,
    SUM_ROWS_EXAMINED                       AS total_rows_examined
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

Method 2 โ€” pg_stat_statements (PostgreSQL):

-- Enable extension (once, in postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries by total execution time
SELECT
    query,
    calls,
    ROUND(total_exec_time::NUMERIC, 2)      AS total_ms,
    ROUND(mean_exec_time::NUMERIC, 2)       AS avg_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Find queries with worst rows-examined to rows-returned ratio (inefficiency indicator)
SELECT
    query,
    calls,
    rows,
    ROUND(mean_exec_time::NUMERIC, 2)       AS avg_ms
FROM pg_stat_statements
WHERE calls > 100                           -- only frequently run queries
ORDER BY mean_exec_time DESC
LIMIT 20;

Method 3 โ€” SQL Server Dynamic Management Views:

-- Top 20 most expensive queries by total CPU time
SELECT TOP 20
    qs.total_worker_time / qs.execution_count   AS avg_cpu_time,
    qs.total_elapsed_time / qs.execution_count  AS avg_elapsed_time,
    qs.execution_count,
    qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
    SUBSTRING(
        st.text,
        (qs.statement_start_offset / 2) + 1,
        ((CASE qs.statement_end_offset
              WHEN -1 THEN DATALENGTH(st.text)
              ELSE qs.statement_end_offset
          END - qs.statement_start_offset) / 2) + 1
    )                                           AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_cpu_time DESC;

Method 4 โ€” EXPLAIN ANALYZE on suspected queries:

-- Once a slow query is identified, analyze its plan
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name, SUM(o.amount)
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name;

-- Look for: Seq Scan on large tables, high actual vs estimated row mismatch,
-- Hash Batches > 1 (spilling to disk), Sort operations on large sets

Production monitoring stack: In practice, slow query identification is also handled by APM tools like Datadog, New Relic, pgBadger (PostgreSQL), Percona Monitoring and Management, or cloud-native tools like AWS RDS Performance Insights and Azure Query Performance Insight โ€” which provide visual dashboards, query fingerprinting, and trend analysis.


Q12. You have a query that joins 5 tables and runs in 45 seconds. Walk through your approach to optimizing it. [5 Marks]

Answer: A systematic, step-by-step optimization approach:

Step 1 โ€” Capture and understand the query baseline:

-- Record current performance metrics before any changes
-- MySQL
SHOW STATUS LIKE 'Handler_read%';        -- rows examined metrics
EXPLAIN ANALYZE <your slow query>;       -- execution plan + actual timings

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<your slow query>;
-- Note: Seq Scan tables, actual vs estimated row counts, total cost

Step 2 โ€” Analyze the execution plan for problem areas:

-- Red flags to look for in EXPLAIN output:
-- โŒ Table Scan / Seq Scan on large tables
-- โŒ Key Lookup / RID Lookup (non-covering index)
-- โŒ Hash Match or Sort on large row sets
-- โŒ Estimated rows << Actual rows (stale statistics)
-- โŒ Nested Loop join on large tables (should be Hash Join)
-- โŒ Early joins producing huge intermediate result sets

-- Example 5-table join to examine:
SELECT
    c.name, o.order_date, oi.quantity,
    p.product_name, cat.category_name
FROM customers c
JOIN orders o       ON c.customer_id   = o.customer_id
JOIN order_items oi ON o.order_id      = oi.order_id
JOIN products p     ON oi.product_id   = p.product_id
JOIN categories cat ON p.category_id   = cat.category_id
WHERE o.order_date >= '2024-01-01'
  AND cat.category_name = 'Electronics';

Step 3 โ€” Fix missing or inefficient indexes:

-- Index every JOIN key and WHERE filter column
CREATE INDEX idx_orders_customer_date   ON orders      (customer_id, order_date);
CREATE INDEX idx_order_items_order      ON order_items (order_id);
CREATE INDEX idx_order_items_product    ON order_items (product_id);
CREATE INDEX idx_products_category      ON products    (category_id);
CREATE INDEX idx_categories_name        ON categories  (category_name);

-- Add covering indexes for SELECT columns to eliminate Key Lookups
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (order_id);                      -- SQL Server syntax

-- Re-run EXPLAIN after adding indexes to verify index seeks

Step 4 โ€” Filter early using CTEs or derived tables:

-- โœ… Push filters as early as possible to reduce rows before expensive joins
WITH filtered_orders AS (
    -- Filter BEFORE joining โ€” drastically reduces rows in subsequent joins
    SELECT order_id, customer_id, order_date
    FROM orders
    WHERE order_date >= '2024-01-01'     -- eliminate old orders immediately
),
electronics_products AS (
    -- Pre-filter products to Electronics category
    SELECT p.product_id, p.product_name
    FROM products p
    JOIN categories cat ON p.category_id = cat.category_id
    WHERE cat.category_name = 'Electronics'
)
SELECT
    c.name,
    fo.order_date,
    oi.quantity,
    ep.product_name
FROM filtered_orders fo
JOIN customers c        ON fo.customer_id = c.customer_id
JOIN order_items oi     ON fo.order_id    = oi.order_id
JOIN electronics_products ep ON oi.product_id = ep.product_id;

Step 5 โ€” Update statistics and check join order:

-- Stale statistics cause the optimizer to make wrong join order decisions
-- MySQL
ANALYZE TABLE orders, order_items, products, customers, categories;

-- PostgreSQL
ANALYZE orders; ANALYZE order_items; ANALYZE products;

-- SQL Server
UPDATE STATISTICS orders;
UPDATE STATISTICS order_items;
-- Or rebuild all indexes (also updates statistics):
ALTER INDEX ALL ON orders REBUILD;

Step 6 โ€” Consider materialization for repeated sub-results:

-- If the 5-table join is run frequently with same base data,
-- materialize intermediate results

-- PostgreSQL: Materialized CTE (forces evaluation once)
WITH MATERIALIZED expensive_join AS (
    SELECT ...
    FROM orders JOIN order_items ... JOIN products ...
    WHERE order_date >= '2024-01-01'
)
SELECT c.name, ej.order_date ...
FROM customers c JOIN expensive_join ej ON ...;

-- Or create a summary/aggregation table and refresh periodically
CREATE TABLE daily_order_summary AS
SELECT DATE(order_date) AS day, category_id, SUM(amount) AS daily_total
FROM orders JOIN order_items ... JOIN products ...
GROUP BY DATE(order_date), category_id;

Step 7 โ€” Measure improvement and document:

Step taken Time before Time after
Added indexes on join keys 45s 18s
Added covering index 18s 8s
Pushed filters into CTEs 8s 3s
Updated statistics 3s 2.1s

Systematic principle: Never guess โ€” always measure with EXPLAIN ANALYZE before and after each change. Change one thing at a time so you can attribute the improvement (or regression) to a specific action.


Q13. A query uses 'WHERE LOWER(email) = 'test@example.com'' on a table with 10 million rows. Explain why it is slow and how to fix it. [5 Marks]

Answer:

Why it is slow โ€” the function defeats the index:

-- This query is slow even with an index on email:
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';

-- Root cause:
-- 1. An index on 'email' stores values in their original case ('Test@Example.COM', 'TEST@EXAMPLE.COM')
-- 2. LOWER(email) is a function applied to the column at query time
-- 3. The database CANNOT use the index on 'email' because the indexed values
--    and the computed values LOWER(email) are different
-- 4. The database must read ALL 10 million rows, apply LOWER() to each,
--    then compare โ€” this is a full table scan = millions of function calls
-- 5. EXPLAIN shows: Seq Scan / Table Scan, rows examined = 10,000,000

Fix 1 โ€” Enforce consistent data storage (best long-term fix):

-- Normalize emails to lowercase at INSERT/UPDATE time
-- Add a CHECK constraint or trigger to enforce lowercase storage

-- MySQL: Add constraint to enforce lowercase
ALTER TABLE customers
ADD CONSTRAINT chk_email_lowercase
CHECK (email = LOWER(email));

-- Application-level enforcement (pseudo-code):
-- Before INSERT: email = email.toLowerCase().trim()

-- Once all emails are lowercase, query without LOWER():
SELECT customer_id, name
FROM customers
WHERE email = 'test@example.com';     -- uses index directly โœ…
-- EXPLAIN: Index Seek, rows examined = 1

Fix 2 โ€” Create a functional/expression index:

-- PostgreSQL: Create an index on LOWER(email) expression
CREATE INDEX idx_customers_email_lower
ON customers (LOWER(email));

-- Now this query uses the expression index:
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';
-- EXPLAIN: Index Scan using idx_customers_email_lower โœ…

-- MySQL 8.0+: Functional index
CREATE INDEX idx_customers_email_lower
ON customers ((LOWER(email)));         -- note double parentheses

-- SQL Server: Computed column + index
ALTER TABLE customers
ADD email_lower AS LOWER(email) PERSISTED;     -- stored computed column

CREATE INDEX idx_customers_email_lower
ON customers (email_lower);

-- Query using computed column:
SELECT customer_id, name
FROM customers
WHERE email_lower = 'test@example.com';   -- uses index โœ…

Fix 3 โ€” Use case-insensitive collation:

-- MySQL: Set column collation to case-insensitive (ci suffix)
ALTER TABLE customers
MODIFY email VARCHAR(255)
COLLATE utf8mb4_unicode_ci;             -- ci = case insensitive

-- Now standard equality is case-insensitive AND index-friendly:
SELECT customer_id, name
FROM customers
WHERE email = 'Test@Example.COM';       -- matches 'test@example.com' โœ…
-- EXPLAIN: uses index on email column โœ…

-- PostgreSQL: Use CITEXT extension
CREATE EXTENSION IF NOT EXISTS citext;

ALTER TABLE customers
ALTER COLUMN email TYPE citext;         -- case-insensitive text type

-- Standard index works with case-insensitive comparison:
CREATE INDEX idx_email ON customers (email);
WHERE email = 'Test@Example.COM';       -- uses index, case-insensitive โœ…

Performance comparison โ€” 10 million row table:

Approach Index used Rows examined Approx query time
WHERE LOWER(email) = '...' (no fix) โŒ None 10,000,000 ~45 seconds
Functional index on LOWER(email) โœ… Expression index 1 ~0.001 seconds
Case-insensitive collation + index โœ… Regular index 1 ~0.001 seconds
Normalized lowercase storage + index โœ… Regular index 1 ~0.001 seconds

Verification after fix:

-- PostgreSQL: confirm index is used
EXPLAIN ANALYZE
SELECT customer_id, name
FROM customers
WHERE LOWER(email) = 'test@example.com';
-- Should show: Index Scan using idx_customers_email_lower
-- Rows Removed by Filter: 0
-- Actual rows: 1, Actual time: ~0.05ms

-- MySQL: confirm no full scan
EXPLAIN
SELECT customer_id, name FROM customers
WHERE email = 'test@example.com';      -- after normalization fix
-- type: ref  key: idx_email  rows: 1  Extra: Using index โœ…

Production recommendation: Use all three fixes together for maximum robustness โ€” normalize email storage to lowercase in the application layer (prevents dirty data), add a functional index as a safety net (handles any existing mixed-case data), and add a database constraint to prevent future case-inconsistent inserts. This is a defense-in-depth approach that eliminates the problem at every layer.


๐Ÿ“… Day 9: Advanced Topics โ€” Pivoting, JSON, Stored Procedures & Transactions

Advanced 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. Explain the PIVOT operator. Write a query to pivot monthly sales into columns (Jan, Feb, Mar...). [5 Marks]

Answer: The PIVOT operator rotates rows into columns โ€” it takes distinct values from one column and spreads them across multiple columns, aggregating related data in the process. It transforms a "tall" (normalized) table format into a "wide" (crosstab/spreadsheet) format.

Before PIVOT โ€” tall format (rows per month):

year month total_sales
2024 Jan 120000
2024 Feb 95000
2024 Mar 140000

After PIVOT โ€” wide format (months as columns):

year Jan Feb Mar Apr ...
2024 120000 95000 140000 110000 ...

SQL Server โ€” Native PIVOT operator:

-- Assumption: orders(order_id, customer_id, order_date, amount)

WITH monthly_sales AS (
    SELECT
        YEAR(order_date)                            AS sale_year,
        FORMAT(order_date, 'MMM')                   AS sale_month,      -- 'Jan', 'Feb', etc.
        amount
    FROM orders
    WHERE YEAR(order_date) = 2024
)
SELECT
    sale_year,
    ISNULL([Jan], 0)    AS Jan,
    ISNULL([Feb], 0)    AS Feb,
    ISNULL([Mar], 0)    AS Mar,
    ISNULL([Apr], 0)    AS Apr,
    ISNULL([May], 0)    AS May,
    ISNULL([Jun], 0)    AS Jun,
    ISNULL([Jul], 0)    AS Jul,
    ISNULL([Aug], 0)    AS Aug,
    ISNULL([Sep], 0)    AS Sep,
    ISNULL([Oct], 0)    AS Oct,
    ISNULL([Nov], 0)    AS Nov,
    ISNULL([Dec], 0)    AS Dec
FROM monthly_sales
PIVOT (
    SUM(amount)                                     -- aggregate function
    FOR sale_month IN (                             -- column to pivot
        [Jan],[Feb],[Mar],[Apr],[May],[Jun],
        [Jul],[Aug],[Sep],[Oct],[Nov],[Dec]
    )
) AS pivoted_sales
ORDER BY sale_year;

PostgreSQL โ€” Using crosstab() from tablefunc extension:

-- Enable extension (once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT *
FROM crosstab(
    -- Query 1: source data (must return row_id, category, value)
    $$
    SELECT
        EXTRACT(YEAR FROM order_date)::INT          AS sale_year,
        TO_CHAR(order_date, 'Mon')                  AS sale_month,
        SUM(amount)                                 AS total_sales
    FROM orders
    WHERE EXTRACT(YEAR FROM order_date) = 2024
    GROUP BY 1, 2
    ORDER BY 1, 2
    $$,
    -- Query 2: distinct category values (column headers)
    $$SELECT unnest(ARRAY['Jan','Feb','Mar','Apr','May','Jun',
                          'Jul','Aug','Sep','Oct','Nov','Dec'])$$
) AS pivoted (
    sale_year   INT,
    "Jan"       NUMERIC, "Feb" NUMERIC, "Mar" NUMERIC,
    "Apr"       NUMERIC, "May" NUMERIC, "Jun" NUMERIC,
    "Jul"       NUMERIC, "Aug" NUMERIC, "Sep" NUMERIC,
    "Oct"       NUMERIC, "Nov" NUMERIC, "Dec" NUMERIC
);

Q2. How would you pivot data in a database that does not support PIVOT (e.g., MySQL)? Use CASE WHEN. [5 Marks]

Answer: MySQL does not have a native PIVOT operator. The standard workaround is conditional aggregation โ€” using CASE WHEN inside aggregate functions to selectively sum/count values for each target column.

Assumption: Table orders(order_id, customer_id, order_date, amount) exists.

MySQL โ€” CASE WHEN pivot (manual column definition):

SELECT
    YEAR(order_date)                                AS sale_year,
    -- Each CASE extracts only the rows matching that month, then SUM aggregates them
    ROUND(SUM(CASE WHEN MONTH(order_date) = 1
                   THEN amount ELSE 0 END), 2)      AS Jan,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 2
                   THEN amount ELSE 0 END), 2)      AS Feb,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 3
                   THEN amount ELSE 0 END), 2)      AS Mar,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 4
                   THEN amount ELSE 0 END), 2)      AS Apr,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 5
                   THEN amount ELSE 0 END), 2)      AS May,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 6
                   THEN amount ELSE 0 END), 2)      AS Jun,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 7
                   THEN amount ELSE 0 END), 2)      AS Jul,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 8
                   THEN amount ELSE 0 END), 2)      AS Aug,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 9
                   THEN amount ELSE 0 END), 2)      AS Sep,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 10
                   THEN amount ELSE 0 END), 2)      AS Oct,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 11
                   THEN amount ELSE 0 END), 2)      AS Nov,
    ROUND(SUM(CASE WHEN MONTH(order_date) = 12
                   THEN amount ELSE 0 END), 2)      AS Dec,
    -- Grand total row
    ROUND(SUM(amount), 2)                           AS annual_total
FROM orders
GROUP BY YEAR(order_date)
ORDER BY sale_year;

Extended โ€” pivot by category AND month:

-- Pivot showing monthly sales broken down per product category
SELECT
    p.category,
    ROUND(SUM(CASE WHEN MONTH(o.order_date) = 1
                   THEN oi.quantity * oi.unit_price ELSE 0 END), 2)  AS Jan,
    ROUND(SUM(CASE WHEN MONTH(o.order_date) = 2
                   THEN oi.quantity * oi.unit_price ELSE 0 END), 2)  AS Feb,
    ROUND(SUM(CASE WHEN MONTH(o.order_date) = 3
                   THEN oi.quantity * oi.unit_price ELSE 0 END), 2)  AS Mar,
    -- ... continue for all 12 months
    ROUND(SUM(oi.quantity * oi.unit_price), 2)                       AS total
FROM orders o
JOIN order_items oi ON o.order_id    = oi.order_id
JOIN products p     ON oi.product_id = p.product_id
WHERE YEAR(o.order_date) = 2024
GROUP BY p.category
ORDER BY total DESC;

Dynamic pivot using prepared statements (MySQL โ€” when column values are unknown at query time):

-- Step 1: Generate the CASE expressions dynamically
SET @sql = NULL;
SELECT GROUP_CONCAT(
    DISTINCT CONCAT(
        'ROUND(SUM(CASE WHEN MONTH(order_date) = ',
        MONTH(order_date),
        ' THEN amount ELSE 0 END), 2) AS `',
        DATE_FORMAT(order_date, '%b'),
        '`'
    )
    ORDER BY MONTH(order_date)
) INTO @sql
FROM orders
WHERE YEAR(order_date) = 2024;

-- Step 2: Build and execute the full query
SET @sql = CONCAT(
    'SELECT YEAR(order_date) AS sale_year, ', @sql,
    ' FROM orders WHERE YEAR(order_date) = 2024 GROUP BY YEAR(order_date)'
);
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Why ELSE 0 matters: Without ELSE 0, the CASE returns NULL for non-matching months. SUM(NULL) returns NULL โ€” not 0. Using ELSE 0 ensures months with no sales show 0 instead of NULL in the pivot output.


Q3. What is UNPIVOT? Give a use case. [5 Marks]

Answer: UNPIVOT is the reverse of PIVOT โ€” it transforms columns back into rows, converting a "wide" table with many columns into a "tall" normalized format. Each column value becomes a row, with an additional column identifying which original column the value came from.

Before UNPIVOT โ€” wide format:

year Jan Feb Mar
2024 120000 95000 140000

After UNPIVOT โ€” tall normalized format:

year month sales
2024 Jan 120000
2024 Feb 95000
2024 Mar 140000

SQL Server โ€” Native UNPIVOT operator:

-- Assumption: wide_sales(sale_year, Jan, Feb, Mar, Apr, May, Jun,
--                                  Jul, Aug, Sep, Oct, Nov, Dec)

SELECT
    sale_year,
    sale_month,
    monthly_sales
FROM wide_sales
UNPIVOT (
    monthly_sales           -- new value column name
    FOR sale_month IN (     -- new category column name, listing source columns
        Jan, Feb, Mar, Apr, May, Jun,
        Jul, Aug, Sep, Oct, Nov, Dec
    )
) AS unpivoted
ORDER BY sale_year, sale_month;

MySQL / PostgreSQL โ€” UNION ALL approach (manual unpivot):

-- Unpivot using UNION ALL (universal approach โ€” works in all databases)
SELECT sale_year, 'Jan' AS sale_month, Jan AS monthly_sales FROM wide_sales
UNION ALL
SELECT sale_year, 'Feb',               Feb                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Mar',               Mar                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Apr',               Apr                  FROM wide_sales
UNION ALL
SELECT sale_year, 'May',               May                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Jun',               Jun                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Jul',               Jul                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Aug',               Aug                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Sep',               Sep                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Oct',               Oct                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Nov',               Nov                  FROM wide_sales
UNION ALL
SELECT sale_year, 'Dec',               Dec                  FROM wide_sales
ORDER BY sale_year, sale_month;

PostgreSQL โ€” Using unnest() for cleaner unpivot:

SELECT
    sale_year,
    unnest(ARRAY['Jan','Feb','Mar','Apr','May','Jun',
                 'Jul','Aug','Sep','Oct','Nov','Dec']) AS sale_month,
    unnest(ARRAY[jan,feb,mar,apr,may,jun,
                 jul,aug,sep,oct,nov,dec])             AS monthly_sales
FROM wide_sales
ORDER BY sale_year;

Real-world use cases for UNPIVOT:

Use Case Description
Data normalization Convert denormalized spreadsheet imports into relational rows
Reporting system ingestion Transform pivot-style Excel exports back to row format for loading into a fact table
Survey data Columns like q1_answer, q2_answer, q3_answer โ†’ rows per question
Financial data Quarterly budget columns โ†’ normalized rows for time-series analysis

Q4. How do you extract a value from a JSON column in SQL? Give examples in PostgreSQL and MySQL. [4 Marks]

Answer:

PostgreSQL โ€” JSON operators:

-- Assumption: customers table has a JSON/JSONB column 'profile'
-- Sample JSON: {"name": "Alice", "age": 30, "address": {"city": "Mumbai", "zip": "400001"}}

-- -> operator: returns JSON object (keeps JSON type)
SELECT profile -> 'name'                    AS name_json       -- returns "Alice" (quoted)
FROM customers;

-- ->> operator: returns TEXT value (most commonly used for comparisons)
SELECT profile ->> 'name'                   AS name_text       -- returns Alice (plain text)
FROM customers;

-- Nested access: chain operators
SELECT profile -> 'address' ->> 'city'      AS city           -- returns Mumbai
FROM customers;

-- #> operator: access by path (array of keys)
SELECT profile #> '{address, city}'         AS city_json
FROM customers;

-- #>> operator: access by path, return as text
SELECT profile #>> '{address, city}'        AS city_text
FROM customers;

-- Filtering on JSON value
SELECT customer_id, profile ->> 'name' AS name
FROM customers
WHERE profile ->> 'age' = '30';            -- JSON values compared as text

-- JSONB-specific: @> containment operator (very fast with GIN index)
SELECT customer_id
FROM customers
WHERE profile @> '{"address": {"city": "Mumbai"}}';

MySQL โ€” JSON functions:

-- JSON_EXTRACT: standard extraction using path syntax
SELECT
    JSON_EXTRACT(profile, '$.name')         AS name_json,     -- returns "Alice" (quoted)
    JSON_UNQUOTE(JSON_EXTRACT(profile, '$.name'))
                                            AS name_text      -- returns Alice (unquoted)
FROM customers;

-- Shorthand -> operator (MySQL 5.7.13+)
SELECT profile -> '$.name'                  AS name_json      -- same as JSON_EXTRACT
FROM customers;

-- Shorthand ->> operator (MySQL 5.7.13+)
SELECT profile ->> '$.name'                 AS name_text      -- same as JSON_UNQUOTE(JSON_EXTRACT)
FROM customers;

-- Nested access
SELECT profile ->> '$.address.city'         AS city
FROM customers;

-- Array element access
SELECT profile ->> '$.tags[0]'              AS first_tag      -- first element of array
FROM customers;

-- Filtering
SELECT customer_id
FROM customers
WHERE JSON_EXTRACT(profile, '$.address.city') = '"Mumbai"';   -- JSON string includes quotes

-- Cleaner filtering with ->>
WHERE profile ->> '$.address.city' = 'Mumbai';

Q5. Write a query to extract the 'city' field from a JSON column 'address'. [3 Marks]

Answer:

Assumption: Table customers(customer_id, name, email, address) exists where address is a JSON/JSONB column.

Sample JSON value: {"street": "42 MG Road", "city": "Mumbai", "state": "Maharashtra", "zip": "400001", "country": "India"}

PostgreSQL (JSONB column โ€” recommended type):

SELECT
    customer_id,
    name,
    address ->> 'city'                          AS city,
    address ->> 'state'                         AS state,
    address ->> 'zip'                           AS zip_code,
    address ->> 'country'                       AS country
FROM customers
WHERE address ->> 'city' IS NOT NULL
ORDER BY city;

-- Filter customers in a specific city
SELECT customer_id, name
FROM customers
WHERE address ->> 'city' = 'Mumbai';

-- With GIN index for fast city filtering
CREATE INDEX idx_customer_address_gin
ON customers USING GIN (address);

-- Or a more targeted expression index for city specifically
CREATE INDEX idx_customer_city
ON customers ((address ->> 'city'));

MySQL:

SELECT
    customer_id,
    name,
    address ->> '$.city'                        AS city,
    address ->> '$.state'                       AS state,
    address ->> '$.zip'                         AS zip_code,
    address ->> '$.country'                     AS country
FROM customers
WHERE address ->> '$.city' IS NOT NULL
ORDER BY city;

-- Filter customers in a specific city
SELECT customer_id, name
FROM customers
WHERE address ->> '$.city' = 'Mumbai';

-- MySQL: create functional index on JSON path (MySQL 8.0+)
CREATE INDEX idx_customer_city
ON customers ((CAST(address ->> '$.city' AS CHAR(100))));

SQL Server (JSON stored as NVARCHAR):

SELECT
    customer_id,
    name,
    JSON_VALUE(address, '$.city')               AS city,
    JSON_VALUE(address, '$.state')              AS state,
    JSON_VALUE(address, '$.zip')                AS zip_code,
    JSON_VALUE(address, '$.country')            AS country
FROM customers
WHERE JSON_VALUE(address, '$.city') IS NOT NULL
ORDER BY JSON_VALUE(address, '$.city');

-- Filter
WHERE JSON_VALUE(address, '$.city') = 'Mumbai';

Q6. What are the advantages and disadvantages of storing data as JSON in a relational database? [3 Marks]

Answer:

Advantages:

-- 1. Schema flexibility โ€” store variable or evolving attributes without ALTER TABLE
-- Different customers can have different fields in their profile JSON
INSERT INTO customers (customer_id, name, profile) VALUES
(1, 'Alice', '{"tier": "Gold", "preferences": ["email", "sms"]}'),
(2, 'Bob',   '{"tier": "Silver", "company": "TechCorp", "gst_no": "27AAPFU0939F1ZV"}');
-- No need for 50 nullable columns to handle every possible attribute

-- 2. Nested/hierarchical data โ€” represents complex structures naturally
-- Address with multiple lines, tags as arrays, nested metadata
'{"address": {"street": "42 MG Road", "city": "Mumbai"}, "tags": ["vip","early-adopter"]}'

-- 3. Reduces table count โ€” avoids creating separate tables for sparse attributes
-- Instead of a customer_attributes EAV table (entity-attribute-value anti-pattern)
-- store flexible attributes directly in the JSON column

-- 4. API-friendly โ€” JSON from external APIs can be stored and queried directly
-- without upfront schema design for every API field

Disadvantages:

-- 1. No referential integrity โ€” JSON values cannot have foreign keys
-- No constraint enforcement inside JSON fields
-- 'city': 'Mumbay' โ€” typo impossible to catch at DB level

-- 2. Poor query performance without careful indexing
-- Full table scan required unless expression indexes or GIN indexes are created
-- Complex JSON queries are harder to optimize than relational queries
SELECT * FROM customers WHERE address ->> 'city' = 'Mumbai';  -- slow without index

-- 3. No JOIN capability โ€” cannot JOIN on a JSON field value efficiently
-- Cannot enforce relationships between JSON values and other tables

-- 4. Data duplication and inconsistency โ€” same city name stored as
-- 'Mumbai', 'mumbai', 'MUMBAI', 'Bombay' with no normalization

-- 5. Harder to aggregate โ€” SUM, COUNT, GROUP BY on JSON values require
-- extraction functions at every step โ€” verbose and error-prone
SELECT address ->> 'city', COUNT(*)
FROM customers
GROUP BY address ->> 'city';           -- works but awkward at scale

-- 6. Schema documentation is implicit โ€” no explicit column list
-- New developers cannot see what fields exist by looking at table structure

Balanced guideline:

Use JSON for Avoid JSON for
Truly variable/optional attributes Core business data (names, dates, amounts)
External API payload storage Data that needs JOINs or foreign keys
Audit logs and event payloads Columns used in GROUP BY or aggregations frequently
Sparse attributes (most rows NULL) Data requiring referential integrity

Q7. What is the difference between a stored procedure and a user-defined function? [3 Marks]

Answer:

Feature Stored Procedure User-Defined Function (UDF)
Return value Can return zero, one, or multiple result sets; uses OUT parameters Must return exactly one value (scalar) or a table (table-valued function)
Use in SELECT โŒ Cannot be called inside a SELECT statement โœ… Can be used in SELECT, WHERE, JOIN
DML operations โœ… Can execute INSERT, UPDATE, DELETE, DDL โš ๏ธ Scalar UDFs cannot modify data (read-only in most databases)
Transaction control โœ… Can use COMMIT, ROLLBACK, SAVEPOINT โŒ Cannot manage transactions
Error handling โœ… Full TRY/CATCH or EXCEPTION handling Limited error handling
Called with CALL proc_name() or EXEC proc_name SELECT func_name() or inline in expressions
Purpose Complex business workflows, batch operations Reusable calculations, transformations, lookups
-- STORED PROCEDURE: executes a workflow, cannot be used inline in SELECT
CALL calculate_monthly_bonuses(2024, 'Q1');
EXEC sp_process_refund @order_id = 1001;

-- USER-DEFINED FUNCTION: returns a value, usable anywhere an expression is valid
SELECT
    customer_id,
    name,
    get_customer_tier(customer_id)          AS tier,    -- scalar UDF in SELECT โœ…
    calculate_tax(amount, 'GST')            AS tax
FROM orders
WHERE get_customer_tier(customer_id) = 'Gold';          -- UDF in WHERE โœ…

Q8. Write a stored procedure that takes a customer_id and returns their total order value. [5 Marks]

Answer:

MySQL:

DELIMITER $$

CREATE PROCEDURE sp_get_customer_total_order_value(
    IN  p_customer_id   INT,
    OUT p_total_value   DECIMAL(15, 2),
    OUT p_order_count   INT,
    OUT p_customer_name VARCHAR(255)
)
BEGIN
    -- Error handling
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
    BEGIN
        SET p_total_value   = NULL;
        SET p_order_count   = 0;
        SET p_customer_name = NULL;
        RESIGNAL;
    END;

    -- Validate customer exists
    IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = p_customer_id) THEN
        SIGNAL SQLSTATE '45000'
            SET MESSAGE_TEXT = 'Customer not found';
    END IF;

    -- Get customer name
    SELECT name INTO p_customer_name
    FROM customers
    WHERE customer_id = p_customer_id;

    -- Calculate total order value and count
    SELECT
        COALESCE(SUM(amount), 0),
        COUNT(order_id)
    INTO
        p_total_value,
        p_order_count
    FROM orders
    WHERE customer_id = p_customer_id;

END$$

DELIMITER ;

-- Calling the procedure:
CALL sp_get_customer_total_order_value(101, @total, @count, @name);
SELECT @name AS customer_name, @total AS total_order_value, @count AS order_count;

PostgreSQL (using PROCEDURE and INOUT, or a FUNCTION):

-- PostgreSQL prefers functions for returning data
CREATE OR REPLACE FUNCTION fn_get_customer_total_order_value(
    p_customer_id   INT
)
RETURNS TABLE (
    customer_name       VARCHAR,
    total_order_value   NUMERIC,
    order_count         BIGINT,
    first_order_date    DATE,
    last_order_date     DATE,
    avg_order_value     NUMERIC
)
LANGUAGE plpgsql
AS $$
BEGIN
    -- Validate customer exists
    IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = p_customer_id) THEN
        RAISE EXCEPTION 'Customer with ID % not found', p_customer_id;
    END IF;

    -- Return result set
    RETURN QUERY
    SELECT
        c.name                                      AS customer_name,
        COALESCE(SUM(o.amount), 0)                  AS total_order_value,
        COUNT(o.order_id)                           AS order_count,
        MIN(o.order_date)::DATE                     AS first_order_date,
        MAX(o.order_date)::DATE                     AS last_order_date,
        ROUND(AVG(o.amount)::NUMERIC, 2)            AS avg_order_value
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE c.customer_id = p_customer_id
    GROUP BY c.name;
END;
$$;

-- Calling the function:
SELECT * FROM fn_get_customer_total_order_value(101);

SQL Server:

CREATE OR ALTER PROCEDURE sp_get_customer_total_order_value
    @customer_id    INT,
    @total_value    DECIMAL(15, 2)  OUTPUT,
    @order_count    INT             OUTPUT
AS
BEGIN
    SET NOCOUNT ON;

    BEGIN TRY
        -- Validate
        IF NOT EXISTS (SELECT 1 FROM customers WHERE customer_id = @customer_id)
        BEGIN
            RAISERROR ('Customer ID %d not found.', 16, 1, @customer_id);
            RETURN;
        END

        -- Calculate totals
        SELECT
            @total_value    = COALESCE(SUM(amount), 0),
            @order_count    = COUNT(order_id)
        FROM orders
        WHERE customer_id = @customer_id;

        -- Return full detail as a result set
        SELECT
            c.name                              AS customer_name,
            @total_value                        AS total_order_value,
            @order_count                        AS order_count,
            MIN(o.order_date)                   AS first_order,
            MAX(o.order_date)                   AS last_order
        FROM customers c
        LEFT JOIN orders o ON c.customer_id = o.customer_id
        WHERE c.customer_id = @customer_id
        GROUP BY c.name;

    END TRY
    BEGIN CATCH
        PRINT ERROR_MESSAGE();
        THROW;
    END CATCH
END;

-- Calling:
DECLARE @total DECIMAL(15,2), @cnt INT;
EXEC sp_get_customer_total_order_value
    @customer_id = 101,
    @total_value = @total OUTPUT,
    @order_count = @cnt  OUTPUT;
SELECT @total AS total_value, @cnt AS order_count;

Q9. What are the pros and cons of using stored procedures for business logic? [4 Marks]

Answer:

Pros:

-- 1. PERFORMANCE: Pre-compiled and cached execution plan
-- The database compiles and optimizes the procedure once, reuses plan on every call
-- Reduces parsing and planning overhead for frequently executed logic

-- 2. NETWORK EFFICIENCY: Only procedure name + parameters sent over the wire
-- Instead of sending 200 lines of SQL, client sends:
CALL sp_process_month_end_closing(2024, 3);   -- single network round trip
-- vs. sending the entire 200-line query from application code each time

-- 3. SECURITY: Granular permission control without exposing base tables
-- Grant EXECUTE on procedure, not SELECT/INSERT/UPDATE on base tables
GRANT EXECUTE ON sp_process_refund TO app_user;     -- app can execute
REVOKE SELECT ON orders FROM app_user;              -- but cannot query directly

-- 4. REUSABILITY: Business logic defined once, used by multiple applications
-- Java app, Python script, and reporting tool all call the same procedure
-- Changing the logic means updating ONE place โ€” not every application

-- 5. CONSISTENCY: Enforces business rules at the database layer
-- Even if an app developer bypasses the application layer, the DB enforces rules

Cons:

-- 1. VERSION CONTROL DIFFICULTY: Stored procedures live in the database
-- Harder to track changes in Git โ€” require special tooling (Flyway, Liquibase)
-- "What changed in this procedure last month?" is hard to answer without tooling

-- 2. PORTABILITY: Procedural SQL is NOT standard across databases
-- PL/pgSQL (PostgreSQL), T-SQL (SQL Server), PL/SQL (Oracle), MySQL differ significantly
-- Migrating database engines requires rewriting ALL procedures

-- 3. TESTING DIFFICULTY: Unit testing stored procedures is cumbersome
-- No native unit test frameworks like Jest or pytest
-- Requires database access for every test โ€” slow, stateful, fragile

-- 4. DEBUGGING: Limited debugging tools compared to application code
-- Debugging PL/pgSQL or T-SQL is much harder than stepping through Python/Java
-- Error messages are often less descriptive

-- 5. HIDDEN BUSINESS LOGIC: Logic split between application and database
-- New developers must look in two places to understand the full flow
-- Can lead to inconsistency if some logic is in app and some in procedures

-- 6. SCALABILITY: Procedures execute on the database server
-- Heavy business logic in DB consumes CPU/memory on the most critical shared resource
-- Application servers are easier (and cheaper) to scale horizontally

Verdict โ€” when to use stored procedures:

Use stored procedures for Keep in application code
Complex multi-step DML operations Simple CRUD queries
Batch jobs with large data volume Business rules needing frequent changes
Security-sensitive data access Logic requiring unit testing
DB-level audit trails and triggers Cross-service or multi-DB workflows

Q10. What is a trigger in SQL? Give a practical example of when you would use one. [3 Marks]

Answer: A trigger is a stored database object that automatically executes in response to a specific event (INSERT, UPDATE, or DELETE) on a table. Triggers fire automatically โ€” they cannot be called manually and do not require application code to invoke them.

Trigger timing:
- BEFORE โ€” fires before the DML operation (can modify values or cancel the operation)
- AFTER โ€” fires after the DML operation completes successfully

Practical example 1 โ€” Audit trail (log all changes to orders):

-- MySQL: Log every order status change to an audit table
CREATE TABLE order_audit_log (
    log_id          INT AUTO_INCREMENT PRIMARY KEY,
    order_id        INT,
    old_status      VARCHAR(50),
    new_status      VARCHAR(50),
    changed_by      VARCHAR(100)    DEFAULT USER(),
    changed_at      DATETIME        DEFAULT NOW()
);

DELIMITER $$
CREATE TRIGGER trg_orders_status_audit
AFTER UPDATE ON orders
FOR EACH ROW
BEGIN
    -- Only log when status actually changes
    IF OLD.status != NEW.status THEN
        INSERT INTO order_audit_log (
            order_id, old_status, new_status, changed_by, changed_at
        ) VALUES (
            NEW.order_id, OLD.status, NEW.status, USER(), NOW()
        );
    END IF;
END$$
DELIMITER ;

Practical example 2 โ€” Maintain denormalized summary (auto-update stock):

-- PostgreSQL: Automatically decrement inventory when an order item is inserted
CREATE OR REPLACE FUNCTION trg_fn_decrement_stock()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
    UPDATE products
    SET stock_quantity = stock_quantity - NEW.quantity
    WHERE product_id = NEW.product_id;

    -- Raise error if stock goes negative
    IF (SELECT stock_quantity FROM products WHERE product_id = NEW.product_id) < 0 THEN
        RAISE EXCEPTION 'Insufficient stock for product_id %', NEW.product_id;
    END IF;

    RETURN NEW;
END;
$$;

CREATE TRIGGER trg_order_item_stock_decrement
AFTER INSERT ON order_items
FOR EACH ROW
EXECUTE FUNCTION trg_fn_decrement_stock();

Common trigger use cases:

Use Case Trigger Type
Audit logging of data changes AFTER UPDATE / DELETE
Auto-updating updated_at timestamp BEFORE UPDATE
Enforcing complex business rules BEFORE INSERT / UPDATE
Maintaining denormalized summary tables AFTER INSERT / UPDATE / DELETE
Cascading soft deletes AFTER DELETE

Caution: Triggers execute invisibly โ€” developers running ad-hoc SQL may not realize a trigger is firing. Overuse leads to "magic behavior" that is hard to debug. Always document triggers clearly and use them sparingly for genuinely cross-cutting concerns.


Q11. Explain ACID properties with a real-world example. [4 Marks]

Answer: ACID is a set of four properties that guarantee database transactions are processed reliably, even in the event of errors, power failures, or concurrent access.

Real-world scenario: Bank transfer of โ‚น10,000 from Alice's account to Bob's account.

A โ€” Atomicity ("all or nothing"):

-- Either BOTH operations succeed, or NEITHER happens
-- There is no partial state where Alice loses money but Bob doesn't receive it

BEGIN TRANSACTION;

    UPDATE accounts
    SET balance = balance - 10000
    WHERE account_id = 'ALICE_001';     -- Step 1: Debit Alice

    -- If the system crashes HERE, Atomicity ensures Step 1 is ROLLED BACK
    -- Alice does not lose โ‚น10,000 without Bob receiving it

    UPDATE accounts
    SET balance = balance + 10000
    WHERE account_id = 'BOB_001';       -- Step 2: Credit Bob

COMMIT;    -- Both succeed โ†’ commit
-- OR
ROLLBACK;  -- If any error โ†’ both reversed, as if transaction never happened

C โ€” Consistency ("valid state to valid state"):

-- The database must move from one valid state to another valid state
-- Business rules (constraints) must hold before AND after the transaction

-- Example constraint: no account can have negative balance
ALTER TABLE accounts ADD CONSTRAINT chk_balance CHECK (balance >= 0);

-- If Alice has only โ‚น5,000 and tries to send โ‚น10,000:
-- The CHECK constraint fires โ†’ transaction REJECTED โ†’ database stays consistent
-- Total money in system: โ‚นX before = โ‚นX after (money is neither created nor destroyed)

I โ€” Isolation ("concurrent transactions don't interfere):

-- Two simultaneous transactions see a consistent snapshot of data
-- Transaction T1 (Alice โ†’ Bob transfer) and T2 (Alice โ†’ Carol transfer)
-- should not interfere with each other

-- Without isolation: both T1 and T2 read Alice's balance as โ‚น10,000
-- Both debit โ‚น10,000 โ†’ Alice ends up with -โ‚น10,000 (race condition)

-- With isolation (SERIALIZABLE level):
-- T2 waits until T1 commits before reading Alice's balance
-- T2 reads โ‚น0 โ†’ rejects transfer โ†’ database remains consistent

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
    -- T2 now sees committed state of T1 before proceeding

D โ€” Durability ("committed data survives failures"):

-- Once COMMIT is executed, the transaction is permanently saved
-- Even if the server crashes 1 millisecond after COMMIT, the data is not lost

COMMIT;
-- At this point:
-- โœ… Transaction written to the Write-Ahead Log (WAL) / Redo Log on disk
-- โœ… Even if RAM is wiped by a crash, the committed data can be recovered from the log
-- โœ… On server restart, the DB replays the log to restore the committed state

ACID summary table:

Property Guarantee Failure it prevents
Atomicity All steps succeed or none do Partial updates leaving inconsistent state
Consistency DB moves between valid states only Constraint violations from concurrent changes
Isolation Concurrent transactions don't interfere Race conditions, dirty reads
Durability Committed data survives crashes Data loss after power failure or crash

Q12. What are the different isolation levels? Explain each briefly. [3 Marks]

Answer: Transaction isolation levels control how much one transaction is shielded from changes made by other concurrent transactions. Higher isolation = more data consistency, but less concurrency and more locking overhead.

The four standard isolation levels (least to most strict):

Level 1 โ€” READ UNCOMMITTED (lowest isolation):

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
-- Can read data that another transaction has modified but NOT yet committed
-- Problem: DIRTY READ โ€” reading data that may be rolled back
-- Example: T1 updates price to โ‚น500 (not committed) โ†’ T2 reads โ‚น500 โ†’ T1 rolls back โ†’ T2 used wrong price
-- Use case: Real-time dashboards where approximate values are acceptable

Level 2 โ€” READ COMMITTED (default in PostgreSQL, SQL Server, Oracle):

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Only reads data that has been COMMITTED by other transactions
-- Prevents: Dirty reads โœ…
-- Problem: NON-REPEATABLE READ โ€” same query within a transaction can return different results
-- Example: T1 reads salary=50000 โ†’ T2 commits salary=60000 โ†’ T1 reads salary again โ†’ gets 60000
-- Use case: Most OLTP applications โ€” good balance of consistency and performance

Level 3 โ€” REPEATABLE READ (default in MySQL InnoDB):

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Guarantees that if you read a row, re-reading it in the same transaction gives the same result
-- Prevents: Dirty reads โœ…, Non-repeatable reads โœ…
-- Problem: PHANTOM READ โ€” a new row inserted by another transaction may appear in a repeated range query
-- Example: T1 counts orders WHERE amount > 1000 โ†’ gets 50
--          T2 inserts a new order with amount=5000 (commits)
--          T1 re-counts โ†’ gets 51 (phantom row appeared)
-- Use case: Financial reports needing consistent row-level reads within a transaction

Level 4 โ€” SERIALIZABLE (highest isolation):

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Transactions execute as if they were run one after another (serially), not concurrently
-- Prevents: Dirty reads โœ…, Non-repeatable reads โœ…, Phantom reads โœ…
-- Problem: LOWEST concurrency โ€” heaviest locking, most likely to cause deadlocks and timeouts
-- Use case: Bank transfers, inventory allocation, any scenario requiring complete data integrity

Summary โ€” what each level prevents:

Isolation Level Dirty Read Non-Repeatable Read Phantom Read
READ UNCOMMITTED โŒ Possible โŒ Possible โŒ Possible
READ COMMITTED โœ… Prevented โŒ Possible โŒ Possible
REPEATABLE READ โœ… Prevented โœ… Prevented โŒ Possible
SERIALIZABLE โœ… Prevented โœ… Prevented โœ… Prevented

Q13. What is a deadlock? How can it be prevented? [3 Marks]

Answer: A deadlock occurs when two or more transactions are permanently blocked, each waiting for the other to release a lock that the other transaction holds. Neither can proceed โ€” they are stuck waiting for each other in a circular dependency.

Classic deadlock scenario:

-- Transaction T1 and T2 run concurrently

-- T1 execution sequence:
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE account_id = 'A';  -- T1 locks row A
-- T1 now waits for row B (held by T2)...

-- T2 execution sequence (concurrent):
BEGIN;
UPDATE accounts SET balance = balance - 500  WHERE account_id = 'B';  -- T2 locks row B
UPDATE accounts SET balance = balance + 500  WHERE account_id = 'A';  -- T2 WAITS for row A (held by T1)

-- Back to T1:
UPDATE accounts SET balance = balance + 1000 WHERE account_id = 'B';  -- T1 WAITS for row B (held by T2)

-- DEADLOCK: T1 waits for T2 to release B, T2 waits for T1 to release A
-- Neither can proceed โ†’ database detects cycle โ†’ kills one transaction

Prevention strategies:

Strategy 1 โ€” Always access resources in the same consistent order:

-- โœ… Both T1 and T2 always lock accounts in ascending account_id order
-- T1: Lock A first, then B
-- T2: Lock A first, then B (waits for T1 to release A โ€” no deadlock possible)

-- Application-level enforcement:
-- Sort the accounts being updated before executing:
-- accounts_to_update = sorted([account_A, account_B])  -- always same order

Strategy 2 โ€” Keep transactions short and fast:

-- โœ… Minimize the time locks are held โ€” do all computation BEFORE the transaction
-- โŒ Bad: long computation inside transaction holds locks for seconds
BEGIN;
    -- complex calculation taking 30 seconds (locks held entire time)
    SELECT ... expensive computation ...
    UPDATE accounts SET balance = ...;
COMMIT;

-- โœ… Good: compute first, then lock briefly
DECLARE @new_balance DECIMAL = (compute new balance in application);
BEGIN;
    UPDATE accounts SET balance = @new_balance WHERE account_id = 'A';  -- lock held milliseconds
COMMIT;

Strategy 3 โ€” Use SELECT FOR UPDATE with SKIP LOCKED (queue processing):

-- PostgreSQL / MySQL: lock only available rows, skip already-locked ones
-- Prevents multiple workers from fighting over the same rows
SELECT order_id FROM pending_orders
ORDER BY order_id
FOR UPDATE SKIP LOCKED
LIMIT 1;

Strategy 4 โ€” Set lock timeouts to fail fast rather than wait forever:

-- MySQL: set a timeout after which the transaction gives up
SET innodb_lock_wait_timeout = 5;      -- give up after 5 seconds

-- PostgreSQL:
SET lock_timeout = '5s';               -- error if lock not acquired in 5 seconds

-- SQL Server: NOWAIT or timeout
SELECT * FROM orders WITH (UPDLOCK, NOWAIT)
WHERE order_id = 1001;                 -- fails immediately if row is locked

Strategy 5 โ€” Use optimistic locking for read-heavy scenarios:

-- Add a version column to detect conflicts without holding locks
ALTER TABLE accounts ADD COLUMN version INT DEFAULT 1;

-- Read with version
SELECT balance, version FROM accounts WHERE account_id = 'A';
-- Returns: balance=10000, version=5

-- Update only if version hasn't changed (no lock held during think time)
UPDATE accounts
SET balance = 9000, version = version + 1
WHERE account_id = 'A'
  AND version = 5;                     -- fails if another transaction updated first
-- If rows_affected = 0 โ†’ conflict detected โ†’ retry in application

Database deadlock detection: All major databases automatically detect deadlocks. When a circular wait is detected, the database selects a victim transaction (usually the one that has done the least work) and rolls it back, allowing the other transaction to proceed. The rolled-back transaction must be retried by the application โ€” so always implement retry logic for deadlock errors (MySQL error 1213, PostgreSQL error 40P01).


๐Ÿ“… Day 10: Mock Interview Paper โ€” All Topics Combined

Advanced 50 Marks 90 Min

Here are all the answers formatted for direct copy-paste:


Q1. What is the difference between DELETE, TRUNCATE, and DROP? [2 Marks]

Answer:

Feature DELETE TRUNCATE DROP
What it removes Specific rows (with WHERE) or all rows All rows in the table Entire table โ€” structure + data + indexes
WHERE clause โœ… Supported โŒ Not supported โŒ Not supported
Transaction rollback โœ… Can be rolled back โš ๏ธ Cannot be rolled back (DDL in most DBs) โŒ Cannot be rolled back
Triggers fired โœ… Yes โ€” row-level triggers fire โŒ No triggers fire โŒ No triggers fire
Speed ๐Ÿ”ด Slowest โ€” logs each row deletion ๐ŸŸข Very fast โ€” deallocates data pages ๐ŸŸข Instant โ€” removes entire object
Auto-increment reset โŒ Does not reset โœ… Resets identity/auto-increment counter N/A โ€” table gone
Statement type DML (Data Manipulation) DDL (Data Definition) DDL (Data Definition)
-- DELETE: remove specific rows, fully logged, rollback-safe
BEGIN TRANSACTION;
DELETE FROM orders WHERE order_date < '2020-01-01';
ROLLBACK;                               -- rows are restored โœ…

-- TRUNCATE: wipe all rows instantly, cannot be rolled back in most DBs
TRUNCATE TABLE temp_staging;            -- all rows gone, structure remains โœ…

-- DROP: destroy the entire table permanently
DROP TABLE temp_staging;               -- table, indexes, constraints โ€” all gone โœ…

-- PostgreSQL exception: TRUNCATE CAN be rolled back inside an explicit transaction
BEGIN;
TRUNCATE TABLE orders;
ROLLBACK;                               -- rows restored in PostgreSQL only โœ…

Interview tip: Use DELETE when you need selective removal with rollback safety. Use TRUNCATE to quickly clear staging/temp tables. Use DROP only when the table itself is no longer needed.


Q2. What is a foreign key constraint? What happens on CASCADE DELETE? [2 Marks]

Answer: A foreign key is a column (or set of columns) in a child table that references the primary key of a parent table. It enforces referential integrity โ€” ensuring that every value in the foreign key column has a matching row in the parent table. No orphan records can exist.

-- Parent table
CREATE TABLE customers (
    customer_id     INT PRIMARY KEY,
    name            VARCHAR(100) NOT NULL
);

-- Child table with foreign key referencing parent
CREATE TABLE orders (
    order_id        INT PRIMARY KEY,
    customer_id     INT NOT NULL,
    order_date      DATE,
    amount          DECIMAL(10,2),
    CONSTRAINT fk_orders_customer
        FOREIGN KEY (customer_id)
        REFERENCES customers (customer_id)
        ON DELETE CASCADE                   -- behavior on parent row deletion
        ON UPDATE CASCADE                   -- behavior on parent key update
);

ON DELETE behavior options:

Option What happens when parent row is deleted
CASCADE Child rows are automatically deleted with the parent
SET NULL Child foreign key column is set to NULL
SET DEFAULT Child foreign key is set to its default value
RESTRICT Prevents deletion of parent if child rows exist (error thrown)
NO ACTION Same as RESTRICT in most databases (checked at end of statement)
-- CASCADE DELETE example:
DELETE FROM customers WHERE customer_id = 101;
-- If ON DELETE CASCADE: all orders WHERE customer_id = 101 are automatically deleted โœ…
-- If ON DELETE RESTRICT: ERROR โ€” cannot delete customer while orders exist โŒ

-- CASCADE UPDATE example:
UPDATE customers SET customer_id = 999 WHERE customer_id = 101;
-- If ON UPDATE CASCADE: all child orders.customer_id updated from 101 โ†’ 999 automatically โœ…

Caution: CASCADE DELETE can cause unintended mass deletions across multiple related tables if the schema has deep chains of cascading relationships. Always verify cascade chains before enabling.


Q3. What is database normalization? Explain 1NF, 2NF, and 3NF briefly. [2 Marks]

Answer: Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity by decomposing tables into smaller, well-structured ones connected by relationships. Each "Normal Form" builds on the previous.

1NF โ€” First Normal Form (eliminate repeating groups and ensure atomicity):

-- โŒ Violates 1NF: multiple values in a single column (non-atomic)
-- customer_id | name  | phone_numbers
-- 1           | Alice | '9876543210, 9123456789'   -- two numbers in one column

-- โœ… 1NF compliant: each column holds a single atomic value
CREATE TABLE customer_phones (
    customer_id     INT,
    phone_number    VARCHAR(15),        -- one phone per row
    PRIMARY KEY (customer_id, phone_number)
);
-- Rules: no repeating columns, no multi-valued columns, each row uniquely identifiable

2NF โ€” Second Normal Form (eliminate partial dependencies):

-- โŒ Violates 2NF: non-key columns depend on PART of a composite key
-- order_id | product_id | quantity | product_name | product_price
-- product_name and product_price depend only on product_id, NOT on (order_id, product_id)
-- This is a PARTIAL DEPENDENCY

-- โœ… 2NF compliant: split into two tables
CREATE TABLE order_items (
    order_id        INT,
    product_id      INT,
    quantity        INT,
    PRIMARY KEY (order_id, product_id)  -- no non-key column partially depends on part of PK
);
CREATE TABLE products (
    product_id      INT PRIMARY KEY,
    product_name    VARCHAR(100),       -- fully depends on product_id only
    product_price   DECIMAL(10,2)
);
-- Rule: every non-key column must depend on the WHOLE primary key

3NF โ€” Third Normal Form (eliminate transitive dependencies):

-- โŒ Violates 3NF: non-key column depends on another non-key column (transitive dependency)
-- emp_id | name | dept_id | dept_name | dept_location
-- dept_name and dept_location depend on dept_id (non-key), not directly on emp_id

-- โœ… 3NF compliant: extract transitive dependency into its own table
CREATE TABLE employees (
    emp_id          INT PRIMARY KEY,
    name            VARCHAR(100),
    dept_id         INT,                -- foreign key to departments
    FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);
CREATE TABLE departments (
    dept_id         INT PRIMARY KEY,
    dept_name       VARCHAR(100),       -- depends only on dept_id โœ…
    dept_location   VARCHAR(100)
);
-- Rule: no non-key column should depend on another non-key column

Q4. What is the difference between a primary key and a unique key? [2 Marks]

Answer:

Feature Primary Key Unique Key
Purpose Uniquely identifies each row โ€” the main row identifier Enforces uniqueness on a column but is not the main identifier
NULL values โŒ Cannot be NULL โ€” ever โœ… Can be NULL (most databases allow one or more NULLs)
Count per table Only ONE primary key per table Multiple unique keys allowed per table
Clustered index โœ… Automatically creates a clustered index (SQL Server) Creates a non-clustered index by default
Foreign key target โœ… Commonly referenced by foreign keys โœ… Can also be referenced by foreign keys
Implicit index โœ… Always creates an index automatically โœ… Always creates an index automatically
CREATE TABLE employees (
    emp_id          INT             PRIMARY KEY,        -- one PK, NOT NULL, clustered index
    national_id     VARCHAR(20)     UNIQUE,             -- unique, but can be NULL if unknown
    email           VARCHAR(255)    UNIQUE NOT NULL,    -- unique + explicitly not null
    pan_number      VARCHAR(10)     UNIQUE,             -- another unique constraint
    name            VARCHAR(100)    NOT NULL
);

-- emp_id: main row identifier โ€” cannot be NULL, only one per table
-- national_id, email, pan_number: business uniqueness rules โ€” multiple allowed

-- NULL behavior with UNIQUE:
INSERT INTO employees (emp_id, name, national_id) VALUES (1, 'Alice', NULL);
INSERT INTO employees (emp_id, name, national_id) VALUES (2, 'Bob',   NULL);
-- โœ… Both NULLs allowed in UNIQUE column (NULL != NULL in SQL)
-- โŒ Would fail if emp_id was NULL

Q5. What is denormalization? When is it used? [2 Marks]

Answer: Denormalization is the deliberate process of introducing controlled redundancy into a normalized database schema by merging tables, adding duplicate columns, or pre-computing aggregates. It trades data consistency for query performance.

-- Normalized (3NF) schema โ€” requires JOIN to get order total with customer name:
SELECT c.name, o.order_id, SUM(oi.quantity * oi.unit_price) AS total
FROM customers c
JOIN orders o       ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id    = oi.order_id
GROUP BY c.name, o.order_id;           -- 3-table join on every query

-- Denormalized approach โ€” pre-store the total and customer name on orders table:
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);   -- duplicate from customers
ALTER TABLE orders ADD COLUMN order_total   DECIMAL(10,2);  -- pre-computed aggregate

-- Now the query needs no JOIN:
SELECT customer_name, order_id, order_total
FROM orders;                           -- single table read, massively faster โœ…

When denormalization is appropriate:

-- 1. OLAP / Data Warehousing (read-heavy, rarely updated)
--    Star schema: fact tables surrounded by dimension tables
--    product_name stored directly in fact_sales (not joined from products)
CREATE TABLE fact_sales (
    sale_id         INT,
    sale_date       DATE,
    product_id      INT,
    product_name    VARCHAR(100),       -- denormalized from products table
    category_name   VARCHAR(100),       -- denormalized from categories table
    amount          DECIMAL(10,2)
);

-- 2. High-traffic reporting dashboards needing sub-second response
-- 3. Pre-aggregated summary tables refreshed on a schedule
CREATE TABLE daily_sales_summary (
    summary_date    DATE PRIMARY KEY,
    total_orders    INT,
    total_revenue   DECIMAL(15,2),
    new_customers   INT
    -- Updated nightly โ€” avoids aggregating millions of rows in real time
);

-- 4. Caching computed values that are expensive to recalculate
ALTER TABLE customers ADD COLUMN lifetime_value DECIMAL(15,2);
-- Recalculated nightly by a batch job

Trade-offs:

Benefit Cost
Faster reads โ€” fewer JOINs Data redundancy โ€” same data stored multiple times
Simpler queries Update anomalies โ€” changing a value requires updating multiple places
Better for reporting Synchronization complexity โ€” keeping duplicated data consistent

Q6. Write a query to find the nth highest salary in a table using any method. [5 Marks]

Answer:

Assumption: Table employees(emp_id, name, department, salary) exists. Finding the nth highest salary where n is a variable (e.g., n=3 for 3rd highest).

Method 1 โ€” Using DENSE_RANK() window function (recommended โ€” handles ties):

-- Find the 3rd highest salary (change n=3 to any value)
WITH salary_ranked AS (
    SELECT
        emp_id,
        name,
        department,
        salary,
        DENSE_RANK() OVER (
            ORDER BY salary DESC
        )                               AS salary_rank
    FROM employees
)
SELECT
    emp_id,
    name,
    department,
    salary
FROM salary_ranked
WHERE salary_rank = 3;                  -- change 3 to any n

Method 2 โ€” Using LIMIT/OFFSET (MySQL / PostgreSQL โ€” simple but tie-unsafe):

-- nth highest: skip (n-1) rows, take 1
-- For n=3: OFFSET 2
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2;                       -- OFFSET = n-1

-- With full employee details using subquery:
SELECT emp_id, name, salary
FROM employees
WHERE salary = (
    SELECT DISTINCT salary
    FROM employees
    ORDER BY salary DESC
    LIMIT 1 OFFSET 2                    -- 3rd highest distinct salary
);

Method 3 โ€” Using a correlated subquery (works in all databases โ€” no window functions needed):

-- Find salary where exactly (n-1) distinct salaries are higher than it
-- For n=3: exactly 2 distinct salaries are higher
SELECT DISTINCT salary AS third_highest_salary
FROM employees e1
WHERE 2 = (                             -- change 2 to (n-1)
    SELECT COUNT(DISTINCT salary)
    FROM employees e2
    WHERE e2.salary > e1.salary
);

-- With employee details:
SELECT emp_id, name, department, salary
FROM employees
WHERE salary = (
    SELECT DISTINCT e1.salary
    FROM employees e1
    WHERE 2 = (
        SELECT COUNT(DISTINCT e2.salary)
        FROM employees e2
        WHERE e2.salary > e1.salary
    )
);

Method 4 โ€” Using FETCH NEXT (SQL Server / PostgreSQL standard SQL):

SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
OFFSET 2 ROWS                           -- skip (n-1) rows
FETCH NEXT 1 ROWS ONLY;                -- fetch exactly 1 row

Method 5 โ€” Stored procedure for dynamic n (MySQL):

DELIMITER $$
CREATE PROCEDURE sp_nth_highest_salary(IN n INT)
BEGIN
    SET @query = CONCAT(
        'SELECT DISTINCT salary FROM employees ',
        'ORDER BY salary DESC LIMIT 1 OFFSET ', n - 1
    );
    PREPARE stmt FROM @query;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
END$$
DELIMITER ;

CALL sp_nth_highest_salary(3);          -- find 3rd highest salary

Method comparison:

Method Handles ties Works across all DBs Dynamic n
DENSE_RANK() โœ… Best โœ… MySQL 8+, PG, SS โœ… Easy
LIMIT/OFFSET โš ๏ธ Risky โŒ MySQL/PG only โœ… Easy
Correlated subquery โœ… Yes โœ… All databases โœ… Change constant
FETCH NEXT โš ๏ธ Risky โœ… SQL Server/PG โœ… Easy

Q7. Given a 'logins' table (user_id, login_date), write a query to find users who logged in on 3 or more consecutive days. [5 Marks]

Answer:

Assumption: Table logins(login_id, user_id, login_date DATE) exists. Multiple logins per day per user are possible and should be deduplicated.

Method 1 โ€” Using LAG() to detect consecutive day gaps:

WITH deduped_logins AS (
    -- Step 1: One row per user per day (remove same-day duplicates)
    SELECT DISTINCT
        user_id,
        login_date
    FROM logins
),
login_with_gaps AS (
    -- Step 2: Calculate days since previous login for each user
    SELECT
        user_id,
        login_date,
        LAG(login_date) OVER (
            PARTITION BY user_id
            ORDER BY login_date
        )                                           AS prev_login_date,
        -- Is this login exactly 1 day after the previous? (consecutive)
        CASE
            WHEN login_date - LAG(login_date) OVER (
                PARTITION BY user_id ORDER BY login_date
            ) = 1
            THEN 0                                  -- consecutive day โ†’ same group
            ELSE 1                                  -- gap โ†’ new group starts
        END                                         AS is_new_group
    FROM deduped_logins
),
group_assigned AS (
    -- Step 3: Assign a group number to each consecutive run
    SELECT
        user_id,
        login_date,
        SUM(is_new_group) OVER (
            PARTITION BY user_id
            ORDER BY login_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        )                                           AS consecutive_group
    FROM login_with_gaps
),
consecutive_counts AS (
    -- Step 4: Count days in each consecutive group
    SELECT
        user_id,
        consecutive_group,
        MIN(login_date)                             AS streak_start,
        MAX(login_date)                             AS streak_end,
        COUNT(*)                                    AS consecutive_days
    FROM group_assigned
    GROUP BY user_id, consecutive_group
)
-- Step 5: Return users with at least one streak of 3+ consecutive days
SELECT DISTINCT
    l.user_id,
    cc.streak_start,
    cc.streak_end,
    cc.consecutive_days
FROM consecutive_counts cc
JOIN logins l ON cc.user_id = l.user_id
WHERE cc.consecutive_days >= 3
ORDER BY cc.consecutive_days DESC, l.user_id;

Method 2 โ€” Using date subtraction and ROW_NUMBER (elegant island-and-gaps technique):

WITH deduped_logins AS (
    SELECT DISTINCT user_id, login_date FROM logins
),
date_groups AS (
    -- Subtracting ROW_NUMBER from login_date groups consecutive dates together
    -- Consecutive dates always produce the same (login_date - ROW_NUMBER) value
    SELECT
        user_id,
        login_date,
        login_date - INTERVAL ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY login_date
        ) DAY                                       AS date_group
        -- PostgreSQL: login_date - ROW_NUMBER() OVER (...) * INTERVAL '1 day'
        -- SQL Server: DATEADD(DAY, -ROW_NUMBER() OVER (...), login_date)
    FROM deduped_logins
),
streaks AS (
    SELECT
        user_id,
        date_group,
        MIN(login_date)                             AS streak_start,
        MAX(login_date)                             AS streak_end,
        COUNT(*)                                    AS streak_length
    FROM date_groups
    GROUP BY user_id, date_group
)
SELECT
    user_id,
    streak_start,
    streak_end,
    streak_length                                   AS consecutive_days
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC, user_id;

Why the island-and-gaps trick works:

User 1 logins: Jan 1, Jan 2, Jan 3, Jan 5, Jan 6
ROW_NUMBER:         1,     2,     3,     4,     5
login - RN:    Dec31, Dec31, Dec31, Jan 1, Jan 1   โ† same date = same group
Groups formed: [Jan1, Jan2, Jan3] = 3 consecutive โœ…
               [Jan5, Jan6]       = 2 consecutive โœ—

Q8. Write a query to find duplicate records in a 'customers' table based on email, keeping only the earliest record. [5 Marks]

Answer:

Assumption: Table customers(customer_id, name, email, created_at, phone) exists. "Earliest" = smallest created_at timestamp. When created_at is equal, keep the smallest customer_id.

Step 1 โ€” Identify duplicates:

-- First, see which emails have duplicates and how many
SELECT
    email,
    COUNT(*)                                        AS duplicate_count,
    MIN(created_at)                                 AS earliest_record,
    MAX(created_at)                                 AS latest_record
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

Step 2 โ€” View all duplicate rows with their rank:

WITH ranked_customers AS (
    SELECT
        customer_id,
        name,
        email,
        created_at,
        ROW_NUMBER() OVER (
            PARTITION BY email              -- group duplicates by email
            ORDER BY
                created_at  ASC,           -- keep earliest first
                customer_id ASC            -- tiebreak by smallest ID
        )                                   AS rn
    FROM customers
)
SELECT
    customer_id,
    name,
    email,
    created_at,
    rn,
    CASE WHEN rn = 1 THEN 'KEEP' ELSE 'DUPLICATE โ€” DELETE' END AS action
FROM ranked_customers
WHERE email IN (                            -- only show emails with duplicates
    SELECT email FROM customers
    GROUP BY email HAVING COUNT(*) > 1
)
ORDER BY email, rn;

Step 3 โ€” Delete duplicates, keep earliest:

-- Method 1: DELETE using CTE (PostgreSQL / SQL Server)
WITH ranked_customers AS (
    SELECT
        customer_id,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at ASC, customer_id ASC
        )                                   AS rn
    FROM customers
)
DELETE FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM ranked_customers
    WHERE rn > 1                            -- delete all but the first (rn=1)
);

-- Method 2: DELETE using self-join (MySQL compatible)
DELETE c2
FROM customers c1
JOIN customers c2
    ON  c1.email        = c2.email          -- same email
    AND (
        c1.created_at < c2.created_at       -- c1 is older โ†’ delete c2
        OR (
            c1.created_at = c2.created_at   -- same date โ†’ keep lower ID
            AND c1.customer_id < c2.customer_id
        )
    );

-- Method 3: Keep only records in the "keep" set (safe for all databases)
DELETE FROM customers
WHERE customer_id NOT IN (
    SELECT MIN(customer_id)                 -- keep only earliest customer_id per email
    FROM customers
    GROUP BY email
);
-- โš ๏ธ Note: This keeps the smallest customer_id, not necessarily earliest created_at
-- For created_at based deduplication, use Method 1 or 2

Step 4 โ€” Prevent future duplicates:

-- Add unique constraint after deduplication
ALTER TABLE customers
ADD CONSTRAINT uq_customers_email UNIQUE (email);

Q9. Given an 'orders' table (order_id, customer_id, order_date, amount), write a query to calculate monthly retention: the percentage of customers from each signup month who made a purchase in subsequent months. [10 Marks]

Answer:

Assumption: The "signup month" is derived as each customer's first ever order month. Retention = for each cohort (month of first order), what % of those customers placed at least one order in each subsequent month.

WITH customer_first_order AS (
    -- Step 1: Find each customer's cohort month (month of first order)
    SELECT
        customer_id,
        DATE_FORMAT(MIN(order_date), '%Y-%m')   AS cohort_month,
        MIN(order_date)                         AS first_order_date
    FROM orders
    GROUP BY customer_id
),
cohort_sizes AS (
    -- Step 2: Count total customers in each cohort
    SELECT
        cohort_month,
        COUNT(DISTINCT customer_id)             AS cohort_size
    FROM customer_first_order
    GROUP BY cohort_month
),
customer_activity AS (
    -- Step 3: For each customer, find every month they placed an order
    SELECT DISTINCT
        o.customer_id,
        cfo.cohort_month,
        DATE_FORMAT(o.order_date, '%Y-%m')      AS activity_month
    FROM orders o
    JOIN customer_first_order cfo
        ON o.customer_id = cfo.customer_id
),
cohort_activity AS (
    -- Step 4: Calculate how many months after cohort month each activity occurred
    SELECT
        ca.cohort_month,
        ca.activity_month,
        -- Period number: 0 = cohort month itself, 1 = one month later, etc.
        PERIOD_DIFF(
            REPLACE(ca.activity_month, '-', ''),
            REPLACE(ca.cohort_month,   '-', '')
        )                                       AS months_since_cohort,
        COUNT(DISTINCT ca.customer_id)          AS retained_customers
    FROM customer_activity ca
    GROUP BY
        ca.cohort_month,
        ca.activity_month
),
retention_pct AS (
    -- Step 5: Join with cohort sizes to compute retention percentage
    SELECT
        ca.cohort_month,
        cs.cohort_size,
        ca.activity_month,
        ca.months_since_cohort,
        ca.retained_customers,
        ROUND(
            ca.retained_customers * 100.0 / cs.cohort_size, 1
        )                                       AS retention_rate_pct
    FROM cohort_activity ca
    JOIN cohort_sizes cs
        ON ca.cohort_month = cs.cohort_month
    WHERE ca.months_since_cohort >= 0           -- exclude activity before cohort month
)
-- Step 6: Final retention table output
SELECT
    cohort_month,
    cohort_size,
    months_since_cohort                         AS month_number,
    activity_month,
    retained_customers,
    retention_rate_pct,
    -- Visual bar for quick reading
    CONCAT(
        REPEAT('โ–ˆ', FLOOR(retention_rate_pct / 10)),
        ' ',
        retention_rate_pct, '%'
    )                                           AS retention_bar
FROM retention_pct
ORDER BY cohort_month ASC, months_since_cohort ASC;

PostgreSQL version (cleaner date arithmetic):

WITH customer_first_order AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', MIN(order_date))::DATE  AS cohort_month
    FROM orders
    GROUP BY customer_id
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(*) AS cohort_size
    FROM customer_first_order
    GROUP BY cohort_month
),
customer_activity AS (
    SELECT DISTINCT
        o.customer_id,
        cfo.cohort_month,
        DATE_TRUNC('month', o.order_date)::DATE     AS activity_month
    FROM orders o
    JOIN customer_first_order cfo ON o.customer_id = cfo.customer_id
),
retention AS (
    SELECT
        ca.cohort_month,
        cs.cohort_size,
        ca.activity_month,
        -- Number of months since cohort start
        EXTRACT(YEAR FROM AGE(ca.activity_month, ca.cohort_month)) * 12
        + EXTRACT(MONTH FROM AGE(ca.activity_month, ca.cohort_month))
                                                    AS months_since_cohort,
        COUNT(DISTINCT ca.customer_id)              AS retained_customers
    FROM customer_activity ca
    JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
    GROUP BY ca.cohort_month, cs.cohort_size, ca.activity_month
)
SELECT
    TO_CHAR(cohort_month, 'YYYY-MM')            AS cohort,
    cohort_size,
    months_since_cohort::INT                    AS month_num,
    TO_CHAR(activity_month, 'YYYY-MM')          AS period,
    retained_customers,
    ROUND(retained_customers * 100.0 / cohort_size, 1) AS retention_pct
FROM retention
WHERE months_since_cohort >= 0
ORDER BY cohort_month, months_since_cohort;

Sample output (cohort retention table):

cohort cohort_size month_num retained_customers retention_pct
2024-01 500 0 500 100.0%
2024-01 500 1 325 65.0%
2024-01 500 2 240 48.0%
2024-01 500 3 195 39.0%
2024-02 620 0 620 100.0%
2024-02 620 1 410 66.1%

Pivoted view (cohort ร— month matrix):

-- MySQL: pivot retention into a matrix using conditional aggregation
SELECT
    cohort_month,
    cohort_size,
    MAX(CASE WHEN months_since_cohort = 0 THEN CONCAT(retention_rate_pct,'%') END) AS M0,
    MAX(CASE WHEN months_since_cohort = 1 THEN CONCAT(retention_rate_pct,'%') END) AS M1,
    MAX(CASE WHEN months_since_cohort = 2 THEN CONCAT(retention_rate_pct,'%') END) AS M2,
    MAX(CASE WHEN months_since_cohort = 3 THEN CONCAT(retention_rate_pct,'%') END) AS M3,
    MAX(CASE WHEN months_since_cohort = 4 THEN CONCAT(retention_rate_pct,'%') END) AS M4,
    MAX(CASE WHEN months_since_cohort = 5 THEN CONCAT(retention_rate_pct,'%') END) AS M5,
    MAX(CASE WHEN months_since_cohort = 6 THEN CONCAT(retention_rate_pct,'%') END) AS M6
FROM retention_pct
GROUP BY cohort_month, cohort_size
ORDER BY cohort_month;

Q10. Design a database schema for an e-commerce platform. List at least 5 tables, their columns, and the relationships between them. [5 Marks]

Answer:

Schema overview โ€” 8 core tables:

-- ============================================================
-- TABLE 1: CUSTOMERS
-- Stores registered user/buyer information
-- ============================================================
CREATE TABLE customers (
    customer_id         INT             PRIMARY KEY AUTO_INCREMENT,
    email               VARCHAR(255)    NOT NULL UNIQUE,
    password_hash       VARCHAR(255)    NOT NULL,
    first_name          VARCHAR(100)    NOT NULL,
    last_name           VARCHAR(100)    NOT NULL,
    phone               VARCHAR(20),
    date_of_birth       DATE,
    created_at          DATETIME        DEFAULT CURRENT_TIMESTAMP,
    updated_at          DATETIME        DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    is_active           BOOLEAN         DEFAULT TRUE
);

-- ============================================================
-- TABLE 2: ADDRESSES
-- Customer shipping/billing addresses (one customer โ†’ many addresses)
-- ============================================================
CREATE TABLE addresses (
    address_id          INT             PRIMARY KEY AUTO_INCREMENT,
    customer_id         INT             NOT NULL,
    address_type        ENUM('shipping','billing','both') DEFAULT 'shipping',
    street_line1        VARCHAR(255)    NOT NULL,
    street_line2        VARCHAR(255),
    city                VARCHAR(100)    NOT NULL,
    state               VARCHAR(100)    NOT NULL,
    postal_code         VARCHAR(20)     NOT NULL,
    country_code        CHAR(2)         NOT NULL,               -- ISO 3166-1 alpha-2
    is_default          BOOLEAN         DEFAULT FALSE,
    CONSTRAINT fk_addresses_customer
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        ON DELETE CASCADE
);

-- ============================================================
-- TABLE 3: CATEGORIES
-- Hierarchical product categories (self-referencing for subcategories)
-- ============================================================
CREATE TABLE categories (
    category_id         INT             PRIMARY KEY AUTO_INCREMENT,
    parent_category_id  INT             NULL,                   -- NULL = top-level category
    category_name       VARCHAR(100)    NOT NULL,
    slug                VARCHAR(100)    NOT NULL UNIQUE,        -- URL-friendly name
    description         TEXT,
    is_active           BOOLEAN         DEFAULT TRUE,
    CONSTRAINT fk_categories_parent
        FOREIGN KEY (parent_category_id) REFERENCES categories(category_id)
        ON DELETE SET NULL
);

-- ============================================================
-- TABLE 4: PRODUCTS
-- Product catalog
-- ============================================================
CREATE TABLE products (
    product_id          INT             PRIMARY KEY AUTO_INCREMENT,
    category_id         INT             NOT NULL,
    product_name        VARCHAR(255)    NOT NULL,
    slug                VARCHAR(255)    NOT NULL UNIQUE,
    description         TEXT,
    base_price          DECIMAL(12,2)   NOT NULL,
    sku                 VARCHAR(100)    NOT NULL UNIQUE,        -- Stock Keeping Unit
    stock_quantity      INT             NOT NULL DEFAULT 0,
    weight_kg           DECIMAL(6,3),
    is_active           BOOLEAN         DEFAULT TRUE,
    created_at          DATETIME        DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_products_category
        FOREIGN KEY (category_id) REFERENCES categories(category_id),
    CONSTRAINT chk_price CHECK (base_price >= 0),
    CONSTRAINT chk_stock CHECK (stock_quantity >= 0)
);

-- ============================================================
-- TABLE 5: ORDERS
-- Customer order header โ€” one row per order
-- ============================================================
CREATE TABLE orders (
    order_id            INT             PRIMARY KEY AUTO_INCREMENT,
    customer_id         INT             NOT NULL,
    shipping_address_id INT             NOT NULL,
    order_date          DATETIME        DEFAULT CURRENT_TIMESTAMP,
    status              ENUM('pending','confirmed','processing',
                             'shipped','delivered','cancelled','refunded')
                                        DEFAULT 'pending',
    subtotal_amount     DECIMAL(12,2)   NOT NULL,
    discount_amount     DECIMAL(12,2)   DEFAULT 0.00,
    tax_amount          DECIMAL(12,2)   DEFAULT 0.00,
    shipping_amount     DECIMAL(12,2)   DEFAULT 0.00,
    total_amount        DECIMAL(12,2)   NOT NULL,
    payment_status      ENUM('unpaid','paid','refunded') DEFAULT 'unpaid',
    notes               TEXT,
    updated_at          DATETIME        DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    CONSTRAINT fk_orders_customer
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    CONSTRAINT fk_orders_address
        FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id)
);

-- ============================================================
-- TABLE 6: ORDER_ITEMS
-- Line items within each order โ€” one row per product per order
-- ============================================================
CREATE TABLE order_items (
    order_item_id       INT             PRIMARY KEY AUTO_INCREMENT,
    order_id            INT             NOT NULL,
    product_id          INT             NOT NULL,
    quantity            INT             NOT NULL,
    unit_price          DECIMAL(12,2)   NOT NULL,               -- price AT TIME of purchase
    discount_pct        DECIMAL(5,2)    DEFAULT 0.00,
    line_total          DECIMAL(12,2)   NOT NULL,               -- quantity * unit_price * (1 - discount)
    CONSTRAINT fk_order_items_order
        FOREIGN KEY (order_id) REFERENCES orders(order_id)
        ON DELETE CASCADE,
    CONSTRAINT fk_order_items_product
        FOREIGN KEY (product_id) REFERENCES products(product_id),
    CONSTRAINT chk_quantity CHECK (quantity > 0)
);

-- ============================================================
-- TABLE 7: PAYMENTS
-- Payment records per order (supports partial payments / refunds)
-- ============================================================
CREATE TABLE payments (
    payment_id          INT             PRIMARY KEY AUTO_INCREMENT,
    order_id            INT             NOT NULL,
    payment_method      ENUM('credit_card','debit_card','upi',
                             'net_banking','wallet','cod') NOT NULL,
    payment_gateway     VARCHAR(50),                            -- 'Razorpay', 'Stripe', etc.
    gateway_txn_id      VARCHAR(255)    UNIQUE,                 -- external transaction reference
    amount              DECIMAL(12,2)   NOT NULL,
    currency            CHAR(3)         DEFAULT 'INR',
    status              ENUM('initiated','success','failed','refunded') DEFAULT 'initiated',
    paid_at             DATETIME,
    created_at          DATETIME        DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_payments_order
        FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

-- ============================================================
-- TABLE 8: REVIEWS
-- Product reviews by customers (only for purchased products)
-- ============================================================
CREATE TABLE reviews (
    review_id           INT             PRIMARY KEY AUTO_INCREMENT,
    product_id          INT             NOT NULL,
    customer_id         INT             NOT NULL,
    order_id            INT             NOT NULL,               -- must have purchased to review
    rating              TINYINT         NOT NULL,
    title               VARCHAR(255),
    body                TEXT,
    is_verified         BOOLEAN         DEFAULT FALSE,
    is_approved         BOOLEAN         DEFAULT TRUE,
    created_at          DATETIME        DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_reviews_product
        FOREIGN KEY (product_id) REFERENCES products(product_id),
    CONSTRAINT fk_reviews_customer
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    CONSTRAINT fk_reviews_order
        FOREIGN KEY (order_id) REFERENCES orders(order_id),
    CONSTRAINT chk_rating CHECK (rating BETWEEN 1 AND 5),
    CONSTRAINT uq_one_review_per_purchase
        UNIQUE (customer_id, product_id, order_id)             -- one review per purchase
);

Entity Relationship summary:

customers (1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (many) addresses
customers (1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (many) orders
customers (1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (many) reviews
orders    (1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (many) order_items
orders    (1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (many) payments
orders    (many) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (1)   addresses       [shipping address]
order_items (many) โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (1) products
products  (many) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (1)   categories
categories (many) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (1) categories      [self-referencing parent]
reviews   (many) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ (1)   products

Key indexes for query performance:

CREATE INDEX idx_orders_customer     ON orders      (customer_id, order_date);
CREATE INDEX idx_orders_status       ON orders      (status);
CREATE INDEX idx_order_items_order   ON order_items (order_id);
CREATE INDEX idx_order_items_product ON order_items (product_id);
CREATE INDEX idx_products_category   ON products    (category_id);
CREATE INDEX idx_reviews_product     ON reviews     (product_id);
CREATE INDEX idx_payments_order      ON payments    (order_id);

Q11. Case Study โ€” Insurance Company Churn Risk Analysis [10 Marks]

(a) What data would you pull, and from which tables? [3 Marks]

Answer:

Key tables and data required:

-- TABLE 1: customers / policyholders
-- Columns: customer_id, name, age, city, join_date, segment, agent_id

-- TABLE 2: policies
-- Columns: policy_id, customer_id, policy_type, start_date, end_date,
--          premium_amount, payment_frequency, status (active/lapsed/cancelled)

-- TABLE 3: premium_payments
-- Columns: payment_id, policy_id, customer_id, due_date, paid_date,
--          amount_due, amount_paid, payment_status (paid/late/missed)

-- TABLE 4: claims
-- Columns: claim_id, policy_id, customer_id, claim_date, claim_amount,
--          claim_status (approved/rejected/pending), settlement_date

-- TABLE 5: customer_interactions
-- Columns: interaction_id, customer_id, interaction_date, channel,
--          interaction_type (complaint/query/renewal_discussion), resolved

-- TABLE 6: policy_renewals
-- Columns: renewal_id, policy_id, customer_id, renewal_due_date,
--          renewal_date, renewal_status (renewed/not_renewed/pending)

-- Churn indicators to extract:
-- โœ… Missed/late premium payments           โ†’ premium_payments
-- โœ… Lapsed or cancelled policies           โ†’ policies
-- โœ… No renewal after policy expiry         โ†’ policy_renewals
-- โœ… High complaint frequency               โ†’ customer_interactions
-- โœ… Claim rejection history               โ†’ claims
-- โœ… Days since last interaction           โ†’ customer_interactions
-- โœ… Policy age and years as customer      โ†’ policies, customers

(b) Write the key SQL queries to calculate churn risk indicators. [4 Marks]

Answer:

-- ============================================================
-- INDICATOR 1: Missed payment rate (% of premiums missed in last 12 months)
-- High missed rate = high churn risk
-- ============================================================
WITH payment_health AS (
    SELECT
        customer_id,
        COUNT(*)                                            AS total_payments_due,
        SUM(CASE WHEN payment_status = 'missed' THEN 1 ELSE 0 END)
                                                            AS missed_payments,
        SUM(CASE WHEN payment_status = 'late'   THEN 1 ELSE 0 END)
                                                            AS late_payments,
        ROUND(
            SUM(CASE WHEN payment_status = 'missed' THEN 1 ELSE 0 END)
            * 100.0 / COUNT(*), 1
        )                                                   AS missed_rate_pct
    FROM premium_payments
    WHERE due_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
    GROUP BY customer_id
)
SELECT * FROM payment_health WHERE missed_rate_pct > 20
ORDER BY missed_rate_pct DESC;

-- ============================================================
-- INDICATOR 2: Days since last interaction (disengaged customers)
-- No contact in 6+ months = silent churn risk
-- ============================================================
WITH last_interaction AS (
    SELECT
        customer_id,
        MAX(interaction_date)                               AS last_contact_date,
        DATEDIFF(CURDATE(), MAX(interaction_date))          AS days_since_contact,
        COUNT(CASE WHEN interaction_type = 'complaint'
                   THEN 1 END)                              AS complaint_count
    FROM customer_interactions
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.name,
    li.last_contact_date,
    li.days_since_contact,
    li.complaint_count,
    CASE
        WHEN li.days_since_contact > 365 THEN 'Critical โ€” No contact 1 year+'
        WHEN li.days_since_contact > 180 THEN 'High โ€” No contact 6 months+'
        WHEN li.complaint_count     > 3  THEN 'Medium โ€” Multiple complaints'
        ELSE 'Low Risk'
    END                                                     AS engagement_risk
FROM customers c
LEFT JOIN last_interaction li ON c.customer_id = li.customer_id
WHERE li.days_since_contact > 180 OR li.complaint_count > 3
ORDER BY li.days_since_contact DESC;

-- ============================================================
-- INDICATOR 3: Renewal failure rate (policies not renewed after expiry)
-- ============================================================
WITH renewal_behavior AS (
    SELECT
        p.customer_id,
        COUNT(pr.renewal_id)                                AS policies_up_for_renewal,
        SUM(CASE WHEN pr.renewal_status = 'renewed'
                 THEN 1 ELSE 0 END)                         AS renewals_completed,
        SUM(CASE WHEN pr.renewal_status = 'not_renewed'
                 THEN 1 ELSE 0 END)                         AS renewals_missed,
        ROUND(
            SUM(CASE WHEN pr.renewal_status = 'not_renewed' THEN 1 ELSE 0 END)
            * 100.0 / NULLIF(COUNT(pr.renewal_id), 0), 1
        )                                                   AS non_renewal_rate_pct
    FROM policies p
    JOIN policy_renewals pr ON p.policy_id = pr.policy_id
    WHERE pr.renewal_due_date <= CURDATE()                  -- past renewal due dates
    GROUP BY p.customer_id
)
SELECT * FROM renewal_behavior
WHERE non_renewal_rate_pct >= 50                            -- missed half or more renewals
ORDER BY non_renewal_rate_pct DESC;

-- ============================================================
-- INDICATOR 4: COMPOSITE CHURN RISK SCORE (0-100)
-- Combines all signals into one actionable score per customer
-- ============================================================
WITH payment_risk AS (
    SELECT customer_id,
           ROUND(SUM(CASE WHEN payment_status='missed' THEN 1 ELSE 0 END)
                 * 100.0 / COUNT(*), 1)     AS missed_pct
    FROM premium_payments
    WHERE due_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
    GROUP BY customer_id
),
engagement_risk AS (
    SELECT customer_id,
           DATEDIFF(CURDATE(), MAX(interaction_date)) AS days_silent,
           COUNT(CASE WHEN interaction_type='complaint' THEN 1 END) AS complaints
    FROM customer_interactions
    GROUP BY customer_id
),
renewal_risk AS (
    SELECT p.customer_id,
           ROUND(SUM(CASE WHEN pr.renewal_status='not_renewed' THEN 1 ELSE 0 END)
                 * 100.0 / NULLIF(COUNT(*),0), 1) AS non_renewal_pct
    FROM policies p
    JOIN policy_renewals pr ON p.policy_id = pr.policy_id
    WHERE pr.renewal_due_date <= CURDATE()
    GROUP BY p.customer_id
)
SELECT
    c.customer_id,
    c.name,
    COALESCE(pr.missed_pct, 0)                              AS payment_risk_score,
    COALESCE(er.days_silent, 0)                             AS days_silent,
    COALESCE(er.complaints, 0)                              AS complaint_count,
    COALESCE(rr.non_renewal_pct, 0)                         AS renewal_risk_score,
    -- Composite score: weighted combination of all risk factors (max 100)
    LEAST(100, ROUND(
        COALESCE(pr.missed_pct, 0)       * 0.35 +  -- 35% weight: payment behaviour
        LEAST(COALESCE(er.days_silent, 0) / 365 * 100, 100) * 0.25 +  -- 25%: engagement
        COALESCE(rr.non_renewal_pct, 0)  * 0.25 +  -- 25%: renewal behaviour
        LEAST(COALESCE(er.complaints, 0) * 10, 100) * 0.15  -- 15%: complaints
    , 1))                                                   AS churn_risk_score,
    CASE
        WHEN LEAST(100, ROUND(
                COALESCE(pr.missed_pct,0)*0.35 +
                LEAST(COALESCE(er.days_silent,0)/365*100,100)*0.25 +
                COALESCE(rr.non_renewal_pct,0)*0.25 +
                LEAST(COALESCE(er.complaints,0)*10,100)*0.15, 1))
             >= 70 THEN '๐Ÿ”ด High Risk โ€” Immediate Action'
        WHEN LEAST(100, ROUND(
                COALESCE(pr.missed_pct,0)*0.35 +
                LEAST(COALESCE(er.days_silent,0)/365*100,100)*0.25 +
                COALESCE(rr.non_renewal_pct,0)*0.25 +
                LEAST(COALESCE(er.complaints,0)*10,100)*0.15, 1))
             >= 40 THEN '๐ŸŸก Medium Risk โ€” Monitor Closely'
        ELSE             '๐ŸŸข Low Risk โ€” Retain'
    END                                                     AS risk_segment
FROM customers c
LEFT JOIN payment_risk    pr ON c.customer_id = pr.customer_id
LEFT JOIN engagement_risk er ON c.customer_id = er.customer_id
LEFT JOIN renewal_risk    rr ON c.customer_id = rr.customer_id
ORDER BY churn_risk_score DESC;

(c) How would you present the findings to the business team? [3 Marks]

Answer:

1. Executive Summary Dashboard โ€” KPI snapshot:

-- A single summary query for the leadership slide
SELECT
    COUNT(*)                                                AS total_active_customers,
    SUM(CASE WHEN churn_risk_score >= 70 THEN 1 ELSE 0 END) AS high_risk_count,
    SUM(CASE WHEN churn_risk_score BETWEEN 40 AND 69
             THEN 1 ELSE 0 END)                             AS medium_risk_count,
    ROUND(SUM(CASE WHEN churn_risk_score >= 70 THEN 1 ELSE 0 END)
          * 100.0 / COUNT(*), 1)                            AS high_risk_pct,
    -- Estimated revenue at risk (annual premium of high-risk customers)
    SUM(CASE WHEN churn_risk_score >= 70
             THEN annual_premium ELSE 0 END)                AS revenue_at_risk
FROM churn_risk_final;

2. Segment the findings into actionable groups:

Risk Tier Count Revenue at Risk Recommended Action
๐Ÿ”ด High Risk (70-100) 1,240 โ‚น4.2 Cr Immediate outreach โ€” retention call within 48 hrs
๐ŸŸก Medium Risk (40-69) 3,850 โ‚น8.7 Cr Email campaign, renewal reminder, discount offer
๐ŸŸข Low Risk (0-39) 18,200 โ€” Routine engagement, upsell opportunities

3. Root cause breakdown โ€” for the operations team:

-- Show what's driving churn risk (which factor contributes most)
SELECT
    CASE
        WHEN missed_pct >= 40                       THEN 'Payment Issues'
        WHEN days_silent >= 180                     THEN 'Disengagement'
        WHEN non_renewal_pct >= 50                  THEN 'Renewal Avoidance'
        WHEN complaint_count >= 3                   THEN 'Service Dissatisfaction'
        ELSE 'Multiple Factors'
    END                                             AS primary_churn_driver,
    COUNT(*)                                        AS customers_affected,
    ROUND(AVG(churn_risk_score), 1)                 AS avg_risk_score
FROM churn_risk_final
WHERE churn_risk_score >= 40
GROUP BY 1
ORDER BY customers_affected DESC;

4. Presentation structure for business stakeholders:

Slide 1 โ€” The Problem
    โ†’ X% of customers are at high churn risk
    โ†’ โ‚นY crore in annual premiums at stake

Slide 2 โ€” What We Measured (the 4 risk indicators)
    โ†’ Payment missed rate, Days silent, Renewal failures, Complaint count

Slide 3 โ€” Who Is at Risk (segmented customer list)
    โ†’ Exportable list with customer_id, agent_id, risk score, primary driver

Slide 4 โ€” Recommended Actions by Tier
    โ†’ High Risk: personal call within 48 hours + loyalty discount offer
    โ†’ Medium Risk: automated WhatsApp/email campaign + early renewal incentive
    โ†’ Monitor: quarterly re-score and track improvement

Slide 5 โ€” Success Metrics to Track
    โ†’ Target: reduce high-risk count by 30% in 90 days
    โ†’ Measure: retention rate, renewal completion rate, complaint resolution time

Key principle for business presentation: Never present raw SQL results to business stakeholders. Translate query outputs into: a number of customers affected, revenue at risk in rupees/crores, one clear recommended action per segment, and one metric to measure whether the action worked. The SQL is the engine โ€” the story is what the business team acts on.


๐ŸŽค Start Mock Interview ๐Ÿ“š View All Skills

๐Ÿ’ผ Related Job Roles

โš™๏ธ Backend Developer ๐Ÿ”„ Full Stack Developer ๐Ÿ“Š Data Analyst ๐Ÿงช Data Scientist โœ… QA Engineer โ˜• Java Developer ๐Ÿ Python Developer ๐Ÿ“ฑ Android Developer ๐Ÿ”ง Data Engineer ๐Ÿ“‹ Product Manager