Subqueries and Candidate Keys: Mastering Advanced Queries and Database Design

Welcome back to our ongoing MSSQL series! In our previous post, User-Defined Functions (UDFs): Extending SQL’s Capabilities , we learned how to encapsulate and reuse SQL logic with custom functions. Today, we’re diving into two equally crucial, yet distinct, concepts that will elevate your SQL proficiency: Subqueries for advanced data retrieval and Candidate Keys for robust database design.

Understanding Subqueries allows you to write more complex and precise queries by nesting one query within another. Meanwhile, grasping the concept of Candidate Keys is fundamental to designing a well-structured and reliable relational database.

As always, we’ll continue leveraging our familiar eCommerce database with tables for our subquery examples. For Candidate Keys, we’ll introduce a new, illustrative table.

Please follow the previous post to setup tables used in this post.

Understanding Subqueries: Queries Within Queries

A subquery (also known as an inner query or nested query) is a query embedded inside another SQL query. The outer query uses the result of the inner query to complete its operation. Subqueries are a powerful way to retrieve data that depends on the result of another query, often simplifying complex logic that might otherwise require multiple steps or more intricate joins.

Subqueries can appear in various clauses of a SQL statement, including:

SELECT clause (Scalar Subquery)
FROM clause (Derived Table)
WHERE or HAVING clause (most common)

Why Use Subqueries? (And When Not to Rely Solely on Joins)

A common question is: “Why use subqueries when I can often achieve the same result with joins?” It’s a valid point, and often, for simple relationships, joins are indeed more performant and straightforward. However, subqueries shine in specific scenarios:

Filtering based on aggregate results: When you need to filter data based on an aggregate function (e.g., AVG, SUM, COUNT) of a related table, a subquery in the WHERE or HAVING clause is often the most direct approach.
Checking existence: Using EXISTS or NOT EXISTS with a subquery is highly efficient for checking if related rows exist.
Retrieving a single value: A scalar subquery (returning one row and one column) can be used anywhere a single value is expected.
Complex logic: Sometimes, breaking down a complex problem into smaller, nested queries makes the SQL code more readable and easier to debug.
Performance considerations: While joins are generally optimized, certain types of subqueries (especially correlated subqueries) can sometimes be less performant than well-optimized joins. It’s always good practice to check execution plans.

Let’s explore examples using our eCommerce database to see subqueries in action.

Examples of Subqueries

1. Finding Transactions from Clients in a Specific City

This is a classic scenario where both subqueries and joins can be used. Let’s find all transactions made by clients living in ‘London’.

Using a Subquery: Here, the inner query first finds the client_id for all clients located in ‘London‘. The outer query then uses these IDs to select the relevant transactions.

SQL

-- Using a Subquery: Finds transactions where the client's city is 'London'.
-- The inner query selects client_id from ClientAccounts where location_city is 'London'.
-- The outer query then uses these client_ids to filter transactions.
SELECT *
FROM Transactions
WHERE cnum IN (SELECT client_id FROM ClientAccounts WHERE client_city = 'London');
GO

-- Using a Subquery: Finds transactions where the client's city is 'London'.
-- The inner query selects client_id from ClientAccounts where location_city is 'London'.
-- The outer query then uses these client_ids to filter transactions.
SELECT *
FROM Transactions
WHERE cnum IN (SELECT client_id FROM ClientAccounts WHERE client_city = 'London');
GO

Output:

Using a Join (for comparison): This achieves the same result by directly joining the Transactions and ClientAccounts tables.

SQL

-- Using a Join (for comparison): Achieves the same result with a direct join.
SELECT T.*
FROM Transactions AS T
JOIN ClientAccounts AS CA ON T.cnum = CA.client_id
WHERE CA.client_city = 'London';
GO

-- Using a Join (for comparison): Achieves the same result with a direct join.
SELECT T.*
FROM Transactions AS T
JOIN ClientAccounts AS CA ON T.cnum = CA.client_id
WHERE CA.client_city = 'London';
GO

Output:

Observation: For this simple filter, both are effective. The join might be slightly more performant in many SQL Server versions due to how the optimizer handles it.

2. Finding Transactions from Clients Assigned to Team Members in a Specific City

This involves a slightly more complex nested condition. Let’s find transactions made by clients whose assigned team members are located in ‘London’.

Using Nested Subqueries: The innermost subquery finds member_ids for team members in ‘London’. The middle subquery then finds client_ids for clients assigned to those team members. Finally, the outermost query retrieves transactions from those clients.

SQL

-- Using Nested Subqueries: Finds transactions where the assigned team member's city is 'New York'.
-- The innermost query finds member_ids for team members in 'New York'.
-- The middle query finds client_ids for clients assigned to those team members.
-- The outermost query retrieves transactions from those clients.
SELECT *
FROM Transactions
WHERE cnum IN (
    SELECT client_id
    FROM ClientAccounts
    WHERE assigned_member_id IN (
        SELECT member_id
        FROM TeamMembers
        WHERE location_city = 'New York'
    ));
GO

-- Using Nested Subqueries: Finds transactions where the assigned team member's city is 'New York'.
-- The innermost query finds member_ids for team members in 'New York'.
-- The middle query finds client_ids for clients assigned to those team members.
-- The outermost query retrieves transactions from those clients.
SELECT *
FROM Transactions
WHERE cnum IN (
    SELECT client_id
    FROM ClientAccounts
    WHERE assigned_member_id IN (
        SELECT member_id
        FROM TeamMembers
        WHERE location_city = 'New York'
    ));
GO

Using Joins (for comparison): This achieves the same result using multiple joins.

SQL

-- Using Joins (for comparison): Achieves the same result using multiple joins.
SELECT T.*
FROM Transactions AS T
JOIN ClientAccounts AS CA ON T.cnum = CA.client_id
JOIN TeamMembers AS TM ON CA.assigned_member_id = TM.member_id
WHERE TM.location_city = 'London';
GO

-- Using Joins (for comparison): Achieves the same result using multiple joins.
SELECT T.*
FROM Transactions AS T
JOIN ClientAccounts AS CA ON T.cnum = CA.client_id
JOIN TeamMembers AS TM ON CA.assigned_member_id = TM.member_id
WHERE TM.location_city = 'London';
GO

Observation: As complexity increases, nested subqueries can sometimes become less readable than well-structured joins. However, the choice often depends on personal preference and specific query optimization needs.

3. Finding Clients with an Average Transaction Amount Above a Threshold

This is a classic use case for subqueries, especially when filtering based on aggregate results. Let’s find clients who have an average transaction amount greater than 700.00.

Using a Subquery in the WHERE clause: The inner query calculates the average amt for each client and filters those whose average is above 700.00. The outer query then selects the details of these clients.

SQL

-- Using a Subquery in the WHERE clause: Identifies clients whose average transaction amount
-- across all their transactions is greater than 700.00.
SELECT client_id, client_name, client_city
FROM ClientAccounts
WHERE client_id IN (
    SELECT cnum
    FROM Transactions
    GROUP BY cnum
    HAVING AVG(amt) > 700.00
);
GO

-- Using a Subquery in the WHERE clause: Identifies clients whose average transaction amount
-- across all their transactions is greater than 700.00.
SELECT client_id, client_name, client_city
FROM ClientAccounts
WHERE client_id IN (
    SELECT cnum
    FROM Transactions
    GROUP BY cnum
    HAVING AVG(amt) > 700.00
);
GO

How it works: The subquery acts as a filter, providing a list of cnum (client IDs) that meet the average transaction value condition. The outer query then retrieves the full client details for those IDs.

4. Finding the Transaction(s) with the Maximum Amount

This is a straightforward use of a scalar subquery.

SQL

-- Using a scalar subquery: Retrieves all details for the transaction(s) that have the highest amount.
SELECT *
FROM Transactions
WHERE amt = (SELECT MAX(amt) FROM Transactions);
GO

-- Using a scalar subquery: Retrieves all details for the transaction(s) that have the highest amount.
SELECT *
FROM Transactions
WHERE amt = (SELECT MAX(amt) FROM Transactions);
GO

How it works: The inner query (SELECT MAX(amt) FROM Transactions) returns a single value (the highest transaction amount), which the outer query then uses in its WHERE clause to filter for matching transactions.

5. Finding the Second Highest Transaction Amount

This demonstrates how subqueries can be chained to find specific ranks or values.

SQL

-- Using nested scalar subqueries: Identifies the second highest transaction amount.
SELECT *
FROM Transactions
WHERE amt = (
    SELECT MAX(amt)
    FROM Transactions
    WHERE amt < (SELECT MAX(amt) FROM Transactions)
);
GO

-- Using nested scalar subqueries: Identifies the second highest transaction amount.
SELECT *
FROM Transactions
WHERE amt = (
    SELECT MAX(amt)
    FROM Transactions
    WHERE amt < (SELECT MAX(amt) FROM Transactions)
);
GO

How it works: The innermost subquery finds the absolute maximum amount. The middle subquery then finds the maximum amount that is less than the absolute maximum, effectively giving us the second highest.

6. Finding Clients Who Placed Transactions on the Date with the Highest Total Transaction Value

This example showcases a more complex scenario, combining aggregation and filtering.

SQL

-- This query combines aggregation and subqueries to find clients from the date with highest total sales.
SELECT CA.client_id, CA.client_name
FROM ClientAccounts CA
WHERE CA.client_id IN (
    SELECT T.cnum
    FROM Transactions T
    WHERE T.odate = (
        SELECT TOP 1 odate
        FROM Transactions
        GROUP BY odate
        ORDER BY SUM(amt) DESC
    )
);
GO

-- This query combines aggregation and subqueries to find clients from the date with highest total sales.
SELECT CA.client_id, CA.client_name
FROM ClientAccounts CA
WHERE CA.client_id IN (
    SELECT T.cnum
    FROM Transactions T
    WHERE T.odate = (
        SELECT TOP 1 odate
        FROM Transactions
        GROUP BY odate
        ORDER BY SUM(amt) DESC
    )
);
GO

How it works: The innermost subquery identifies the odate (order date) that had the highest total transaction amount across all transactions. The outer queries then filter to show clients who made transactions on that specific date.

Candidate Key: A Pillar of Database Design

While Primary Keys are well-known, Candidate Keys are a broader concept fundamental to relational database theory. A Candidate Key is any attribute or set of attributes that can uniquely identify a tuple (row) in a relation (table).

Properties of a Candidate Key:

Uniqueness: Each value of the key must be unique for every row in the table. No two rows can have the same value for the candidate key.
Minimality (Irreducibility): No proper subset of the candidate key attributes can uniquely identify tuples. This means you cannot remove any attribute from the candidate key and still maintain uniqueness.
Non-Nullability (for Primary Key choice): While a Candidate Key itself can sometimes contain NULL values (if it’s not chosen as the Primary Key and the UNIQUE constraint allows one NULL), the chosen Primary Key must be non-nullable.

A table can have multiple Candidate Keys. From these Candidate Keys, one is chosen to be the Primary Key for that table. All other Candidate Keys that are not chosen as the Primary Key are called Alternate Keys.

Example: `Student` Table

Let’s consider a new table, Student, to illustrate Candidate Keys. In a real-world scenario, a student_id is a common Primary Key. However, other attributes like a unique student_roll_no or a unique student_email could also uniquely identify a student.

Here’s the schema for our Student table:

SQL

CREATE TABLE Student (
    student_id INT NOT NULL PRIMARY KEY,      -- Primary Key
    student_roll_no VARCHAR(20),              -- Candidate Key (Alternate Key)
    student_email VARCHAR(100),               -- Candidate Key (Alternate Key)
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    major VARCHAR(50),
    CONSTRAINT UQ_Student_RollNo UNIQUE (student_roll_no), -- Unique constraint for Roll Number
    CONSTRAINT UQ_Student_Email UNIQUE (student_email)     -- Unique constraint for Email
);
GO

CREATE TABLE Student (
    student_id INT NOT NULL PRIMARY KEY,      -- Primary Key
    student_roll_no VARCHAR(20),              -- Candidate Key (Alternate Key)
    student_email VARCHAR(100),               -- Candidate Key (Alternate Key)
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    major VARCHAR(50),
    CONSTRAINT UQ_Student_RollNo UNIQUE (student_roll_no), -- Unique constraint for Roll Number
    CONSTRAINT UQ_Student_Email UNIQUE (student_email)     -- Unique constraint for Email
);
GO

In this table:

student_id is the Primary Key. It uniquely identifies each student and cannot be NULL.
student_roll_no is a Candidate Key (and thus an Alternate Key). It’s unique for each student.
student_email is also a Candidate Key (and thus an Alternate Key). It’s unique for each student.

Important Note on UNIQUE Constraints and NULLs: A UNIQUE constraint, which defines a Candidate Key, allows for one NULL value in the column (or one set of NULLs if it’s a composite unique key). This is because NULL is considered “unknown” and therefore not equal to any other NULL. However, if a Candidate Key is chosen as the Primary Key, it automatically becomes NOT NULL.

Let’s insert some data and observe the behavior:

SQL

-- Insert first value: All unique identifiers present
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (101, 'ROLL001', 'alice.smith@example.com', 'Alice', 'Smith', 'Computer Science');

-- Insert with one NULL for student_email (allowed)
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (102, 'ROLL002', NULL, 'Bob', 'Johnson', 'Physics');

-- Insert with one NULL for student_roll_no (allowed)
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (103, NULL, 'charlie.brown@example.com', 'Charlie', 'Brown', 'Mathematics');

-- Insert with unique values
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (104, 'ROLL004', 'diana.prince@example.com', 'Diana', 'Prince', 'History');
GO

-- Insert first value: All unique identifiers present
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (101, 'ROLL001', 'alice.smith@example.com', 'Alice', 'Smith', 'Computer Science');

-- Insert with one NULL for student_email (allowed)
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (102, 'ROLL002', NULL, 'Bob', 'Johnson', 'Physics');

-- Insert with one NULL for student_roll_no (allowed)
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (103, NULL, 'charlie.brown@example.com', 'Charlie', 'Brown', 'Mathematics');

-- Insert with unique values
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (104, 'ROLL004', 'diana.prince@example.com', 'Diana', 'Prince', 'History');
GO

Now, let’s try to insert a record that violates the UNIQUE constraint due to a second NULL:

SQL

-- This will give an error: Cannot insert duplicate key row in object 'dbo.Student'
-- with unique index 'UQ_Student_RollNo' (or 'UQ_Student_Email') where key is (NULL).
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (105, NULL, NULL, 'Eve', 'Adams', 'Biology'); -- This will fail if a NULL already exists in either student_roll_no or student_email
GO

-- This will give an error: Cannot insert duplicate key row in object 'dbo.Student'
-- with unique index 'UQ_Student_RollNo' (or 'UQ_Student_Email') where key is (NULL).
INSERT INTO Student (student_id, student_roll_no, student_email, first_name, last_name, major)
VALUES (105, NULL, NULL, 'Eve', 'Adams', 'Biology'); -- This will fail if a NULL already exists in either student_roll_no or student_email
GO

Explanation: SQL Server’s UNIQUE constraint (and thus Candidate Key definition) allows only one row to have a NULL value for that specific column. If you try to insert a second row with NULL in student_roll_no (or student_email), it will violate the uniqueness rule for NULLs.

Let’s view the current data:

SQL

-- Fetching all data from Student
SELECT * FROM Student;
GO

-- Fetching only records with a valid Roll Number
SELECT * FROM Student WHERE student_roll_no IS NOT NULL;
GO

-- Fetching only records with a valid Email
SELECT * FROM Student WHERE student_email IS NOT NULL;
GO

-- Fetching all data from Student
SELECT * FROM Student;
GO

-- Fetching only records with a valid Roll Number
SELECT * FROM Student WHERE student_roll_no IS NOT NULL;
GO

-- Fetching only records with a valid Email
SELECT * FROM Student WHERE student_email IS NOT NULL;
GO

To view the constraint details, you can use sp_help:

SQL

-- View details of the Student table, including constraints
EXEC sp_help 'Student';
GO

-- View details of the Student table, including constraints
EXEC sp_help 'Student';
GO

Conclusion

Mastering Subqueries provides you with the flexibility to construct powerful and precise data retrieval operations, allowing you to answer complex business questions by nesting queries. While joins are often preferred for performance in many scenarios, subqueries offer unique advantages for filtering based on aggregates or for breaking down intricate logic.

Simultaneously, a solid understanding of Candidate Keys is foundational to designing efficient and reliable relational databases. Recognizing all potential unique identifiers helps in selecting the most appropriate Primary Key and establishing robust data integrity through UNIQUE constraints.

Our next blog post will delve deeper into advanced querying techniques, focusing on Common Table Expressions (CTEs) and Window Functions, which further enhance your ability to manipulate and analyze data in SQL Server.

Stay tuned for more!

The Code Neuron

Subqueries and Candidate Keys: Mastering Advanced Queries and Database Design

Understanding Subqueries: Queries Within Queries

Why Use Subqueries? (And When Not to Rely Solely on Joins)

Examples of Subqueries

Candidate Key: A Pillar of Database Design

Example: `Student` Table

Conclusion

Hi there 👋
It’s nice to see you.

Sign up to receive awesome content in your inbox, as soon as they gets posted!

Comments

Leave a Reply

Subqueries and Candidate Keys: Mastering Advanced Queries and Database Design

Understanding Subqueries: Queries Within Queries

Why Use Subqueries? (And When Not to Rely Solely on Joins)

Examples of Subqueries

Candidate Key: A Pillar of Database Design

Example: Student Table

Conclusion

Hi there 👋It’s nice to see you.

Sign up to receive awesome content in your inbox, as soon as they gets posted!

Comments

Leave a Reply

Example: `Student` Table

Hi there 👋
It’s nice to see you.