SQL is probably the most widely used tool in the world of Data Science, Analysis, or Engineering. In most job postings SQL is one of the common requirements. There are a lot of tutorials and articles available as well. I decided to make a collection of SQL select queries, clauses, and common inbuilt functions. Experienced SQL users may already have all these commands memorized. But it should be helpful for new grads who are looking for jobs to help with their interviews or recent SQL users who still have to look at a cheat sheet.
I will try to explain the minimum in this article, starting with the most basic, and moving toward some more complex queries.
I used the Heart.csv dataset from Kaggle and Microsoft SQL Management Studio for this demonstration.
The most basic query is to use a * to bring everything from a table:
select * from Heart;
It gives you the whole Heart table.
Here is part of the output. It shows all the columns. It has a total of 303 rows.
In most tutorials, you will find writing the clauses and SQL expressions in upper case letters. But SQL commands are not case sensitive. So it’s your choice. You can use upper case, lower case, or proper case.
Selecting a few columns of my choice:
select Age, Sex, Chol, AHD from Heart;
Column values can be transformed in the select queries. Here, I am adding a negative sign before the Chol column (though it does not make sense, I am doing it only to show that it is possible) and rounding the Oldpeak column up to 1 decimal number:
select Age, Sex, -Chol, round(Oldpeak, 1), AHD from Heart;
It added a negative sign before every value in the Chol column and Rounded the values in the Oldpeak column up to 1 decimal point. If you notice, two transformed columns do not have a name. You can add names that are called column alias.
Here I am using column aliases to name those two transformed columns:
select Age, Sex, -Chol as negativeChol, round(Oldpeak, 1) as roundedOldpeak, AHD from Heart;
Two transformed columns have names now.
‘Where’ clause uses the condition. For example, the AHD column has two unique values: 0 and 1. If you want to get the data only where the AHD values are 1s, the ‘where’ clause can do that. Or if we need the data where Chol is greater or less than a certain value, we can use the ‘where’ clause to do that. A few examples here:
select Age, Chol, AHD from Heart where AHD = 0;
select Age, Chol, AHD from Heart where Chol > 250;
select Age, Chol, AHD, RestBP, MaxHR from Heart where RestBP > MaxHR;
I want to introduce this interesting symbol <>, that checks for non-matching. If you use <> in a condition that means it will return the non-matching value. Here is an example:
select Age, Chol, AHD from Heart where AHD <> 0;
It returned the rows where AHD is not equal to 0.
Mathmatical operations with where clause
This example will show a simple mathematical operation with the ‘where’ clause. We will return the rows where the sum of RestBP and MaxHR is greater than 300. But I also had to use cast(RestBP as int) and cast(MaxHR as int) because their data type was tinyint. So I had to cast them as int. Otherwise, it would through an error in my SQL engine. Different SQL engines may behave differently in this case.
select Age, Chol, AHD, RestBP, MaxHR, cast(RestBP as int) + cast(MaxHR as int) as Sum_BPHR from Heart where cast(RestBP as int) + cast(MaxHR as int)> 300;
Or this expression will also work:
select Age, Chol, AHD, RestBP, MaxHR, 1.0*RestBP + 1.0*MaxHR as Sum_BPHR from Heart where 1.0*RestBP + 1.0*MaxHR> 300;
Use of ‘between’ with where clause
This expression will return the specified column in the select statement if the value of Chol is between 200 and 300 and AHD is not 0:
select Age, Chol, AHD from Heart where Chol between 200 and 300 and AHD <>0;
In the same way, we can use ‘not between’ to return the rows that are not in a specified range:
select Age, Chol, AHD from Heart where Chol not between 200 and 300 and AHD <>0;
Filtering null values
Where clause is useful to filter null values. Let’s filter out the null values from the RestBP column:
select Age, Chol, AHD from Heart where RestBP is not NULL;
I am not showing the output here because this dataset does not have any null values.
IN operator with where clause
‘In’ operator can check from a list of values. For example, suppose we want the rows with some certain ages:
select Age, Chol, AHD, ChestPain from Heart where Age in (45, 23, 50, 52, 65, 60);
Or operator with where clause
In this example, we are returning the rows where ChestPain is either ‘typical’ or ‘nontypical’.
select Age, Chol, AHD, ChestPain from Heart where ChestPain = 'typical' or ChestPain = 'nontypical';
The same thing can be achieved using the ‘in’ operator as well:
select Age, Chol, AHD, ChestPain from Heart where ChestPain in ('typical', 'nontypical');
Or , and operator together
Returning the rows where age is less than 50 and ChestPain is either ‘typical’ or ‘nontypical’:
select Age, Chol, AHD, ChestPain from Heart where Age < 50 and
Not, and operators together
In this example, we will return the rows where ChestPain is not equal to ‘typical’ and AHD is equal to 1:
select Age, Chol, AHD, ChestPain from Heart where not ChestPain = 'typical'
It is useful for string data. This expression below retrieve all the rows where there is ‘typ’ in ChestPain value. Does not matter if ‘typical’ has some other part before or after.
select Age, Chol, AHD, ChestPain from Heart where ChestPain like '%typ%';
If we use “ChestPain like ‘%typical’”, it will return the rows where ChestPain value has either ‘typical’ or something + typical.
select Age, Chol, AHD, ChestPain from Heart where ChestPain like '%typical';
As you can see, we got the ChestPain values of ‘typical’ or ‘nontypical’
This expression below will return when ChestPain value is non + something because ‘non’ has ‘%’ at the end:
select Age, Chol, AHD, ChestPain from Heart where ChestPain like 'non%';
We got all the rows where the ChestPain value has a ‘non’ in the begining.
Order By Clause
Here is an example of a simple order by clause use case.
select Age, Chol, AHD, ChestPain from Heart order by Chol;
As order by Chol is specified, we got the output where Chol is arranged in descending order.
More than one variable can be used in order by clause as well:
select Age, Chol, AHD, ChestPain from Heart order by Age, Chol;
By default order by clause arranges in ascending order. But you can specify if you want ascending or descending order. Even if you use order with two variables one can be ascending and the other can be in descending order:
select Age, Chol, AHD, ChestPain from Heart order by Age DESC, Chol ASC;
Here I am calculating the probable MaxHR using the Age column:
select Age, Chol, MaxHR, ChestPain,210 - (0.5*Age) as calculated_MaxHR from Heart;
We can also find out the difference between the MaxHR and calculated_MaxHR and make a different column:
select Age, Chol, MaxHR, ChestPain,210 - (0.5*Age) as calculated_MaxHR, from Heart;
Aggregate functions are very popular. These are the aggregate function in SQL:
Here is a use of ‘avg’ to find the average age of the population in the dataset:
select avg(Age) as avg_age from Heart;
Next, I am using the count function to find out the number of samples in the dataset:
select count(Chol) as total_sample from Heart;
We could use any other column in the count function to get the number of samples here. Remember we can get the number of samples using count(Chol) because Chol column does not have any null values. If there would be null values in the Chol column we would not be able to get the total samples here. Because the count function gives you only the number for non-null values.
Finding the maximum of age and minimum of age:
select max(Age) as max_age, min(Age) as min_age from Heart;
Calculating the spread of the Age data:
select max(Age) - min(Age) as Age_spread from Heart;
We can use the distinct function to find all the distinct rows of the table. This expression will return all the distinct rows eliminating any duplicate rows if there are any:
select distinct * from Heart;
How many distinct Age values are n the Age column:
select count(distinct(Age)) as distinct_age from Heart;
We can get a distinct Age, ChestPain combination like this:
select distinct Age, ChestPain from Heart;
Here I wanted to get the distinct combination of two columns only. You can add more columns and check.
Group by clause with aggregate functions
Here we are getting the number of samples for each type of ChestPain:
select ChestPain, count(*) as numberOfSample from Heart group by ChestPain;
Another example here. Calculating the average age for each type of ChestPain:
select ChestPain, avg(Age) as Avg_Age from Heart group by ChestPain;
The average age is pretty close.
You can group using more than one variable as well. In the next example, I am grouping by ChestPain and AHD, getting average age for each subgroup:
select ChestPain, AHD, avg(Age) as Avg_Age from Heart group by ChestPain, AHD;
Look at the output. It shows the average age for each type of ChestPain and AHD value of 0 first and then we get the average age for each type of ChestPain and AHD value of 1.
Having clause is used to filter the data from the groups. Let’s see an example:
select ChestPain, count(*) from Heart group by ChestPain having count(*) > 50;
Here we get the number of samples for each type of ChestPain and then use a condition that the number of samples has to be greater than 50.
Having clause does not have to be on the groups. It can be on some other columns as well. Here we are selecting the ChestPain type if the sum of MaxHR + RestBP for that type is more than 10000
select ChestPain from Heart group by ChestPain Having sum(1.0*MaxHR + 1.0*RestBP) > 10000;
The following example shows how to use aggregate functions on several columns and then use a condition using thehaving clause.
select ChestPain, SUM(1.0*MaxHR + 1.0*RestBP) as sum_HRBP, Min(Age) as min_Age, count(*) as total_sample from Heart group by ChestPain
Where and having clause together
Having clause almost works as ‘where’ clause on groups. We can use both where and having clause together. In this case, first where clause filter out the data from the total dataset, then group by clause groups the data as specified, and at the end having clause filters out the data from the group as per the condition provided in the having clause.
select ChestPain, count(*) as Num_people from Heart where Age >= 40 group by ChestPain having count(*) > 50;
Having clause can use more than one condition:
select ChestPain, avg(RestBP) as avg_BP from Heart where Age >= 40 having avg(RestBP) > 120 and count(*) > 30;
Order by and group by together
As we saw before ChestPain is a categorical variable. Here we are calculating the average Chol for each type of ChestPain and arranging them in ascending order by average Chol:
select ChestPain, avg(Chol) as avg_Chol from Heart where Chol is not null group by ChestPain
We can concatenate two column values using concat function:
select ChestPain, Thal, CONCAT(ChestPain, ' ', Thal) as ChestPainThal from Heart;
Look at the ChestPainThal column in the picture above.
The trim function trims any space from the beginning or from the end of strings.
SELECT TRIM(' Test string ') as trimmed;
In the above example, we simply use the trim function and it trimmed the extra spaces from the beginning and the end by default. But if you want to trim some particular characters from the beginning or from the end, you can specify those characters in the trim function.
Here I am specifying that I want to trim ‘%’ and ‘.’ from the string. It will remove ‘%’ or ‘.’ if present in the beginning or at the end.
SELECT TRIM('.%' from '%%Test... string...') as trimmed;
If the column values are strings, the substring function can be used to return a part of the strings.
select ChestPain, substring(ChestPain, 2, 4) as short_cp from Heart where Age > 40;
In the substring function above we passed ChestPain, 2, and 4. That means we are asking the function to take a substring from ChestPain column, 2 represents the substring should start from the second element of the strings, and 4 means we want the length of the substring to be 4.
Converting Thal column to upper case:
select upper(Thal) from Heart;
Using case-when is almost like if-else statement.
Here is an example where we are making a categorical column out of the Chol column. We are setting different ranges to define Chol as Optimal, Borderline High, High, and Very High. This new categorical column is named Chol_Category which is mentioned in the ‘End’.
select Age, Sex, ChestPain, Chol,casewhen Chol >= 100 and Chol <= 129 then 'Optimal'when Chol >=130 and Chol <= 159 then 'Borderline High'when Chol >= 160 and Chol <= 189 then 'High'when Chol > 189 then 'Very High'End Chol_Categoryfrom Heart;
We do not need the whole result of a query all the time. Sometimes a certain portion is enough. In that case, we can use the limit clause to limit the output to a certain number of rows. This syntax will work in the most SQL engines.
select Age, Chol, ChestPain from Heart limit 10;
But I used Microsoft SQL management studio for this tutorial. This syntax did not work in the Microsoft SQL management studio. I had to use this instead:
select top 10 Age, Chol, ChestPain from Heart;
It gives you the top 10 rows or the number you mention here.
If you have a big dataset a query can return a long table too. You can organize that long table in separate pages where each page will only contain your specified number of rows.
select Age, Chol, ChestPain from Heart order by Chol offset 15 rows fetch next 15 rows only;
That was all the functions and clauses I wanted to work on today.
I did not cover the joins here because this article is getting too big. That is for a different article.
In this article, I wanted to demonstrate the examples of all the major clauses and popular functions of SQL using Microsoft SQL Management Studio. If you remember all these clauses and functions by heart, writing SQL queries will be really easy for you. I will make more tutorials on some more advanced topics of SQL in the future.
#sql #DataAnalysis #DataScience #database #RelationalDatabase