SQL: SQL Functions (Analytic)

Use analytic functions to determine values based on groups of values. For example, you can use this type of function to determine running totals, percentages, or the top result within a group.
Syntax

FIRST_VALUE ( scalar_expression ) OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )
LAST_VALUE ( scalar_expression ) OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )
LAG (scalar_expression [,offset] [,default]) OVER ( [ partition_by_clause ] order_by_clause )
LEAD ( scalar_expression [ ,offset ] , [ default ] ) OVER ( [ partition_by_clause ] order_by_clause )
PERCENT_RANK( ) OVER ( [ partition_by_clause ] order_by_clause )
CUME_DIST( ) OVER ( [ partition_by_clause ] order_by_clause )
PERCENTILE_DISC ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] )
PERCENTILE_CONT ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] )

Window Frame in SQL Server

Understanding windowing and aggregation function in depth requires a good understanding of window framing in SQL Server. As we aware of that window function works on a subset of rows in a partition. Framing helps us to decide the upper boundary & lower boundary for a partition where analytical function work.

Window frames can be indicated as following with ORDER BY clause.

RANGE BETWEEN start_boundary AND end_boundary

ROWS BETWEEN start_boundary AND end_boundary

It’s important to understand the following terms before we use window frames in our queries.

FIRST_VALUE

The FIRST_VALUE() is a window function that returns the first value in an ordered set of values.

The following illustrates the syntax of the FIRST_VALUE() function:

FIRST_VALUE(expression) OVER (

partition_clause

order_clause

frame_clause

)

In this syntax:

expression

The return value of the expression from the first row in a partition or result set.

The OVER clause consists of three clauses: partition_clause, order_clause, and frame_clause.

partition_clause

The partition_clause clause has the following syntax:

PARTITION BY expr1, expr2, ...

The PARTITION BY clause divides the rows of the result sets into partitions to which the FIRST_VALUE() function applies. If you skip the PARTITION BY clause, the function treats the whole result set as a single partition.

order_clause

The order_clause clause sorts the rows in partitions to which the FIRST_VALUE() function applies. The ORDER BY clause has the following syntax:

ORDER BY expr1 [ASC | DESC], expr2, ...

frame_clause

The frame_clause defines the subset (or frame) of the current partition. Check it out the window function tutorial for the detailed information of the frame clause.

Example :

SELECT first_name, last_name, salary, FIRST_VALUE (first_name) OVER (ORDER BY salary) lowest_salary FROM employees;

It Returns the first value in an ordered set of values in a specified column. It’s mandatory to use OVER clause with ORDER BY to have an ordered set. PARTITION BY is optional.

Syntax

FIRST_VALUE(ColName) OVER(PARTITION BY ColName ORDER BY ColName)

FIRST_VALUE – Without Partition BY

In the following example, we’re using FIRST_VALUE function to find the very first value in Salary column ordered ascending. While we observed that salary column was not in the order when populated employee table. Applying FIRST_VALUE with ORDER BY causes to order the rows and fetch the first value from the ordered set of values.

Example :

use tempdb;

DROP TABLE IF EXISTS Employee;

CREATE TABLE Employee

(

Id INT ,Salary INT

);

INSERT INTO Employee VALUES (1, 200), (2, 100), (3, 300), (4, 500), (5, 400);

SELECT * FROM Employee;

SELECT Id, Salary, FIRST_VALUE(Salary) OVER(ORDER BY Salary) as FirstValue

FROM Employee;

FIRST_VALUE and PARTITION BY

Running following example, we can observe that result of the query was partitioned by dept name column thus creating two logical windows of IT dept and HR dept. When we have two different windows then FIRST_VALUE function is applied to each partition to fetch first value from the ordered set using salary column in ascending order.

Example

use tempdb;

DROP TABLE IF EXISTS Employee;

CREATE TABLE Employee

(Id INT,DeptName VARCHAR(25),Salary INT);

INSERT INTO Employee VALUES (1, 'IT', 200), (2, 'IT', 100), (3, 'IT', 300), (4, 'HR', 500), (5, 'HR', 400);

SELECT * FROM Employee;

SELECT Id, DeptName, Salary, FIRST_VALUE(Salary) OVER(PARTITION BY DeptName ORDER BY Salary) as FirstValue FROM Employee ORDER BY Salary, Id;

LAST_VALUE

Returns the last value in an ordered set of values in a specified column. It’s mandatory to use OVER clause with ORDER BY to have an ordered set. PARTITION BY is optional. Besides, we need to consider window framing while using LAST_VALUE function to avoid unexpected results.

Syntax

LAST_VALUE(ColName) OVER(PARTITION BY ColNameORDER BY ColName ROW_OR_RANGE_Frame)

LAST_VALUE – Without Partition By

Following example displays the last value from an ordered set of values. Make a note of using LAST_VALUE function without window frame produces an incorrect result because when we’ve not specified any window frame then it uses default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW which only goes up to current row hence giving the incorrect row value which results in incorrect values.

Specifying the correct window frame ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING produces a correct result by taking all the rows into consideration before producing the last value.

Example :

use tempdb;

DROP TABLE IF EXISTS Employee;

CREATE TABLE Employee

( Id INT ,Salary INT);

INSERT INTO Employee VALUES (1, 200), (2, 100), (3, 300), (4, 500), (5, 400);

SELECT * FROM Employee;

--Produces wrong output because default framing giving the wrong order

SELECT Id, Salary, LAST_VALUE(Salary) OVER(ORDER BY Salary) as LastValue

FROM Employee;

--------------------------------------------

--Produces correct output when correct window frame is applied

SELECT Id, Salary, LAST_VALUE(Salary) OVER(ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as LastValue

FROM Employee;

LAST_VALUE and PARTITION BY

Here is an example using PARTITION BY with LAST_VALUE function. Similar to the above example; the result of query divided into two partitions of I.T dept and HR dept and applying LAST_VALUE function to each partition to fetch last value from an ordered set of values.

We have used ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING window frame to make sure LAST_VALUE function considers rows between first row in partition to last row in partition.

Example :

use tempdb;

DROP TABLE IF EXISTS Employee;

CREATE TABLE Employee

(Id INT,DeptName VARCHAR(25),Salary INT);

INSERT INTO Employee VALUES (1, 'IT', 200), (2, 'IT', 100), (3, 'IT', 300), (4, 'HR', 500), (5, 'HR', 400);

SELECT * FROM Employee;

SELECT Id, DeptName, Salary, LAST_VALUE(Salary) OVER(PARTITION BY DeptName ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as LastValue

FROM Employee

ORDER BY Salary, Id;

LAG Function

The LAG function is an analytic function that lets you query more than one row in a table at a time without having to join the table to itself. It returns values from a previous row in the table. To return a value from the next row, try using the LEAD function.

Syntax:

LAG ( expression [, offset [, default] ] ) OVER ( [ query_partition_clause ] order_by_clause )

Parameters or Arguments

expression

An expression that can contain other built-in functions, but can not contain any analytic functions.

offset

Optional. It is the physical offset from the current row in the table. If this parameter is omitted, the default is 1.

default

Optional. It is the value that is returned if the offset goes out of the bounds of the table. If this parameter is omitted, the default is null.

query_partition_clause

Optional. It is used to partition the results into groups based on one or more expressions.

order_by_clause

Optional. It is used to order the data within each partition.

Employees table

employee_number	last_name	first_name	salary	dept_id
12009	Sutherland	Barbara	54000	45
34974	Yates	Fred	80000	45
34987	Erickson	Neil	42000	45
45001	Parker	Sally	57500	30
75623	Gates	Steve	65000	30

Execute SQL statement:

SELECT dept_id, last_name, salary, LAG (salary,1) OVER (ORDER BY salary) AS lower_salary FROM employees;

//Output

dept_id	last_name	salary	lower_salary
45	Erickson	42000	NULL
45	Sutherland	54000	42000
30	Parker	57500	54000
30	Gates	65000	57500
45	Yates	80000	65000

LAG Using Partitions

Now let's look at a more complex example where we use a query partition clause to return the lower salary for each employee within their own department.

Execute SQL statement:

SELECT dept_id, last_name, salary, LAG (salary,1) OVER (PARTITION BY dept_id ORDER BY salary) AS lower_salary FROM employees;

//Output :

dept_id	last_name	salary	lower_salary
30	Parker	57500	NULL
30	Gates	65000	57500
45	Erickson	42000	NULL
45	Sutherland	54000	42000
45	Yates	80000	54000

LEAD function

LEAD function is an analytic function that lets you query more than one row in a table at a time without having to join the table to itself. It returns values from the next row in the table. To return a value from a previous row, try using the LAG function.

Syntax:

LEAD ( expression [, offset [, default] ] )

OVER ( [ query_partition_clause ] order_by_clause )

expression

An expression that can contain other built-in functions, but can not contain any analytic functions.

offset

Optional. It is the physical offset from the current row in the table. If this parameter is omitted, the default is 1.

default

Optional. It is the value that is returned if the offset goes out of the bounds of the table. If this parameter is omitted, the default is null.

query_partition_clause

Optional. It is used to partition the results into groups based on one or more expressions.

order_by_clause

Optional. It is used to order the data within each partition.

Employees table :

employee_number	last_name	first_name	salary	dept_id
12009	Sutherland	Barbara	54000	45
34974	Yates	Fred	80000	45
34987	Erickson	Neil	42000	45
45001	Parker	Sally	57500	30
75623	Gates	Steve	65000	30

Execute SQL statement:

SELECT dept_id, last_name, salary, LEAD (salary,1) OVER (ORDER BY salary) AS next_highest_salary FROM employees;

//Output :

dept_id	last_name	salary	next_highest_salary
45	Erickson	42000	54000
45	Sutherland	54000	57500
30	Parker	57500	65000
30	Gates	65000	80000
45	Yates	80000	NULL

Lead Using Partitions

Now let's look at a more complex example where we use a query partition clause to return the next highest salary for each employee within their own department.

Execute SQL statement:

SELECT dept_id, last_name, salary, LEAD (salary,1) OVER (PARTITION BY dept_id ORDER BY salary) AS next_highest_salary FROM employees;

//Output :

dept_id	last_name	salary	next_highest_salary
30	Parker	57500	65000
30	Gates	65000	NULL
45	Erickson	42000	54000
45	Sutherland	54000	80000
45	Yates	80000	NULL

PERCENT_RANK() :
This represents the percentage of values less than the current value in the group, excluding the highest value. Percent_Rank() for the highest value in a group will always be 1.
Syntax
PERCENT_RANK OVER ( [ partition_by_clause ] order_by_clause )
Create a EmpSalary table on database TestDB and insert some data:
CREATE DATABASE [TestDB]
--Create testable to hold some data
Use TestDB
Go
CREATE TABLE [dbo].[EmpSalary](
[Department] [nchar](10) NULL,
[EmpName] [nchar](10) NULL,
[Salary] [int] NULL
) ON [PRIMARY]
GO
--Insert some test data to apply these functions
insert into [EmpSalary] values('IT','Emp1',87)
insert into [EmpSalary] values('IT','Emp2',45)
insert into [EmpSalary] values('IT','Emp3',89)
insert into [EmpSalary] values('IT','Emp4',87)
insert into [EmpSalary] values('OPS','Emp1',19)
insert into [EmpSalary] values('OPS','Emp2',33)
insert into [EmpSalary] values('OPS','Emp3',89)
insert into [EmpSalary] values('OPS','Emp4',89)
PERCENT_RANK Explanation
The above example uses the PERCENT_RANK function to compute the salary percentile for each employee. The value returned by the PERCENT_RANK function represents the percentile rank for that employee compared to all other employees.
For PERCENT_RANK, the first value always has a percent rank of 0 and the rest of the values are divided by their percentage through the number of rows - 1. Since there are 8 rows, and the first PERCENT_RANK starts at 0, the others (8-1=7) are divided equally, so each row has a PERCENT_RANK of 1/7 = 0.1428571428571429. The values are then based on that value times the number of rows that have a value less than the group value.
For row 1, based on the above, = 0
For row 2, the value will be (1/7) * 1 (row) = 0.1428571428571429
For row 3, the value will be (1/7) * 2 (rows) = 0.2857142857142857
For rows 4 and 5 the value will be (1/7) * 3 (rows) = 0.4285714285714287

For rows 6, 7 and 8 the value will be (1/7) * 5 (rows) = 0.7142857142857143
Execute SQL statement:
SELECT Department,EmpName,Salary, PERCENT_RANK() OVER (ORDER BY [Salary]) as Percent_Rank FROM [EmpSalary] ORDER BY [Salary]
GO
//Output :

Note - The range of values returned by PERCENT_RANK is greater than 0 and less than or equal to 1. The same values always return the same cumulative distribution value.
CUME_DIST()
This gives the percentage of values less than or equal to the current value in the group. This is called the cumulative distribution.
Syntax:
CUME_DIST( [ partition_by_clause ] order_by_clause )
CUME_DIST Explanation
The above example uses the CUME_DIST function to compute the salary percentile for each employee. The value returned by the CUME_DIST function represents the percent of employees that have a salary less than or equal to all employees.
Let me explain how we can calculate the values.
We have a total of 8 records in the table and there are 5 distinct salary values, so CUME_DIST values are determined as follows
For row 1 there is 1 row with that value or lower, so CUME_DIST = 1/8 or 0.125
For row 2 there are 2 rows with that value or lower, so CUME_DIST = 2/8 or 0.25
For row 3 there are 3 rows with that value or lower, so CUME_DIST = 3/8 or 0.375
For rows 4 and 5 there are 5 rows with that value or lower, so CUME_DIST = 5/8 or 0.625
For rows 6, 7 and 8 there are 8 rows with that value or lower, so CUME_DIST = 8/8 or 1
Note - The range of values returned by CUME_DIST is greater than 0 and less than or equal to 1. The same values always return the same cumulative distribution value.
Execute SQL Statement :
SELECT Department,EmpName,Salary, CUME_DIST() OVER (ORDER BY [Salary]) as CUME_DIST FROM [EmpSalary] ORDER BY [Salary]

GO
//Output

PERCENTILE_DISC
PERCENTILE_DISC is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.
This function takes as an argument any numeric datatype or any nonnumeric datatype that can be implicitly converted to a numeric datatype. The function returns the same datatype as the numeric datatype of the argument.
The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value. This expression must be constant within each aggregate group. The ORDER BY clause takes a single expression that can be of any type that can be sorted.
For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

Create a test table on database Test and insert some data:
use Test;
CREATE TABLE test( [TestA] [int] NOT NULL, [TestB] [nchar](10) NOT NULL, [TestC] [int] NOT NULL) ON [PRIMARY]
--Insert some test data
insert into test values( 101,'A',565)
insert into test values( 101,'B',741)
insert into test values( 101,'C',369)
insert into test values( 101,'D',111)
insert into test values( 102,'E',214)
insert into test values( 102,'F',69)
insert into test values( 102,'G',697)
insert into test values( 103,'H',97)
insert into test values( 103,'I',121)
insert into test values( 103,'J',198)
insert into test values( 104,'K',648)
insert into test values( 104,'L',444)
insert into test values( 104,'M',219)
insert into test values( 104,'N',239)
insert into test values( 104,'O',239)
insert into test values( 102,'P',239)
Execute SQL statement:
select TestA,TestB,TestC,
CUME_DIST() OVER (PARTITION BY [TestA] ORDER BY [TestC]) as CD,
--PERCENTILE_DISC(0.7) WITHIN GROUP (ORDER BY [TestC]) OVER (PARTITION BY [TestA]) AS PD,
PERCENTILE_CONT(0.6) WITHIN GROUP (ORDER BY [TestC]) OVER (PARTITION BY [TestA]) AS PC

FROM test

PERCENTILE_CONT
PERCENTILE_CONT is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
This function takes as an argument any numeric datatype or any nonnumeric datatype that can be implicitly converted to a numeric datatype. The function returns the same datatype as the numeric datatype of the argument.
The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value. This expr must be constant within each aggregation group. The ORDER BY clause takes a single expression that must be a numeric or datetime value, as these are the types over which Oracle can perform interpolation.
The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
If (CRN = FRN = RN) then the result is
(value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
You can use the PERCENTILE_CONT function as an analytic function. You can specify only the query_partitioning_clause in its OVER clause. It returns, for each row, the value that would fall into the specified percentile among a set of values within each partition.

The MEDIAN function is a specific case of PERCENTILE_CONT where the percentile value defaults to 0.5.

PERCENTILE_CONT()
It (standing for continuous percentile) is a window function which returns a value which corresponds to the given fraction in the sort order. If required, it will interpolate between adjacent input items.
Essentially, the following process is followed to find the value to return:
Get the number of rows in the partition, denoted by N
RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
look up rows FRN and CRN
If (CRN = FRN = RN) then the result is (value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
The MEDIAN function is a specific case of PERCENTILE_CONT, equivalent to PERCENTILE_CONT(0.5).

CREATE TABLE book_rating (name CHAR(30), star_rating TINYINT);

INSERT INTO book_rating VALUES ('Lord of the Ladybirds', 5);
INSERT INTO book_rating VALUES ('Lord of the Ladybirds', 3);
INSERT INTO book_rating VALUES ('Lady of the Flies', 1);
INSERT INTO book_rating VALUES ('Lady of the Flies', 2);
INSERT INTO book_rating VALUES ('Lady of the Flies', 5);

SELECT name, PERCENTILE_CONT(0.6) WITHIN GROUP (ORDER BY star_rating) OVER (PARTITION BY name) AS pc FROM book_rating;
+-------------------------------------+--------------------------------+
| name | pc |
+-------------------------------------+--------------------------------+
| Lord of the Ladybirds | 4.2000000000 |
| Lord of the Ladybirds | 4.2000000000   |
| Lady of the Flies | 2.6000000000   |
| Lady of the Flies | 2.6000000000   |
| Lady of the Flies | 2.6000000000   |
+------------------------------------+---------------------------------+

Calculation Step :
RN = 0.6(2-1) = 0.6
FRN = 0
CRN = 1
result = (1-0.6)*(3) + (0.6-0)*(5) = 4.2
RN = 0.6(3-1) = 1.2
FRN = 1
CRN = 2
result = (2-1.2)*(2)+(1.2-1)*(5) = 2.6

ROW_NUMBER
The ROW_NUMBER() is a window function that assigns a sequential integer number to each row in the query’s result set.
The following illustrates the syntax of the ROW_NUMBER() function:
ROW_NUMBER() OVER (
[PARTITION BY expr1, expr2,...]
ORDER BY expr1 [ASC | DESC], expr2,...
)
In this syntax,
First, the PARTITION BY clause divides the result set returned from the FROM clause into partitions. The PARTITION BY clause is optional. If you omit it, the whole result set is treated as a single partition.
Then, the ORDER BY clause sorts the rows in each partition. Because the ROW_NUMBER() is an order sensitive function, the ORDER BY clause is required.
Finally, each row in each partition is assigned a sequential integer number called a row number. The row number is reset whenever the partition boundary is crossed.

A) Simple SQL ROW_NUMBER() example
The following statement finds the first name, last name, and salary of all employees. In addition, it uses the ROW_NUMBER() function to add sequential integer number to each row.
SELECT ROW_NUMBER() OVER (ORDER BY salary) row_num,first_name,last_name, salary FROM employees;

B) Using SQL ROW_NUMBER() for pagination
The ROW_NUMBER() function can be used for pagination. For example, if you want to display all employees on a table in an application by pages, which each page has ten records.

First, use the ROW_NUMBER() function to assign each row a sequential integer number.
Second, filter rows by requested page. For example, the first page has the rows starting from one to 9, and the second page has the rows starting from 11 to 20, and so on.
The following statement returns the records of the second page, each page has ten records.

SELECT * FROM ( SELECT ROW_NUMBER() OVER (ORDER BY salary) row_num, first_name, last_name, salary FROM employees) t WHERE row_num > 10 AND row_num <=20;

C) Using SQL ROW_NUMBER() for finding nth highest value per group
The following example shows you how to find the employees whose have the highest salary in their departments:
-- find the highest salary per department
SELECT department_name,first_name,last_name,salary
FROM (SELECT department_name,
ROW_NUMBER() OVER (PARTITION BY department_name ORDER BY salary DESC) row_num, first_name, last_name, salary
FROM employees e INNER JOIN departments d ON d.department_id = e.department_id ) t
WHERE row_num = 1;
In the subquery:
First, the PARTITION BY clause distributes the employees by departments.
Second, the ORDER BY clause sorts the employee in each department by salary in the descending order.
Third, the ROW_NUMBER() assigns each row a sequential integer number. It resets the number when the department changes.

The following shows the result set of the subquery:

SQL

Wednesday, 24 July 2019

SQL Functions (Analytic)

No comments:

Post a Comment

SQL Transactions

About Me