Analyzing the Distribution of Records
Analyzing the Distribution of Records
The First and Last Values of a Selected Field
The First Value of a Selected Field
We have already seen how to create groups (partitions) of records. One of the ways you can analyze your records is to get some statistics about each group. Transact-SQL provides a series of functions, named analytic functions that allow you to get the variations and tendencies of records within a group.
Inside of each group of records, you may want to know what value comes first based on a column of your choice. To get this information, you can call the FIRST_VALUE() function. Its syntax is:
FIRST_VALUE ( [scalar_expression ) OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] ) RETURNS Data Type of scalar_expression
The FIRST_VALUE() function takes as argument one the columns of the table or view on which data selection is made. The function is followed by a call to OVER() that requires arranging the records.
Practical Learning: Getting the First Value of a Selected Field |
USE LambdaPropertiesManagement1; GO SELECT props.Bedrooms, props.MonthlyRate FROM Listing.Properties props ORDER BY props.MonthlyRate; GO
USE LambdaPropertiesManagement1; GO SELECT FIRST_VALUE(props.MonthlyRate) OVER(ORDER BY props.MonthlyRate) [First Monthly Rate] FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType [Property Type], props.MonthlyRate Rate, props.City, FIRST_VALUE(props.MonthlyRate) OVER(ORDER BY props.PropertyType) [First Monthly Rate in Selected Property Type] FROM Listing.Properties props; GO
|
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType [Property Type],
props.MonthlyRate Rate,
props.City,
FIRST_VALUE(props.MonthlyRate) OVER(PARTITION BY props.PropertyType ORDER BY props.PropertyType) [First Monthly Rate in Selected Property Type]
FROM Listing.Properties props;
GO
|
As opposed to the first value of a group of records, you may want to get the last one. To do this, you can call the LAST_VALUE() function. Its syntax is:
LAST_VALUE ( [scalar_expression ) OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] ) RETURNS Data Type of scalar_expression
The LAST_VALUE() function takes one the columns of the table or view as argument. The function follows the same logic as FIRST_VALUE(), in reverse.
Practical Learning: Getting the First Value of a Selected Field |
SELECT FIRST_VALUE(props.MonthlyRate) OVER(ORDER BY props.MonthlyRate) [First Monthly Rate] FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType [Property Type], props.MonthlyRate Rate, props.City, FIRST_VALUE(props.MonthlyRate) OVER(ORDER BY props.PropertyType) [First Monthly Rate in Selected Property Type] FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType [Property Type],
props.MonthlyRate Rate,
props.City,
FIRST_VALUE(props.MonthlyRate) OVER(PARTITION BY props.PropertyType ORDER BY props.PropertyType) [First Monthly Rate in Selected Property Type]
FROM Listing.Properties props;
GO
The Distribution of Records |
Introduction |
A percentage is a fractional number from a 100 scale. The fractions go from 1 to 100 and each number is represented by following it with the % sign. Examples are 1%, 2%, 3%, 4%, 5%, and so on. Imagine you are in charge of managing rent for 100 houses or properties, or you are teaching a course to 100 students. Based on these examples, each house (or student) represents 1% of the total number of houses (or students). 2 houses (or students) represent 2% of the total, and so on.
A fraction is a section of a whole thing. For example, you can cut a cantaloupe in two parts or fractions. Each fraction is half the whole:
A whole can have many fractions. For our example of managing rent for 100 houses or teaching a course to 100 students, for managing purposes you can split the houses (or students) in half. Each half would be a fraction and would have 50 houses (or students).
You can number the houses (or students) in the first group from 1 to 50 and the houses (or student) in the second group from 51 to 100. In the same way, you can divide the number of houses (or students) in 4 groups. Each group would contain 25 houses (or students). You can number the first group from 1 to 25, the second group from 26 to 50, and so on. For some statistical reasons, you may want to take some action for the house or the student in the 25th position, which is the last one in the 25% range. Or imagine you want to take some action for the house or the student in the 50th position, which is the last in the 50% range. |
A percentile is a positional (position or location) value that corresponds to the percentage value of a whole. Actually, to get the percentiles, the whole must be divided in equal parts and each part can be given a name. If the whole is divided in 4 parts, each part is called a quartile:
If the whole is divided in two, the divider is called a median, and is made of two quartiles (the first quartile and the second quartile).
Transact-SQL provides a series of functions used to get percentile-related statistics. These are referred to as analytic functions.
When you have a group of values, each occupies a certain position. If the list is arranged (ordered), each value holds an incrementing position as 1, 2, 3, and so on. This position is also referred to as the rank. For statistical purposes, that rank can be treated as a weight. To give the same weight or the same importance to each value, the position, which is the value of the rank, is divided by the total number of values. This produces a fraction that can be, or is, converted to a percentage value. To let you get this information, Transact-SQL provides a function named PERCENT_RANK. Its syntax is:
PERCENT_RANK() OVER( [ partition_by_clause ] order_by_clause ) RETURNS float(53)
The PERCENT_RANK() function takes no argument. The function is followed by a call to OVER() that requires arranging the records.
Practical Learning: Getting the Percentage Distribution of Records |
USE LambdaPropertiesManagement1; GO SELECT props.PropertyNumber [Property #], props.PropertyType Type, props.MonthlyRate Rate, PERCENT_RANK() OVER(ORDER BY props.MonthlyRate) "Rank Distribution" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyNumber [Property #],
props.PropertyType Type,
props.MonthlyRate Rate,
PERCENT_RANK() OVER(PARTITION BY props.PropertyType ORDER BY props.MonthlyRate) "Rank Distribution"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyNumber [Property #], props.PropertyType Type, props.MonthlyRate Rate, FORMAT(PERCENT_RANK() OVER(ORDER BY props.MonthlyRate), N'P') "Rank Percentage" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyNumber [Property #],
props.PropertyType Type,
props.MonthlyRate Rate,
FORMAT(PERCENT_RANK() OVER(PARTITION BY props.PropertyType ORDER BY props.MonthlyRate), N'P') "Rank Distribution"
FROM Listing.Properties props;
GO
If a table is not empty, one of its obvious characteristics is that the table has values that have been created (the values exist and are not hidden) and each value occupies a specific position (the values of a table of a database are not truly inserted; each value is added to the end of the existing list, which is referred to as appending a value to the list). The fact that the values exist and are known, we say that they are spread or distributed. By definition, data or record distribution is the characteristic that the values of a table are available (to be accessed and used).
Discrete distribution is the characteristic that:
A percentile discrete distribution is a value that represents the percentage position (the percentile) of a value of a column of a table. To let you calculate the percentile discrete distribution, Transact-SQL provides a function named PERCENTILE_DISC. Its syntax is:
PERCENTILE_DISC ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] ) RETURNS Data Type of order_by_expression;
The PERCENTILE_DISC() function takes one argument and requires some conditions. The value passed to the function must be a decimal value between 0.00 and 1.00. The value represents the fraction by which the percentile will be calculated. We will see how that number influences the result. After calling the function, you must add a WITHIN GROUP clause that resembles a function. In the parentheses of that clause, you arrange the values of a column of your choice. Then, you must call OVER(). If you don't pass a parameter to it, the result is the first value of the column passed to WITHIN GROUP(). Otherwise, you can create partitions over but you must not create another arrangement of records OVER().
Practical Learning: Getting the Discrete Percentile |
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType, props.City, props.MonthlyRate, PERCENTILE_DISC(0) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER() "Mohnthly Rate of First City Selected" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_DISC(0) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.PropertyType) "Mohnthly Rate of First City Selected"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.PropertyType) "First City Within Group"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_DISC(0.50) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.PropertyType) "Mohnthly Rate of First City Selected"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.PropertyType) "Mohnthly Rate of First City Selected"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_DISC(0.850) WITHIN GROUP(ORDER BY props.City) OVER(PARTITION BY props.PropertyType) "Mohnthly Rate of First City Selected"
FROM Listing.Properties props;
GO
The Continuous Distribution of a Percentile
One of the most common aspects of the values of a column is that they are usually different. This means that:
These characteristics mean that the distribution is continuous. To let you evaluate the continuous distribution of a percentil, Transact-SQL provides the PERCENTILE_CONT() function. Its syntax is:
PERCENTILE_CONT ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] )
The PERCENTILE_CONT() function takes one argument that must be a decimal value between 0.00 and 1.00. Calling the function is followed by the WITHIN GROUP() clause. In the parentheses of that clause, you must arrange the values of a column of your choice. The column must be numeric-based. Strings are not allowed. This is followed by a call to OVER() in which you can create partitions but no ORDER BY.
Practical Learning: Getting the Percentage Distribution of Records |
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType, props.MonthlyRate, PERCENTILE_CONT(0) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER() "Continuous Percentile" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.MonthlyRate,
PERCENTILE_CONT(0) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.MonthlyRate) "Continuous Percentile"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
props.MonthlyRate,
PERCENTILE_CONT(1) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER() "Continuous Percentile"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.City,
PERCENTILE_CONT(1) WITHIN GROUP(ORDER BY props.MonthlyRate) OVER(PARTITION BY props.MonthlyRate) "Continuous Percentile"
FROM Listing.Properties props;
GO
The cumulative distribution of the values of a column is the probability that a certain value can be found at a certain position or a value less than that position. To let you can calculate it, Transact-SQL provides the CUME_DIST() function. Its syntax is:
CUME_DIST() OVER( [ partition_by_clause ] order_by_clause ) RETURNS float(53)
The CUME_DIST() function takes no argument and it is followed by a call to OVER() that requires arranging the records.
Practical Learning: Getting the Cumulative Distribution of a Series |
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType Type, props.MonthlyRate Rate, CUME_DIST() OVER(ORDER BY props.MonthlyRate) N'Cumulative Distribution' FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType Type,
props.MonthlyRate Rate,
CUME_DIST() OVER(PARTITION BY props.PropertyType ORDER BY props.MonthlyRate) N'Cumulative Distribution'
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType Type, props.MonthlyRate Rate, FORMAT(CUME_DIST() OVER(PARTITION BY props.PropertyType ORDER BY props.MonthlyRate), N'P') N'Cumulative Distribution' FROM Listing.Properties props; GO
One of the most common operations of data analysis is to perform comparisons and one of the goals of comparing records is to know how they change from one occurrence to another. To let you perform such comparisons, Transact-SQL provides the LAG() function. Its syntax is:
LAG(scalar_expression [, offset] [, default]) OVER( [ partition_by_clause ] order_by_clause ) RETURNS Data Type of scalar_expression
The LAG() function takes as argument one the columns of the table or view on which data selection is made. The function is followed by a call to OVER() that requires arranging the records.
By default, the LAG() function produces the value of the previous record passed to it. If you want, you can indicate how many records to jump back. That's the role of the offset argument.
If a record has no value, then the LAG() function would return the previous value as NULL, as shown in the last column of the following result:
If you don't want the table to display NULL for previous records that have no value, then pass the default argument.
Practical Learning: Getting the Previous Value of a Record |
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType, props.MonthlyRate, props.City, LAG(props.MonthlyRate) OVER(ORDER BY props.MonthlyRate) "Comparison With Previous Value" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.MonthlyRate,
props.City,
LAG(props.City) OVER(ORDER BY props.MonthlyRate) "Previous Monthly Rate"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType, props.MonthlyRate, props.City, LAG(props.MonthlyRate) OVER(PARTITION BY props.City ORDER BY props.MonthlyRate) "Previous Monthly Rate In Group" FROM Listing.Properties props; GO
USE LambdaPropertiesManagement1;
GO
SELECT props.PropertyType,
props.MonthlyRate,
props.City,
LAG(props.MonthlyRate, 3) OVER(PARTITION BY props.City ORDER BY props.MonthlyRate) "Back 3 Previous Monthly Rates"
FROM Listing.Properties props;
GO
USE LambdaPropertiesManagement1; GO SELECT props.PropertyType, props.MonthlyRate, props.City, LAG(props.MonthlyRate, 1, 0) OVER(PARTITION BY props.City ORDER BY props.MonthlyRate) "Previous Monthly Rates" FROM Listing.Properties props; GO
As opposed to the previous value of a record, you may want to get the next value. To let you get this information, Transact-SQL provides the LEAD() function. Its syntax is:
LEAD(scalar_expression [, offset] [, default]) OVER( [ partition_by_clause ] order_by_clause ) RETURNS Data Type of scalar_expression
The LEAD() function takes the exact same argument as the LAG() function and it follows the same logic in reverse.
Practical Learning: Getting the Next Value of a Record |
USE LambdaPropertiesManagement1; GO SELECT t.FirstName "First Name", t.LastName "Last Name", t.MaritalStatus "Status", t.NumberOfChildren "Children Count", LEAD(t.LastName) OVER(ORDER BY t.MaritalStatus) "Next Last Name" FROM Rentals.Tenants t; GO
USE LambdaPropertiesManagement1; GO SELECT t.FirstName "First Name", t.LastName "Last Name", t.MaritalStatus "Status", LEAD(t.LastName) OVER(PARTITION BY t.MaritalStatus ORDER BY t.MaritalStatus) "Next Last Name In Partition", t.NumberOfChildren "Children Count" FROM Rentals.Tenants t; GO
USE LambdaPropertiesManagement1;
GO
SELECT t.FirstName "First Name",
t.LastName "Last Name",
t.MaritalStatus "Status",
LEAD(t.LastName, 2) OVER(PARTITION BY t.MaritalStatus ORDER BY t.MaritalStatus) "Last Name In Next 2 Positions",
t.NumberOfChildren "Children Count"
FROM Rentals.Tenants t;
GO
USE LambdaPropertiesManagement1;
GO
SELECT t.FirstName "First Name",
t.LastName "Last Name",
t.MaritalStatus "Status",
LEAD(t.LastName, 2, 1234) OVER(PARTITION BY t.MaritalStatus ORDER BY t.MaritalStatus) "Last Name In Next 2 Positions",
t.NumberOfChildren "Children Count"
FROM Rentals.Tenants t;
GO
|
||
Previous | CCopyright © 2013-2022, FunctionX | Next |
|