How the MEDIAN() function works in Mariadb?

The MEDIAN() function is a useful tool for calculating the median value of a set of numbers.

Posted on

The MEDIAN() function is a useful tool for calculating the median value of a set of numbers. The median is the middle value of a sorted list of numbers, or the average of the middle two values if the list has an even number of elements. It can be used for various purposes, such as finding the typical value, removing outliers, and comparing distributions.

Unlike the MEDIAN() function in some other databases, the MEDIAN() function in Mariadb is a window function, which means that it can operate on a subset of rows that are related to the current row. This allows you to calculate the median value for each group of rows, or for each row based on a sliding window.

Syntax

The syntax of the MEDIAN() function is as follows:

MEDIAN(median_expression) OVER ([ PARTITION BY partition_expression ] [ ORDER BY order_expression [ ASC | DESC ] ] [ window_frame_clause ])

The function takes one argument and an optional OVER clause:

  • median_expression: A numeric expression that represents the set of numbers to calculate the median from. It can be any valid expression that returns a numeric value, such as a column name, a literal, or a function.
  • OVER: An optional clause that specifies how to partition and order the rows for the median calculation. It can contain the following optional subclauses:
    • PARTITION BY partition_expression: An optional subclause that specifies how to divide the rows into groups for the median calculation. The partition_expression can be any valid expression that returns a value, such as a column name, a literal, or a function. The function will calculate the median value for each group of rows that have the same value for the partition_expression.
    • ORDER BY order_expression [ ASC | DESC ]: An optional subclause that specifies how to order the rows within each partition for the median calculation. The order_expression can be any valid expression that returns a value, such as a column name, a literal, or a function. The function will calculate the median value based on the sorted order of the rows. You can optionally specify ASC or DESC to indicate the ascending or descending order. The default order is ASC.
    • window_frame_clause: An optional subclause that specifies the range or rows of rows to be used for the median calculation for each row. It can be one of the following forms:
      • ROWS BETWEEN start_point AND end_point: This form specifies the number of rows before and after the current row to be used for the median calculation. The start_point and end_point can be one of the following values:
        • UNBOUNDED PRECEDING: This value means the first row of the partition.
        • UNBOUNDED FOLLOWING: This value means the last row of the partition.
        • CURRENT ROW: This value means the current row.
        • expr PRECEDING: This value means the row that is expr rows before the current row. The expr must be a positive integer literal.
        • expr FOLLOWING: This value means the row that is expr rows after the current row. The expr must be a positive integer literal.
      • RANGE BETWEEN start_point AND end_point: This form specifies the value range before and after the current row to be used for the median calculation. The start_point and end_point can be one of the following values:
        • UNBOUNDED PRECEDING: This value means the lowest value of the order_expression in the partition.
        • UNBOUNDED FOLLOWING: This value means the highest value of the order_expression in the partition.
        • CURRENT ROW: This value means the current value of the order_expression.
        • expr PRECEDING: This value means the value that is expr less than the current value of the order_expression. The expr must be a numeric literal.
        • expr FOLLOWING: This value means the value that is expr more than the current value of the order_expression. The expr must be a numeric literal.

The function returns a numeric value that represents the median value of the input expression, based on the partition, order, and window frame specified by the OVER clause.

Examples

In this section, we will show some examples of how to use the MEDIAN() function in different scenarios.

Example 1: Calculating the median of a column value

Suppose you have a table called scores that stores the scores of various students, such as their name, subject, and score. The score column is a numeric value that represents the score of the student in the subject. You want to calculate the median score of each subject, so that you can find the typical score, remove outliers, and compare distributions. You can use the MEDIAN() function with the PARTITION BY subclause to do so. For example, you can execute the following statement:

SELECT subject, MEDIAN(score) OVER (PARTITION BY subject) AS median_score FROM scores;

This will return the subject and the median score of each subject for each row, or an empty result set if the table is empty. For example, the result might look like this:

+---------+-------------+
| subject | median_score |
+---------+-------------+
| Math    | 75          |
| Math    | 75          |
| Math    | 75          |
| Math    | 75          |
| Math    | 75          |
| English | 80          |
| English | 80          |
| English | 80          |
| English | 80          |
| English | 80          |
| Science | 85          |
| Science | 85          |
| Science | 85          |
| Science | 85          |
| Science | 85          |
+---------+-------------+

Note that the median score is the middle value of the sorted list of scores for each subject, or the average of the middle two values if the list has an even number of elements. For example, the median score of the subject Math is 75, because the sorted list of scores for Math is 50, 60, 75, 80, and 90, and the middle value is 75.

Example 2: Calculating the median of a column value with a sliding window

Suppose you have a table called sales that stores the sales data of various products, such as their product_id, date, and amount. The amount column is a numeric value that represents the sales amount of the product on the date. You want to calculate the median sales amount of each product for the current date and the previous two dates, so that you can find the trend and seasonality of the sales. You can use the MEDIAN() function with the PARTITION BY, ORDER BY, and ROWS BETWEEN subclauses to do so. For example, you can execute the following statement:

SELECT
  product_id,
  date,
  amount,
  MEDIAN(amount) OVER (
    PARTITION BY product_id
    ORDER BY date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS median_amount
FROM sales;

This will return the product_id, date, amount, and the median amount of each product for the current date and the previous two dates for each row, or an empty result set if the table is empty. For example, the result might look like this:

+------------+------------+--------+--------------+
| product_id | date       | amount | median_amount |
+------------+------------+--------+--------------+
| 1          | 2024-01-01 | 100    | 100          |
| 1          | 2024-01-02 | 150    | 125          |
| 1          | 2024-01-03 | 200    | 150          |
| 1          | 2024-01-04 | 250    | 200          |
| 1          | 2024-01-05 | 300    | 250          |
| 2          | 2024-01-01 | 50     | 50           |
| 2          | 2024-01-02 | 75     | 62.5         |
| 2          | 2024-01-03 | 100    | 75           |
| 2          | 2024-01-04 | 125    | 100          |
| 2          | 2024-01-05 | 150    | 125          |
+------------+------------+--------+--------------+

Note that the median amount is the middle value of the sorted list of amounts for each product for the current date and the previous two dates, or the average of the middle two values if the list has an even number of elements. For example, the median amount of the product 1 for the date 2024-01-03 is 150, because the sorted list of amounts for product 1 for the dates 2024-01-01, 2024-01-02, and 2024-01-03 is 100, 150, and 200, and the middle value is 150.

There are some other functions that are related to the MEDIAN() function and can be used to perform other statistical calculations in Mariadb. Here are some of them:

  • AVG(): This function returns the average value of a set of numbers.
  • MIN(): This function returns the minimum value of a set of numbers.
  • MAX(): This function returns the maximum value of a set of numbers.
  • SUM(): This function returns the sum of a set of numbers.
  • STDDEV(): This function returns the standard deviation of a set of numbers.
  • VARIANCE(): This function returns the variance of a set of numbers.

Conclusion

The MEDIAN() function is a powerful and flexible function that can help you calculate the median value of a set of numbers. It can be used for various purposes, such as finding the typical value, removing outliers, and comparing distributions. You can also use some other related functions to perform other statistical operations, such as average, minimum, maximum, sum, standard deviation, or variance. By using these functions, you can achieve a better analysis and understanding of your data.