Introduction to MongoDB $substrCP Operator

$substrCP is a string aggregation operator in MongoDB used to extract a substring from a string by code point. Code point refers to the unique numeric identifier of each character in the Unicode code table.

Syntax

The syntax of the $substrCP operator is as follows:

{ $substrCP: [ <string>, <startingIndex>, <length> ] }
  • <string>: The string from which to extract the substring.
  • <startingIndex>: The starting position of the substring to be extracted, counting from 0.
  • <length>: The length of the substring to be extracted. If not specified, all characters from the starting position to the end of the string are extracted.

Use Cases

The $substrCP operator is commonly used in the following scenarios:

  • Extracting a part of a string, such as extracting the date and time from an email subject.
  • Extracting specific code points from a string, such as extracting specific emojis from an emoji expression.

Examples

Here are two examples of using the $substrCP operator.

Example 1

Assume there is a collection called user that stores user information, including the first name and last name of each user. Now we need to query the first two characters of each user’s first name. We can use the following aggregation pipeline:

db.user.aggregate([
  {
    $project: {
      firstName: { $substrCP: ["$name", 0, 2] }
    }
  }
])

In this aggregation pipeline, we first use the $project operator to project each document in the collection as a document containing only the firstName field. In the $project operator, we use the $substrCP operator to extract the first two characters from the name field as the value of the firstName field.

Assume the collection contains the following two documents:

{ "_id": 1, "name": "John Doe" }
{ "_id": 2, "name": "Jane Smith" }

Using the above aggregation pipeline, we get the following results:

{ "_id": 1, "firstName": "Jo" }
{ "_id": 2, "firstName": "Ja" }

Example 2

Assume there is a collection called product that stores product information, including the name and price of each product. Now we need to query the first three characters of each product name starting from the second character. We can use the following aggregation pipeline:

db.product.aggregate([
  {
    $project: {
      namePrefix: { $substrCP: ["$name", 1, 3] }
    }
  }
])

In this aggregation pipeline, we also use the $project operator to project each document in the collection as a document containing only the namePrefix field. In the $project operator, we use the $substrCP operator to extract the first three characters from the name field starting from the second character as the value of the namePrefix field.

Assume the collection contains the following two documents:

{ "_id": 1, "name": "Apple iPhone 13", "price": 999 }
{ "_id": 2, "name": "Samsung Galaxy S21", "price": 799 }

Using the above aggregation pipeline, we get the following results:

{ "_id": 1, "namePrefix": "ppl" }
{ "_id": 2, "namePrefix": "ams" }

Conclusion

The $substrCP operator is a string aggregation operator in MongoDB, which is used to extract substrings from a string. Unlike the $substrBytes operator, the $substrCP operator extracts substrings according to Unicode code points, which ensures correct handling of multibyte characters. In practical application scenarios, $substrCP operator can be conveniently used to process strings according to specific requirements.