Last Updated : 15 Jul, 2025
The $substrCP
operator in MongoDB is a powerful tool used within the aggregation framework to extract substrings from strings based on Unicode code points. Unlike traditional substring functions that operate on bytes, $substrCP
ensures accurate substring extraction for both ASCII and non-ASCII characters, making it an essential operator for handling multilingual and special character-based text data.
The $substrCP Operator in MongoDB is used in the aggregation pipeline to find substrings from a given string expression. It uses the Unicode code point index and count to determine the substring. This makes it useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly ensuring that characters are correctly processed regardless of their encoding.
For example: { $substrCP: [ "geeksforgeeks", 0, 5 ] } will give the output as "geeks" as 0 is given as starting location, and from there 5 characters need to be taken and hence "geeks" is the result
Why Use$substrCP
?
Syntax:
{ $substrCP: [ <your string expression>, <code point index>, <code point count> ] }
Key Terms:
To understand MongoDB $substrCP Operator we need a collection on which we will perform various operations and queries.
We have an articles
collection with a publishedon
field storing publication dates in YYYYMMDD format. We need to extract the publication month and publication year separately.
Query:
db.articles.aggregate([
{
$project: {
articlename: 1,
publicationmonth: { $substrCP: [ "$publishedon", 0, 4 ] },
publicationyear: {
$substrCP: [
"$publishedon",
4,
{ $subtract: [ { $strLenCP: "$publishedon" }, 4 ] }
]
}
}
}
])
Output:
Explanation:
"publicationmonth"
extracts the first 4 characters of publishedon
, representing the year."publicationyear"
extracts the remaining characters by using $subtract
to calculate the length dynamically.Suppose we have a collection articles
in the geeksforgeeks
database with documents containing an articlename
field. We want to create a new field shortName
with only the first 10 characters of each article's name. This is useful for displaying short previews of article titles.
Query:
db.articles.aggregate([
{
$project: {
articlename: 1,
shortName: {
$substrCP: ["$articlename", 0, 10]
}
}
}
]);
Output:
{
"articlename": "Deep learning in R Programming",
"shortName": "Deep learn"
}
Explanation:
$substrCP
extracts a substring starting from index 0
(first character) and taking 10
characters from articlename
.shortName
contains the first 10 characters, which can be used as a preview or snippet of the full title.Suppose another document in the articles
collection has an articlename
in a Multibyte Character Set.
Query:
db.articles.aggregate([
{
$project: {
shortName: {
$substrCP: ["$articlename", 0, 15]
}
}
}
]);
Output:
{ "shortName": "Social Media AP" }
Explanation: $substrCP
ensures that characters are correctly extracted even if they are multibyte characters, preventing data corruption.
In MongoDB, the $substrCP
operator is crucial for accurately extracting substrings based on Unicode code points. It supports efficient handling of both single-byte and multibyte character sets which making it essential for applications managing diverse textual data. By using $substrCP
, MongoDB users can effectively manipulate string data within the aggregation framework. Whether we're dealing with date parsing, text truncation, or multilingual support, $substrCP
is a must-use operator for handling Unicode-compliant string manipulation in MongoDB.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4