A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding below:

Manual:PAGENAMEE encoding - MediaWiki

MediaWiki pages' name encoding is a complicated topic. MediaWiki magic words PAGENAME, PAGENAMEE, urlencode have distinct implementations, each with their own peculiarities.

A MediaWiki page name can have a leading space but not a trailing space. The ASCII characters that are not allowed in MediaWiki page names are the three types of brackets, the sharp sign, underscore, vertical bar, and all control characters (including tabs and newlines).

# < > [ ] _ { | }

The underscore is not really disallowed, but is treated like a space without distinction in MediaWiki page names, so "A_B" and "A B" are referencing exactly the same page name (pages will be created, searched, and displayed (with their title) using spaces, never using underscores).

This article shall refer to these as the "not-allowed pagename characters". For clarity, we will present other ASCII 7-bit values for characters as the URL-style encoding of percent-hex-hex form known as percent-encoding.

Some allowed characters returned by {{PAGENAME}} are HTML-style encoded:

This HTML/XML encoding is standard, even if the standard does not always requires escaping the single and double quotes except in few cases; the standard would also require reencoding the lower-than (<) and greater-than (>) signs but these two characters are forbidden in MediaWiki pagenames due to the syntax of the MediaWiki code used to compose pages.

The same HTML-encoding is used also with (see Help:Magic words#Page names to remind the definitions):

We will refer to the three characters ", &, ' as the "three special pagename characters".

{{PAGENAMEE}} converts spaces to underscore and percent-encodes a set of characters:

The same encoding is used also with:

When preparing a pagename for embedding in the "searchpart" of a URL (see RFC 1738 and/or RFC 3986), it might have to be both percent-encoded and all space characters converted %20 or plus sign + which we will call "searchpart-encoded".

This avoids the problematic coding of the three special pagename characters by encoding, for instance, ampersand (&) as %26, but the typical searchpart-encoding of space is the plus sign (or sometimes as %20).

If no MediaWiki string manipulation extensions exist, then {{PAGENAMEE}} might only be useful for constructing a URL back into one's own wiki, to other wikis or to other sites where the page they provide use the same name and use underscores (there's no standard here, the encoding presented in the table was defined by MediaWiki itself for its own local use. Do not assume that other sites will perform the same conversion, most of them just use plain UTF-8 in their own local URLs if they need to represent non-ASCII characters and standard URL-encoding for the "unsafe" ASCII characters).

The {{urlencode:data|style}} function (in its current version using now the "QUERY" style by default since MediaWiki 1.17) percent-encodes many more characters than PAGENAMEE.

It can convert any valid input string from its native UTF-8 encoding.
This function will also convert the 9 characters that are forbidden in pagenames and listed at top of page.
It converts the "three special characters" differently than what is performed by {{PAGENAME}}, using %nn hexadecimal triplets, instead of HTML entities.
It preserves the distinction between space and underscore (a distinction lost only in MediaWiki pagenames).
The result is conforming to the RFC 1738 URL encoding standard, using only letters, digits and "safe" characters and the two characters % (followed by two hexadecimal digits) and + (to encode spaces).
This result is fully and easily reversible, but MediaWiki does not natively provide a urldecode function to do it.

It can also be used to allow the Wikisource editor to work with multilingual characters they are accustomed to rather than deal with the more opaque percent-encoded characters. When considering using urlencode to construct an external link URL, especially within a template, there are two design style where that might be appropriate. Which one is appropriate is a matter the trade-offs between generality and ease-of-use.

Note that there's no mediawiki parser function that can successfully decode the HTML-encoding performed by PAGENAME. As well, there's no function to decode the special encoding performed by PAGENAMEE or found in URL paths to wiki pages. Parser functions like #ifeq or #ifswitch work because they compare their input by only HTML-decoding them, but they never URL-decode their parameters.

Web browser URL and wiki web server HTTP interface[edit]

The URL you type in or cut/paste into your web browser URL is similar but not exactly the same as PAGENAMEE.

Encodings compared[edit]

The following table shows the effect of the various supported encodings over the full set of printable ASCII characters (plus SPACE) and on the two first printable Unicode characters after ASCII. Tabulations and other whitespace controls are discussed more completely in the section below about whitespaces, but the table shows some contextual "effects" occurring with possibly dropped spaces and some other characters.

Different character encodings.

Characters →

↓Encodings

\

09AZ- az . . ! " # $ % & ' ( ) * + , ... / .: ; < = > ? @ [ \ ] ^ _ ` { | } ~   ¡ {{PAGENAME:...}} 09AZ- Az . . ! &quot; $ % &amp; &apos; ( ) * + , ... / .: ; = ? @ \ ^ ` ~ ¡ {{PAGENAMEE:...}} 09AZ- Az ._. ! %22 $ %25 %26 %27 ( ) * %2B , ... / .: ; %3D %3F @ %5C %5E %60 ~ %C2%A1 {{urlencode:...|WIKI}} 09AZ- az ._. ! %22 %23 $ %25 %26 %27 ( ) * %2B , ... / .: ; %3C %3D %3E %3F @ %5B %5C %5D %5E _ %60 %7B %7C %7D ~ %C2%A0 %C2%A1 {{urlencode:...|PATH}} 09AZ- az .%20. %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C ... %2F .%3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ %C2%A0 %C2%A1 {{urlencode:...|QUERY}} 09AZ- az .+. %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C ... %2F .%3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D %7E %C2%A0 %C2%A1 {{anchorencode:...}} 09AZ- az ._. ! " # $ % & ' ( ) * + , ... / .: ; < = > ? @ [ \ ] ^ ` { | } ~ ¡

The behaviour of the {{anchorencode:...}} parser function depends on the

$wgFragmentMode

setting.

With the various encodings proposed in MediaWiki, it is notable that the only characters that are never transformed (or removed) are the 10 decimal digits, the minus-hyphen (-) and the uppercase Basic Latin letters (A-Z : an initial lowercase letter may be transformed by capitalisation in most wikis except those that preserve the case distinction in page names).

Note also that namespace names and interwiki prefixes don't have a case-significant letters, and if they are recognized at the beginning of a title, they may be replaced by an incompletely unrelated term, possibly in another language and/or script! So be careful with everything that comes before a colon (:) as the behavior will be specific for each wiki and their own local set of recognized namespace names (or synonyms) and interwiki prefixes (however these local prefixes do not affect what urlencode and anchorencode will return, which is independent of local naming rules for each wiki).

The two styles PATH and QUERY for urlencode are almost identical, their only difference is that:

Specific characters in page names[edit] Capitalisation of page names[edit]

Lowercase letters (a-z) are preserved, except at the initial position where they may be converted to uppercase with PAGENAME and PAGENAMEE on wikis that have not disabled this capitalisation.

You can see an example of capitalisation in the table.

Colons in page names[edit]

The colon (:) is treated specially in page names when it is the first character in the trimmed[2] given name (where it will link to a description page instead of showing the content of that page when it is one of the special name spaces like "File", "Image", or "Int"). But PAGENAME will drop this leading colon, along with spaces immediately after that colon:

Otherwise, if the non-empty text before the first colon matches a known local namespace, then this name space and the colon will be dropped, along with spaces immediately after that colon (the dropped namespace will be trimmed[2] and returned by {{NAMESPACE:...}}):

Otherwise, if the non-empty text before the first colon matched a known interwiki prefix, then this prefix and that colon are dropped, along with spaces immediately after that colon, but an empty namespace will be returned:

Otherwise, the colons are kept, even of the text before the first colon could be a valid interwiki prefix (containing only letters without case distinction, or digits, or minus-hyphens and dashes, spaces or underscores; not restricted to be ASCII only):

The same rules are applied by {{PAGENAMEE:...}} and {{NAMESPACEE:...}} before they encode their return value.

Colons (and their surrounding spaces as long as they are not leading or trailing spaces) are left intact by {{FULLPAGENAME:...}}.

All colons are left intact by URL-encoding. But most (not all) colons are preserved by anchor-encoding.

Full stops and slashes in page names[edit]

Note that page names are parsed from left to right into (possibly empty) segments (called "title parts") separated by slashes (/). In some cases the occurrence of segments containing only a single dot (or full stop .) or two dots (..) will cause the rest of the string to be transformed. See Help:Extension:ParserFunctions for details.

Otherwise these dots are left intact by {{urlencode:...}} and {{anchorencode:...}}, but slashes may be converted.

Also the sequence of two successive slashes (//) may not be accepted in page names, depending on the configuration of the wiki. Usually this is an indicator that the name is a URL, when it is preceded by a valid URI[1] scheme (or no URI scheme at all where it means a default http: or https: URI scheme will be used, depending on user's preference). An URI scheme should then contain a colon (:), but MediaWiki currently recognizes only URI schemes where the colon is final, in a restricted list; otherwise.

For example on this wiki,

"{{PAGENAME|//www.mediawiki.org/}}""//www.mediawiki.org/"

On Wikimedia sites, such as MediaWiki.org, the double slashes are recognized as URIs, and most valid URIs are disallowed as page names (if an URI scheme is present, it could be recognized as a name space if it has been configured, otherwise the page name will fall into the main namespace of the wiki):

So on this wiki on MediaWiki.org, the following code unexpectedly creates a direct link to the external URL, surrounded by verbatim single brackets:

[[//www.mediawiki.org/|www.mediawiki.org]][[1]]

URIs are not recognized by URL-encoding and anchor-encoding (this means that valid full URLs cannot be safely created with urlencode!).

Specific characters in anchors[edit] Colons (:) in anchors[edit]

Anchor-encoding is bit more tricky: most colons are kept, except when they are at the leading positions, even though a section heading like this one could start by a colon).

So for the title of this section, you get

_Colons_(:)_in_anchors

Note that the colon is unexpectedly converted by inserting a newline before it, as if this the parameter was the content of a wiki source page (causing an indented block to be rendered)! The result does not match the identifier that MediaWiki generated for this section heading.

Pipes (|) in anchors[edit]

A more critical bug/limitation is observed when the leading character is a pipe (|), because it is treated as a parameter separator of {{anchorencode:...}} (despite the fact that it takes only a single parameter with no extra option):

A common work-around (using the common utility template {{!}} to avoid the verbatim pipe returned by this template to be interpreted as a parameter separator):

This works because the expansion of templates is delayed after the parser function name and its parameter(s) in {{anchorencode:...}} have first been parsed up to the colon, and then (needlessly) separated on pipes: the expansion of templates, which may be present within parameter names or values between the colon and the double closing brace, will occur only when these parameters will be queried by the parser function itself, but this will not change the number or order of these parameters.

The same work-around may be used if you need to pass any of the following:

Semicolons (;), asterisks (*), or sharp signs (#) in anchors[edit]

The same bug does not occur when the first non-blank character (or any further character) of a section heading is a semicolon (;), an asterisk (*), or even a sharp sign (#), so these characters are preserved along with the rest of the string:

Whitespaces in page names and anchors (section headings)[edit]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4