jsguides

String.prototype.toWellFormed()

toWellFormed()

String.prototype.toWellFormed() returns a new string in which every lone UTF-16 surrogate code unit has been replaced with the Unicode replacement character, U+FFFD (the glyph). Strings in JavaScript are UTF-16 encoded, so any string can hold a well-formed sequence of code units. If it was built from user input, scraped text, or a broken decoder, it can be ill-formed instead, and toWellFormed() is the built-in sanitizer for that case.

What is a well-formed UTF-16 string?

A well-formed UTF-16 string contains no lone surrogates. The surrogate range in Unicode runs from 0xD800 to 0xDFFF and is split into two halves:

  • High surrogates (0xD8000xDBFF) are meant to be followed by a low surrogate so the pair can encode a code point above the Basic Multilingual Plane (most emoji live up there).
  • Low surrogates (0xDC000xDFFF) are meant to be preceded by a high surrogate.

If you slice a string in the middle of a pair, drop a byte, or receive data from a buggy encoder, you can end up with a high or low surrogate sitting next to a regular ASCII character. That’s a lone surrogate, and it makes the string ill-formed. toWellFormed() walks the code units and swaps each lone surrogate for U+FFFD.

Syntax and return value

str.toWellFormed()
  • Parameters: none.
  • Returns: a new string. Every lone surrogate in the receiver becomes U+FFFD (\uFFFD). If the receiver is already well-formed, the result is a fresh string with the same content; the original is never mutated.

The method does not throw on ill-formed input. That alone makes it a useful cleanup step: encodeURI and encodeURIComponent will throw URIError on the same input, but toWellFormed() just patches the bad bytes.

Examples

The canonical case from MDN: a list of strings, some well-formed and some not, run through the method:

const strings = [
  // Lone leading surrogate
  "ab\uD800",
  "ab\uD800c",
  // Lone trailing surrogate
  "\uDFFFab",
  "c\uDFFFab",
  // Well-formed
  "abc",
  "ab\uD83D\uDE04c",
];

for (const str of strings) {
  console.log(str.toWellFormed());
}
// "ab�"
// "ab�c"
// "�ab"
// "c�ab"
// "abc"
// "ab😄c"

The well-formed ab😄c keeps its content unchanged because the high and low surrogate are a proper pair. The ill-formed variants each get their stray surrogate replaced with .

Fixing input for encodeURI

encodeURI throws URIError: URI malformed when its input contains a lone surrogate. toWellFormed() is the cheapest way to avoid that:

const illFormed = "https://example.com/search?q=\uD800";

try {
  encodeURI(illFormed);
} catch (e) {
  console.log(e); // URIError: URI malformed
}

console.log(encodeURI(illFormed.toWellFormed()));
// "https://example.com/search?q=%EF%BF%BD"

This is the strongest real-world argument for the method: anywhere you build a URL or filename from foreign or user-supplied text, sanitize first.

Pairing with TextEncoder

TextEncoder already replaces lone surrogates with U+FFFD when encoding to bytes. toWellFormed() gives you the same fix on the string itself, no Web API needed:

const s = "Hi \uD800 there";
const clean = s.toWellFormed(); // "Hi � there"

const bytes = new TextEncoder().encode(s);            // 9 bytes: EF BF FD replaces \uD800
const round  = new TextDecoder().decode(bytes);       // "Hi � there"

You usually want toWellFormed() when the string itself needs to be clean (for display, comparison, or as a map key), not the bytes.

Short-circuit with isWellFormed()

The companion method isWellFormed() is a boolean check. Use it as a fast path so you only allocate a new string when the input actually has a problem:

function sanitize(input) {
  if (input.isWellFormed()) return input;
  return input.toWellFormed();
}

toWellFormed() vs. normalize()

String.prototype.normalize() applies Unicode Normalization Forms (NFC, NFD, NFKC, NFKD) and can reorder or split combined characters. toWellFormed() only swaps lone surrogates and never changes character order. They solve different problems, and the two are often chained:

str.normalize("NFC").toWellFormed();

Browser support

toWellFormed() is Baseline Widely Available as of March 2024: Chrome 119+, Firefox 121+, Safari 17.0+, and Node 20.10+. For older runtimes, use the core-js polyfill or the es-shims string.prototype.towellformed package.

See also