jsguides

String.prototype.isWellFormed()

isWellFormed()

What isWellFormed does

String.prototype.isWellFormed() answers one question: does this string contain any lone UTF-16 surrogates? It returns true if the string is well-formed, false otherwise, and it never throws.

const strings = [
  "ab\uD800",            // lone high surrogate at the end
  "ab\uD800c",           // lone high surrogate in the middle
  "\uDFFFab",            // lone low surrogate at the start
  "c\uDFFFab",           // lone low surrogate in the middle
  "abc",                 // plain ASCII
  "ab\uD83D\uDE04c",     // surrogate pair (U+D83D U+DE04) for U+1F604
];

for (const s of strings) {
  console.log(s.isWellFormed());
}
// false
// false
// false
// false
// true
// true

That last case matters. Strings that look “complicated” because they contain emoji or other non-BMP characters are still well-formed, because code points above the Basic Multilingual Plane (U+1F604 and similar) are encoded as a proper surrogate pair (U+D83D U+DE04), not as lone halves.

Syntax

isWellFormed()
ParameterTypeDescription
(none)The method takes no arguments.

Returns: booleantrue if the string contains no lone surrogates, false otherwise.

Throws: Never. The whole point of the API is to let you branch on validity without a try/catch.

Why well-formedness matters

JavaScript strings are UTF-16 encoded. Characters in the Basic Multilingual Plane (U+0000U+FFFF) fit in a single 16-bit code unit. Code points above that, which includes most emoji and many CJK characters, are encoded as a surrogate pair: a high surrogate in the range U+D800U+DBFF followed by a low surrogate in the range U+DC00U+DFFF.

A string is well-formed when every high surrogate is immediately followed by a low surrogate, and every low surrogate is immediately preceded by a high surrogate. Anything else is ill-formed, also called a “lone surrogate” string.

Lone surrogates show up in real code more often than you might expect:

  • String.fromCharCode(0xD800) produces a lone high surrogate. The fromCodePoint form does not.
  • Slicing a surrogate pair in half: "\uD83D\uDE04".slice(0, 1) gives you "\uD83D".
  • Decoding invalid UTF-8 with a non-fatal TextDecoder produces lone surrogates for byte sequences that don’t form a valid code point.
  • Older APIs or binary protocols that hand you a string decoded as Latin-1 and re-encoded as UTF-16.

A lone surrogate isn’t a crash on its own, but it is a problem for anything that expects valid UTF-16: encodeURI throws URIError, the behavior of JSON.stringify on strings containing them is implementation-defined, and rendering can be inconsistent.

Guarding encodeURI

encodeURI and encodeURIComponent throw URIError: URI malformed when handed a lone surrogate. isWellFormed() gives you a clean way to check first:

const url = "https://example.com/search?q=\uD800";

if (url.isWellFormed()) {
  console.log(encodeURI(url));
} else {
  console.warn("Refusing to encode a string with lone surrogates.");
}
// Refusing to encode a string with lone surrogates.

Without isWellFormed, the same code would need a try/catch around the encodeURI call, which is the exact pattern the new method is meant to replace.

Coercion

Like other String.prototype methods, isWellFormed coerces this to a string first, so you can call it on any value:

String.prototype.isWellFormed.call(123);       // true  (123 -> "123", all ASCII)
String.prototype.isWellFormed.call(null);      // true  (null -> "null", all ASCII)
String.prototype.isWellFormed.call("\uD800");  // false (lone high surrogate)

In practice you almost always call it as a method, but the coercion behavior is useful to know when you want to validate a value without first checking that it’s already a string.

Gotchas

  • It scans the whole string. There is no “is this one character well-formed” form. If you need to find where a lone surrogate lives, walk the string with String.prototype.codePointAt() and look for indices where the returned code point is itself a surrogate, or compare the string’s .length to Array.from(str).length.
  • Companion method: toWellFormed(). Also added in ES2024, this returns a new string with every lone surrogate replaced by U+FFFD (the Unicode replacement character ). Use isWellFormed() to reject, toWellFormed() to sanitize.
  • Not the same as normalize(). String.prototype.normalize() handles Unicode normalization forms (NFC, NFD, NFKC, NFKD). It does not check, fix, or even care about lone surrogates.
  • A regex is slower and easier to get wrong. The native method inspects the engine’s internal UTF-16 representation in a single pass. A regex like /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/ checks the same condition but is typically slower on long strings and notoriously fiddly to write correctly the first time.

Browser and runtime support

Available in all current engines:

  • Chrome and Edge 111+ (March 2023)
  • Firefox 119+ (October 2023)
  • Safari 16.4+ (March 2023)
  • Node.js 20.0+ (April 2023)

For older runtimes, core-js ships a well-formed-unicode-strings module, and the es-shims/string.prototype.iswellformed package on npm provides a spec-compliant polyfill.

See also

  • String.prototype.normalize(): the closest “Unicode hygiene” sibling on String.prototype. It handles normalization forms, which is a different concern from encoding validity.
  • String.prototype.codePointAt(): reads a full code point starting at an index. Use it when you need to locate a lone surrogate inside a string.
  • encodeURI(): the function that throws on ill-formed strings. The isWellFormed guard pattern in this article’s example pairs naturally with it.