-
-
Notifications
You must be signed in to change notification settings - Fork 35.3k
Buffer.toString('utf8') appears to use wtf-8 #23280
Copy link
Copy link
Closed
Labels
bufferIssues and PRs related to the buffer subsystem.Issues and PRs related to the buffer subsystem.docIssues and PRs related to the documentations.Issues and PRs related to the documentations.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.Issues and PRs related to the TextEncoder and TextDecoder APIs.help wantedIssues that need assistance from volunteers or PRs that need help to proceed.Issues that need assistance from volunteers or PRs that need help to proceed.
Metadata
Metadata
Assignees
Labels
bufferIssues and PRs related to the buffer subsystem.Issues and PRs related to the buffer subsystem.docIssues and PRs related to the documentations.Issues and PRs related to the documentations.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.Issues and PRs related to the TextEncoder and TextDecoder APIs.help wantedIssues that need assistance from volunteers or PRs that need help to proceed.Issues that need assistance from volunteers or PRs that need help to proceed.
The byte sequence
237, 166, 164is not valid utf8, since it encodes a surrogate code point, which is not a valid unicode scalar value. SoBuffer.from([237, 166, 164]).toString('utf8')should error. But instead, it returns a string, effectively implementing wtf-8 rather than utf-8.Or does
Buffer.toStringsimply not provide any validity guarantees at all, returning garbage strings if the buffer contains invalid input? In that case, please document this as expected behavior, since it makes the function completely useless for a bunch of use cases.node -v:v10.11.0uname -a:Linux aljoscha-laptop 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22 UTC 2018 x86_64 GNU/LinuxSee also rust-lang/rust#54845
edit: This also leaks into
JSON.parse, which can accept garbage strings even though ECMA-404 (the json standard prescribed for JSON.parse as defined in ECMAScript) only allows valid utf8 input.