Background
RFC 3986 defines a host as follows
host = IP-literal / IPv4address / reg-name
Where
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
reg-name = *( unreserved / pct-encoded / sub-delims )
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
WhatWG says that "A valid host string must be a valid domain string, a valid IPv4-address string, or: U+005B ([), followed by a valid IPv6-address string, followed by U+005D (])."
The Bug
This is code from Lib/urllib/parse.py:196-208 used for retrieving the hostname from the netloc
@property
def _hostinfo(self):
netloc = self.netloc
_, _, hostinfo = netloc.rpartition('@')
_, have_open_br, bracketed = hostinfo.partition('[')
if have_open_br:
hostname, _, port = bracketed.partition(']')
_, _, port = port.partition(':')
else:
hostname, _, port = hostinfo.partition(':')
if not port:
port = None
return hostname, port
It will incorrectly retrieve IPv4 addresses and regular name hosts from inside brackets. This is in violation of both specifications.
Minimally reproducible example:
from urllib.parse import urlsplit
parsedURL = urlsplit('scheme://user@[regname]/Path')
print(parsedURL.hostname) # Prints 'regname'
Your environment
- CPython versions tested on:
- Operating system and architecture:
Linked PRs
Background
RFC 3986 defines a host as follows
Where
WhatWG says that "A valid host string must be a valid domain string, a valid IPv4-address string, or: U+005B ([), followed by a valid IPv6-address string, followed by U+005D (])."
The Bug
This is code from
Lib/urllib/parse.py:196-208used for retrieving the hostname from the netlocIt will incorrectly retrieve IPv4 addresses and regular name hosts from inside brackets. This is in violation of both specifications.
Minimally reproducible example:
Your environment
23cf1e2)Linked PRs