go解析url的注意点

有用户输入一个看似正常的url信息,接口却一直返回参数不符合规范,于是反馈来询问原因

代码乍一看没什么问题,直接使用的标准包url.Parse方法去判断,判断条件为解析正确且域名路径不为空
直到看到日志,用户传入的是一个不包含协议的地址,导致解析错误无法获取正确的Host信息

原因

用一段代码来说明问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func main() {
s1 := "a.cn:88/b/c?d=1&e=2#f"
s2 := "http://" + s1

u, err := url.Parse(s1)
if err != nil {
fmt.Println("s1 parse err", err)
}
fmt.Println("scheme:", u.Scheme, "host:", u.Host, "hostname:", u.Hostname(), "port:", u.Port(), "path:", u.Path, "rawpath:", u.RawPath, "fragment", u.Fragment)
// scheme: a.cn host: hostname: port: path: rawpath: fragment f

u, err = url.Parse(s2)
if err != nil {
fmt.Println("s2 parse err", err)
}
fmt.Println("scheme:", u.Scheme, "host:", u.Host, "hostname:", u.Hostname(), "port:", u.Port(), "path:", u.Path, "rawpath:", u.RawPath, "fragment", u.Fragment)
// scheme: http host: a.cn:88 hostname: a.cn port: 88 path: /b/c rawpath: fragment f

u, err = url.ParseRequestURI(s1)
if err != nil {
fmt.Println("s1 parserequesturi err", err)
}
fmt.Println("scheme:", u.Scheme, "host:", u.Host, "hostname:", u.Hostname(), "port:", u.Port(), "path:", u.Path, "rawpath:", u.RawPath, "fragment", u.Fragment)
// scheme: a.cn host: hostname: port: path: rawpath: fragment

u, err = url.ParseRequestURI(s2)
if err != nil {
fmt.Println("s2 parserequesturi err", err)
}
fmt.Println("scheme:", u.Scheme, "host:", u.Host, "hostname:", u.Hostname(), "port:", u.Port(), "path:", u.Path, "rawpath:", u.RawPath, "fragment", u.Fragment)
// scheme: http host: a.cn:88 hostname: a.cn port: 88 path: /b/c rawpath: fragment
}

ParseParseRequestURI两个方法的行为不一致上(go 1.17.2)
两个方法的签名及注释信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Parse parses a raw url into a URL structure.
//
// The url may be relative (a path, without a host) or absolute
// (starting with a scheme). Trying to parse a hostname and path
// without a scheme is invalid but may not necessarily return an
// error, due to parsing ambiguities.
func Parse(rawURL string) (*URL, error)

// ParseRequestURI parses a raw url into a URL structure. It assumes that
// url was received in an HTTP request, so the url is interpreted
// only as an absolute URI or an absolute path.
// The string url is assumed not to have a #fragment suffix.
// (Web browsers strip #fragment before sending the URL to a web server.)
func ParseRequestURI(rawURL string) (*URL, error)

从url字符串中解析scheme的方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Maybe rawURL is of the form scheme:path.
// (Scheme must be [a-zA-Z][a-zA-Z0-9+-.]*)
// If so, return scheme, path; else return "", rawURL.
func getScheme(rawURL string) (scheme, path string, err error) {
for i := 0; i < len(rawURL); i++ {
c := rawURL[i]
switch {
case 'a' <= c && c <= 'z' || 'A' <= c && c <= 'Z':
// do nothing
case '0' <= c && c <= '9' || c == '+' || c == '-' || c == '.':
if i == 0 {
return "", rawURL, nil
}
case c == ':':
if i == 0 {
return "", "", errors.New("missing protocol scheme")
}
return rawURL[:i], rawURL[i+1:], nil
default:
// we have encountered an invalid character,
// so there is no valid scheme
return "", rawURL, nil
}
}
return "", rawURL, nil
}

go文档中的说明信息

1
2
3
4
5
6
7
8
9
A URL represents a parsed URL (technically, a URI reference).

The general form represented is:

[scheme:][//[userinfo@]host][/]path[?query][#fragment]
URLs that do not start with a slash after the scheme are interpreted as:

scheme:opaque[?query][#fragment]
Note that the Path field is stored in decoded form: /%47%6f%2f becomes /Go/. A consequence is that it is impossible to tell which slashes in the Path were slashes in the raw URL and which were %2f. This distinction is rarely important, but when it is, the code should use RawPath, an optional field which only gets set if the default encoding is different from Path.

  • 从注释信息可知,Parse支持相对路径和绝对路径地址的解析,但ParseRequestURI仅应用作绝对路径地址的解析
    • 所以 ParseRequestURI 解析 s1 时返回了不正确的信息
  • Parse已经注明了不应当尝试解析一个不包含scheme信息的地址,但你这么做并不会得到一个error
    • 从 getScheme 方法中可知是通过:这个字符串来区分的,此前的字符串被认为是 scheme,之后的则是其它路径(结合上面go文档中的正则来看)
    • parse字符串s1时,scheme 返回了 a.cn 就是这个原因
  • ParseRequestURI由于假定被解析的地址是http请求中获取的(主要区别),所以会忽略fragment,原因是浏览器不会发送它
  • 二者对相对路径/不包含scheme信息的地址均不返回错误,需要自行判断

解决方法

  1. 按需选择使用,相对路径解析时选择Parse
  2. 自行封装解析方法以兼容无scheme情况
  3. 用之前多读读文档

参考链接