Java URL類踩坑指南

時間 2019-12-04

標籤 java url 指南欄目 Java 简体版

原文原文鏈接

背景介紹

最近再作一個RSS閱讀工具給本身用，其中一個環節是從服務器端獲取一個包含了RSS源列表的json文件，再根據這個json文件下載、解析RSS內容。核心代碼以下：html

class PresenterImpl(val context: Context, val activity: MainActivity) : IPresenter {
    private val URL_API = "https://vimerzhao.github.io/others/rssreader/RSS.json"

    override fun getRssResource(): RssSource {
        val gson = GsonBuilder().create()
        return gson.fromJson(getFromNet(URL_API), RssSource::class.java)
    }

    private fun getFromNet(url: String): String {
        val result = URL(url).readText()
        return result
    }

    ......
}

以前一直執行地很好，直到前兩天我購買了一個vimerzhao.top的域名，並將原來的域名vimerzhao.github.io重定向到了vimerzhao.top。這個工具就沒法使用了，但在瀏覽器輸入URL_API卻能獲得數據：
java

那爲何URL.readText()沒有拿到數據呢？python

不支持重定向

能夠經過下面代碼測試：linux

import java.net.*;
import java.io.*;

public class TestRedirect {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/others/rssreader/RSS.json");
            URL url2 = new URL("http://vimerzhao.top/others/rssreader/RSS.json");
            read(url1);
            System.out.println("=--------------------------------=");
            read(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

獲得結果以下：android

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>
=--------------------------------=
{"theme":"tech","author":"zhaoyu","email":"dutzhaoyu@gmail.com","version":"0.01","contents":[{"category":"綜合版塊","websites":[{"tag":"門戶網站","url":["http://geek.csdn.net/admin/news_service/rss","http://blog.jobbole.com/feed/","http://feed.cnblogs.com/blog/sitehome/rss","https://segmentfault.com/feeds","http://www.codeceo.com/article/category/pick/feed"]},{"tag":"知名社區","url":["https://stackoverflow.com/feeds","https://www.v2ex.com/index.xml"]},{"tag":"官方博客","url":["https://www.blog.google/rss/","https://blog.jetbrains.com/feed/"]},{"tag":"我的博客-行業","url":["http://feed.williamlong.info/","https://www.liaoxuefeng.com/feed/articles"]},{"tag":"我的博客-學術","url":["http://www.norvig.com/rss-feed.xml"]}]},{"category":"編程語言","websites":[{"tag":"Kotlin","url":["https://kotliner.cn/api/rss/latest"]},{"tag":"Python","url":["https://www.python.org/dev/peps/peps.rss/"]},{"tag":"Java","url":["http://www.codeceo.com/article/category/develop/java/feed"]}]},{"category":"行業動態","websites":[{"tag":"Android","url":["http://www.codeceo.com/article/category/develop/android/feed"]}]},{"category":"亂七八遭","websites":[{"tag":"Linux-綜合","url":["https://linux.cn/rss.xml","http://www.linuxidc.com/rssFeed.aspx","http://www.codeceo.com/article/tag/linux/feed"]},{"tag":"Linux-發行版","url":["https://blog.linuxmint.com/?feed=rss2","https://manjaro.github.io/feed.xml"]}]}]}

HTTP返回碼301，即發生了重定向。可在瀏覽器上這個過程太快以致於咱們看不到這個301界面的出現。這裏須要說明的是URL.readText()是Kotlin中一個擴展函數，本質仍是調用了URL類的openStream方法，部分源碼以下：nginx

.....
/**
 * Reads the entire content of this URL as a String using UTF-8 or the specified [charset].
 *
 * This method is not recommended on huge files.
 *
 * @param charset a character set to use.
 * @return a string with this URL entire content.
 */
@kotlin.internal.InlineOnly
public inline fun URL.readText(charset: Charset = Charsets.UTF_8): String = readBytes().toString(charset)

/**
 * Reads the entire content of the URL as byte array.
 *
 * This method is not recommended on huge files.
 *
 * @return a byte array with this URL entire content.
 */
public fun URL.readBytes(): ByteArray = openStream().use { it.readBytes() }

因此上面的測試代碼即說明了URL.readText()失敗的緣由。
不過URL不支持重定向是否合理？爲何不支持？還有待探究。git

不穩定的`equals`方法

首先看下equals的說明(URL (Java Platform SE 7 ))：github

Compares this URL for equality with another object.
If the given object is not a URL then this method immediately returns false.
Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file.
Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
Since hosts comparison requires name resolution, this operation is a blocking operation.
Note: The defined behavior for equals is known to be inconsistent with virtual hosting in HTTP.web

接下來再看一段代碼：編程

import java.net.*;
public class TestEquals {
    public static void main(String args[]) {
        try {
            // vimerzhao的博客主頁
            URL url1 = new URL("https://vimerzhao.github.io/");
            // zhanglanqing的博客主頁
            URL url2 = new URL("https://zhanglanqing.github.io/");
            // vimerzhao博客主頁重定向後的域名
            URL url3 = new URL("http://vimerzhao.top/");
            System.out.println(url1.equals(url2));
            System.out.println(url1.equals(url3));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

根據定義輸出結果是什麼呢？運行以後是這樣：

true
false

你可能猜對了，但若是我把電腦斷網以後再次執行，結果倒是：

false
false

但其實3個域名的IP地址都是相同的，能夠ping一下：

zhaoyu@Inspiron ~/Project $ ping vimezhao.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 396.692/396.692/396.692/0.000 ms
zhaoyu@Inspiron ~/Project $ ping zhanglanqing.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1000ms
rtt min/avg/max/mdev = 396.009/396.009/396.009/0.000 ms
zhaoyu@Inspiron ~/Project $ ping vimezhao.top
ping: unknown host vimezhao.top
zhaoyu@Inspiron ~/Project $ ping vimerzhao.top
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=409 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 409.978/409.978/409.978/0.000 ms

首先看一下有網絡鏈接的狀況，vimerzhao.github.io和zhanglanqing.github.io是我和我同窗的博客，雖然內容不同可是指向相同的IP，協議、端口等都相同，因此相等了；而vimerzhao.github.io雖然和vimerzhao.top指向同一個博客，可是一個是https一個是http，協議不一樣，因此判斷爲不相等。相信這和大多數人的直覺是相背的：指向不一樣博客的URL相等了，但指向相同博客的URL卻不相等！
再分析斷網以後的結果：首先查看URL的源碼：

public boolean equals(Object obj) {
        if (!(obj instanceof URL))
            return false;
        URL u2 = (URL)obj;

        return handler.equals(this, u2);
    }

再看handler對象的源碼：

protected boolean equals(URL u1, URL u2) {
        String ref1 = u1.getRef();
        String ref2 = u2.getRef();
        return (ref1 == ref2 || (ref1 != null && ref1.equals(ref2))) &&
               sameFile(u1, u2);
    }

sameFile源碼：

protected boolean sameFile(URL u1, URL u2) {
        // Compare the protocols.
        if (!((u1.getProtocol() == u2.getProtocol()) ||
              (u1.getProtocol() != null &&
               u1.getProtocol().equalsIgnoreCase(u2.getProtocol()))))
            return false;

        // Compare the files.
        if (!(u1.getFile() == u2.getFile() ||
              (u1.getFile() != null && u1.getFile().equals(u2.getFile()))))
            return false;

        // Compare the ports.
        int port1, port2;
        port1 = (u1.getPort() != -1) ? u1.getPort() : u1.handler.getDefaultPort();
        port2 = (u2.getPort() != -1) ? u2.getPort() : u2.handler.getDefaultPort();
        if (port1 != port2)
            return false;

        // Compare the hosts.
        if (!hostsEqual(u1, u2))
            return false;// 無網絡鏈接時會觸發這一句

        return true;
    }

最後是hostsEqual的源碼：

protected boolean hostsEqual(URL u1, URL u2) {
        InetAddress a1 = getHostAddress(u1);
        InetAddress a2 = getHostAddress(u2);
        // if we have internet address for both, compare them
        if (a1 != null && a2 != null) {
            return a1.equals(a2);
        // else, if both have host names, compare them
        } else if (u1.getHost() != null && u2.getHost() != null)
            return u1.getHost().equalsIgnoreCase(u2.getHost());
         else
            return u1.getHost() == null && u2.getHost() == null;
    }

在有網絡的狀況下，a1和a2都不是null因此會觸發return a1.equals(a2)，返回true；而沒有網絡時則會觸發return u1.getHost().equalsIgnoreCase(u2.getHost());即第二個判斷，顯然url1的host（vimerzhao.github.io）和url2的host（zhanglanqing.github.io）不等，因此返回false，致使if (!hostsEqual(u1, u2))判斷爲真，return false執行。
可見，URL類的equals方法不只違反直覺還缺少一致性，在不一樣環境會有不一樣結果，十分危險！

耗時的`equals`方法

此外，equals仍是個耗時的操做，由於在有網絡的狀況下須要進行DNS解析，hashCode()同理，這裏以hashCode()爲例說明。URL類的hashCode()源碼：

public synchronized int hashCode() {
        if (hashCode != -1)
            return hashCode;

        hashCode = handler.hashCode(this);
        return hashCode;
    }

handler對象的hashCode()方法：

protected int hashCode(URL u) {
        int h = 0;

        // Generate the protocol part.
        String protocol = u.getProtocol();
        if (protocol != null)
            h += protocol.hashCode();

        // Generate the host part.
        InetAddress addr = getHostAddress(u);
        if (addr != null) {
            h += addr.hashCode();
        } else {
            String host = u.getHost();
            if (host != null)
                h += host.toLowerCase().hashCode();
        }

        // Generate the file part.
        String file = u.getFile();
        if (file != null)
            h += file.hashCode();

        // Generate the port part.
        if (u.getPort() == -1)
            h += getDefaultPort();
        else
            h += u.getPort();

        // Generate the ref part.
        String ref = u.getRef();
        if (ref != null)
            h += ref.hashCode();

        return h;
    }

其中getHostAddress()會消耗大量時間。因此，若是在基於哈希表的容器中存儲URL對象，簡直就是災難。下面這段代碼，對比了URL和URI在存儲50次時的表現：

import java.net.*;
import java.util.*;

public class TestHash {
    public static void main(String args[]) {
        HashSet<URL> list1 = new HashSet<>();
        HashSet<URI> list2 = new HashSet<>();
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URI url2 = new URI("https://zhanglanqing.github.io/");
            long cur = System.currentTimeMillis();
            int cnt = 50;
            for (int i = 0; i < cnt; i++) {
                list1.add(url1);
            }
            System.out.println(System.currentTimeMillis() - cur);
            cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                list2.add(url2);
            }
            System.out.println(System.currentTimeMillis() - cur);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

輸出爲：

271
0

因此，基於哈希表實現的容器最好不要用URL。

`TrailingSlash`的做用

所謂TrailingSlash就是域名結尾的斜槓。好比咱們在瀏覽器看到vimerzhao.top,複製後粘貼發現是http://vimerzhao.top/。首先用下面代碼測試：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URL url2 = new URL("https://vimerzhao.github.io");
            System.out.println(url1.equals(url2));
            outputInfo(url1);
            outputInfo(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void outputInfo(URL url) {
        System.out.println("------" + url.toString() + "----------");
        System.out.println(url.getRef());
        System.out.println(url.getFile());
        System.out.println(url.getHost());
        System.out.println("----------------");
    }
}

獲得結果以下：

false
------https://vimerzhao.github.io/----------
null
/
vimerzhao.github.io
----------------
------https://vimerzhao.github.io----------
null

vimerzhao.github.io
----------------

其實，不管用前面的read()方法讀或者地址欄直接輸入url，url1和url2的內容都是相同的，可是加/表示這是一個目錄，不加表示這是一個文件，因此兩者getFile()的結果不一樣，致使equals判斷爲false。在地址欄輸入時甚至不會覺察到這個TrailingSlash，所返回的結果也同樣，但equals判斷居然爲false，真是防不勝防！
這裏還有一個問題就是：一個是文件，令一個是目錄，爲何都能獲得相同結果？
調查一番後發現：其實再請求的時候若是有/，那麼就會在這個目錄下找index.html文件；若是沒有，以vimerzhao.top/tags爲例，則會先找tags，若是找不到就會自動在後面添加一個/，再在tags目錄下找index.html文件。如圖：

這裏有一個有趣的測試，編寫兩段代碼以下：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithSlash = new URL("http://vimerzhao.top/tags/");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

import java.net.*;
import java.io.*;

public class TestWithoutTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithoutSlash = new URL("http://vimerzhao.top/tags");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithoutSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

使用以下腳本測試：

#!/bin/sh
for i in {1..20}; do
    java TestTrailingSlash > out1
    java TestWithoutTrailingSlash > out2
done

將輸出的時間作成表格：

能夠發現，添加了/的速度更快，這是由於省去了查找是否有tags文件的操做。這也給咱們啓發：URL結尾的/最好仍是加上！

以上，本週末發現的一些坑。

參考

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Java URL類踩坑指南

背景介紹

不支持重定向

不穩定的equals方法

耗時的equals方法

TrailingSlash的做用

參考

不穩定的`equals`方法

耗時的`equals`方法

`TrailingSlash`的做用