背景介紹
最近再作一個RSS閱讀工具給本身用,其中一個環節是從服務器端獲取一個包含了RSS源列表的json文件,再根據這個json文件下載、解析RSS內容。核心代碼以下:html
class PresenterImpl(val context: Context, val activity: MainActivity) : IPresenter { private val URL_API = "https://vimerzhao.github.io/others/rssreader/RSS.json" override fun getRssResource(): RssSource { val gson = GsonBuilder().create() return gson.fromJson(getFromNet(URL_API), RssSource::class.java) } private fun getFromNet(url: String): String { val result = URL(url).readText() return result } ...... }
以前一直執行地很好,直到前兩天我購買了一個vimerzhao.top
的域名,並將原來的域名vimerzhao.github.io
重定向到了vimerzhao.top
。這個工具就沒法使用了,但在瀏覽器輸入URL_API
卻能獲得數據: java
那爲何URL.readText()
沒有拿到數據呢?python
不支持重定向
能夠經過下面代碼測試:linux
import java.net.*; import java.io.*; public class TestRedirect { public static void main(String args[]) { try { URL url1 = new URL("https://vimerzhao.github.io/others/rssreader/RSS.json"); URL url2 = new URL("http://vimerzhao.top/others/rssreader/RSS.json"); read(url1); System.out.println("=--------------------------------="); read(url2); } catch (Exception e) { e.printStackTrace(); } } public static void read(URL url) { try { BufferedReader in = new BufferedReader( new InputStreamReader(url.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) { System.out.println(inputLine); } in.close(); } catch (IOException e) { e.printStackTrace(); } } }
獲得結果以下:android
<html> <head><title>301 Moved Permanently</title></head> <body bgcolor="white"> <center><h1>301 Moved Permanently</h1></center> <hr><center>nginx</center> </body> </html> =--------------------------------= {"theme":"tech","author":"zhaoyu","email":"dutzhaoyu@gmail.com","version":"0.01","contents":[{"category":"綜合版塊","websites":[{"tag":"門戶網站","url":["http://geek.csdn.net/admin/news_service/rss","http://blog.jobbole.com/feed/","http://feed.cnblogs.com/blog/sitehome/rss","https://segmentfault.com/feeds","http://www.codeceo.com/article/category/pick/feed"]},{"tag":"知名社區","url":["https://stackoverflow.com/feeds","https://www.v2ex.com/index.xml"]},{"tag":"官方博客","url":["https://www.blog.google/rss/","https://blog.jetbrains.com/feed/"]},{"tag":"我的博客-行業","url":["http://feed.williamlong.info/","https://www.liaoxuefeng.com/feed/articles"]},{"tag":"我的博客-學術","url":["http://www.norvig.com/rss-feed.xml"]}]},{"category":"編程語言","websites":[{"tag":"Kotlin","url":["https://kotliner.cn/api/rss/latest"]},{"tag":"Python","url":["https://www.python.org/dev/peps/peps.rss/"]},{"tag":"Java","url":["http://www.codeceo.com/article/category/develop/java/feed"]}]},{"category":"行業動態","websites":[{"tag":"Android","url":["http://www.codeceo.com/article/category/develop/android/feed"]}]},{"category":"亂七八遭","websites":[{"tag":"Linux-綜合","url":["https://linux.cn/rss.xml","http://www.linuxidc.com/rssFeed.aspx","http://www.codeceo.com/article/tag/linux/feed"]},{"tag":"Linux-發行版","url":["https://blog.linuxmint.com/?feed=rss2","https://manjaro.github.io/feed.xml"]}]}]}
HTTP返回碼301,即發生了重定向。可在瀏覽器上這個過程太快以致於咱們看不到這個301界面的出現。這裏須要說明的是URL.readText()
是Kotlin中一個擴展函數
,本質仍是調用了URL
類的openStream
方法,部分源碼以下:nginx
..... /** * Reads the entire content of this URL as a String using UTF-8 or the specified [charset]. * * This method is not recommended on huge files. * * @param charset a character set to use. * @return a string with this URL entire content. */ @kotlin.internal.InlineOnly public inline fun URL.readText(charset: Charset = Charsets.UTF_8): String = readBytes().toString(charset) /** * Reads the entire content of the URL as byte array. * * This method is not recommended on huge files. * * @return a byte array with this URL entire content. */ public fun URL.readBytes(): ByteArray = openStream().use { it.readBytes() }
因此上面的測試代碼即說明了URL.readText()
失敗的緣由。 不過URL
不支持重定向是否合理?爲何不支持?還有待探究。git
不穩定的equals
方法
首先看下equals
的說明(URL (Java Platform SE 7 )):github
Compares this URL for equality with another object. If the given object is not a URL then this method immediately returns false. Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file. Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null. Since hosts comparison requires name resolution, this operation is a blocking operation. Note: The defined behavior for equals is known to be inconsistent with virtual hosting in HTTP.web
接下來再看一段代碼:編程
import java.net.*; public class TestEquals { public static void main(String args[]) { try { // vimerzhao的博客主頁 URL url1 = new URL("https://vimerzhao.github.io/"); // zhanglanqing的博客主頁 URL url2 = new URL("https://zhanglanqing.github.io/"); // vimerzhao博客主頁重定向後的域名 URL url3 = new URL("http://vimerzhao.top/"); System.out.println(url1.equals(url2)); System.out.println(url1.equals(url3)); } catch (Exception e) { e.printStackTrace(); } } }
根據定義輸出結果是什麼呢?運行以後是這樣:
true false
你可能猜對了,但若是我把電腦斷網以後再次執行,結果倒是:
false false
但其實3個域名的IP地址都是相同的,能夠ping
一下:
zhaoyu@Inspiron ~/Project $ ping vimezhao.github.io PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data. 64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms ^C --- sni.github.map.fastly.net ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 396.692/396.692/396.692/0.000 ms zhaoyu@Inspiron ~/Project $ ping zhanglanqing.github.io PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data. 64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms ^C --- sni.github.map.fastly.net ping statistics --- 2 packets transmitted, 1 received, 50% packet loss, time 1000ms rtt min/avg/max/mdev = 396.009/396.009/396.009/0.000 ms zhaoyu@Inspiron ~/Project $ ping vimezhao.top ping: unknown host vimezhao.top zhaoyu@Inspiron ~/Project $ ping vimerzhao.top PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data. 64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=409 ms ^C --- sni.github.map.fastly.net ping statistics --- 2 packets transmitted, 1 received, 50% packet loss, time 1001ms rtt min/avg/max/mdev = 409.978/409.978/409.978/0.000 ms
首先看一下有網絡鏈接的狀況,vimerzhao.github.io
和zhanglanqing.github.io
是我和我同窗的博客,雖然內容不同可是指向相同的IP,協議、端口等都相同,因此相等了;而vimerzhao.github.io
雖然和vimerzhao.top
指向同一個博客,可是一個是https
一個是http
,協議不一樣,因此判斷爲不相等。相信這和大多數人的直覺是相背的:指向不一樣博客的URL相等了,但指向相同博客的URL卻不相等! 再分析斷網以後的結果:首先查看URL
的源碼:
public boolean equals(Object obj) { if (!(obj instanceof URL)) return false; URL u2 = (URL)obj; return handler.equals(this, u2); }
再看handler
對象的源碼:
protected boolean equals(URL u1, URL u2) { String ref1 = u1.getRef(); String ref2 = u2.getRef(); return (ref1 == ref2 || (ref1 != null && ref1.equals(ref2))) && sameFile(u1, u2); }
sameFile
源碼:
protected boolean sameFile(URL u1, URL u2) { // Compare the protocols. if (!((u1.getProtocol() == u2.getProtocol()) || (u1.getProtocol() != null && u1.getProtocol().equalsIgnoreCase(u2.getProtocol())))) return false; // Compare the files. if (!(u1.getFile() == u2.getFile() || (u1.getFile() != null && u1.getFile().equals(u2.getFile())))) return false; // Compare the ports. int port1, port2; port1 = (u1.getPort() != -1) ? u1.getPort() : u1.handler.getDefaultPort(); port2 = (u2.getPort() != -1) ? u2.getPort() : u2.handler.getDefaultPort(); if (port1 != port2) return false; // Compare the hosts. if (!hostsEqual(u1, u2)) return false;// 無網絡鏈接時會觸發這一句 return true; }
最後是hostsEqual
的源碼:
protected boolean hostsEqual(URL u1, URL u2) { InetAddress a1 = getHostAddress(u1); InetAddress a2 = getHostAddress(u2); // if we have internet address for both, compare them if (a1 != null && a2 != null) { return a1.equals(a2); // else, if both have host names, compare them } else if (u1.getHost() != null && u2.getHost() != null) return u1.getHost().equalsIgnoreCase(u2.getHost()); else return u1.getHost() == null && u2.getHost() == null; }
在有網絡的狀況下,a1
和a2
都不是null
因此會觸發return a1.equals(a2)
,返回true
;而沒有網絡時則會觸發return u1.getHost().equalsIgnoreCase(u2.getHost());
即第二個判斷,顯然url1
的host
(vimerzhao.github.io
)和url2
的host
(zhanglanqing.github.io
)不等,因此返回false
,致使if (!hostsEqual(u1, u2))
判斷爲真,return false
執行。 可見,URL
類的equals
方法不只違反直覺還缺少一致性,在不一樣環境會有不一樣結果,十分危險!
耗時的equals
方法
此外,equals
仍是個耗時的操做,由於在有網絡的狀況下須要進行DNS解析,hashCode()
同理,這裏以hashCode()
爲例說明。URL
類的hashCode()
源碼:
public synchronized int hashCode() { if (hashCode != -1) return hashCode; hashCode = handler.hashCode(this); return hashCode; }
handler
對象的hashCode()
方法:
protected int hashCode(URL u) { int h = 0; // Generate the protocol part. String protocol = u.getProtocol(); if (protocol != null) h += protocol.hashCode(); // Generate the host part. InetAddress addr = getHostAddress(u); if (addr != null) { h += addr.hashCode(); } else { String host = u.getHost(); if (host != null) h += host.toLowerCase().hashCode(); } // Generate the file part. String file = u.getFile(); if (file != null) h += file.hashCode(); // Generate the port part. if (u.getPort() == -1) h += getDefaultPort(); else h += u.getPort(); // Generate the ref part. String ref = u.getRef(); if (ref != null) h += ref.hashCode(); return h; }
其中getHostAddress()
會消耗大量時間。因此,若是在基於哈希表的容器中存儲URL
對象,簡直就是災難。下面這段代碼,對比了URL
和URI
在存儲50次時的表現:
import java.net.*; import java.util.*; public class TestHash { public static void main(String args[]) { HashSet<URL> list1 = new HashSet<>(); HashSet<URI> list2 = new HashSet<>(); try { URL url1 = new URL("https://vimerzhao.github.io/"); URI url2 = new URI("https://zhanglanqing.github.io/"); long cur = System.currentTimeMillis(); int cnt = 50; for (int i = 0; i < cnt; i++) { list1.add(url1); } System.out.println(System.currentTimeMillis() - cur); cur = System.currentTimeMillis(); for (int i = 0; i < cnt; i++) { list2.add(url2); } System.out.println(System.currentTimeMillis() - cur); } catch (Exception e) { e.printStackTrace(); } } }
輸出爲:
271 0
因此,基於哈希表實現的容器最好不要用URL
。
TrailingSlash
的做用
所謂TrailingSlash
就是域名結尾的斜槓。好比咱們在瀏覽器看到vimerzhao.top
,複製後粘貼發現是http://vimerzhao.top/
。首先用下面代碼測試:
import java.net.*; import java.io.*; public class TestTrailingSlash { public static void main(String args[]) { try { URL url1 = new URL("https://vimerzhao.github.io/"); URL url2 = new URL("https://vimerzhao.github.io"); System.out.println(url1.equals(url2)); outputInfo(url1); outputInfo(url2); } catch (Exception e) { e.printStackTrace(); } } public static void outputInfo(URL url) { System.out.println("------" + url.toString() + "----------"); System.out.println(url.getRef()); System.out.println(url.getFile()); System.out.println(url.getHost()); System.out.println("----------------"); } }
獲得結果以下:
false ------https://vimerzhao.github.io/---------- null / vimerzhao.github.io ---------------- ------https://vimerzhao.github.io---------- null vimerzhao.github.io ----------------
其實,不管用前面的read()
方法讀或者地址欄直接輸入url,url1
和url2
的內容都是相同的,可是加/
表示這是一個目錄,不加表示這是一個文件,因此兩者getFile()
的結果不一樣,致使equals
判斷爲false
。在地址欄輸入時甚至不會覺察到這個TrailingSlash
,所返回的結果也同樣,但equals
判斷居然爲false
,真是防不勝防! 這裏還有一個問題就是:一個是文件,令一個是目錄,爲何都能獲得相同結果? 調查一番後發現:其實再請求的時候若是有/
,那麼就會在這個目錄下找index.html
文件;若是沒有,以vimerzhao.top/tags
爲例,則會先找tags
,若是找不到就會自動在後面添加一個/
,再在tags
目錄下找index.html
文件。如圖:
這裏有一個有趣的測試,編寫兩段代碼以下:
import java.net.*; import java.io.*; public class TestTrailingSlash { public static void main(String args[]) { try { URL urlWithSlash = new URL("http://vimerzhao.top/tags/"); int cnt = 5; long cur = System.currentTimeMillis(); for (int i = 0; i < cnt; i++) { read(urlWithSlash); } System.out.println(System.currentTimeMillis() - cur); } catch (Exception e) { e.printStackTrace(); } } public static void read(URL url) { try { BufferedReader in = new BufferedReader( new InputStreamReader(url.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) { //System.out.println(inputLine); } in.close(); } catch (IOException e) { e.printStackTrace(); } } }
import java.net.*; import java.io.*; public class TestWithoutTrailingSlash { public static void main(String args[]) { try { URL urlWithoutSlash = new URL("http://vimerzhao.top/tags"); int cnt = 5; long cur = System.currentTimeMillis(); for (int i = 0; i < cnt; i++) { read(urlWithoutSlash); } System.out.println(System.currentTimeMillis() - cur); } catch (Exception e) { e.printStackTrace(); } } public static void read(URL url) { try { BufferedReader in = new BufferedReader( new InputStreamReader(url.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) { //System.out.println(inputLine); } in.close(); } catch (IOException e) { e.printStackTrace(); } } }
使用以下腳本測試:
#!/bin/sh for i in {1..20}; do java TestTrailingSlash > out1 java TestWithoutTrailingSlash > out2 done
將輸出的時間作成表格:
能夠發現,添加了/
的速度更快,這是由於省去了查找是否有tags
文件的操做。這也給咱們啓發:URL結尾的/
最好仍是加上!
以上,本週末發現的一些坑。
參考
- Official Google Webmaster Central Blog: To slash or not to slash
- url rewriting - When should I use a trailing slash in my URL? - Stack Overflow
- What Does a Slash at the End of a Website's URL Mean?
- Mr. Gosling - why did you make URL equals suck?!? - Invert Your Mind » Invert Your Mind
- java - URLConnection Doesn't Follow Redirect - Stack Overflow
- java - Proper way to check for URL equality - Stack Overflow
- http - How to compare two URLs in java? - Stack Overflow