基礎命令學習目錄首頁 javascript
原文連接:https://blog.csdn.net/u012313689/article/details/53033804css
iconv命令是用來轉換文件的編碼方式的(Convert encoding of given files from one encoding to another),好比它能夠將UTF8編碼的轉換成GB18030的編碼,反過來也行。JDK中也提供了相似的工具native2ascii。Linux下的iconv開發庫包括iconv_open,iconv_close,iconv等C函數,能夠用來在C/C++程序中很方便的轉換字符編碼,這在抓取網頁的程序中頗有用處,而iconv命令在調試此類程序時用得着。html
首先,咱們要知道支持的字符編碼有哪些,這個能夠用-l參數獲得(List known coded character sets)。java
格式:iconv -lless
其次,是怎樣轉換,以下所示:curl
格式:iconv -f from-encoding -t to-encoding inputfileide
上面的調用方式,會把輸出打印在屏幕上,若是要輸出到文件,能夠像下面這樣函數
格式:iconv -f from-encoding -t to-encoding inputfile -o outputfile工具
[root@new55 ~]# iconv -l
The following list contain all the coded character sets known. This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters. One coded character set can be
listed with several different names (aliases).
437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BS_4730, CA, CN-BIG5, CN-GB,
中間省略掉輸出了。
EUCJP-OPEN, EUCJP-WIN, EUCJP, EUCKR, EUCTW, FI, FR, GB, GB2312, GB13000,
GB18030, GBK, GB_1988-80, GB_198880, GEORGIAN-ACADEMY, GEORGIAN-PS,
GOST_19768-74, GOST_19768, GOST_1976874, GREEK-CCITT, GREEK, GREEK7-OLD,
GREEK7, GREEK7OLD, GREEK8, GREEKCCITT, HEBREW, HP-ROMAN8, HPROMAN8, HU,
中間省略掉輸出了。
TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, UCS-2, UCS-2BE,
UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK, UNICODE,
UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE,
UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE,
UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM,
WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251,
WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256,
WINDOWS-1257, WINDOWS-1258, WINSAMI2, WS2, YUpost
太多了,我只想知道支持哪些中文格式的。
[root@new55 ~]# iconv -l | grep GB
CN-GB//
CSGB2312//
CSISO58GB1988//
EBCDIC-CP-GB//
GB//
GB2312//
GB13000//
GB18030//
GBK//
GB_1988-80//
GB_198880//
ISO646-GB//
有沒有發現奇怪的地方,每行顯示一個,而且後面加了兩個斜槓。
[root@new55 ~]#
[root@new55 ~]# curl -s http://www.google.com.hk/ | iconv -f big5 -t gbk
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=Big5"><title>Google</title><script>window.google={kEI:"tFXZTNHKDcGTkAXpvOHhCA",kEXPI:"26637,27404",kCSI:{e:"26637,27404",ei:"tFXZTNHKDcGTkAXpvOHhCA",expi:"26637,27404"},ml:function(){},kHL:"zh-TW",time:function(){return(new Date).getTime()},log:function(b,d,c){var a=new Image,e=google,g=e.lc,f=e.li;a.onerror=(a.onload=(a.onabort=function(){delete g[f]}));g[f]=a;c=c||"/gen_204?atyp=i&ct="+b+"&cad="+d+"&zx="+google.time();a.src=c;e.li=f+1},lc:[],li:0,Toolbelt:{}};
id=ghead><div id=gbar><nobr><b class="gb1">全部網頁</b> <a onclick=gbar.qs(this) href="http://www.google.com.hk/imghp?hl=zh-tw&tab=wi" class="gb1">圖片</a> <a onclick=gbar.qs(this) href="http://video.google.com.hk/?hl=zh-tw&tab=wv" class="gb1">影片</a> <a onclick=gbar.qs(this) href="http://maps.google.com.hk/maps?hl=zh-tw&tab=wl" class="gb1">地圖</a> <a onclick=gbar.qs(this) f||document.f||document.gs;google.ac.i(form,form.q,'','','',{o:1,sw:1});google.mc = [[14,{}],[64,{}],[105,{}],[22,{"m_error":"\u003Cfont color=red\u003E錯誤:\u003C/font\u003E 伺服器無法完成您的要求。 請在 30 秒後再試一次。","m_tip":"按一下以取得詳細資訊。"}],[84,{}]];google.med('init');google.History&&google.History.initialize('/')});if(google.j&&google.j.en&&google.j.xi){window.setTimeout(google.j.xi,0);google.fade=null;}</script></div><script>(function(){
中間省略掉輸出了。
})();
</script>[root@new55 ~]#
[root@new55 ~]# curl -s http://codingstandards.iteye.com/ | iconv -f utf8 -t gbk
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Bash @ Linux - JavaEye技術網站</title>
<meta name="description" content="" />
<meta name="keywords" content="codingstandards Bash @ Linux" />
中間省略掉輸出了。
<div class="blog_main">
<div class="blog_title">
<div class="date"><span class='year'>2010</span><span class='sep_year'>-</span><span class='month'>10</span><span class='sep_month'>-</span><span class='day'>17</span></div>
<div class="show_full_flag"><a href='?show_full=true'>全文顯示</a></div>
<h3><a href='/blog/786653'>[置頂] 我使用過的Linux命令系列總目錄</a></h3>
<strong>文章分類:<a href="http://www.iteye.com/blogs/category/os" >操做系統</a></strong>
</div>
<div class="blog_content">
我使用過的Linux命令系列總目錄
本文連接: http://codingstandards.iteye.com/blog/786653
iconv: 未知 3345 處的非法輸入序列
最後一行代表有錯,改用下面的就會成功了。
[root@new55 ~]# curl -s http://codingstandards.iteye.com/ | iconv -f utf8 -t gb18030
此處省略輸出。有興趣的讀者能夠試一下,能夠完整的顯示整個頁面的源代碼。由於gbk是gb18030的子集,gb18030包含更多的字符。
[root@new55 ~]#
[root@new55 ~]# curl -s http://www.dreamdu.com/ | iconv -futf8 -t gbk
iconv: 未知 0 處的非法輸入序列
有問題,用hexdump來看一下里面的字節,發現裏面有ef bb bf的BOM信息,iconv不支持。
[root@new55 ~]# curl -s http://www.dreamdu.com/ | hexdump -C | less
00000000 ef bb bf 3c 21 44 4f 43 54 59 50 45 20 68 74 6d |...<!DOCTYPE htm|
00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 |l PUBLIC "-//W3C|
00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 |//DTD XHTML 1.0 |
00000030 53 74 72 69 63 74 2f 2f 45 4e 22 20 22 68 74 74 |Strict//EN" "htt|
00000040 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 54 |p://www.w3.org/T|
00000050 52 2f 78 68 74 6d 6c 31 2f 44 54 44 2f 78 68 74 |R/xhtml1/DTD/xht|
00000060 6d 6c 31 2d 73 74 72 69 63 74 2e 64 74 64 22 3e |ml1-strict.dtd">|
00000070 0d 0a 3c 68 74 6d 6c 20 78 6d 6c 6e 73 3d 22 68 |..<html xmlns="h|
:q
那就把前面三個字節去掉試試,果真能夠了。
[root@new55 ~]# curl -s http://www.dreamdu.com/ | cut -b 4- | iconv -futf8 -t gbk
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
ml xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">
ead>
meta http-equiv="content-type" content="text/html; charset=utf-8" />
meta http-equiv="content-language" content="zh-CN" />
link rel="stylesheet" type="text/css" href="/style.css?v=1" media="screen" />
script type="text/javascript" src="/js.js"></script>
title>夢之都 - 網站設計與開發教程</title>
head>
ody>
中間省略掉輸出。
body>
tml>
發現問題沒有,每行的前面幾個字符都消失了!!! [root@new55 ~]#