LVS是內核代碼(ip_vs)和用戶空間控制器(ipvsadm)html
neutron-lbaas-lvs底層實現機制:ipvsadmlinux
下載ipvsadm客戶端源碼,ipvsad-1-26.8中有使用相關配置描述:算法
echo "1" > /proc/sys/net/ipv4/ip_forward數組
cat /proc/sys/net/ipv4/vs/amemthresh服務器
/proc/sys/net/ipv4/vs/timeout_*網絡
cat /proc/sys/net/ipv4/vs/snat_reroute數據結構
重要資料:架構
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.htmlapp
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/負載均衡
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.what_is_an_LVS.html
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.ipvsadm.html
http://www.linuxvirtualserver.org/software/ipvs.html
http://kb.linuxvirtualserver.org/wiki/Ipvsadm#Compiling_ipvsadm
# ll /proc/net/ip_vs*
ip_vs
ip_vs_app
ip_vs_conn
ip_vs_conn_sync
ip_vs_stats
ip_vs_stats_percpu
# cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
UDP 0A212E47:22B8 rr
-> 0A212E04:1E61 Masq 10000 0 0
# ipvsadm -S
-A -u host-10-33-46-71.openstacklocal:ddi-udp-1 -s rr
-a -u host-10-33-46-71.openstacklocal:ddi-udp-1 -r host-10-33-46-4.openstacklocal:cbt -m -w 10000
# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
UDP host-10-33-46-71.openstacklo rr
-> host-10-33-46-4.openstackloc Masq 10000 0 0
ipvsadm通用模型:CIP<-->VIP--DIP<-->RIP
Direct Routing:直接路由
Director:負載均衡層(Loader Balancer)
RS real server :真實提供服務的計算機
VIP:Virtual IP(VIP)address:Director用來向客戶端提供服務的IP地址
RIP:Real IP (RIP) address:集羣節點(後臺真正提供服務的服務器)所使用的IP地址
DIP:Director's IP (DIP) address:Director用來和D/RIP 進行聯繫的地址
CIP:Client computer's IP (CIP) address:公網IP,客戶端使用的IP
ipvsadm -A|E -t|u|f service-address [-s scheduler] [-p [timeout]] [-M netmask] [--pe persistence_engine] [-b sched-flags]
LVS類型:
--gatewaying -g gatewaying (direct routing) (default)
--ipip -i ipip encapsulation (tunneling)
--masquerading -m masquerading (NAT)
--scheduler -s scheduler one of rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq, the default scheduler is wlc.
1.Fixed Scheduling Method 靜態調服方法
(1).RR 輪詢
(2).WRR 加權輪詢
(3).DH 目標地址hash
(4).SH 源地址hash
2.Dynamic Scheduling Method 動態調服方法
(1).LC 最少鏈接
(2).WLC 加權最少鏈接
(3).SED 最少指望延遲
(4).NQ 從不排隊調度方法
(5).LBLC 基於本地的最少鏈接
(6).LBLCR 帶複製的基於本地的最少鏈接
網卡配置:
Director,內網網卡,外網網卡
外網eth0 : 10.19.172.188
外網eth0:0 (vip):10.19.172.184
內網eth1 : 192.168.1.1
RS(Realserver), 內網網卡
eth0: 192.168.1.2
192.168.1.3
192.168.1.4
gateway: 192.168.1.1
配置IPVS Table腳本 :
VIP=192.168.34.41
RIP1=192.168.34.27
RIP2=192.168.34.26
GW=192.168.34.1
#清除IPVS Table
ipvsadm -C
#設置IPVS Table
ipvsadm -A -t $VIP:443 -s wlcipvsadm -a -t $VIP:443 -r $RIP1:443 -g -w 1
ipvsadm -a -t $VIP:443 -r $RIP2:443 -g -w 1
#將IPVS Table保存到/etc/sysconfig/ipvsadm/etc/rc.d/init.d/
ipvsadm save
#啓動IPVSservice
ipvsadm start
#顯示IPVS狀態
ipvsadm -l
Director配置集羣,添加RS
# ipvsadm -A -t 172.16.251.184:80 -s sh
# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.2 -m
# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.3–m
# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.4 -m
LVS優於HAProxy的一個優勢是LVS支持客戶端的透明性(即不要SNAT)爲何
查看IPVS詳情 查看 /proc/net目錄下的 ip_vs ip_vs_app ip_vs_conn ip_vs_conn_sync ip_vs_ext_stats ip_vs_stats
ipvs相關的proc文件:
/proc/net/ip_vs :IPVS的規則表
/proc/net/ip_vs_app :IPVS應用協議
/proc/net/ip_vs_conn :IPVS當前鏈接
/proc/net/ip_vs_stats :IPVS狀態統計信息
# depmod -n|grep ipvs
kernel/net/netfilter/xt_ipvs.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs.ko.xz: kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_rr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_wrr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_wlc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lblc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lblcr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_dh.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_sh.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_sed.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_nq.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_ftp.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_nat.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_pe_sip.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack_sip.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
alias ip6t_ipvs xt_ipvs
alias ipt_ipvs xt_ipvs
# modinfo ip_vs
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/kernel/net/netfilter/ipvs/ip_vs.ko.xz
\license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 69C7A8962537C8009A78FBC
depends: nf_conntrack,libcrc32c
intree: Y
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: C0:02:A7:AD:CC:7C:84:36:68:A1:BC:B4:97:4C:1A:30:2D:FF:EA:35
sig_hashalgo: sha256
parm: conn_tab_bits:Set connections' hash size (int)
# modinfo iptable_nat
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/kernel/net/ipv4/netfilter/iptable_nat.ko.xz
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 291B36E315928812DAB1A47
depends: ip_tables,nf_nat_ipv4
intree: Y
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: C0:02:A7:AD:CC:7C:84:36:68:A1:BC:B4:97:4C:1A:30:2D:FF:EA:35
sig_hashalgo: sha256
源碼:http://www.linuxvirtualserver.org/software/ipvs.html
ipvsadm.sh:
# config: /etc/sysconfig/ipvsadm
# config: /etc/ipvsadm.rules
ipvsadm.c
int main(int argc, char **argv) { int result; if (ipvs_init()) { /* try to insmod the ip_vs module if ipvs_init failed */ if (modprobe_ipvs() || ipvs_init()) fail(2, "Can't initialize ipvs: %s\n" "Are you sure that IP Virtual Server is " "built in the kernel or as module?", ipvs_strerror(errno)); } /* warn the user if the IPVS version is out of date */ check_ipvs_version(); /* list the table if there is no other arguement */ if (argc == 1){ list_all(FMT_NONE); ipvs_close(); return 0; } /* process command line arguments */ result = process_options(argc, argv, 0); ipvs_close(); return result; }
static int parse_options(int argc, char **argv, struct ipvs_command_entry *ce, unsigned int *options, unsigned int *format) { int c, parse; poptContext context; char *optarg=NULL; struct poptOption options_table[] = { { "add-service", 'A', POPT_ARG_NONE, NULL, 'A', NULL, NULL }, { "edit-service", 'E', POPT_ARG_NONE, NULL, 'E', NULL, NULL }, { "delete-service", 'D', POPT_ARG_NONE, NULL, 'D', NULL, NULL }, { "clear", 'C', POPT_ARG_NONE, NULL, 'C', NULL, NULL }, { "list", 'L', POPT_ARG_NONE, NULL, 'L', NULL, NULL }, { "list", 'l', POPT_ARG_NONE, NULL, 'l', NULL, NULL }, { "zero", 'Z', POPT_ARG_NONE, NULL, 'Z', NULL, NULL }, { "add-server", 'a', POPT_ARG_NONE, NULL, 'a', NULL, NULL }, { "edit-server", 'e', POPT_ARG_NONE, NULL, 'e', NULL, NULL }, { "delete-server", 'd', POPT_ARG_NONE, NULL, 'd', NULL, NULL }, { "set", '\0', POPT_ARG_NONE, NULL, TAG_SET, NULL, NULL }, { "help", 'h', POPT_ARG_NONE, NULL, 'h', NULL, NULL }, { "version", 'v', POPT_ARG_NONE, NULL, 'v', NULL, NULL }, { "restore", 'R', POPT_ARG_NONE, NULL, 'R', NULL, NULL }, { "save", 'S', POPT_ARG_NONE, NULL, 'S', NULL, NULL }, { "start-daemon", '\0', POPT_ARG_STRING, &optarg, TAG_START_DAEMON, NULL, NULL }, { "stop-daemon", '\0', POPT_ARG_STRING, &optarg, TAG_STOP_DAEMON, NULL, NULL }, { "tcp-service", 't', POPT_ARG_STRING, &optarg, 't', NULL, NULL }, { "udp-service", 'u', POPT_ARG_STRING, &optarg, 'u', NULL, NULL }, { "fwmark-service", 'f', POPT_ARG_STRING, &optarg, 'f', NULL, NULL }, { "scheduler", 's', POPT_ARG_STRING, &optarg, 's', NULL, NULL }, { "persistent", 'p', POPT_ARG_STRING|POPT_ARGFLAG_OPTIONAL, &optarg, 'p', NULL, NULL }, { "netmask", 'M', POPT_ARG_STRING, &optarg, 'M', NULL, NULL }, { "real-server", 'r', POPT_ARG_STRING, &optarg, 'r', NULL, NULL }, { "masquerading", 'm', POPT_ARG_NONE, NULL, 'm', NULL, NULL }, { "ipip", 'i', POPT_ARG_NONE, NULL, 'i', NULL, NULL }, { "gatewaying", 'g', POPT_ARG_NONE, NULL, 'g', NULL, NULL }, { "weight", 'w', POPT_ARG_STRING, &optarg, 'w', NULL, NULL }, { "u-threshold", 'x', POPT_ARG_STRING, &optarg, 'x', NULL, NULL }, { "l-threshold", 'y', POPT_ARG_STRING, &optarg, 'y', NULL, NULL }, { "numeric", 'n', POPT_ARG_NONE, NULL, 'n', NULL, NULL }, { "connection", 'c', POPT_ARG_NONE, NULL, 'c', NULL, NULL }, { "mcast-interface", '\0', POPT_ARG_STRING, &optarg, TAG_MCAST_INTERFACE, NULL, NULL }, { "syncid", '\0', POPT_ARG_STRING, &optarg, 'I', NULL, NULL }, { "timeout", '\0', POPT_ARG_NONE, NULL, TAG_TIMEOUT, NULL, NULL }, { "daemon", '\0', POPT_ARG_NONE, NULL, TAG_DAEMON, NULL, NULL }, { "stats", '\0', POPT_ARG_NONE, NULL, TAG_STATS, NULL, NULL }, { "rate", '\0', POPT_ARG_NONE, NULL, TAG_RATE, NULL, NULL }, { "thresholds", '\0', POPT_ARG_NONE, NULL, TAG_THRESHOLDS, NULL, NULL }, { "persistent-conn", '\0', POPT_ARG_NONE, NULL, TAG_PERSISTENTCONN, NULL, NULL }, { "nosort", '\0', POPT_ARG_NONE, NULL, TAG_NO_SORT, NULL, NULL }, { "sort", '\0', POPT_ARG_NONE, NULL, TAG_SORT, NULL, NULL }, { "exact", 'X', POPT_ARG_NONE, NULL, 'X', NULL, NULL }, { "ipv6", '6', POPT_ARG_NONE, NULL, '6', NULL, NULL }, { "ops", 'o', POPT_ARG_NONE, NULL, 'o', NULL, NULL }, { "pe", '\0', POPT_ARG_STRING, &optarg, TAG_PERSISTENCE_ENGINE, NULL, NULL }, { NULL, 0, 0, NULL, 0, NULL, NULL } }; context = poptGetContext("ipvsadm", argc, (const char **)argv, options_table, 0); if ((c = poptGetNextOpt(context)) < 0) tryhelp_exit(argv[0], -1); switch (c) { case 'A': set_command(&ce->cmd, CMD_ADD); break; case 'E': set_command(&ce->cmd, CMD_EDIT); break; case 'D': set_command(&ce->cmd, CMD_DEL); break; case 'a': set_command(&ce->cmd, CMD_ADDDEST); break; case 'e': set_command(&ce->cmd, CMD_EDITDEST); break; case 'd': set_command(&ce->cmd, CMD_DELDEST); break; case 'C': set_command(&ce->cmd, CMD_FLUSH); break; case 'L': case 'l': set_command(&ce->cmd, CMD_LIST); break; case 'Z': set_command(&ce->cmd, CMD_ZERO); break; case TAG_SET: set_command(&ce->cmd, CMD_TIMEOUT); break; case 'R': set_command(&ce->cmd, CMD_RESTORE); break; case 'S': set_command(&ce->cmd, CMD_SAVE); break; case TAG_START_DAEMON: set_command(&ce->cmd, CMD_STARTDAEMON); if (!strcmp(optarg, "master")) ce->daemon.state = IP_VS_STATE_MASTER; else if (!strcmp(optarg, "backup")) ce->daemon.state = IP_VS_STATE_BACKUP; else fail(2, "illegal start-daemon parameter specified"); break; case TAG_STOP_DAEMON: set_command(&ce->cmd, CMD_STOPDAEMON); if (!strcmp(optarg, "master")) ce->daemon.state = IP_VS_STATE_MASTER; else if (!strcmp(optarg, "backup")) ce->daemon.state = IP_VS_STATE_BACKUP; else fail(2, "illegal start_daemon specified"); break; case 'h': usage_exit(argv[0], 0); break; case 'v': version_exit(0); break; default: tryhelp_exit(argv[0], -1); } while ((c=poptGetNextOpt(context)) >= 0){ switch (c) { case 't': case 'u': set_option(options, OPT_SERVICE); ce->svc.protocol = (c=='t' ? IPPROTO_TCP : IPPROTO_UDP); parse = parse_service(optarg, &ce->svc); if (!(parse & SERVICE_ADDR)) fail(2, "illegal virtual server " "address[:port] specified"); break; case 'f': set_option(options, OPT_SERVICE); /* * Set protocol to a sane values, even * though it is not used */ ce->svc.af = AF_INET; ce->svc.protocol = IPPROTO_TCP; ce->svc.fwmark = parse_fwmark(optarg); break; case 's': set_option(options, OPT_SCHEDULER); strncpy(ce->svc.sched_name, optarg, IP_VS_SCHEDNAME_MAXLEN); break; case 'p': set_option(options, OPT_PERSISTENT); ce->svc.flags |= IP_VS_SVC_F_PERSISTENT; ce->svc.timeout = parse_timeout(optarg, 1, MAX_TIMEOUT); break; case 'M': set_option(options, OPT_NETMASK); if (ce->svc.af != AF_INET6) { parse = parse_netmask(optarg, &ce->svc.netmask); if (parse != 1) fail(2, "illegal virtual server " "persistent mask specified"); } else { ce->svc.netmask = atoi(optarg); if ((ce->svc.netmask < 1) || (ce->svc.netmask > 128)) fail(2, "illegal ipv6 netmask specified"); } break; case 'r': set_option(options, OPT_SERVER); ipvs_service_t t_dest = ce->svc; parse = parse_service(optarg, &t_dest); ce->dest.af = t_dest.af; ce->dest.addr = t_dest.addr; ce->dest.port = t_dest.port; if (!(parse & SERVICE_ADDR)) fail(2, "illegal real server " "address[:port] specified"); /* copy vport to dport if not specified */ if (parse == 1) ce->dest.port = ce->svc.port; break; case 'i': set_option(options, OPT_FORWARD); ce->dest.conn_flags = IP_VS_CONN_F_TUNNEL; break; case 'g': set_option(options, OPT_FORWARD); ce->dest.conn_flags = IP_VS_CONN_F_DROUTE; break; case 'm': set_option(options, OPT_FORWARD); ce->dest.conn_flags = IP_VS_CONN_F_MASQ; break; case 'w': set_option(options, OPT_WEIGHT); if ((ce->dest.weight = string_to_number(optarg, 0, 65535)) == -1) fail(2, "illegal weight specified"); break; case 'x': set_option(options, OPT_UTHRESHOLD); if ((ce->dest.u_threshold = string_to_number(optarg, 0, INT_MAX)) == -1) fail(2, "illegal u_threshold specified"); break; case 'y': set_option(options, OPT_LTHRESHOLD); if ((ce->dest.l_threshold = string_to_number(optarg, 0, INT_MAX)) == -1) fail(2, "illegal l_threshold specified"); break; case 'c': set_option(options, OPT_CONNECTION); break; case 'n': set_option(options, OPT_NUMERIC); *format |= FMT_NUMERIC; break; case TAG_MCAST_INTERFACE: set_option(options, OPT_MCAST); strncpy(ce->daemon.mcast_ifn, optarg, IP_VS_IFNAME_MAXLEN); break; case 'I': set_option(options, OPT_SYNCID); if ((ce->daemon.syncid = string_to_number(optarg, 0, 255)) == -1) fail(2, "illegal syncid specified"); break; case TAG_TIMEOUT: set_option(options, OPT_TIMEOUT); break; case TAG_DAEMON: set_option(options, OPT_DAEMON); break; case TAG_STATS: set_option(options, OPT_STATS); *format |= FMT_STATS; break; case TAG_RATE: set_option(options, OPT_RATE); *format |= FMT_RATE; break; case TAG_THRESHOLDS: set_option(options, OPT_THRESHOLDS); *format |= FMT_THRESHOLDS; break; case TAG_PERSISTENTCONN: set_option(options, OPT_PERSISTENTCONN); *format |= FMT_PERSISTENTCONN; break; case TAG_NO_SORT: set_option(options, OPT_NOSORT ); *format |= FMT_NOSORT; break; case TAG_SORT: /* Sort is the default, this is a no-op for compatibility */ break; case 'X': set_option(options, OPT_EXACT); *format |= FMT_EXACT; break; case '6': if (ce->svc.fwmark) { ce->svc.af = AF_INET6; ce->svc.netmask = 128; } else { fail(2, "-6 used before -f\n"); } break; case 'o': set_option(options, OPT_ONEPACKET); ce->svc.flags |= IP_VS_SVC_F_ONEPACKET; break; case TAG_PERSISTENCE_ENGINE: set_option(options, OPT_PERSISTENCE_ENGINE); strncpy(ce->svc.pe_name, optarg, IP_VS_PENAME_MAXLEN); break; default: fail(2, "invalid option `%s'", poptBadOption(context, POPT_BADOPTION_NOALIAS)); } } if (c < -1) { /* an error occurred during option processing */ fprintf(stderr, "%s: %s\n", poptBadOption(context, POPT_BADOPTION_NOALIAS), poptStrerror(c)); poptFreeContext(context); return -1; } if (ce->cmd == CMD_TIMEOUT) { char *optarg1, *optarg2; if ((optarg=(char *)poptGetArg(context)) && (optarg1=(char *)poptGetArg(context)) && (optarg2=(char *)poptGetArg(context))) { ce->timeout.tcp_timeout = parse_timeout(optarg, 0, MAX_TIMEOUT); ce->timeout.tcp_fin_timeout = parse_timeout(optarg1, 0, MAX_TIMEOUT); ce->timeout.udp_timeout = parse_timeout(optarg2, 0, MAX_TIMEOUT); } else fail(2, "--set option requires 3 timeout values"); } if ((optarg=(char *)poptGetArg(context))) fail(2, "unexpected argument %s", optarg); poptFreeContext(context); return 0; }
IP_VS_CONN_F_MASQ在kernel代碼中搜索相關處理
http://kb.linuxvirtualserver.org/wiki/Ipvsadm#Compiling_ipvsadm
主要參考了http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/ 這個系列講的很是詳細。以及yfydz的博客,不少對代碼的註釋都是直接轉載的他的內容,先說明一下,我本身寫的主要是對代碼總體脈絡和思路的分析。
IPVS這部分的代碼看了挺長時間了,對於普通應用的處理,相對簡單。
對於FTP這種多鏈接的處理,IPVS雖然目前只支持FTP,可是用了不少代碼來處理這種affinity connection,其中用到了persistent connection和template。其中,persistent connection是指對於多鏈接的服務,須要把connection都定向到同一個server上面,persistent這個標記是在ipvsadm 添加service時就配置的。template是指對於多鏈接的服務,建立了一個template connection做爲其餘鏈接的template,(這裏的其餘鏈接是指從同一個src發出的,被iptables打過mark的鏈接,iptables能夠對相關的鏈接根據端口號打上mark,方便IPVS的處理。)這樣其餘鏈接就根據template中指向的dest,也定向到了dest。也就說相關的鏈接都發到了同一個dest。
根據LVS官方網站的介紹,LVS支持三種負載均衡模式:NAT,tunnel和direct routing(DR)。 NAT是通用模式,全部交互數據必須經過均衡器;後兩種則是一種半鏈接處理方式,請求數據經過均衡器,而服務器的迴應則是直接路由返回的, 而這兩種方法的區別是tunnel模式下因爲進行了IP封裝因此可路由,而DR方式是修改MAC地址來實現,因此必須同一網段. [主要數據結構] 這個結構用來描述IPVS支持的IP協議。IPVS的IP層協議支持TCP, UDP, AH和ESP這4種IP層協議 struct ip_vs_protocol { //鏈表中的下一項 struct ip_vs_protocol *next; //協議名稱, "TCP", "UDP". char *name; //協議值 __u16 protocol; //不進行分片 int dont_defrag; //協議應用計數器,根據是該協議的中多鏈接協議的數量 atomic_t appcnt; //協議各狀態的超時數組 int *timeout_table; void (*init)(struct ip_vs_protocol *pp); //協議初始化 void (*exit)(struct ip_vs_protocol *pp); //協議釋放 int (*conn_schedule)(struct sk_buff *skb, struct ip_vs_protocol *pp, int *verdict, struct ip_vs_conn **cpp); //協議調度 //查找in方向的IPVS鏈接 struct ip_vs_conn * (*conn_in_get)(const struct sk_buff *skb, struct ip_vs_protocol *pp, const struct iphdr *iph, unsigned int proto_off, int inverse); //查找out方向的IPVS鏈接 struct ip_vs_conn * (*conn_out_get)(const struct sk_buff *skb, struct ip_vs_protocol *pp, const struct iphdr *iph, unsigned int proto_off, int inverse); //源NAT操做 int (*snat_handler)(struct sk_buff **pskb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp); //目的NAT操做 int (*dnat_handler)(struct sk_buff **pskb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp); //協議校驗和計算 int (*csum_check)(struct sk_buff *skb, struct ip_vs_protocol *pp); //當前協議狀態名稱: 如"LISTEN", "ESTABLISH" const char *(*state_name)(int state); //協議狀態遷移 int (*state_transition)(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_protocol *pp); //登記應用 int (*register_app)(struct ip_vs_app *inc); //去除應用登記 void (*unregister_app)(struct ip_vs_app *inc); int (*app_conn_bind)(struct ip_vs_conn *cp); //數據包打印 void (*debug_packet)(struct ip_vs_protocol *pp, const struct sk_buff *skb, int offset, const char *msg); //調整超時 void (*timeout_change)(struct ip_vs_protocol *pp, int flags); //設置各類狀態下的協議超時 int (*set_state_timeout)(struct ip_vs_protocol *pp, char *sname, int to); }; 這個結構用來描述IPVS的鏈接。IPVS的鏈接和netfilter定義的鏈接相似 struct ip_vs_conn { struct list_head c_list; //HASH鏈表 __u32 caddr; //客戶機地址 __u32 vaddr; //服務器對外的虛擬地址 __u32 daddr; //服務器實際地址 __u16 cport; //客戶端的端口 __u16 vport; //服務器對外虛擬端口 __u16 dport; //服務器實際端口 __u16 protocol; //協議類型 atomic_t refcnt; //鏈接引用計數 struct timer_list timer; //定時器 volatile unsigned long timeout; //超時時間 spinlock_t lock; //狀態轉換鎖 volatile __u16 flags; /* status flags */ volatile __u16 state; /* state info */ struct ip_vs_conn *control; //主鏈接, 如FTP atomic_t n_control; //子鏈接數 struct ip_vs_dest *dest; //真正服務器 atomic_t in_pkts; //進入的數據統計 int (*packet_xmit)(struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp); //數據包發送 struct ip_vs_app *app; //IPVS應用 void *app_data; //應用的私有數據 struct ip_vs_seq in_seq; //進入數據的序列號 struct ip_vs_seq out_seq; //發出數據的序列號 }; 這個結構用來描述IPVS對外的虛擬服務器信息 struct ip_vs_service { struct list_head s_list; //按普通協議,地址,端口進行HASH的鏈表 struct list_head f_list; //按nfmark進行HASH的鏈表 atomic_t refcnt; //引用計數 atomic_t usecnt; //使用計數 __u16 protocol; //協議 __u32 addr; //虛擬服務器地址 __u16 port; //虛擬服務器端口 __u32 fwmark; //就是skb中的nfmark unsigned flags; //狀態標誌 unsigned timeout; //超時 __u32 netmask; //網絡掩碼 struct list_head destinations; //真實服務器的地址鏈表 __u32 num_dests; //真實服務器的數量 struct ip_vs_stats stats; //服務統計信息 struct ip_vs_app *inc; //應用 struct ip_vs_scheduler *scheduler; //調度指針 rwlock_t sched_lock; //調度鎖 void *sched_data; //調度私有數據 }; 這個結構用來描述具體的真實服務器的信息 struct ip_vs_dest { struct list_head n_list; /* for the dests in the service */ struct list_head d_list; /* for table with all the dests */ __u32 addr; //服務器地址 __u16 port; //服務器端口 volatile unsigned flags; //目標標誌,易變參數 atomic_t conn_flags; //鏈接標誌 atomic_t weight; //服務器權重 atomic_t refcnt; //引用計數 struct ip_vs_stats stats; //統計數 atomic_t activeconns; //活動的鏈接 atomic_t inactconns; //不活動的鏈接 atomic_t persistconns; //保持的鏈接,常駐 __u32 u_threshold; //鏈接上限 __u32 l_threshold; //鏈接下限 /* for destination cache */ spinlock_t dst_lock; /* lock of dst_cache */ struct dst_entry *dst_cache; /* destination cache entry */ u32 dst_rtos; struct ip_vs_service *svc; /* service it belongs to */ __u16 protocol; /* which protocol (TCP/UDP) */ __u32 vaddr; /* virtual IP address */ __u16 vport; /* virtual port number */ __u32 vfwmark; /* firewall mark of service */ }; 這個結構用來描述IPVS調度算法,目前調度方法包括rr,wrr,lc, wlc, lblc, lblcr, dh, sh等 struct ip_vs_scheduler { struct list_head n_list; /* d-linked list head */ char *name; /* scheduler name */ atomic_t refcnt; /* reference counter */ struct module *module; /* THIS_MODULE/NULL */ /* scheduler initializing service */ int (*init_service)(struct ip_vs_service *svc); /* scheduling service finish */ int (*done_service)(struct ip_vs_service *svc); /* scheduler updating service */ int (*update_service)(struct ip_vs_service *svc); /* selecting a server from the given service */ struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc, const struct sk_buff *skb); }; IPVS應用是針對多鏈接協議的, 目前也就只支持FTP。 因爲ip_vs_app.c是從2.2過來的,沒有管內核是否自己有NAT的狀況,因此至關於自身實現了應用協議的NAT處理,包 括內容信息的改變, TCP序列號確認號的調整等,而如今這些都由netfilter實現了,IPVS能夠不用管這些,只處理鏈接調度就好了。 IPVS的應用模塊化還不是很好,在處理鏈接端口時,還要判斷是不是FTPPORT,也就是說不支持其餘多鏈接協議的, 應該象netfilter同樣爲每一個多鏈接協議設置一個helper,自動調用,不用在程序裏判斷端口。 struct ip_vs_app { struct list_head a_list; //用來掛接到應用鏈表 int type; /* IP_VS_APP_TYPE_xxx */ char *name; /* application module name */ __u16 protocol; //協議, TCP, UD struct module *module; /* THIS_MODULE/NULL */ struct list_head incs_list; //應用的具體實例鏈表 /* members for application incarnations */ struct list_head p_list; //將應用結構掛接到對應協議(TCP, UDP...)的應用表 struct ip_vs_app *app; /* its real application */ __u16 port; /* port number in net order */ atomic_t usecnt; /* usage counter */ /* output hook: return false if can't linearize. diff set for TCP. */ int (*pkt_out)(struct ip_vs_app *, struct ip_vs_conn *, struct sk_buff **, int *diff); /* input hook: return false if can't linearize. diff set for TCP. */ int (*pkt_in)(struct ip_vs_app *, struct ip_vs_conn *, struct sk_buff **, int *diff); /* ip_vs_app initializer */ int (*init_conn)(struct ip_vs_app *, struct ip_vs_conn *); /* ip_vs_app finish */ int (*done_conn)(struct ip_vs_app *, struct ip_vs_conn *); /* not used now */ int (*bind_conn)(struct ip_vs_app *, struct ip_vs_conn *, struct ip_vs_protocol *); void (*unbind_conn)(struct ip_vs_app *, struct ip_vs_conn *); int * timeout_table; int * timeouts; int timeouts_size; int (*conn_schedule)(struct sk_buff *skb, struct ip_vs_app *app, int *verdict, struct ip_vs_conn **cpp); struct ip_vs_conn * (*conn_in_get)(const struct sk_buff *skb, struct ip_vs_app *app, const struct iphdr *iph, unsigned int proto_off, int inverse); struct ip_vs_conn * (*conn_out_get)(const struct sk_buff *skb, struct ip_vs_app *app, const struct iphdr *iph, unsigned int proto_off, int inverse); int (*state_transition)(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_app *app); void (*timeout_change)(struct ip_vs_app *app, int flags); }; 用戶空間信息是ipvsadm程序接收用戶輸入後傳遞給內核ipvs的信息,信息都是很直接的,沒有各類控制信息。 ipvsadm和ipvs的關係至關於iptables和netfilter的關係. 用戶空間的虛擬服務信息 struct ip_vs_service_user { /* virtual service addresses */ u_int16_t protocol; u_int32_t addr; /* virtual ip address */ u_int16_t port; u_int32_t fwmark; /* firwall mark of service */ /* virtual service options */ char sched_name[IP_VS_SCHEDNAME_MAXLEN]; unsigned flags; /* virtual service flags */ unsigned timeout; /* persistent timeout in sec */ u_int32_t netmask; /* persistent netmask */ }; 用戶空間的真實服務器信息 struct ip_vs_dest_user { /* destination server address */ u_int32_t addr; u_int16_t port; /* real server options */ unsigned conn_flags; /* connection flags */ int weight; /* destination weight */ /* thresholds for active connections */ u_int32_t u_threshold; /* upper threshold */ u_int32_t l_threshold; /* lower threshold */ }; 用戶空間的統計信息 struct ip_vs_stats_user { __u32 conns; /* connections scheduled */ __u32 inpkts; /* incoming packets */ __u32 outpkts; /* outgoing packets */ __u64 inbytes; /* incoming bytes */ __u64 outbytes; /* outgoing bytes */ __u32 cps; /* current connection rate */ __u32 inpps; /* current in packet rate */ __u32 outpps; /* current out packet rate */ __u32 inbps; /* current in byte rate */ __u32 outbps; /* current out byte rate */ }; 用戶空間的獲取信息結構 struct ip_vs_getinfo { /* version number */ unsigned int version; /* size of connection hash table */ unsigned int size; /* number of virtual services */ unsigned int num_services; }; 用戶空間的服務規則項信息 struct ip_vs_service_entry { /* which service: user fills in these */ u_int16_t protocol; u_int32_t addr; /* virtual address */ u_int16_t port; u_int32_t fwmark; /* firwall mark of service */ /* service options */ char sched_name[IP_VS_SCHEDNAME_MAXLEN]; unsigned flags; /* virtual service flags */ unsigned timeout; /* persistent timeout */ u_int32_t netmask; /* persistent netmask */ /* number of real servers */ unsigned int num_dests; /* statistics */ struct ip_vs_stats_user stats; }; 用戶空間的服務器項信息 struct ip_vs_dest_entry { u_int32_t addr; /* destination address */ u_int16_t port; unsigned conn_flags; /* connection flags */ int weight; /* destination weight */ u_int32_t u_threshold; /* upper threshold */ u_int32_t l_threshold; /* lower threshold */ u_int32_t activeconns; /* active connections */ u_int32_t inactconns; /* inactive connections */ u_int32_t persistconns; /* persistent connections */ /* statistics */ struct ip_vs_stats_user stats; }; 用戶空間的獲取服務器項信息 struct ip_vs_get_dests { /* which service: user fills in these */ u_int16_t protocol; u_int32_t addr; /* virtual address */ u_int16_t port; u_int32_t fwmark; /* firwall mark of service */ /* number of real servers */ unsigned int num_dests; /* the real servers */ struct ip_vs_dest_entry entrytable[0]; }; 用戶空間的獲取虛擬服務項信息 struct ip_vs_get_services { /* number of virtual services */ unsigned int num_services; /* service table */ struct ip_vs_service_entry entrytable[0]; }; 用戶空間的獲取超時信息結構 struct ip_vs_timeout_user { int tcp_timeout; int tcp_fin_timeout; int udp_timeout; }; 用戶空間的獲取IPVS內核守護進程信息結構 struct ip_vs_daemon_user { /* sync daemon state (master/backup) */ int state; /* multicast interface name */ char mcast_ifn[IP_VS_IFNAME_MAXLEN]; /* SyncID we belong to */ int syncid; }; [/主要數據結構] static int __init ip_vs_init(void) { int ret; //初始化ipvs的控制接口,set/get sockopt操做 ret = ip_vs_control_init(); if (ret < 0) { IP_VS_ERR("can't setup control.\n"); goto cleanup_nothing; } //協議初始化 ip_vs_protocol_init(); //應用層輔助接口初始化 ret = ip_vs_app_init(); if (ret < 0) { IP_VS_ERR("can't setup application helper.\n"); goto cleanup_protocol; } //主要數據結構初始化 ret = ip_vs_conn_init(); if (ret < 0) { IP_VS_ERR("can't setup connection table.\n"); goto cleanup_app; } //下面分別掛接各個處理點到netfilter架構中,看下面hook點實現 //關於hook點知識,參考ip_conntrack實現 ret = nf_register_hook(&ip_vs_in_ops); if (ret < 0) { IP_VS_ERR("can't register in hook.\n"); goto cleanup_conn; } ret = nf_register_hook(&ip_vs_out_ops); if (ret < 0) { IP_VS_ERR("can't register out hook.\n"); goto cleanup_inops; } ret = nf_register_hook(&ip_vs_post_routing_ops); if (ret < 0) { IP_VS_ERR("can't register post_routing hook.\n"); goto cleanup_outops; } ret = nf_register_hook(&ip_vs_forward_icmp_ops); if (ret < 0) { IP_VS_ERR("can't register forward_icmp hook.\n"); goto cleanup_postroutingops; } IP_VS_INFO("ipvs loaded.\n"); return ret; ...... } 控制接口初始化 int ip_vs_control_init(void) { int ret; int idx; //登記ipvs的sockopt控制,這樣用戶空間可經過setsockopt函數來和ipvs進行通訊,看下面控制接口實現 ret = nf_register_sockopt(&ip_vs_sockopts); if (ret) { IP_VS_ERR("cannot register sockopt.\n"); return ret; } //創建/proc/net/ip_vs和/proc/net/ip_vs_stats只讀項 //看下面控制接口實現 proc_net_fops_create("ip_vs", 0, &ip_vs_info_fops); proc_net_fops_create("ip_vs_stats",0, &ip_vs_stats_fops); //創建/proc/sys/net/ipv4/vs目錄下的各可讀寫控制參數 sysctl_header = register_sysctl_table(vs_root_table, 0); //初始化各類雙向鏈表 //svc_table是根據協議地址端口等信息進行服務結構struct ip_vs_service查找的HASH表 //svc_fwm_table是根據數據包的nfmark信息進行服務結構struct ip_vs_service查找的HASH表 for(idx = 0; idx < IP_VS_SVC_TAB_SIZE; idx++) { INIT_LIST_HEAD(&ip_vs_svc_table[idx]); INIT_LIST_HEAD(&ip_vs_svc_fwm_table[idx]); } //rtable是目的結構struct ip_vs_dest的HASH鏈表 for(idx = 0; idx < IP_VS_RTAB_SIZE; idx++) { INIT_LIST_HEAD(&ip_vs_rtable[idx]); } //ipvs統計信息 memset(&ip_vs_stats, 0, sizeof(ip_vs_stats)); spin_lock_init(&ip_vs_stats.lock); //統計鎖 //對當前統計信息創建一個預估器,可用於計算服務器的性能參數 ip_vs_new_estimator(&ip_vs_stats); //掛一個定時操做,根據系統當前負載狀況定時調整系統參數 schedule_delayed_work(&defense_work, DEFENSE_TIMER_PERIOD); return 0; } //協議初始化,具體看下面協議實現 int ip_vs_protocol_init(void) { //掛接ipvs能進行均衡處理的各類協議,目前支持TCP/UDP/AH/ESP char protocols[64]; #define REGISTER_PROTOCOL(p) \ do { \ register_ip_vs_protocol(p); \ strcat(protocols, ", "); \ strcat(protocols, (p)->name); \ } while (0) //0,1字符是給", "預留的 protocols[0] = '\0'; protocols[2] = '\0'; #ifdef CONFIG_IP_VS_PROTO_TCP REGISTER_PROTOCOL(&ip_vs_protocol_tcp); #endif #ifdef CONFIG_IP_VS_PROTO_UDP REGISTER_PROTOCOL(&ip_vs_protocol_udp); #endif #ifdef CONFIG_IP_VS_PROTO_AH REGISTER_PROTOCOL(&ip_vs_protocol_ah); #endif #ifdef CONFIG_IP_VS_PROTO_ESP REGISTER_PROTOCOL(&ip_vs_protocol_esp); #endif IP_VS_INFO("Registered protocols (%s)\n", &protocols[2]); return 0; } #define IP_VS_PROTO_TAB_SIZE 32 static int register_ip_vs_protocol(struct ip_vs_protocol *pp) { //#define IP_VS_PROTO_HASH(proto) ((proto) & (IP_VS_PROTO_TAB_SIZE-1)) unsigned hash = IP_VS_PROTO_HASH(pp->protocol); //計算一個hash值 pp->next = ip_vs_proto_table[hash]; ip_vs_proto_table[hash] = pp; if (pp->init != NULL) pp->init(pp); return 0; } 應用層輔助接口初始化 int ip_vs_app_init(void) { //創建一個/proc/net/ip_vs_app項 proc_net_fops_create("ip_vs_app", 0, &ip_vs_app_fops); return 0; } 主要數據結構初始化 int ip_vs_conn_init(void) { int idx; //ipvs鏈接HASH表 static struct list_head *ip_vs_conn_tab; ip_vs_conn_tab = vmalloc(IP_VS_CONN_TAB_SIZE*sizeof(struct list_head)); if (!ip_vs_conn_tab) return -ENOMEM; //ipvs鏈接cache ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn", sizeof(struct ip_vs_conn), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); if (!ip_vs_conn_cachep) { vfree(ip_vs_conn_tab); return -ENOMEM; } //初始化HASH鏈表頭 for (idx = 0; idx < IP_VS_CONN_TAB_SIZE; idx++) { INIT_LIST_HEAD(&ip_vs_conn_tab[idx]); } //初始化各讀寫鎖 for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++) { rwlock_init(&__ip_vs_conntbl_lock_array[idx].l); } //創建/proc/net/ip_vs_conn項 proc_net_fops_create("ip_vs_conn", 0, &ip_vs_conn_fops); //初始化隨機數 get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd)); return 0; }
分類: LINUX
2013-05-07 20:44:15
NAT模式是IPVS最經常使用的一種模式。相比於TUN和DR模式,NAT模式更容易部署,僅僅是須要更改真實服務器的默認網關配置。
IPVS是基於Netfilter實現的。它註冊了4個Netfilter鉤子函數,其中與NAT模式相關的是ip_vs_in和ip_vs_out兩個鉤子函數。前者處理了客戶端-〉服務器的數據包,後者則針對服務器-〉客戶端的數據包。
ip_vs_in函數的處理流程很是清晰:
預校驗--〉ip_vs_conn對象的查找或生成--〉更新統計信息--〉更新ip_vs_conn對象的狀態--〉修改sk_buff並轉發數據包--〉IPVS狀態同步--〉結束
預校驗的過程很簡單,主要包括:
既然有數據包來了,必然有對應的ip_vs_conn對象。首先根據數據包的<源地址-源端口-目的地址-目的端口>,查找當前的ip_vs_conn對象列表。若是沒有找到的話,說明這是一個新鏈接,因而運行調度過程,找到一個合適的真實服務器,而後再生成一個新的ip_vs_conn對象。這裏先不對調度過程進行展開描述。
這裏先看一下ip_vs_stats結構,其各個成員的做用很容易看出來:
|
這是一個專門用於統計的數據結構,每一個ip_vs_service對象、每一個ip_vs_dest對象都包含有這麼一個結構,另外還有一個ip_vs_stats全局變量。
函數ip_vs_in_stats和ip_vs_out_stats分別統計兩個方向的數據包流量,函數ip_vs_conn_stats用於統計新建鏈接數。
|
conns、inpkts、outpkts、inbytes和outbytes統計比較容易,只需簡單的加1。但cps等統計起來就要複雜一些了,它是經過內核定時器來實現的。每一個ip_vs_stats對象都對應有一個ip_vs_estimator結構:
|
全部的ip_vs_estimator結構造成一張鏈表,經過全局變量est_list能夠遍歷這個鏈表。定時器的時間間隔爲2秒,對應的觸發函數爲:
|
以cps的統計爲例,其計算過程很簡單,cps = (rate+cps)/2,其中單位爲2^10。
在TCP數據包處理過程當中,每一個ip_vs_conn對象對應於一個TCP鏈接,所以也必須有一個狀態轉換過程,纔可以引導此TCP鏈接正常創建和終止。這個狀態轉換頗爲複雜,在後續內容將IN/OUT結合一塊兒,來看TCP鏈接的狀態轉換。
NAT模式下的數據包轉發由ip_vs_nat_xmit函數完成。對sk_buff數據結構的操做不熟悉,略過。
先判斷此ip_vs_conn對象是否須要進行主備機同步。首先當前IPVS必須是MASTER,而且此ip_vs_conn對象的狀態爲ESTABLISHED。另外,知足這些條件時,並不是每一個Packet轉發的時候都進行同步,而是每50個Packet,才同步一次。
同步過程由函數ip_vs_sync_conn完成:
|
ip_vs_out函數處理 服務器-〉客戶端 的數據包。相比於ip_vs_in函數,它要簡單得多。這裏再也不描述。
IPVS支持TCP、UDP、AH和ESP四種協議。因爲TCP協議的邏輯相對複雜一些,因此IPVS對TCP協議的特殊處理也更多。
IPVS針對TCP協議的處理主要是體如今TCP狀態維護上,而TCP狀態維護依賴於一個狀態轉換矩陣:
|
與NAT模式相關的爲INPUT和OUTPUT兩張表,其意思也較容易理解:
狀態轉換矩陣也能夠用一張狀態轉換圖來表示: