neutron-lbaas-lvs-ipvsadm+ip_vs

LVS是內核代碼(ip_vs)和用戶空間控制器(ipvsadmhtml

neutron-lbaas-lvs底層實現機制:ipvsadmlinux

下載ipvsadm客戶端源碼,ipvsad-1-26.8中有使用相關配置描述:算法

echo "1" > /proc/sys/net/ipv4/ip_forward數組

cat /proc/sys/net/ipv4/vs/amemthresh服務器

/proc/sys/net/ipv4/vs/timeout_*網絡

cat /proc/sys/net/ipv4/vs/snat_reroute數據結構

 

重要資料:架構

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.htmlapp

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/負載均衡

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.what_is_an_LVS.html

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.ipvsadm.html

 http://ja.ssi.bg/

http://www.tldp.org/LDP/nag2/

http://www.linuxvirtualserver.org/software/ipvs.html

http://kb.linuxvirtualserver.org/wiki/Ipvsadm#Compiling_ipvsadm

 

# ll /proc/net/ip_vs*

ip_vs
ip_vs_app
ip_vs_conn
ip_vs_conn_sync
ip_vs_stats
ip_vs_stats_percpu

 

# cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
UDP 0A212E47:22B8 rr
-> 0A212E04:1E61 Masq 10000 0 0

 

 

# ipvsadm -S
-A -u host-10-33-46-71.openstacklocal:ddi-udp-1 -s rr
-a -u host-10-33-46-71.openstacklocal:ddi-udp-1 -r host-10-33-46-4.openstacklocal:cbt -m -w 10000

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
UDP host-10-33-46-71.openstacklo rr
-> host-10-33-46-4.openstackloc Masq 10000 0 0

 

ipvsadm通用模型:CIP<-->VIP--DIP<-->RIP

Direct Routing:直接路由

Director:負載均衡層(Loader Balancer)

RS real server 真實提供服務的計算機

VIPVirtual IP(VIP)address:Director用來向客戶端提供服務的IP地址

RIPReal IP (RIP) address:集羣節點(後臺真正提供服務的服務器)所使用的IP地址

DIPDirector's IP (DIP) address:Director用來和D/RIP 進行聯繫的地址

CIPClient computer's IP (CIP) address:公網IP,客戶端使用的IP

ipvsadm -A|E -t|u|f service-address [-s scheduler] [-p [timeout]] [-M netmask] [--pe persistence_engine] [-b sched-flags]

LVS類型:

 --gatewaying   -g                   gatewaying (direct routing) (default)

  --ipip         -i                   ipip encapsulation (tunneling)

  --masquerading -m                   masquerading (NAT)

 --scheduler    -s scheduler         one of rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq, the default scheduler is wlc.

1.Fixed Scheduling Method  靜態調服方法

(1).RR     輪詢

(2).WRR    加權輪詢

(3).DH     目標地址hash

(4).SH     源地址hash

2.Dynamic Scheduling Method 動態調服方法

(1).LC     最少鏈接

(2).WLC    加權最少鏈接

(3).SED    最少指望延遲

(4).NQ     從不排隊調度方法

(5).LBLC   基於本地的最少鏈接

(6).LBLCR  帶複製的基於本地的最少鏈接

應用舉例,典型配置

網卡配置:

Director,內網網卡,外網網卡

   外網eth0 : 10.19.172.188

   外網eth0:0 (vip):10.19.172.184

   內網eth1 : 192.168.1.1

RS(Realserver), 內網網卡

   eth0: 192.168.1.2

          192.168.1.3

          192.168.1.4

          gateway: 192.168.1.1

配置IPVS Table腳本 :
VIP=192.168.34.41
RIP1=192.168.34.27
RIP2=192.168.34.26
GW=192.168.34.1
#清除IPVS Table
ipvsadm -C
#設置IPVS Table
ipvsadm -A -t $VIP:443 -s wlcipvsadm -a -t $VIP:443 -r $RIP1:443 -g -w 1
ipvsadm -a -t $VIP:443 -r $RIP2:443 -g -w 1
#將IPVS Table保存到/etc/sysconfig/ipvsadm/etc/rc.d/init.d/
ipvsadm save
#啓動IPVSservice
ipvsadm start
#顯示IPVS狀態  
ipvsadm -l

 

Director配置集羣,添加RS

# ipvsadm -A -t 172.16.251.184:80 -s sh

# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.2 -m

# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.3–m

# ipvsadm -a -t 172.16.251.184:80 -r 192.168.1.4 -m  

LVS優於HAProxy的一個優勢是LVS支持客戶端的透明性(即不要SNAT)爲何

查看IPVS詳情 查看 /proc/net目錄下的 ip_vs ip_vs_app ip_vs_conn ip_vs_conn_sync ip_vs_ext_stats ip_vs_stats

ipvs相關的proc文件:

/proc/net/ip_vs :IPVS的規則表

/proc/net/ip_vs_app :IPVS應用協議

/proc/net/ip_vs_conn :IPVS當前鏈接

/proc/net/ip_vs_stats :IPVS狀態統計信息

# depmod -n|grep ipvs
kernel/net/netfilter/xt_ipvs.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs.ko.xz: kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_rr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_wrr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_wlc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lblc.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_lblcr.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_dh.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_sh.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_sed.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_nq.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_ftp.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_nat.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
kernel/net/netfilter/ipvs/ip_vs_pe_sip.ko.xz: kernel/net/netfilter/ipvs/ip_vs.ko.xz kernel/net/netfilter/nf_conntrack_sip.ko.xz kernel/net/netfilter/nf_conntrack.ko.xz kernel/lib/libcrc32c.ko.xz
alias ip6t_ipvs xt_ipvs
alias ipt_ipvs xt_ipvs

# modinfo ip_vs
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/kernel/net/netfilter/ipvs/ip_vs.ko.xz
\license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 69C7A8962537C8009A78FBC
depends: nf_conntrack,libcrc32c
intree: Y
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: C0:02:A7:AD:CC:7C:84:36:68:A1:BC:B4:97:4C:1A:30:2D:FF:EA:35
sig_hashalgo: sha256
parm: conn_tab_bits:Set connections' hash size (int)

# modinfo iptable_nat
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/kernel/net/ipv4/netfilter/iptable_nat.ko.xz
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 291B36E315928812DAB1A47
depends: ip_tables,nf_nat_ipv4
intree: Y
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: C0:02:A7:AD:CC:7C:84:36:68:A1:BC:B4:97:4C:1A:30:2D:FF:EA:35
sig_hashalgo: sha256

 源碼:http://www.linuxvirtualserver.org/software/ipvs.html

ipvsadm.sh:

# config: /etc/sysconfig/ipvsadm

# config: /etc/ipvsadm.rules

ipvsadm.c

int main(int argc, char **argv)
{
    int result;

    if (ipvs_init()) {
        /* try to insmod the ip_vs module if ipvs_init failed */
        if (modprobe_ipvs() || ipvs_init())
            fail(2, "Can't initialize ipvs: %s\n"
                "Are you sure that IP Virtual Server is "
                "built in the kernel or as module?",
                 ipvs_strerror(errno));
    }

    /* warn the user if the IPVS version is out of date */
    check_ipvs_version();

    /* list the table if there is no other arguement */
    if (argc == 1){
        list_all(FMT_NONE);
        ipvs_close();
        return 0;
    }

    /* process command line arguments */
    result = process_options(argc, argv, 0);

    ipvs_close();
    return result;
}
static int
parse_options(int argc, char **argv, struct ipvs_command_entry *ce,
          unsigned int *options, unsigned int *format)
{
    int c, parse;
    poptContext context;
    char *optarg=NULL;
    struct poptOption options_table[] = {
        { "add-service", 'A', POPT_ARG_NONE, NULL, 'A', NULL, NULL },
        { "edit-service", 'E', POPT_ARG_NONE, NULL, 'E', NULL, NULL },
        { "delete-service", 'D', POPT_ARG_NONE, NULL, 'D', NULL, NULL },
        { "clear", 'C', POPT_ARG_NONE, NULL, 'C', NULL, NULL },
        { "list", 'L', POPT_ARG_NONE, NULL, 'L', NULL, NULL },
        { "list", 'l', POPT_ARG_NONE, NULL, 'l', NULL, NULL },
        { "zero", 'Z', POPT_ARG_NONE, NULL, 'Z', NULL, NULL },
        { "add-server", 'a', POPT_ARG_NONE, NULL, 'a', NULL, NULL },
        { "edit-server", 'e', POPT_ARG_NONE, NULL, 'e', NULL, NULL },
        { "delete-server", 'd', POPT_ARG_NONE, NULL, 'd', NULL, NULL },
        { "set", '\0', POPT_ARG_NONE, NULL, TAG_SET, NULL, NULL },
        { "help", 'h', POPT_ARG_NONE, NULL, 'h', NULL, NULL },
        { "version", 'v', POPT_ARG_NONE, NULL, 'v', NULL, NULL },
        { "restore", 'R', POPT_ARG_NONE, NULL, 'R', NULL, NULL },
        { "save", 'S', POPT_ARG_NONE, NULL, 'S', NULL, NULL },
        { "start-daemon", '\0', POPT_ARG_STRING, &optarg,
          TAG_START_DAEMON, NULL, NULL },
        { "stop-daemon", '\0', POPT_ARG_STRING, &optarg,
          TAG_STOP_DAEMON, NULL, NULL },
        { "tcp-service", 't', POPT_ARG_STRING, &optarg, 't',
          NULL, NULL },
        { "udp-service", 'u', POPT_ARG_STRING, &optarg, 'u',
          NULL, NULL },
        { "fwmark-service", 'f', POPT_ARG_STRING, &optarg, 'f',
          NULL, NULL },
        { "scheduler", 's', POPT_ARG_STRING, &optarg, 's', NULL, NULL },
        { "persistent", 'p', POPT_ARG_STRING|POPT_ARGFLAG_OPTIONAL,
         &optarg, 'p', NULL, NULL },
        { "netmask", 'M', POPT_ARG_STRING, &optarg, 'M', NULL, NULL },
        { "real-server", 'r', POPT_ARG_STRING, &optarg, 'r',
          NULL, NULL },
        { "masquerading", 'm', POPT_ARG_NONE, NULL, 'm', NULL, NULL },
        { "ipip", 'i', POPT_ARG_NONE, NULL, 'i', NULL, NULL },
        { "gatewaying", 'g', POPT_ARG_NONE, NULL, 'g', NULL, NULL },
        { "weight", 'w', POPT_ARG_STRING, &optarg, 'w', NULL, NULL },
        { "u-threshold", 'x', POPT_ARG_STRING, &optarg, 'x',
          NULL, NULL },
        { "l-threshold", 'y', POPT_ARG_STRING, &optarg, 'y',
          NULL, NULL },
        { "numeric", 'n', POPT_ARG_NONE, NULL, 'n', NULL, NULL },
        { "connection", 'c', POPT_ARG_NONE, NULL, 'c', NULL, NULL },
        { "mcast-interface", '\0', POPT_ARG_STRING, &optarg,
          TAG_MCAST_INTERFACE, NULL, NULL },
        { "syncid", '\0', POPT_ARG_STRING, &optarg, 'I', NULL, NULL },
        { "timeout", '\0', POPT_ARG_NONE, NULL, TAG_TIMEOUT,
          NULL, NULL },
        { "daemon", '\0', POPT_ARG_NONE, NULL, TAG_DAEMON, NULL, NULL },
        { "stats", '\0', POPT_ARG_NONE, NULL, TAG_STATS, NULL, NULL },
        { "rate", '\0', POPT_ARG_NONE, NULL, TAG_RATE, NULL, NULL },
        { "thresholds", '\0', POPT_ARG_NONE, NULL,
           TAG_THRESHOLDS, NULL, NULL },
        { "persistent-conn", '\0', POPT_ARG_NONE, NULL,
          TAG_PERSISTENTCONN, NULL, NULL },
        { "nosort", '\0', POPT_ARG_NONE, NULL,
           TAG_NO_SORT, NULL, NULL },
        { "sort", '\0', POPT_ARG_NONE, NULL, TAG_SORT, NULL, NULL },
        { "exact", 'X', POPT_ARG_NONE, NULL, 'X', NULL, NULL },
        { "ipv6", '6', POPT_ARG_NONE, NULL, '6', NULL, NULL },
        { "ops", 'o', POPT_ARG_NONE, NULL, 'o', NULL, NULL },
        { "pe", '\0', POPT_ARG_STRING, &optarg, TAG_PERSISTENCE_ENGINE,
          NULL, NULL },
        { NULL, 0, 0, NULL, 0, NULL, NULL }
    };

    context = poptGetContext("ipvsadm", argc, (const char **)argv,
                 options_table, 0);

    if ((c = poptGetNextOpt(context)) < 0)
        tryhelp_exit(argv[0], -1);

    switch (c) {
    case 'A':
        set_command(&ce->cmd, CMD_ADD);
        break;
    case 'E':
        set_command(&ce->cmd, CMD_EDIT);
        break;
    case 'D':
        set_command(&ce->cmd, CMD_DEL);
        break;
    case 'a':
        set_command(&ce->cmd, CMD_ADDDEST);
        break;
    case 'e':
        set_command(&ce->cmd, CMD_EDITDEST);
        break;
    case 'd':
        set_command(&ce->cmd, CMD_DELDEST);
        break;
    case 'C':
        set_command(&ce->cmd, CMD_FLUSH);
        break;
    case 'L':
    case 'l':
        set_command(&ce->cmd, CMD_LIST);
        break;
    case 'Z':
        set_command(&ce->cmd, CMD_ZERO);
        break;
    case TAG_SET:
        set_command(&ce->cmd, CMD_TIMEOUT);
        break;
    case 'R':
        set_command(&ce->cmd, CMD_RESTORE);
        break;
    case 'S':
        set_command(&ce->cmd, CMD_SAVE);
        break;
    case TAG_START_DAEMON:
        set_command(&ce->cmd, CMD_STARTDAEMON);
        if (!strcmp(optarg, "master"))
            ce->daemon.state = IP_VS_STATE_MASTER;
        else if (!strcmp(optarg, "backup"))
            ce->daemon.state = IP_VS_STATE_BACKUP;
        else fail(2, "illegal start-daemon parameter specified");
        break;
    case TAG_STOP_DAEMON:
        set_command(&ce->cmd, CMD_STOPDAEMON);
        if (!strcmp(optarg, "master"))
            ce->daemon.state = IP_VS_STATE_MASTER;
        else if (!strcmp(optarg, "backup"))
            ce->daemon.state = IP_VS_STATE_BACKUP;
        else fail(2, "illegal start_daemon specified");
        break;
    case 'h':
        usage_exit(argv[0], 0);
        break;
    case 'v':
        version_exit(0);
        break;
    default:
        tryhelp_exit(argv[0], -1);
    }

    while ((c=poptGetNextOpt(context)) >= 0){
        switch (c) {
        case 't':
        case 'u':
            set_option(options, OPT_SERVICE);
            ce->svc.protocol =
                (c=='t' ? IPPROTO_TCP : IPPROTO_UDP);
            parse = parse_service(optarg, &ce->svc);
            if (!(parse & SERVICE_ADDR))
                fail(2, "illegal virtual server "
                     "address[:port] specified");
            break;
        case 'f':
            set_option(options, OPT_SERVICE);
            /*
             * Set protocol to a sane values, even
             * though it is not used
             */
            ce->svc.af = AF_INET;
            ce->svc.protocol = IPPROTO_TCP;
            ce->svc.fwmark = parse_fwmark(optarg);
            break;
        case 's':
            set_option(options, OPT_SCHEDULER);
            strncpy(ce->svc.sched_name,
                optarg, IP_VS_SCHEDNAME_MAXLEN);
            break;
        case 'p':
            set_option(options, OPT_PERSISTENT);
            ce->svc.flags |= IP_VS_SVC_F_PERSISTENT;
            ce->svc.timeout =
                parse_timeout(optarg, 1, MAX_TIMEOUT);
            break;
        case 'M':
            set_option(options, OPT_NETMASK);
            if (ce->svc.af != AF_INET6) {
                parse = parse_netmask(optarg, &ce->svc.netmask);
                if (parse != 1)
                    fail(2, "illegal virtual server "
                         "persistent mask specified");
            } else {
                ce->svc.netmask = atoi(optarg);
                if ((ce->svc.netmask < 1) || (ce->svc.netmask > 128))
                    fail(2, "illegal ipv6 netmask specified");
            }
            break;
        case 'r':
            set_option(options, OPT_SERVER);
            ipvs_service_t t_dest = ce->svc;
            parse = parse_service(optarg, &t_dest);
            ce->dest.af = t_dest.af;
            ce->dest.addr = t_dest.addr;
            ce->dest.port = t_dest.port;
            if (!(parse & SERVICE_ADDR))
                fail(2, "illegal real server "
                     "address[:port] specified");
            /* copy vport to dport if not specified */
            if (parse == 1)
                ce->dest.port = ce->svc.port;
            break;
        case 'i':
            set_option(options, OPT_FORWARD);
            ce->dest.conn_flags = IP_VS_CONN_F_TUNNEL;
            break;
        case 'g':
            set_option(options, OPT_FORWARD);
            ce->dest.conn_flags = IP_VS_CONN_F_DROUTE;
            break;
        case 'm':
            set_option(options, OPT_FORWARD);
            ce->dest.conn_flags = IP_VS_CONN_F_MASQ;
 break; case 'w':
            set_option(options, OPT_WEIGHT);
            if ((ce->dest.weight =
                 string_to_number(optarg, 0, 65535)) == -1)
                fail(2, "illegal weight specified");
            break;
        case 'x':
            set_option(options, OPT_UTHRESHOLD);
            if ((ce->dest.u_threshold =
                 string_to_number(optarg, 0, INT_MAX)) == -1)
                fail(2, "illegal u_threshold specified");
            break;
        case 'y':
            set_option(options, OPT_LTHRESHOLD);
            if ((ce->dest.l_threshold =
                 string_to_number(optarg, 0, INT_MAX)) == -1)
                fail(2, "illegal l_threshold specified");
            break;
        case 'c':
            set_option(options, OPT_CONNECTION);
            break;
        case 'n':
            set_option(options, OPT_NUMERIC);
            *format |= FMT_NUMERIC;
            break;
        case TAG_MCAST_INTERFACE:
            set_option(options, OPT_MCAST);
            strncpy(ce->daemon.mcast_ifn,
                optarg, IP_VS_IFNAME_MAXLEN);
            break;
        case 'I':
            set_option(options, OPT_SYNCID);
            if ((ce->daemon.syncid =
                 string_to_number(optarg, 0, 255)) == -1)
                fail(2, "illegal syncid specified");
            break;
        case TAG_TIMEOUT:
            set_option(options, OPT_TIMEOUT);
            break;
        case TAG_DAEMON:
            set_option(options, OPT_DAEMON);
            break;
        case TAG_STATS:
            set_option(options, OPT_STATS);
            *format |= FMT_STATS;
            break;
        case TAG_RATE:
            set_option(options, OPT_RATE);
            *format |= FMT_RATE;
            break;
        case TAG_THRESHOLDS:
            set_option(options, OPT_THRESHOLDS);
            *format |= FMT_THRESHOLDS;
            break;
        case TAG_PERSISTENTCONN:
            set_option(options, OPT_PERSISTENTCONN);
            *format |= FMT_PERSISTENTCONN;
            break;
        case TAG_NO_SORT:
            set_option(options, OPT_NOSORT    );
            *format |= FMT_NOSORT;
            break;
        case TAG_SORT:
            /* Sort is the default, this is a no-op for compatibility */
            break;
        case 'X':
            set_option(options, OPT_EXACT);
            *format |= FMT_EXACT;
            break;
        case '6':
            if (ce->svc.fwmark) {
                ce->svc.af = AF_INET6;
                ce->svc.netmask = 128;
            } else {
                fail(2, "-6 used before -f\n");
            }
            break;
        case 'o':
            set_option(options, OPT_ONEPACKET);
            ce->svc.flags |= IP_VS_SVC_F_ONEPACKET;
            break;
        case TAG_PERSISTENCE_ENGINE:
            set_option(options, OPT_PERSISTENCE_ENGINE);
            strncpy(ce->svc.pe_name, optarg, IP_VS_PENAME_MAXLEN);
            break;
        default:
            fail(2, "invalid option `%s'",
                 poptBadOption(context, POPT_BADOPTION_NOALIAS));
        }
    }

    if (c < -1) {
        /* an error occurred during option processing */
        fprintf(stderr, "%s: %s\n",
            poptBadOption(context, POPT_BADOPTION_NOALIAS),
            poptStrerror(c));
        poptFreeContext(context);
        return -1;
    }

    if (ce->cmd == CMD_TIMEOUT) {
        char *optarg1, *optarg2;

        if ((optarg=(char *)poptGetArg(context))
            && (optarg1=(char *)poptGetArg(context))
            && (optarg2=(char *)poptGetArg(context))) {
            ce->timeout.tcp_timeout =
                parse_timeout(optarg, 0, MAX_TIMEOUT);
            ce->timeout.tcp_fin_timeout =
                parse_timeout(optarg1, 0, MAX_TIMEOUT);
            ce->timeout.udp_timeout =
                parse_timeout(optarg2, 0, MAX_TIMEOUT);
        } else
            fail(2, "--set option requires 3 timeout values");
    }

    if ((optarg=(char *)poptGetArg(context)))
        fail(2, "unexpected argument %s", optarg);

    poptFreeContext(context);

    return 0;
}
IP_VS_CONN_F_MASQ在kernel代碼中搜索相關處理

http://kb.linuxvirtualserver.org/wiki/Ipvsadm#Compiling_ipvsadm

IPVS源代碼分析---總述和初始化

 
主要參考了http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/ 這個系列講的很是詳細。以及yfydz的博客,不少對代碼的註釋都是直接轉載的他的內容,先說明一下,我本身寫的主要是對代碼總體脈絡和思路的分析。
 
  
IPVS這部分的代碼看了挺長時間了,對於普通應用的處理,相對簡單。 
對於FTP這種多鏈接的處理,IPVS雖然目前只支持FTP,可是用了不少代碼來處理這種affinity connection,其中用到了persistent connection和template。其中,persistent connection是指對於多鏈接的服務,須要把connection都定向到同一個server上面,persistent這個標記是在ipvsadm 添加service時就配置的。template是指對於多鏈接的服務,建立了一個template connection做爲其餘鏈接的template,(這裏的其餘鏈接是指從同一個src發出的,被iptables打過mark的鏈接,iptables能夠對相關的鏈接根據端口號打上mark,方便IPVS的處理。)這樣其餘鏈接就根據template中指向的dest,也定向到了dest。也就說相關的鏈接都發到了同一個dest。
 
  
 
  
根據LVS官方網站的介紹,LVS支持三種負載均衡模式:NAT,tunnel和direct routing(DR)。
NAT是通用模式,全部交互數據必須經過均衡器;後兩種則是一種半鏈接處理方式,請求數據經過均衡器,而服務器的迴應則是直接路由返回的,
而這兩種方法的區別是tunnel模式下因爲進行了IP封裝因此可路由,而DR方式是修改MAC地址來實現,因此必須同一網段.
[主要數據結構]
這個結構用來描述IPVS支持的IP協議。IPVS的IP層協議支持TCP, UDP, AH和ESP這4種IP層協議
struct ip_vs_protocol {
        //鏈表中的下一項
        struct ip_vs_protocol   *next;
        //協議名稱, "TCP", "UDP".
        char                    *name;
        //協議值
        __u16                   protocol;
        //不進行分片
        int                     dont_defrag;
        //協議應用計數器,根據是該協議的中多鏈接協議的數量
        atomic_t                appcnt;
        //協議各狀態的超時數組
        int                     *timeout_table;

        void (*init)(struct ip_vs_protocol *pp);  //協議初始化
        void (*exit)(struct ip_vs_protocol *pp); //協議釋放
        int (*conn_schedule)(struct sk_buff *skb, struct ip_vs_protocol *pp, int *verdict, struct ip_vs_conn **cpp); //協議調度
        //查找in方向的IPVS鏈接
        struct ip_vs_conn * (*conn_in_get)(const struct sk_buff *skb, struct ip_vs_protocol *pp,
                       const struct iphdr *iph, unsigned int proto_off, int inverse);
        //查找out方向的IPVS鏈接
        struct ip_vs_conn * (*conn_out_get)(const struct sk_buff *skb, struct ip_vs_protocol *pp,
                        const struct iphdr *iph, unsigned int proto_off, int inverse);
        //源NAT操做
        int (*snat_handler)(struct sk_buff **pskb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp);
        //目的NAT操做
        int (*dnat_handler)(struct sk_buff **pskb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp);
        //協議校驗和計算
        int (*csum_check)(struct sk_buff *skb, struct ip_vs_protocol *pp);
        //當前協議狀態名稱: 如"LISTEN", "ESTABLISH"
        const char *(*state_name)(int state);
        //協議狀態遷移
        int (*state_transition)(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_protocol *pp);
        //登記應用
        int (*register_app)(struct ip_vs_app *inc);
        //去除應用登記
        void (*unregister_app)(struct ip_vs_app *inc);

        int (*app_conn_bind)(struct ip_vs_conn *cp);
        //數據包打印
        void (*debug_packet)(struct ip_vs_protocol *pp, const struct sk_buff *skb, int offset, const char *msg);
        //調整超時
        void (*timeout_change)(struct ip_vs_protocol *pp, int flags);
        //設置各類狀態下的協議超時
        int (*set_state_timeout)(struct ip_vs_protocol *pp, char *sname, int to);
};
這個結構用來描述IPVS的鏈接。IPVS的鏈接和netfilter定義的鏈接相似
struct ip_vs_conn {
        struct list_head         c_list;            //HASH鏈表
        __u32                   caddr;          //客戶機地址
        __u32                   vaddr;          //服務器對外的虛擬地址
        __u32                   daddr;          //服務器實際地址
        __u16                   cport;          //客戶端的端口
        __u16                   vport;         //服務器對外虛擬端口
        __u16                   dport;         //服務器實際端口
        __u16                   protocol;      //協議類型

        atomic_t                refcnt;        //鏈接引用計數
        struct timer_list       timer;           //定時器
        volatile unsigned long  timeout;       //超時時間

        spinlock_t              lock;           //狀態轉換鎖
        volatile __u16          flags;          /* status flags */
        volatile __u16          state;          /* state info */

        struct ip_vs_conn       *control;       //主鏈接, 如FTP
        atomic_t                n_control;      //子鏈接數
        struct ip_vs_dest       *dest;          //真正服務器
        atomic_t                in_pkts;        //進入的數據統計

        int (*packet_xmit)(struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp); //數據包發送

        struct ip_vs_app        *app;           //IPVS應用
        void                    *app_data;      //應用的私有數據
        struct ip_vs_seq        in_seq;         //進入數據的序列號
        struct ip_vs_seq        out_seq;       //發出數據的序列號
};
這個結構用來描述IPVS對外的虛擬服務器信息
struct ip_vs_service {
        struct list_head        s_list;        //按普通協議,地址,端口進行HASH的鏈表
        struct list_head        f_list;        //按nfmark進行HASH的鏈表
        atomic_t                refcnt;      //引用計數
        atomic_t                usecnt;      //使用計數

        __u16                   protocol;   //協議
        __u32                   addr;       //虛擬服務器地址
        __u16                   port;       //虛擬服務器端口
        __u32                   fwmark;    //就是skb中的nfmark
        unsigned                flags;       //狀態標誌
        unsigned                timeout;     //超時
        __u32                   netmask;    //網絡掩碼

        struct list_head        destinations;  //真實服務器的地址鏈表
        __u32                  num_dests;  //真實服務器的數量
        struct ip_vs_stats      stats;        //服務統計信息
        struct ip_vs_app        *inc;         //應用

        struct ip_vs_scheduler  *scheduler;    //調度指針
        rwlock_t                sched_lock;    //調度鎖
        void                    *sched_data;   //調度私有數據
};
這個結構用來描述具體的真實服務器的信息
struct ip_vs_dest {
        struct list_head        n_list;   /* for the dests in the service */
        struct list_head        d_list;   /* for table with all the dests */

        __u32                   addr;           //服務器地址
        __u16                   port;           //服務器端口
        volatile unsigned        flags;          //目標標誌,易變參數
        atomic_t                conn_flags;     //鏈接標誌
        atomic_t                weight;         //服務器權重

        atomic_t                refcnt;         //引用計數
        struct ip_vs_stats      stats;          //統計數

        atomic_t                activeconns;    //活動的鏈接
        atomic_t                inactconns;     //不活動的鏈接
        atomic_t                persistconns;   //保持的鏈接,常駐
        __u32                   u_threshold;   //鏈接上限
        __u32                   l_threshold;    //鏈接下限

        /* for destination cache */
        spinlock_t              dst_lock;       /* lock of dst_cache */
        struct dst_entry        *dst_cache;     /* destination cache entry */
        u32                     dst_rtos;
        struct ip_vs_service    *svc;           /* service it belongs to */
        __u16                   protocol;       /* which protocol (TCP/UDP) */
        __u32                   vaddr;          /* virtual IP address */
        __u16                   vport;          /* virtual port number */
        __u32                   vfwmark;        /* firewall mark of service */
};
這個結構用來描述IPVS調度算法,目前調度方法包括rr,wrr,lc, wlc, lblc, lblcr, dh, sh等
struct ip_vs_scheduler {
        struct list_head        n_list;         /* d-linked list head */
        char                    *name;          /* scheduler name */
        atomic_t                refcnt;         /* reference counter */
        struct module           *module;        /* THIS_MODULE/NULL */

        /* scheduler initializing service */
        int (*init_service)(struct ip_vs_service *svc);
        /* scheduling service finish */
        int (*done_service)(struct ip_vs_service *svc);
        /* scheduler updating service */
        int (*update_service)(struct ip_vs_service *svc);

        /* selecting a server from the given service */
        struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc, const struct sk_buff *skb);
};

IPVS應用是針對多鏈接協議的, 目前也就只支持FTP。
因爲ip_vs_app.c是從2.2過來的,沒有管內核是否自己有NAT的狀況,因此至關於自身實現了應用協議的NAT處理,包 括內容信息的改變,
TCP序列號確認號的調整等,而如今這些都由netfilter實現了,IPVS能夠不用管這些,只處理鏈接調度就好了。
IPVS的應用模塊化還不是很好,在處理鏈接端口時,還要判斷是不是FTPPORT,也就是說不支持其餘多鏈接協議的,
應該象netfilter同樣爲每一個多鏈接協議設置一個helper,自動調用,不用在程序裏判斷端口。

struct ip_vs_app
{
        struct list_head        a_list;           //用來掛接到應用鏈表
        int                     type;           /* IP_VS_APP_TYPE_xxx */
        char                    *name;          /* application module name */
        __u16                   protocol;      //協議, TCP, UD
        struct module           *module;        /* THIS_MODULE/NULL */
        struct list_head        incs_list;        //應用的具體實例鏈表

        /* members for application incarnations */
        struct list_head        p_list;         //將應用結構掛接到對應協議(TCP, UDP...)的應用表
        struct ip_vs_app        *app;           /* its real application */
        __u16                   port;           /* port number in net order */
        atomic_t                usecnt;         /* usage counter */

        /* output hook: return false if can't linearize. diff set for TCP.  */
        int (*pkt_out)(struct ip_vs_app *, struct ip_vs_conn *, struct sk_buff **, int *diff);
        /* input hook: return false if can't linearize. diff set for TCP. */
        int (*pkt_in)(struct ip_vs_app *, struct ip_vs_conn *, struct sk_buff **, int *diff);
        /* ip_vs_app initializer */
        int (*init_conn)(struct ip_vs_app *, struct ip_vs_conn *);
        /* ip_vs_app finish */
        int (*done_conn)(struct ip_vs_app *, struct ip_vs_conn *);
        /* not used now */
        int (*bind_conn)(struct ip_vs_app *, struct ip_vs_conn *, struct ip_vs_protocol *);
        void (*unbind_conn)(struct ip_vs_app *, struct ip_vs_conn *);

        int *                   timeout_table;
        int *                   timeouts;
        int                     timeouts_size;

        int (*conn_schedule)(struct sk_buff *skb, struct ip_vs_app *app, int *verdict, struct ip_vs_conn **cpp);
        struct ip_vs_conn *
        (*conn_in_get)(const struct sk_buff *skb, struct ip_vs_app *app, const struct iphdr *iph, unsigned int proto_off, int inverse);
        struct ip_vs_conn *
        (*conn_out_get)(const struct sk_buff *skb, struct ip_vs_app *app, const struct iphdr *iph, unsigned int proto_off, int inverse);
        int (*state_transition)(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_app *app);
        void (*timeout_change)(struct ip_vs_app *app, int flags);
};
用戶空間信息是ipvsadm程序接收用戶輸入後傳遞給內核ipvs的信息,信息都是很直接的,沒有各類控制信息。
ipvsadm和ipvs的關係至關於iptables和netfilter的關係.

用戶空間的虛擬服務信息
struct ip_vs_service_user {
        /* virtual service addresses */
        u_int16_t               protocol;
        u_int32_t               addr;           /* virtual ip address */
        u_int16_t               port;
        u_int32_t               fwmark;         /* firwall mark of service */

        /* virtual service options */
        char                    sched_name[IP_VS_SCHEDNAME_MAXLEN];
        unsigned                flags;          /* virtual service flags */
        unsigned                timeout;        /* persistent timeout in sec */
        u_int32_t               netmask;        /* persistent netmask */
};
用戶空間的真實服務器信息
struct ip_vs_dest_user {
        /* destination server address */
        u_int32_t               addr;
        u_int16_t               port;
        /* real server options */
        unsigned                conn_flags;     /* connection flags */
        int                     weight;         /* destination weight */
        /* thresholds for active connections */
        u_int32_t               u_threshold;    /* upper threshold */
        u_int32_t               l_threshold;    /* lower threshold */
};
用戶空間的統計信息
struct ip_vs_stats_user
{
        __u32                   conns;          /* connections scheduled */
        __u32                   inpkts;         /* incoming packets */
        __u32                   outpkts;        /* outgoing packets */
        __u64                   inbytes;        /* incoming bytes */
        __u64                   outbytes;       /* outgoing bytes */
        __u32                   cps;            /* current connection rate */
        __u32                   inpps;          /* current in packet rate */
        __u32                   outpps;         /* current out packet rate */
        __u32                   inbps;          /* current in byte rate */
        __u32                   outbps;         /* current out byte rate */
};
用戶空間的獲取信息結構
struct ip_vs_getinfo {
        /* version number */
        unsigned int            version;
        /* size of connection hash table */
        unsigned int            size;
        /* number of virtual services */
        unsigned int            num_services;
};
用戶空間的服務規則項信息
struct ip_vs_service_entry {
        /* which service: user fills in these */
        u_int16_t               protocol;
        u_int32_t               addr;           /* virtual address */
        u_int16_t               port;
        u_int32_t               fwmark;         /* firwall mark of service */

        /* service options */
        char                    sched_name[IP_VS_SCHEDNAME_MAXLEN];
        unsigned                flags;          /* virtual service flags */
        unsigned                timeout;        /* persistent timeout */
        u_int32_t               netmask;        /* persistent netmask */

        /* number of real servers */
        unsigned int            num_dests;
        /* statistics */
        struct ip_vs_stats_user stats;
};
用戶空間的服務器項信息
struct ip_vs_dest_entry {
        u_int32_t               addr;           /* destination address */
        u_int16_t               port;
        unsigned                conn_flags;     /* connection flags */
        int                     weight;         /* destination weight */

        u_int32_t               u_threshold;    /* upper threshold */
        u_int32_t               l_threshold;    /* lower threshold */

        u_int32_t               activeconns;    /* active connections */
        u_int32_t               inactconns;     /* inactive connections */
        u_int32_t               persistconns;   /* persistent connections */

        /* statistics */
        struct ip_vs_stats_user stats;
};
用戶空間的獲取服務器項信息
struct ip_vs_get_dests {
        /* which service: user fills in these */
        u_int16_t               protocol;
        u_int32_t               addr;           /* virtual address */
        u_int16_t               port;
        u_int32_t               fwmark;         /* firwall mark of service */

        /* number of real servers */
        unsigned int            num_dests;

        /* the real servers */
        struct ip_vs_dest_entry entrytable[0];
};
用戶空間的獲取虛擬服務項信息
struct ip_vs_get_services {
        /* number of virtual services */
        unsigned int            num_services;
        /* service table */
        struct ip_vs_service_entry entrytable[0];
};
用戶空間的獲取超時信息結構
struct ip_vs_timeout_user {
        int                     tcp_timeout;
        int                     tcp_fin_timeout;
        int                     udp_timeout;
};
用戶空間的獲取IPVS內核守護進程信息結構
struct ip_vs_daemon_user {
        /* sync daemon state (master/backup) */
        int                     state;
        /* multicast interface name */
        char                    mcast_ifn[IP_VS_IFNAME_MAXLEN];
        /* SyncID we belong to */
        int                     syncid;
};
[/主要數據結構]
static int __init ip_vs_init(void)
{
      int ret;

      //初始化ipvs的控制接口,set/get sockopt操做
      ret = ip_vs_control_init();
      if (ret < 0) {
            IP_VS_ERR("can't setup control.\n");
            goto cleanup_nothing;
      }
      //協議初始化
      ip_vs_protocol_init();
      //應用層輔助接口初始化
      ret = ip_vs_app_init();
      if (ret < 0) {
            IP_VS_ERR("can't setup application helper.\n");
            goto cleanup_protocol;
      }
      //主要數據結構初始化
      ret = ip_vs_conn_init();
      if (ret < 0) {
            IP_VS_ERR("can't setup connection table.\n");
            goto cleanup_app;
      }
      //下面分別掛接各個處理點到netfilter架構中,看下面hook點實現
      //關於hook點知識,參考ip_conntrack實現
      ret = nf_register_hook(&ip_vs_in_ops);
      if (ret < 0) {
            IP_VS_ERR("can't register in hook.\n");
            goto cleanup_conn;
      }
      ret = nf_register_hook(&ip_vs_out_ops);
      if (ret < 0) {
            IP_VS_ERR("can't register out hook.\n");
            goto cleanup_inops;
      }
      ret = nf_register_hook(&ip_vs_post_routing_ops);
      if (ret < 0) {
            IP_VS_ERR("can't register post_routing hook.\n");
            goto cleanup_outops;
      }
      ret = nf_register_hook(&ip_vs_forward_icmp_ops);
      if (ret < 0) {
            IP_VS_ERR("can't register forward_icmp hook.\n");
            goto cleanup_postroutingops;
      }

      IP_VS_INFO("ipvs loaded.\n");
      return ret;
      ......
}
控制接口初始化
int ip_vs_control_init(void)
{
      int ret;
      int idx;
      //登記ipvs的sockopt控制,這樣用戶空間可經過setsockopt函數來和ipvs進行通訊,看下面控制接口實現
      ret = nf_register_sockopt(&ip_vs_sockopts);
      if (ret) {
           IP_VS_ERR("cannot register sockopt.\n");
           return ret;
      }
      //創建/proc/net/ip_vs和/proc/net/ip_vs_stats只讀項
      //看下面控制接口實現
      proc_net_fops_create("ip_vs", 0, &ip_vs_info_fops);
      proc_net_fops_create("ip_vs_stats",0, &ip_vs_stats_fops);

      //創建/proc/sys/net/ipv4/vs目錄下的各可讀寫控制參數
      sysctl_header = register_sysctl_table(vs_root_table, 0);

      //初始化各類雙向鏈表
      //svc_table是根據協議地址端口等信息進行服務結構struct ip_vs_service查找的HASH表
      //svc_fwm_table是根據數據包的nfmark信息進行服務結構struct ip_vs_service查找的HASH表
      for(idx = 0; idx < IP_VS_SVC_TAB_SIZE; idx++)  {
            INIT_LIST_HEAD(&ip_vs_svc_table[idx]);
            INIT_LIST_HEAD(&ip_vs_svc_fwm_table[idx]);
      }
      //rtable是目的結構struct ip_vs_dest的HASH鏈表
      for(idx = 0; idx < IP_VS_RTAB_SIZE; idx++)  {
            INIT_LIST_HEAD(&ip_vs_rtable[idx]);
      }
      //ipvs統計信息
      memset(&ip_vs_stats, 0, sizeof(ip_vs_stats));
      spin_lock_init(&ip_vs_stats.lock);              //統計鎖
      //對當前統計信息創建一個預估器,可用於計算服務器的性能參數
      ip_vs_new_estimator(&ip_vs_stats);
      //掛一個定時操做,根據系統當前負載狀況定時調整系統參數
      schedule_delayed_work(&defense_work, DEFENSE_TIMER_PERIOD);
      return 0;
}
//協議初始化,具體看下面協議實現
int ip_vs_protocol_init(void)
{
      //掛接ipvs能進行均衡處理的各類協議,目前支持TCP/UDP/AH/ESP
      char protocols[64];
#define REGISTER_PROTOCOL(p)                    \
      do {                                    \
            register_ip_vs_protocol(p);     \
            strcat(protocols, ", ");        \
            strcat(protocols, (p)->name);   \
      } while (0)

      //0,1字符是給", "預留的
      protocols[0] = '\0';
      protocols[2] = '\0';
#ifdef CONFIG_IP_VS_PROTO_TCP
      REGISTER_PROTOCOL(&ip_vs_protocol_tcp);
#endif
#ifdef CONFIG_IP_VS_PROTO_UDP
      REGISTER_PROTOCOL(&ip_vs_protocol_udp);
#endif
#ifdef CONFIG_IP_VS_PROTO_AH
      REGISTER_PROTOCOL(&ip_vs_protocol_ah);
#endif
#ifdef CONFIG_IP_VS_PROTO_ESP
      REGISTER_PROTOCOL(&ip_vs_protocol_esp);
#endif
      IP_VS_INFO("Registered protocols (%s)\n", &protocols[2]);
      return 0;
}
#define IP_VS_PROTO_TAB_SIZE  32
static int register_ip_vs_protocol(struct ip_vs_protocol *pp)
{
        //#define IP_VS_PROTO_HASH(proto)         ((proto) & (IP_VS_PROTO_TAB_SIZE-1))
        unsigned hash = IP_VS_PROTO_HASH(pp->protocol); //計算一個hash值

        pp->next = ip_vs_proto_table[hash];
        ip_vs_proto_table[hash] = pp;

        if (pp->init != NULL)
                pp->init(pp);
        return 0;
}
應用層輔助接口初始化
int ip_vs_app_init(void)
{
        //創建一個/proc/net/ip_vs_app項
        proc_net_fops_create("ip_vs_app", 0, &ip_vs_app_fops);
        return 0;
}
主要數據結構初始化
int ip_vs_conn_init(void)
{
      int idx;
      //ipvs鏈接HASH表 static struct list_head *ip_vs_conn_tab;
      ip_vs_conn_tab = vmalloc(IP_VS_CONN_TAB_SIZE*sizeof(struct list_head));
      if (!ip_vs_conn_tab)
return -ENOMEM;

      //ipvs鏈接cache
      ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn", sizeof(struct ip_vs_conn), 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
      if (!ip_vs_conn_cachep) {
            vfree(ip_vs_conn_tab);
            return -ENOMEM;
      }
      //初始化HASH鏈表頭
      for (idx = 0; idx < IP_VS_CONN_TAB_SIZE; idx++) {
            INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
      }
      //初始化各讀寫鎖
      for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++)  {
            rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);
      }
      //創建/proc/net/ip_vs_conn項
      proc_net_fops_create("ip_vs_conn", 0, &ip_vs_conn_fops);
      //初始化隨機數
      get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
      return 0;
}
 

分類: LINUX

2013-05-07 20:44:15

NAT模式是IPVS最經常使用的一種模式。相比於TUN和DR模式,NAT模式更容易部署,僅僅是須要更改真實服務器的默認網關配置。

IPVS是基於Netfilter實現的。它註冊了4個Netfilter鉤子函數,其中與NAT模式相關的是ip_vs_in和ip_vs_out兩個鉤子函數。前者處理了客戶端-〉服務器的數據包,後者則針對服務器-〉客戶端的數據包。

一、ip_vs_in鉤子函數

ip_vs_in函數的處理流程很是清晰:

預校驗--〉ip_vs_conn對象的查找或生成--〉更新統計信息--〉更新ip_vs_conn對象的狀態--〉修改sk_buff並轉發數據包--〉IPVS狀態同步--〉結束

1.一、預校驗

預校驗的過程很簡單,主要包括:

  • 確保數據包的類型爲PACKET_HOST
  • 確保數據包不是loopback設備發出來的
  • 確保數據包的協議類型是IPVS所支持的,目前IPVS支持TCP、UDP、AH和ESP協議
1.二、ip_vs_conn對象的查找或生成

既然有數據包來了,必然有對應的ip_vs_conn對象。首先根據數據包的<源地址-源端口-目的地址-目的端口>,查找當前的ip_vs_conn對象列表。若是沒有找到的話,說明這是一個新鏈接,因而運行調度過程,找到一個合適的真實服務器,而後再生成一個新的ip_vs_conn對象。這裏先不對調度過程進行展開描述。

1.三、更新統計信息

這裏先看一下ip_vs_stats結構,其各個成員的做用很容易看出來:

 
 
 

/*
 *    IPVS statistics object
 */
struct ip_vs_stats
{
    __u32 conns; /* connections scheduled */
    __u32 inpkts; /* incoming packets */
    __u32 outpkts; /* outgoing packets */
    __u64 inbytes; /* incoming bytes */
    __u64 outbytes; /* outgoing bytes */

    __u32            cps;        /* current connection rate */
    __u32            inpps;        /* current in packet rate */
    __u32            outpps;        /* current out packet rate */
    __u32            inbps;        /* current in byte rate */
    __u32            outbps;        /* current out byte rate */

    spinlock_t lock; /* spin lock */
};

這是一個專門用於統計的數據結構,每一個ip_vs_service對象、每一個ip_vs_dest對象都包含有這麼一個結構,另外還有一個ip_vs_stats全局變量。

函數ip_vs_in_stats和ip_vs_out_stats分別統計兩個方向的數據包流量,函數ip_vs_conn_stats用於統計新建鏈接數。

static inline void
ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
{
    struct ip_vs_dest *dest = cp->dest;
    if (dest && (dest->flags & IP_VS_DEST_F_AVAILABLE)) {
        spin_lock(&dest->stats.lock);
        dest->stats.inpkts++;
        dest->stats.inbytes += skb->len;
        spin_unlock(&dest->stats.lock);

        spin_lock(&dest->svc->stats.lock);
        dest->svc->stats.inpkts++;
        dest->svc->stats.inbytes += skb->len;
        spin_unlock(&dest->svc->stats.lock);

        spin_lock(&ip_vs_stats.lock);
        ip_vs_stats.inpkts++;
        ip_vs_stats.inbytes += skb->len;
        spin_unlock(&ip_vs_stats.lock);
    }
}

conns、inpkts、outpkts、inbytes和outbytes統計比較容易,只需簡單的加1。但cps等統計起來就要複雜一些了,它是經過內核定時器來實現的。每一個ip_vs_stats對象都對應有一個ip_vs_estimator結構:

struct ip_vs_estimator
{
    struct ip_vs_estimator    *next;
    struct ip_vs_stats    *stats;

    u32            last_conns;
    u32            last_inpkts;
    u32            last_outpkts;
    u64            last_inbytes;
    u64            last_outbytes;

    u32            cps;
    u32            inpps;
    u32            outpps;
    u32            inbps;
    u32            outbps;
};

全部的ip_vs_estimator結構造成一張鏈表,經過全局變量est_list能夠遍歷這個鏈表。定時器的時間間隔爲2秒,對應的觸發函數爲:

static void estimation_timer(unsigned long arg)
{
    struct ip_vs_estimator *e;
    struct ip_vs_stats *s;
    u32 n_conns;
    u32 n_inpkts, n_outpkts;
    u64 n_inbytes, n_outbytes;
    u32 rate;

    read_lock(&est_lock);
    for (e = est_list; e; e = e->next) {
        s = e->stats;

        spin_lock(&s->lock);
        n_conns = s->conns;
        n_inpkts = s->inpkts;
        n_outpkts = s->outpkts;
        n_inbytes = s->inbytes;
        n_outbytes = s->outbytes;

        /* scaled by 2^10, but divided 2 seconds */
        rate = (n_conns - e->last_conns)<<9;
        e->last_conns = n_conns;
        e->cps += ((long)rate - (long)e->cps)>>2;
        s->cps = (e->cps+0x1FF)>>10;

        rate = (n_inpkts - e->last_inpkts)<<9;
        e->last_inpkts = n_inpkts;
        e->inpps += ((long)rate - (long)e->inpps)>>2;
        s->inpps = (e->inpps+0x1FF)>>10;

        rate = (n_outpkts - e->last_outpkts)<<9;
        e->last_outpkts = n_outpkts;
        e->outpps += ((long)rate - (long)e->outpps)>>2;
        s->outpps = (e->outpps+0x1FF)>>10;

        rate = (n_inbytes - e->last_inbytes)<<4;
        e->last_inbytes = n_inbytes;
        e->inbps += ((long)rate - (long)e->inbps)>>2;
        s->inbps = (e->inbps+0xF)>>5;

        rate = (n_outbytes - e->last_outbytes)<<4;
        e->last_outbytes = n_outbytes;
        e->outbps += ((long)rate - (long)e->outbps)>>2;
        s->outbps = (e->outbps+0xF)>>5;
        spin_unlock(&s->lock);
    }
    read_unlock(&est_lock);
    mod_timer(&est_timer, jiffies + 2*HZ);
}

以cps的統計爲例,其計算過程很簡單,cps = (rate+cps)/2,其中單位爲2^10。

1.四、更新ip_vs_conn對象的狀態

在TCP數據包處理過程當中,每一個ip_vs_conn對象對應於一個TCP鏈接,所以也必須有一個狀態轉換過程,纔可以引導此TCP鏈接正常創建和終止。這個狀態轉換頗爲複雜,在後續內容將IN/OUT結合一塊兒,來看TCP鏈接的狀態轉換。

1.五、修改sk_buff並轉發數據包

NAT模式下的數據包轉發由ip_vs_nat_xmit函數完成。對sk_buff數據結構的操做不熟悉,略過。

1.六、IPVS狀態同步

先判斷此ip_vs_conn對象是否須要進行主備機同步。首先當前IPVS必須是MASTER,而且此ip_vs_conn對象的狀態爲ESTABLISHED。另外,知足這些條件時,並不是每一個Packet轉發的時候都進行同步,而是每50個Packet,才同步一次。

同步過程由函數ip_vs_sync_conn完成:

 
 

/*
 * Add an ip_vs_conn information into the current sync_buff.
 * Called by ip_vs_in.
 */
void ip_vs_sync_conn(struct ip_vs_conn *cp)
{
    struct ip_vs_sync_mesg *m;
    struct ip_vs_sync_conn *s;
    int len;

    spin_lock(&curr_sb_lock);
    if (!curr_sb) {
        if (!(curr_sb=ip_vs_sync_buff_create())) {
            spin_unlock(&curr_sb_lock);
            IP_VS_ERR("ip_vs_sync_buff_create failed.\n");
            return;
        }
    }

    len = (cp->flags & IP_VS_CONN_F_SEQ_MASK) ? FULL_CONN_SIZE :
        SIMPLE_CONN_SIZE;
    m = curr_sb->mesg;
    s = (struct ip_vs_sync_conn *)curr_sb->head;

    /* copy members */
    s->protocol = cp->protocol;
    s->cport = cp->cport;
    s->vport = cp->vport;
    s->dport = cp->dport;
    s->caddr = cp->caddr;
    s->vaddr = cp->vaddr;
    s->daddr = cp->daddr;
    s->flags = htons(cp->flags & ~IP_VS_CONN_F_HASHED);
    s->state = htons(cp->state);
    if (cp->flags & IP_VS_CONN_F_SEQ_MASK) {
        struct ip_vs_sync_conn_options *opt =
            (struct ip_vs_sync_conn_options *)&s[1];
        memcpy(opt, &cp->in_seq, sizeof(*opt));
    }

    m->nr_conns++;
    m->size += len;
    curr_sb->head += len;

    /* check if there is a space for next one */
    if (curr_sb->head+FULL_CONN_SIZE > curr_sb->end) {
        sb_queue_tail(curr_sb);
        curr_sb = NULL;
    }
    spin_unlock(&curr_sb_lock);

    /* synchronize its controller if it has */
    if (cp->control)
        ip_vs_sync_conn(cp->control);
}

 
每一個ip_vs_conn對象對應的同步數據拷貝到curr_sb中,而後將它放到ip_vs_sync_queue鏈表中。同步線程主函數sync_master_loop則從ip_vs_sync_queue鏈表取同步數據對象,而後發送到備機中去。備機中的同步線程主函數sync_backup_loop則從網絡中讀取同步數據對象,而後由函數ip_vs_process_message將其恢復成ip_vs_conn對象,並保存起來。
 
1.七、ip_vs_out鉤子函數

ip_vs_out函數處理 服務器-〉客戶端 的數據包。相比於ip_vs_in函數,它要簡單得多。這裏再也不描述。

二、TCP狀態轉換過程

IPVS支持TCP、UDP、AH和ESP四種協議。因爲TCP協議的邏輯相對複雜一些,因此IPVS對TCP協議的特殊處理也更多。

IPVS針對TCP協議的處理主要是體如今TCP狀態維護上,而TCP狀態維護依賴於一個狀態轉換矩陣:

/*
 *    Timeout table[state]
 */
static int tcp_timeouts[IP_VS_TCP_S_LAST+1] = {
    [IP_VS_TCP_S_NONE]        =    2*HZ,
    [IP_VS_TCP_S_ESTABLISHED]    =    15*60*HZ,
    [IP_VS_TCP_S_SYN_SENT]        =    2*60*HZ,
    [IP_VS_TCP_S_SYN_RECV]        =    1*60*HZ,
    [IP_VS_TCP_S_FIN_WAIT]        =    2*60*HZ,
    [IP_VS_TCP_S_TIME_WAIT]        =    2*60*HZ,
    [IP_VS_TCP_S_CLOSE]        =    10*HZ,
    [IP_VS_TCP_S_CLOSE_WAIT]    =    60*HZ,
    [IP_VS_TCP_S_LAST_ACK]        =    30*HZ,
    [IP_VS_TCP_S_LISTEN]        =    2*60*HZ,
    [IP_VS_TCP_S_SYNACK]        =    120*HZ,
    [IP_VS_TCP_S_LAST]        =    2*HZ,
};

static char * tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
    [IP_VS_TCP_S_NONE]        =    "NONE",
    [IP_VS_TCP_S_ESTABLISHED]    =    "ESTABLISHED",
    [IP_VS_TCP_S_SYN_SENT]        =    "SYN_SENT",
    [IP_VS_TCP_S_SYN_RECV]        =    "SYN_RECV",
    [IP_VS_TCP_S_FIN_WAIT]        =    "FIN_WAIT",
    [IP_VS_TCP_S_TIME_WAIT]        =    "TIME_WAIT",
    [IP_VS_TCP_S_CLOSE]        =    "CLOSE",
    [IP_VS_TCP_S_CLOSE_WAIT]    =    "CLOSE_WAIT",
    [IP_VS_TCP_S_LAST_ACK]        =    "LAST_ACK",
    [IP_VS_TCP_S_LISTEN]        =    "LISTEN",
    [IP_VS_TCP_S_SYNACK]        =    "SYNACK",
    [IP_VS_TCP_S_LAST]        =    "BUG!",
};

#define sNO IP_VS_TCP_S_NONE
#define sES IP_VS_TCP_S_ESTABLISHED
#define sSS IP_VS_TCP_S_SYN_SENT
#define sSR IP_VS_TCP_S_SYN_RECV
#define sFW IP_VS_TCP_S_FIN_WAIT
#define sTW IP_VS_TCP_S_TIME_WAIT
#define sCL IP_VS_TCP_S_CLOSE
#define sCW IP_VS_TCP_S_CLOSE_WAIT
#define sLA IP_VS_TCP_S_LAST_ACK
#define sLI IP_VS_TCP_S_LISTEN
#define sSA IP_VS_TCP_S_SYNACK

struct tcp_states_t {
    int next_state[IP_VS_TCP_S_LAST];
};

static struct tcp_states_t tcp_states [] = {
/*    INPUT */
/* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA    */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }},

/*    OUTPUT */
/* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA    */
/*syn*/ {{sSS, sES, sSS, sSR, sSS, sSS, sSS, sSS, sSS, sLI, sSR }},
/*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }},
/*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }},
/*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }},

/*    INPUT-ONLY */
/* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA    */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }},
};

與NAT模式相關的爲INPUT和OUTPUT兩張表,其意思也較容易理解:

  1. sNO、sES等爲TCP狀態,tcp_state_name_table爲狀態名稱表,而tcp_timeouts表指明瞭每一個狀態的維繫時間。此維繫時間決定了ip_vs_conn對象的生命期,當維繫時間內此鏈接無任何輸入輸出,則ip_vs_conn對象自動銷燬,它是經過設置ip_vs_conn對象的timeout來實現的。
  2. 當鏈接處於某狀態時,在tcp_states矩陣中查找對應狀態列,而後根據當前的輸入(INPUT指客戶端輸入,OUTPUT指真實服務器輸出),查找到下一個狀態值。

狀態轉換矩陣也能夠用一張狀態轉換圖來表示:

相關文章
相關標籤/搜索
本站公眾號
   歡迎關注本站公眾號,獲取更多信息