linux內核網絡協議棧架構分析,全流程分析-乾貨

https://download.csdn.net/download/wuhuacai/10157233html

https://blog.csdn.net/zxorange321/article/details/75676063 node

LINUX內核協議棧分析 linux

目  錄redis

1      說明...4數據庫

2      TCP協議...4api

2.1       分層...4數組

2.2       TCP/IP的分層...5緩存

2.3       互聯網的地址...6安全

2.4       封裝...7cookie

2.5       分用...8

3      數據包格式...8

3.1       ethhdr.8

3.2       iphdr.10

3.3       udphdr.11

4      數據結構...12

4.1       內核協議棧分層結構...13

4.2       msghdr.14

4.3       iovec.14

4.4       file.15

4.5       file_operations.16

4.6       socket.17

4.7       sock.18

4.8       sock_common.22

4.9       inet_sock.23

4.10     udp_sock.25

4.11     proto_ops.25

4.12     proto.27

4.13     net_proto_family.29

4.14     softnet_data.31

4.15     sk_buff31

4.16     sk_buff_head.35

4.17     net_device.35

4.18     inet_protosw..42

4.19     inetsw_array.43

4.20     sock_type.44

4.21     IPPROTO..45

4.22     net_protocol46

4.23     packet_type.46

4.24     rtable.49

4.25     dst_entry.50

4.26     napi_struct.52

5      數據結構類圖...53

6      協議棧註冊流程...53

6.1       內核啓動流程...53

6.2       協議棧初始化流程...55

7      協議棧收包流程...56

7.1       驅動收包流程...57

7.2       應用層收包流程...57

8      協議棧發包流程...57

9      總結...57

10        參考文獻...57

 


 

1      說明

本文檔製做基於版本  linux-2.6.32,本文檔的目的是讓有必定的網絡協議基礎的人瞭解到網絡數據包在協議棧中的傳輸流程,大體理解到從網卡收到數據包傳輸到應用層所經歷的步驟,以及每一個步驟所作的事情。 圖片貼到最後。

本文檔閱讀基礎:C語言基礎,C語言回調函數,UML建模基礎,C++面向對象封裝思想,TCP/IP協議或網絡基礎。

2      TCP協議

本章摘自[TCP-IP詳解卷一] 第一章。

2.1分層

網絡協議一般分不一樣層次進行開發,每一層分別負責不一樣的通訊功能。一個協議族,好比T C P / I P,是一組不一樣層次上的多個協議的組合。T C P / I P一般被認爲是一個四層協議系統,如圖1 - 1所示。每一層負責不一樣的功能:

1)鏈路層,有時也稱做數據鏈路層或網絡接口層,一般包括操做系統中的設備驅動程序和計算機中對應的網絡接口卡。它們一塊兒處理與電纜(或其餘任何傳輸媒介)的物理接口細節。

2)網絡層,有時也稱做互聯網層,處理分組在網絡中的活動,例如分組的選路。在

TC P / I P協議族中,網絡層協議包括I P協議(網際協議),I C M P協議(I n t e r n e t互聯網控制報文協議),以及I G M P協議(I n t e r n e t組管理協議)。

3) 運輸層主要爲兩臺主機上的應用程序提供端到端的通訊。在T C P / I P協議族中,有兩個互不相同的傳輸協議:T C P(傳輸控制協議)和U D P(用戶數據報協議)。T C P爲兩臺主機提供高可靠性的數據通訊。它所作的工做包括把應用程序交給它的數據分紅合適的小塊交給下面的網絡層,確認接收到的分組,設置發送最後確認分組的超時時鐘等。因爲運輸層提供了高可靠性的端到端的通訊,所以應用層能夠忽略全部這些細節。而另外一方面,U D P則爲應用層提供一種很是簡單的服務。它只是把稱做數據報的分組從一臺主機發送到另外一臺主機,但並不保證該數據報能到達另外一端。任何須需的可靠性必須由應用層來提供。

這兩種運輸層協議分別在不一樣的應用程序中有不一樣的用途,這一點將在後面看到。

4 ) 應用層負責處理特定的應用程序細節。幾乎各類不一樣的T C P / I P實現都會提供下面這些通用的應用程序:

• Telnet 遠程登陸。

• FTP 文件傳輸協議。

• SMTP 簡單郵件傳送協議。

• SNMP 簡單網絡管理協議。

2.2TCP/IP的分層

在TC P / I P協議族中,有不少種協議。圖1 - 4給出了本書將要討論的其餘協議。

 

T C P和U D P是兩種最爲著名的運輸層協議,兩者都使用I P做爲網絡層協議。

雖然T C P使用不可靠的I P服務,但它卻提供一種可靠的運輸層服務。本書第1 7~2 2章將詳細討論T C P的內部操做細節。而後,咱們將介紹一些T C P的應用,如第2 6章中的Te l n e t和R l o g i n、第2 7章中的F T P以及第2 8章中的S M T P等。這些應用一般都是用戶進程。

U D P爲應用程序發送和接收數據報。一個數據報是指從發送方傳輸到接收方的一個信息單元(例如,發送方指定的必定字節數的信息)。可是與T C P不一樣的是,U D P是不可靠的,它不能保證數據報能安全無誤地到達最終目的。本書第11章將討論U D P,而後在第1 4章(D N S :域名系統),第1 5章(T F T P:簡單文件傳送協議),以及第1 6章(BO OT P:引導程序協議)介紹使用U D P的應用程序。S N M P也使用了U D P協議,可是因爲它還要處理許多其餘的協議,所以本書把它留到第2 5章再進行討論。

I P是網絡層上的主要協議,同時被T C P和U D P使用。T C P和U D P的每組數據都經過端系統和每一箇中間路由器中的I P層在互聯網中進行傳輸。在圖1 - 4中,咱們給出了一個直接訪問I P的應用程序。這是不多見的,但也是可能的(一些較老的選路協議就是以這種方式來實現的。固然新的運輸層協議也有可能使用這種方式)。第3章主要討論I P協議,可是爲了使內容更加有針對性,一些細節將留在後面的章節中進行討論。第9章和第1 0章討論I P如何進行選路。

I C M P是I P協議的附屬協議。I P層用它來與其餘主機或路由器交換錯誤報文和其餘重要信息。

第6章對I C M P的有關細節進行討論。儘管I C M P主要被I P使用,但應用程序也有可能訪問它。咱們將分析兩個流行的診斷工具,P i n g和Tr a c e r o u t e(第7章和第8章),它們都使用了I C M P。

I G M P是I n t e r n e t組管理協議。它用來把一個U D P數據報多播到多個主機。咱們在第1 2章中描述廣播(把一個U D P數據報發送到某個指定網絡上的全部主機)和多播的通常特性,而後在第1 3章中對I G M P協議自己進行描述。

A R P(地址解析協議)和R A R P(逆地址解析協議)是某些網絡接口(如以太網和令牌環網)使用的特殊協議,用來轉換I P層和網絡接口層使用的地址。咱們分別在第4章和第5章對這兩種協議進行分析和介紹。

2.3互聯網的地址

互聯網上的每一個接口必須有一個惟一的I n t er n e t地址(也稱做I P地址)。I P地址長32 bit。I n t e r n e t地址並不採用平面形式的地址空間,如一、二、3等。I P地址具備必定的結構,五類不一樣 的互聯網地址格式如圖1 - 5所示。

這些3 2位的地址一般寫成四個十進制的數,其中每一個整數對應一個字節。這種表示方法稱做「點分十進制表示法(Dotted decimal notation)」。例如,做者的系統就是一個B類地址,它表示爲:1 4 0 . 2 5 2 .1 3 . 3 3。

區分各種地址的最簡單方法是看它的第一個十進制整數。圖1 - 6列出了各種地址的起止範圍,其中第一個十進制整數用加黑字體表示。

須要再次指出的是,多接口主機具備多個I P地址,其中每一個接口都對應一個I P地址。

因爲互聯網上的每一個接口必須有一個惟一的I P地址,所以必需要有一個管理機構爲接入互聯網的網絡分配I P地址。這個管理機構就是互聯網絡信息中心(Internet Network InformationC e n t r e),稱做I n t e r N I C。I n t e r N I C只分配網絡號。主機號的分配由系統管理員來負責。

I n t e r n e t註冊服務( I P地址和D N S域名)過去由N I C來負責,其網絡地址是n i c . d d n . m i l。1 9 9 3年4月1日,I n t e r N I C成立。如今,N I C只負責處理國防數據網的註冊請求,全部其餘的I n t e r n e t用戶註冊請求均由I n t e rN I C負責處理,其網址是:r s . i n t er n i c . n e t。

事實上I n t e r N I C由三部分組成:註冊服務(r s. i n t e r n i c . n e t),目錄和數據庫服

務(d s . i n t e r n i c. n e t),以及信息服務(i s . i n t e rn i c . n e t)。有關I n t e r N I C的其餘信息參見習題1 . 8。

有三類I P地址:單播地址(目的爲單個主機)、廣播地址(目的端爲給定網絡上的全部主機)以及多播地址(目的端爲同一組內的全部主機)。第1 2章和第1 3章將分別討論廣播和多播的更多細節。

在3 . 4節中,咱們在介紹I P選路之後將進一步介紹子網的概念。圖3 - 9給出了幾個特殊的I P地址:主機號和網絡號爲全0或全1。

2.4封裝

 

當應用程序用T C P傳送數據時,數據被送入協議棧中,而後逐個經過每一層直到被看成一串比特流送入網絡。其中每一層對收到的數據都要增長一些首部信息(有時還要增長尾部信息),該過程如圖1 - 7所示。T C P傳給I P的數據單元稱做T C P報文段或簡稱爲T C P段(T C P s e g m e n t)。I P傳給網絡接口層的數據單元稱做I P數據報(IP datagram)。經過以太網傳輸的比特流稱做幀(Fr a m e )。1 - 7中幀頭和幀尾下面所標註的數字是典型以太網幀首部的字節長度

 

 

 

2.5分用

 

當目的主機收到一個以太網數據幀時,數據就開始從協議棧中由底向上升,同時去掉各

層協議加上的報文首部。每層協議盒都要去檢查報文首部中的協議標識,以肯定接收數據的

上層協議。這個過程稱做分用( D e m u lt i p l e x i n g),圖1 - 8顯示了該過程是如何發生的。[TCP-IP詳解卷一]

3      數據包格式

1. 

2. 

3.1ethhdr

 

描述以太網頭部

/*

 *  Thisis an Ethernet frame header.

 */

 

struct ethhdr {

    unsigned char h_dest[ETH_ALEN];/* destination ethaddr  */

    unsigned char h_source[ETH_ALEN]; /* source ether addr */

    __be16     h_proto;     /* packet type ID field  */

} __attribute__((packed));

 

 

 

/*

 *  Theseare the defined Ethernet Protocol ID's.

 */

 

#define ETH_P_LOOP   0x0060    /* Ethernet Loopback packet */

#define ETH_P_PUP 0x0200     /* Xerox PUP packet      */

#define ETH_P_PUPAT  0x0201    /* Xerox PUP Addr Trans packet  */

#define ETH_P_IP  0x0800     /* Internet Protocol packet */

#define ETH_P_X25 0x0805     /* CCITT X.25        */

#define ETH_P_ARP 0x0806     /* Address Resolution packet    */

#define    ETH_P_BPQ  0x08FF    /* G8BPQ AX.25Ethernet Packet  [ NOT AN OFFICIALLYREGISTERED ID ] */

#define ETH_P_IEEEPUP    0x0a00    /* Xerox IEEE802.3 PUP packet */

#define ETH_P_IEEEPUPAT  0x0a01    /* Xerox IEEE802.3 PUP Addr Trans packet */

#define ETH_P_DEC       0x6000         /* DEC Assigned proto           */

#define ETH_P_DNA_DL    0x6001         /* DEC DNA Dump/Load            */

#define ETH_P_DNA_RC    0x6002         /* DEC DNA Remote Console       */

#define ETH_P_DNA_RT    0x6003         /* DEC DNA Routing              */

#define ETH_P_LAT       0x6004         /* DEC LAT                      */

#define ETH_P_DIAG      0x6005         /* DEC Diagnostics              */

#define ETH_P_CUST      0x6006         /* DEC Customer use             */

#define ETH_P_SCA       0x6007         /* DEC Systems Comms Arch       */

#define ETH_P_TEB 0x6558     /* Trans Ether Bridging     */

#define ETH_P_RARP      0x8035      /* Reverse Addr Res packet  */

#define ETH_P_ATALK  0x809B    /* Appletalk DDP     */

#define ETH_P_AARP   0x80F3    /* Appletalk AARP    */

#define ETH_P_8021Q  0x8100         /* 802.1Q VLAN Extended Header  */

#define ETH_P_IPX 0x8137     /* IPX over DIX          */

#define ETH_P_IPV6   0x86DD    /* IPv6 over bluebook       */

#define ETH_P_PAUSE  0x8808    /* IEEE Pause frames. See 802.3 31B */

#define ETH_P_SLOW   0x8809    /* Slow Protocol. See 802.3ad 43B */

#define ETH_P_WCCP   0x883E    /* Web-cache coordination protocol

                   * defined in draft-wilson-wrec-wccp-v2-00.txt*/

#define ETH_P_PPP_DISC   0x8863    /* PPPoE discovery messages     */

#define ETH_P_PPP_SES    0x8864    /* PPPoE session messages   */

#define ETH_P_MPLS_UC    0x8847    /* MPLS Unicast traffic     */

#define ETH_P_MPLS_MC    0x8848    /* MPLS Multicast traffic   */

#define ETH_P_ATMMPOA    0x884c    /* MultiProtocol Over ATM   */

#define ETH_P_ATMFATE    0x8884    /* Frame-based ATM Transport

                   * over Ethernet

                   */

#define ETH_P_PAE 0x888E     /* Port Access Entity (IEEE 802.1X) */

#define ETH_P_AOE 0x88A2     /* ATA over Ethernet     */

#define ETH_P_TIPC   0x88CA    /* TIPC           */

#define ETH_P_1588   0x88F7    /* IEEE 1588 Timesync */

#define ETH_P_FCOE   0x8906    /* Fibre Channel over Ethernet  */

#define ETH_P_TDLS   0x890D    /* TDLS */

#define ETH_P_FIP 0x8914     /* FCoE Initialization Protocol */

#define ETH_P_EDSA   0xDADA    /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */

#define ETH_P_AF_IUCV   0xFBFB     /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/

 

 

3.2iphdr

描述ip頭部

struct iphdr {

#if defined(__LITTLE_ENDIAN_BITFIELD)

    __u8   ihl:4,

       version:4;

#elif defined (__BIG_ENDIAN_BITFIELD)

    __u8   version:4,

        ihl:4;

#else

#error "Please fix<asm/byteorder.h>"

#endif

    __u8   tos;

    __be16 tot_len;

    __be16 id;

    __be16 frag_off;

    __u8   ttl;

    __u8   protocol;

    __sum16    check;

    __be32 saddr;

    __be32 daddr;

    /*The options start here. */

};

 

 

3.3udphdr

描述udp頭部

struct udphdr {

    __be16 source;

    __be16 dest;

    __be16 len;

    __sum16    check;

};

 

 

4      數據結構

內核協議棧涉及的數據結較多,錯綜複雜,這裏只是粘貼了設計到的數據結構的源碼。源碼和註釋用10字體,高亮顯示;重要的成員和方法用加粗11號字體標出。例如

 

4.1內核協議棧分層結構

 

圖4-1 內核協議棧分層結構

 

Physical device hardware : 指的實實在在的物理設備。    對應physical layer

Device agnostic interface : 設備無關層。                                對應Link layer

Network protocols            :  網絡層。                                        對應Ip layer 和 transportlayer

Protocol agnostic interface: 協議無關層                                  適配系統調用層,屏蔽了協議的細節

System callinterface:系統調用層     提供給應用層的系統調用,屏蔽了socket操做的細節

BSD socket:  BSD Socket層           提供統一socket操做的接口, socket結構關係緊密

Inet socket:      inet socket 層          調用ip層協議的統一接口,sock結構關係緊密

4.2msghdr

描述了從應用層傳遞下來的消息格式,包含有用戶空間地址,消息標記等重要信息。

/*

 *  Aswe do 4.4BSD message passing we use a 4.4BSD message passing

 *  system,not 4.3. Thus msg_accrights(len) are now missing. They

 *  belongin an obscure libc emulation or the bin.

 */

 

struct msghdr {

    void   *   msg_name; /* Socket name           */

    int    msg_namelen; /* Length of name    */

    struct iovec*    msg_iov;  /* Data blocks           */

    __kernel_size_t   msg_iovlen;  /* Number of blocks      */

    void   *   msg_control; /* Per protocolmagic (eg BSD file descriptor passing) */

    __kernel_size_t   msg_controllen;  /* Length of cmsglist */

    unsigned   msg_flags;

};

4.3iovec

描述了用戶空間地址的起始位置。

/*

 *  Berkeleystyle UIO structures   -   Alan Cox 1994.

 *

 *     Thisprogram is free software; you can redistribute it and/or

 *     modifyit under the terms of the GNU General Public License

 *     aspublished by the Free Software Foundation; either version

 *     2of the License, or (at your option) any later version.

 */

 

struct iovec {

    void __user*iov_base;  /* BSD uses caddr_t(1003.1g requires void *) */

    __kernel_size_t iov_len;/* Must be size_t(1003.1g) */

};

 

4.4file

描述文件屬性的結構體,與文件描述符一一對應。

 

struct file {

    /*

     * fu_list becomes invalid after file_free iscalled and queued via

     * fu_rcuhead for RCU freeing

     */

    union {

       struct list_head  fu_list;

       struct rcu_head   fu_rcuhead;

    } f_u;

    struct path       f_path;

#define f_dentry  f_path.dentry

#define f_vfsmnt  f_path.mnt

    const struct file_operations   *f_op;

    spinlock_t    f_lock; /* f_ep_links,f_flags, no IRQ */

    atomic_long_t     f_count;

    unsigned int      f_flags;

    fmode_t           f_mode;

    loff_t        f_pos;

    struct fown_struct   f_owner;

    const struct cred *f_cred;

    struct file_ra_state f_ra;

 

    u64        f_version;

#ifdef CONFIG_SECURITY

    void          *f_security;

#endif

    /* needed for tty driver, and maybeothers */

   void        *private_data;

 

#ifdef CONFIG_EPOLL

    /* Used by fs/eventpoll.c to link allthe hooks to this file */

    struct list_head  f_ep_links;

#endif /*#ifdef CONFIG_EPOLL */

    struct address_space*f_mapping;

#ifdef CONFIG_DEBUG_WRITECOUNT

    unsigned long f_mnt_write_state;

#endif

};

4.5file_operations

文件操做相關結構體,包括read(), write(), open(),ioctl()等。

 

/*

 * NOTE:

 * read, write, poll, fsync, readv,writev, unlocked_ioctl and compat_ioctl

 * can be called without the bigkernel lock held in all filesystems.

 */

structfile_operations {

    struct module *owner;

    loff_t (*llseek)(struct file*, loff_t,int);

    ssize_t (*read) (struct file*,char __user*,size_t, loff_t*);

    ssize_t (*write) (struct file*,constchar __user*,size_t, loff_t*);

    ssize_t (*aio_read)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);

    ssize_t (*aio_write)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);

    int (*readdir)(struct file*,void*, filldir_t);

    unsigned int (*poll)(struct file*,struct poll_table_struct *);

    int (*ioctl) (struct inode*,struct file*,unsignedint,unsignedlong);

    long (*unlocked_ioctl)(struct file*, unsigned int,unsignedlong);

    long (*compat_ioctl)(struct file*, unsigned int,unsignedlong);

    int (*mmap)(struct file*,struct vm_area_struct *);

    int (*open) (struct inode*,struct file*);

    int (*flush)(struct file*, fl_owner_t id);

    int (*release)(struct inode*,struct file *);

    int (*fsync)(struct file*,struct dentry *,int datasync);

    int (*aio_fsync)(struct kiocb*, int datasync);

    int (*fasync)(int,struct file *,int);

    int (*lock)(struct file*,int,struct file_lock *);

    ssize_t (*sendpage)(struct file*, struct page *, int, size_t, loff_t *,int);

    unsigned long (*get_unmapped_area)(struct file*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);

    int (*check_flags)(int);

    int (*flock)(struct file*,int,struct file_lock *);

    ssize_t (*splice_write)(struct pipe_inode_info*,struct file *, loff_t*,size_t,unsignedint);

    ssize_t (*splice_read)(struct file*, loff_t *,struct pipe_inode_info*,size_t,unsignedint);

    int (*setlease)(struct file*,long,struct file_lock **);

};

 

4.6socket

嚮應用層提供的BSD socket操做結構體,協議無關,主要做用爲應用層提供統一的socket操做。BSD: BerkeleySoftwareDistribution)

 

/**

 * struct socket - general BSD socket

 * @state: socket state (%SS_CONNECTED, etc)

 * @type: socket type (%SOCK_STREAM, etc)

 * @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc)

 *  @ops:protocol specific socket operations

 * @fasync_list: Asynchronous wake up list

 * @file: File back pointer for gc

 *  @sk:internal networking protocol agnostic socket representation

 * @wait: wait queue for several uses

 */

struct socket {

   socket_state    state;

 

    kmemcheck_bitfield_begin(type);

    short         type;

    kmemcheck_bitfield_end(type);

 

    unsigned long     flags;

    /*

     * Please keep fasync_list & wait fields inthe same cache line

     */

    struct fasync_struct*fasync_list;

    wait_queue_head_t wait;

 

    struct file    *file;

   struct sock    *sk;

   const struct proto_ops   *ops;

};

 

typedef enum {

    SS_FREE = 0,         /* not allocated     */

    SS_UNCONNECTED,         /* unconnected to any socket    */

    SS_CONNECTING,          /* in process of connecting */

    SS_CONNECTED,       /* connected to socket      */

    SS_DISCONNECTING     /* in process of disconnecting  */

} socket_state;

4.7sock

網絡層sock(可理解爲C++基類),定義與協議無關操做,是網絡層的統一的結構,傳輸層在此基礎上實現了inet_sock(可理解爲C++派生類)。

/**

  * structsock - network layer representation of sockets

  * @__sk_common:shared layout with inet_timewait_sock

  * @sk_shutdown:mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN

  * @sk_userlocks:%SO_SNDBUF and %SO_RCVBUF settings

  * @sk_lock:  synchronizer

  * @sk_rcvbuf:size of receive buffer in bytes

  * @sk_sleep:sock wait queue

  * @sk_dst_cache:destination cache

  * @sk_dst_lock:destination cache lock

  * @sk_policy:flow policy

  * @sk_rmem_alloc:receive queue bytes committed

  * @sk_receive_queue:incoming packets

  * @sk_wmem_alloc:transmit queue bytes committed

  * @sk_write_queue:Packet sending queue

  * @sk_async_wait_queue:DMA copied packets

  * @sk_omem_alloc:"o" is "option" or "other"

  * @sk_wmem_queued:persistent queue size

  * @sk_forward_alloc:space allocated forward

  * @sk_allocation:allocation mode

  * @sk_sndbuf:size of send buffer in bytes

  * @sk_flags:%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,

  *        %SO_OOBINLINE settings, %SO_TIMESTAMPINGsettings

  * @sk_no_check:%SO_NO_CHECK setting, wether or not checkup packets

  * @sk_route_caps:route capabilities (e.g. %NETIF_F_TSO)

  * @sk_gso_type:GSO type (e.g. %SKB_GSO_TCPV4)

  * @sk_gso_max_size:Maximum GSO segment size to build

  * @sk_lingertime:%SO_LINGER l_linger setting

  * @sk_backlog:always used with the per-socket spinlock held

  * @sk_callback_lock:used with the callbacks in the end of this struct

  * @sk_error_queue:rarely used

  * @sk_prot_creator:sk_prot of original sock creator (see ipv6_setsockopt,

  *          IPV6_ADDRFORM for instance)

  * @sk_err:last error

  * @sk_err_soft:errors that don't cause failure but are the cause of a

  *           persistent failure not just 'timed out'

  * @sk_drops:raw/udp drops counter

  * @sk_ack_backlog:current listen backlog

  * @sk_max_ack_backlog:listen backlog set in listen()

  * @sk_priority:%SO_PRIORITY setting

  * @sk_type:socket type (%SOCK_STREAM, etc)

  * @sk_protocol:which protocol this socket belongs in this network family

  * @sk_peercred:%SO_PEERCRED setting

  * @sk_rcvlowat:%SO_RCVLOWAT setting

  * @sk_rcvtimeo:%SO_RCVTIMEO setting

  * @sk_sndtimeo:%SO_SNDTIMEO setting

  * @sk_filter:socket filtering instructions

  * @sk_protinfo:private area, net family specific, when not using slab

  * @sk_timer:sock cleanup timer

  * @sk_stamp:time stamp of last packet received

  * @sk_socket:Identd and reporting IO signals

  * @sk_user_data:RPC layer private data

  * @sk_sndmsg_page:cached page for sendmsg

  * @sk_sndmsg_off:cached offset for sendmsg

  * @sk_send_head:front of stuff to transmit

  * @sk_security:used by security modules

  * @sk_mark:generic packet mark

  * @sk_write_pending:a write to stream socket waits to start

  * @sk_state_change:callback to indicate change in the state of the sock

  * @sk_data_ready:callback to indicate there is data to be processed

  * @sk_write_space:callback to indicate there is bf sending space available

  * @sk_error_report:callback to indicate errors (e.g. %MSG_ERRQUEUE)

  * @sk_backlog_rcv:callback to process the backlog

  * @sk_destruct:called at sock freeing time, i.e. when all refcnt == 0

 */

struct sock {

    /*

     * Now struct inet_timewait_sock also usessock_common, so please just

     * don't add nothing before this first member(__sk_common) --acme

     */

    struct sock_common   __sk_common;

#define sk_node          __sk_common.skc_node

#define sk_nulls_node       __sk_common.skc_nulls_node

#define sk_refcnt    __sk_common.skc_refcnt

 

#define sk_copy_start       __sk_common.skc_hash

#define sk_hash          __sk_common.skc_hash

#define sk_family    __sk_common.skc_family

#define sk_state     __sk_common.skc_state

#define sk_reuse     __sk_common.skc_reuse

#define sk_bound_dev_if     __sk_common.skc_bound_dev_if

#define sk_bind_node     __sk_common.skc_bind_node

#definesk_prot          __sk_common.skc_prot

#define sk_net           __sk_common.skc_net

    kmemcheck_bitfield_begin(flags);

    unsigned int      sk_shutdown  : 2,

              sk_no_check  :2,

              sk_userlocks :4,

              sk_protocol  :8,

              sk_type      :16;

    kmemcheck_bitfield_end(flags);

    int        sk_rcvbuf;

    socket_lock_t     sk_lock;

    /*

     * The backlog queue is special, it is alwaysused with

     * the per-socket spinlock held and requireslow latency

     * access. Therefore we special case it'simplementation.

     */

    struct {

       struct sk_buff *head;

       struct sk_buff *tail;

    } sk_backlog;

    wait_queue_head_t *sk_sleep;

    struct dst_entry  *sk_dst_cache;

#ifdef CONFIG_XFRM

    struct xfrm_policy  *sk_policy[2];

#endif

    rwlock_t      sk_dst_lock;

    atomic_t       sk_rmem_alloc;

    atomic_t      sk_wmem_alloc;

    atomic_t      sk_omem_alloc;

    int        sk_sndbuf;

    struct sk_buff_head  sk_receive_queue;

    struct sk_buff_head  sk_write_queue;

#ifdef CONFIG_NET_DMA

    struct sk_buff_head  sk_async_wait_queue;

#endif

    int        sk_wmem_queued;

    int        sk_forward_alloc;

    gfp_t         sk_allocation;

    int        sk_route_caps;

    int        sk_gso_type;

    unsigned int      sk_gso_max_size;

    int        sk_rcvlowat;

    unsigned long     sk_flags;

    unsigned long        sk_lingertime;

    struct sk_buff_head  sk_error_queue;

    struct proto      *sk_prot_creator;

    rwlock_t      sk_callback_lock;

    int        sk_err,

              sk_err_soft;

    atomic_t      sk_drops;

    unsigned short       sk_ack_backlog;

    unsigned short       sk_max_ack_backlog;

    __u32         sk_priority;

    struct ucred      sk_peercred;

    long          sk_rcvtimeo;

    long          sk_sndtimeo;

    struct sk_filter     *sk_filter;

    void          *sk_protinfo;

    struct timer_list sk_timer;

    ktime_t           sk_stamp;

    struct socket     *sk_socket;

    void          *sk_user_data;

    struct page       *sk_sndmsg_page;

    struct sk_buff      *sk_send_head;

    __u32         sk_sndmsg_off;

    int        sk_write_pending;

#ifdef CONFIG_SECURITY

    void          *sk_security;

#endif

    __u32         sk_mark;

    u32        sk_classid;

    void          (*sk_state_change)(struct sock*sk);

    void          (*sk_data_ready)(struct sock*sk,int bytes);

    void          (*sk_write_space)(struct sock*sk);

    void          (*sk_error_report)(struct sock*sk);

    int        (*sk_backlog_rcv)(struct sock*sk,

                       struct sk_buff*skb); 

    void                   (*sk_destruct)(struct sock*sk);

};

 

4.8sock_common

最小網絡層表示結構體

 

/**

 *  structsock_common - minimal network layer representation of sockets

 *  @skc_node:main hash linkage for various protocol lookup tables

 *  @skc_nulls_node:main hash linkage for UDP/UDP-Lite protocol

 *  @skc_refcnt:reference count

 *  @skc_hash:hash value used with various protocol lookup tables

 *  @skc_family:network address family

 *  @skc_state:Connection state

 *  @skc_reuse:%SO_REUSEADDR setting

 *  @skc_bound_dev_if:bound device index if != 0

 *  @skc_bind_node:bind hash linkage for various protocol lookup tables

 *  @skc_prot:protocol handlers inside a network family

 *  @skc_net:reference to the network namespace of this socket

 *

 *  Thisis the minimal network layer representation of sockets, the header

 *  forstruct sock and struct inet_timewait_sock.

 */

struct sock_common {

    /*

     * first fields are not copied in sock_copy()

     */

    union {

       struct hlist_node skc_node;

       struct hlist_nulls_node skc_nulls_node;

    };

    atomic_t      skc_refcnt;

 

    unsigned int      skc_hash;

    unsigned short       skc_family;

    volatile unsigned char   skc_state;

    unsigned char     skc_reuse;

    int        skc_bound_dev_if;

    struct hlist_node skc_bind_node;

    struct proto     *skc_prot;

#ifdef CONFIG_NET_NS

    struct net    *skc_net;

#endif

};

 

 

4.9inet_sock

Inet_sock表示層結構體,在sock上作的擴展,用於在網絡層之上表示inet協議族的的傳輸層公共結構體。

/** struct inet_sock - representation of INET sockets

 *

 * @sk - ancestor class

 * @pinet6 - pointer to IPv6 controlblock

 * @daddr - Foreign IPv4 addr

 * @rcv_saddr - Bound local IPv4addr

 * @dport - Destination port

 * @num - Local port

 * @saddr - Sending source

 * @uc_ttl - Unicast TTL

 * @sport - Source port

 * @id - ID counter for DF pkts

 * @tos - TOS

 * @mc_ttl - Multicasting TTL

 * @is_icsk - is this aninet_connection_sock?

 * @mc_index - Multicast deviceindex

 * @mc_list - Group array

 * @cork - info to build ip hdr oneach ip frag while socket is corked

 */

structinet_sock {

    /* sk and pinet6 has to be the firsttwo members of inet_sock */

    struct sock       sk;

#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

    struct ipv6_pinfo *pinet6;

#endif

    /* Socket demultiplex comparisons onincoming packets. */

    __be32        daddr;

    __be32        rcv_saddr;

    __be16        dport;

    __u16         num;

    __be32        saddr;

    __s16         uc_ttl;

    __u16         cmsg_flags;

    struct ip_options *opt;

    __be16        sport;

    __u16         id;

    __u8          tos;

    __u8          mc_ttl;

    __u8          pmtudisc;

    __u8          recverr:1,

              is_icsk:1,

              freebind:1,

              hdrincl:1,

              mc_loop:1,

              transparent:1,

              mc_all:1;

    int        mc_index;

    __be32        mc_addr;

    struct ip_mc_socklist   *mc_list;

    struct {

       unsigned int      flags;

       unsigned int      fragsize;

       struct ip_options*opt;

       struct dst_entry *dst;

       int        length;/* Total length ofall frames */

       __be32        addr;

       struct flowi      fl;

    } cork;

};

 

 

4.10  udp_sock

傳輸層UDP協議專用sock結構,在傳輸層inet_sock上擴展

structudp_sock {

    /* inet_sock has to be the firstmember */

    struct inet_sock inet;

    int    pending; /* Any pending frames ? */

    unsigned int  corkflag; /* Cork is required*/

    __u16      encap_type; /* Is this anEncapsulation socket? */

    /*

     * Following member retains the information tocreate a UDP header

     * when the socket is uncorked.

     */

    __u16      len;     /* total length ofpending frames */

    /*

     * Fields specific to UDP-Lite.

     */

    __u16      pcslen;

    __u16      pcrlen;

/* indicator bits used by pcflag: */

#define UDPLITE_BIT      0x1        /* set by udpliteproto init function */

#define UDPLITE_SEND_CC  0x2       /* set via udplitesetsockopt         */

#define UDPLITE_RECV_CC  0x4   /* set via udplite setsocktopt        */

    __u8       pcflag;       /* marks socket asUDP-Lite if > 0    */

    __u8       unused[3];

    /*

     * For encapsulation sockets.

     */

    int (*encap_rcv)(struct sock*sk,struct sk_buff *skb);

};

 

4.11  proto_ops

BSD socket層到inet_sock層接口,主要用於操做socket結構

structproto_ops {

 

    int    family;

    struct module *owner;

    int    (*release)  (struct socket*sock);

    int    (*bind)        (struct socket*sock,

                    struct sockaddr*myaddr,

                    int sockaddr_len);

    int    (*connect)  (struct socket*sock,

                    struct sockaddr*vaddr,

                     int sockaddr_len,int flags);

    int    (*socketpair)(struct socket*sock1,

                    struct socket*sock2);

    int    (*accept)   (struct socket*sock,

                    struct socket*newsock,int flags);

    int    (*getname)  (struct socket*sock,

                    struct sockaddr*addr,

                    int*sockaddr_len,int peer);

    unsigned int  (*poll)        (struct file*file,struct socket *sock,

                    struct poll_table_struct*wait);

    int    (*ioctl)    (struct socket*sock,unsignedint cmd,

                    unsignedlong arg);

    int    (*compat_ioctl)(struct socket*sock,unsignedint cmd,

                    unsignedlong arg);

    int    (*listen)   (struct socket*sock,int len);

    int    (*shutdown) (struct socket*sock,int flags);

    int    (*setsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,unsignedint optlen);

    int    (*getsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,int __user*optlen);

    int    (*compat_setsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,unsignedint optlen);

    int    (*compat_getsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,int __user*optlen);

    int    (*sendmsg)  (struct kiocb*iocb,struct socket*sock,

                    struct msghdr*m, size_t total_len);

    int    (*recvmsg)  (struct kiocb*iocb,struct socket*sock,

                    struct msghdr*m, size_t total_len,

                    int flags);

    int    (*mmap)        (struct file*file,struct socket*sock,

                    struct vm_area_struct* vma);

    ssize_t       (*sendpage) (struct socket*sock,struct page *page,

                    int offset, size_t size,int flags);

    ssize_t    (*splice_read)(struct socket*sock,  loff_t*ppos,

                     struct pipe_inode_info*pipe, size_t len,unsignedint flags);

};

4.12  proto

inet_sock 層到傳輸層 操做的統一接口,主要用於操做sock結構

/* Networking protocol blocks we attach to sockets.

 * socket layer -> transportlayer interface

 * transport -> network interfaceis defined by struct inet_proto

 */

struct proto {

    void          (*close)(struct sock*sk,

                  long timeout);

    int        (*connect)(struct sock*sk,

                      struct sockaddr*uaddr,

                  int addr_len);

    int        (*disconnect)(struct sock*sk,int flags);

 

    struct sock *     (*accept)(struct sock*sk,int flags,int*err);

 

    int        (*ioctl)(struct sock*sk,int cmd,

                   unsignedlong arg);

    int        (*init)(struct sock*sk);

    void          (*destroy)(struct sock*sk);

    void          (*shutdown)(struct sock*sk,int how);

    int        (*setsockopt)(struct sock*sk,int level,

                  int optname,char __user*optval,

                  unsignedint optlen);

    int        (*getsockopt)(struct sock*sk,int level,

                  int optname,char __user*optval,

                  int __user*option);      

#ifdef CONFIG_COMPAT

    int        (*compat_setsockopt)(struct sock*sk,

                  int level,

                  int optname,char __user*optval,

                  unsignedint optlen);

    int        (*compat_getsockopt)(struct sock*sk,

                  int level,

                  int optname,char __user*optval,

                  int __user*option);

#endif

    int        (*sendmsg)(struct kiocb*iocb,struct sock *sk,

                     struct msghdr*msg, size_t len);

    int        (*recvmsg)(struct kiocb*iocb,struct sock *sk,

                     struct msghdr*msg,

                  size_t len,int noblock,int flags,

                  int *addr_len);

    int        (*sendpage)(struct sock*sk,struct page *page,

                  int offset, size_t size,int flags);

    int        (*bind)(struct sock*sk,

                  struct sockaddr*uaddr,int addr_len);

 

    int        (*backlog_rcv)(struct sock*sk,

                     struct sk_buff*skb);

 

    /* Keeping track of sk's, lookingthem up, and port selection methods. */

    void          (*hash)(struct sock*sk);

    void          (*unhash)(struct sock*sk);

    int        (*get_port)(struct sock*sk,unsignedshort snum);

 

    /* Keeping track of sockets in use */

#ifdef CONFIG_PROC_FS

    unsigned int      inuse_idx;

#endif

 

    /* Memory pressure */

    void          (*enter_memory_pressure)(struct sock*sk);

    atomic_t      *memory_allocated;  /* Current allocated memory. */

    struct percpu_counter   *sockets_allocated; /* Current number ofsockets. */

    /*

     * Pressure flag: try to collapse.

     * Technical note: it is used by multiplecontexts non atomically.

     * All the __sk_mem_schedule() is of thisnature: accounting

     * is strict, actions are advisory and havesome latency.

     */

    int        *memory_pressure;

    int        *sysctl_mem;

    int        *sysctl_wmem;

    int        *sysctl_rmem;

    int        max_header;

 

    struct kmem_cache *slab;

    unsigned int      obj_size;

    int        slab_flags;

 

    struct percpu_counter   *orphan_count;

 

    struct request_sock_ops *rsk_prot;

    struct timewait_sock_ops*twsk_prot;

 

    union {

       struct inet_hashinfo*hashinfo;

       struct udp_table *udp_table;

       struct raw_hashinfo *raw_hash;

    } h;

 

    struct module     *owner;

 

    char          name[32];

 

    struct list_head  node;

#ifdef SOCK_REFCNT_DEBUG

    atomic_t      socks;

#endif

};

 

 

4.13  net_proto_family

用於標識和註冊協議族,常見的協議族有 ipv4, ipv6。

協議族: 用於完成某些特定的功能的協議集合。

structnet_proto_family {

    int    family;

    int    (*create)(struct net*net,struct socket *sock,

                int protocol,int kern);

    struct module *owner;

};

 

內核中聲明瞭大量的協議族,並非全部的協議族都支持。

/* Supported address families. */

#define AF_UNSPEC 0

#define AF_UNIX      1   /* Unix domain sockets      */

#define AF_LOCAL  1   /* POSIX name for AF_UNIX   */

#define AF_INET      2   /* Internet IP Protocol */

#define AF_AX25      3   /* Amateur Radio AX.25      */

#define AF_IPX       4   /* Novell IPX        */

#define AF_APPLETALK 5   /* AppleTalk DDP     */

#define AF_NETROM 6   /* Amateur Radio NET/ROM    */

#define AF_BRIDGE 7   /* Multiprotocol bridge */

#define AF_ATMPVC 8   /* ATM PVCs          */

#define AF_X25       9   /* Reserved for X.25 project    */

#define AF_INET6  10  /* IP version 6          */

#define AF_ROSE      11  /* Amateur Radio X.25 PLP   */

#define AF_DECnet 12  /* Reserved for DECnet project  */

#define AF_NETBEUI   13  /* Reserved for 802.2LLC project*/

#define AF_SECURITY  14  /* Security callback pseudo AF */

#define AF_KEY       15      /* PF_KEY key management API */

#define AF_NETLINK   16

#define AF_ROUTE  AF_NETLINK /* Alias to emulate4.4BSD */

#define AF_PACKET 17  /* Packet family     */

#define AF_ASH       18  /* Ash            */

#define AF_ECONET 19  /* Acorn Econet          */

#define AF_ATMSVC 20  /* ATM SVCs          */

#define AF_RDS       21  /* RDS sockets           */

#define AF_SNA       22  /* Linux SNA Project (nutters!) */

#define AF_IRDA      23  /* IRDA sockets          */

#define AF_PPPOX  24  /* PPPoX sockets     */

#define AF_WANPIPE   25  /* Wanpipe API Sockets */

#define AF_LLC       26  /* Linux LLC         */

#define AF_CAN       29  /* Controller Area Network      */

#define AF_TIPC      30  /* TIPC sockets          */

#define AF_BLUETOOTH 31  /* Bluetooth sockets     */

#define AF_IUCV      32  /* IUCV sockets          */

#define AF_RXRPC  33  /* RxRPC sockets     */

#define AF_ISDN      34  /* mISDN sockets     */

#define AF_PHONET 35  /* Phonet sockets    */

#define AF_IEEE802154    36  /* IEEE802154 sockets       */

#define AF_MAX       37  /* For now.. */

 

static const struct net_proto_family *net_families[NPROTO];

 

4.14  softnet_data

內核爲每一個CPU都分配一個這樣的softnet_data數據空間。

每一個CPU都有一個這樣的隊列,用於接收數據包。

/*

 * Incoming packets are placed onper-cpu queues so that

 * no locking is needed.

 */

structsoftnet_data {

    struct Qdisc      *output_queue;

    struct list_head  poll_list;

    struct sk_buff      *completion_queue;

 

    /* Elements below can be accessedbetween CPUs for RPS */

    struct call_single_data  csd ____cacheline_aligned_in_smp;

    unsigned int            input_queue_head;

    struct sk_buff_head  input_pkt_queue;

    struct napi_struct   backlog;

};

 

 

4.15  sk_buff

描述一個幀結構的屬性,持有socket,到達時間,到達設備,各層頭部大小,下一站路由入口,幀長度,校驗和,等等。

/**

 *  structsk_buff - socket buffer

 *  @next:Next buffer in list

 *  @prev:Previous buffer in list

 *  @sk:Socket we are owned by

 *  @tstamp:Time we arrived

 *  @dev:Device we arrived on/are leaving by

 *  @transport_header:Transport layer header

 *  @network_header:Network layer header

 *  @mac_header:Link layer header

 *  @_skb_dst:destination entry

 *  @sp:the security path, used for xfrm

 *  @cb:Control buffer. Free for use by every layer. Put private vars here

 *  @len:Length of actual data

 *  @data_len:Data length

 *  @mac_len:Length of link layer header

 *  @hdr_len:writable header length of cloned skb

 *  @csum:Checksum (must include start/offset pair)

 *  @csum_start:Offset from skb->head where checksumming should start

 *  @csum_offset:Offset from csum_start where checksum should be stored

 *  @local_df:allow local fragmentation

 *  @cloned:Head may be cloned (check refcnt to be sure)

 *  @nohdr:Payload reference only, must not modify header

 *  @pkt_type:Packet class

 *  @fclone:skbuff clone status

 *  @ip_summed:Driver fed us an IP checksum

 *  @priority:Packet queueing priority

 *  @users:User count - see {datagram,tcp}.c

 *  @protocol:Packet protocol from driver

 *  @truesize:Buffer size

 *  @head:Head of buffer

 *  @data:Data head pointer

 *  @tail:Tail pointer

 *  @end:End pointer

 *  @destructor:Destruct function

 *  @mark:Generic packet mark

 *  @nfct:Associated connection, if any

 *  @ipvs_property:skbuff is owned by ipvs

 *  @peeked:this packet has been seen already, so stats have been

 *     donefor it, don't do them again

 *  @nf_trace:netfilter packet trace flag

 *  @nfctinfo:Relationship of this skb to the connection

 *  @nfct_reasm:netfilter conntrack re-assembly pointer

 *  @nf_bridge:Saved data about a bridged frame - see br_netfilter.c

 *  @iif:ifindex of device we arrived on

 *  @queue_mapping:Queue mapping for multiqueue devices

 *  @tc_index:Traffic control index

 *  @tc_verd:traffic control verdict

 *  @ndisc_nodetype:router type (from link layer)

 *  @dma_cookie:a cookie to one of several possible DMA operations

 *     doneby skb DMA functions

 *  @secmark:security marking

 *  @vlan_tci:vlan tag control information

 */

struct sk_buff{

    /* These two members must be first. */

    struct sk_buff      *next;

    struct sk_buff      *prev;

 

    struct sock      *sk;

    ktime_t           tstamp;

    struct net_device*dev;

 

    unsigned long     _skb_dst;

#ifdef CONFIG_XFRM

    struct sec_path   *sp;

#endif

    /*

     * This is the control buffer. It is free touse for every

     * layer. Please put your private variablesthere. If you

     * want to keep them across layers you have todo a skb_clone()

     * first. This is owned by whoever has the skbqueued ATM.

     */

    char          cb[48];

 

    unsigned int      len,

              data_len;

    __u16         mac_len,

              hdr_len;

    union {

       __wsum     csum;

       struct {

           __u16  csum_start;

           __u16  csum_offset;

       };

    };

    __u32         priority;

    kmemcheck_bitfield_begin(flags1);

    __u8          local_df:1,

              cloned:1,

              ip_summed:2,

              nohdr:1,

              nfctinfo:3;

    __u8          pkt_type:3,

               fclone:2,

              ipvs_property:1,

              peeked:1,

              nf_trace:1;

    __be16        protocol:16;

    kmemcheck_bitfield_end(flags1);

 

    void          (*destructor)(struct sk_buff*skb);

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)

    struct nf_conntrack *nfct;

    struct sk_buff      *nfct_reasm;

#endif

#ifdef CONFIG_BRIDGE_NETFILTER

    struct nf_bridge_info   *nf_bridge;

#endif

 

    int        iif;

#ifdef CONFIG_NET_SCHED

    __u16         tc_index; /* traffic controlindex */

#ifdef CONFIG_NET_CLS_ACT

    __u16         tc_verd;  /* traffic controlverdict */

#endif

#endif

 

    kmemcheck_bitfield_begin(flags2);

    __u16         queue_mapping:16;

#ifdef CONFIG_IPV6_NDISC_NODETYPE

    __u8          ndisc_nodetype:2,

              deliver_no_wcard:1;

#else

    __u8          deliver_no_wcard:1;

#endif

#ifndef __GENKSYMS__

    __u8          ooo_okay:1;

#endif

    kmemcheck_bitfield_end(flags2);

 

    /* 0/13 bit hole */

 

#ifdef CONFIG_NET_DMA

    dma_cookie_t      dma_cookie;

#endif

#ifdef CONFIG_NETWORK_SECMARK

    __u32         secmark;

#endif

    union {

       __u32      mark;

       __u32      dropcount;

    };

 

    __u16         vlan_tci;

#ifndef __GENKSYMS__

    __u16         rxhash;

#endif

    sk_buff_data_t       transport_header;

    sk_buff_data_t       network_header;

    sk_buff_data_t       mac_header;

    /* These elements must be at the end,see alloc_skb() for details.  */

    sk_buff_data_t       tail;

    sk_buff_data_t       end;

    unsigned char     *head,

              *data;

    unsigned int      truesize;

    atomic_t      users;

};

4.16  sk_buff_head

數據包隊列結構

structsk_buff_head {

    /* These two members must be first.*/

    struct sk_buff    *next;

    struct sk_buff    *prev;

 

    __u32      qlen;

    spinlock_t lock;

};

4.17  net_device

這個巨大的結構體描述一個網絡設備的全部屬性,數據等信息。

/*

 *  TheDEVICE structure.

 *  Actually,this whole structure is a big mistake. It mixes I/O

 *  datawith strictly "high-level" data, and it has to know about

 *  almostevery data structure used in the INET module.

 *

 *  FIXME:cleanup struct net_device such that network protocol info

 *  movesout.

 */

 

structnet_device

{

 

    /*

     * This is the first field of the"visible" part of this structure

     * (i.e. as seen by users in the"Space.c" file).  It is thename

     * the interface.

     */

    char          name[IFNAMSIZ];

    /* device name hash chain */

    struct hlist_node name_hlist;

    /* snmp alias */

    char          *ifalias;

 

    /*

     *  I/Ospecific fields

     *  FIXME:Merge these and struct ifmap into one

     */

    unsigned long     mem_end;   /* shared mem end */

    unsigned long     mem_start; /* shared mem start  */

    unsigned long     base_addr; /* device I/Oaddress    */

    unsigned int      irq;       /* device IRQ number */

 

    /*

     *  Somehardware also needs these fields, but they are not

     *  partof the usual set specified in Space.c.

     */

 

    unsigned char     if_port;   /* Selectable AUI,TP,..*/

    unsigned char     dma;       /* DMA channel       */

 

    unsigned long     state;

 

    struct list_head  dev_list;

    struct list_head  napi_list;

 

    /* Net device features */

    unsigned long     features;

#define NETIF_F_SG       1   /* Scatter/gather IO. */

#define NETIF_F_IP_CSUM     2  /* Can checksum TCP/UDP over IPv4. */

#define NETIF_F_NO_CSUM     4  /* Does not require checksum. F.e. loopack. */

#define NETIF_F_HW_CSUM     8  /* Can checksum all the packets. */

#define NETIF_F_IPV6_CSUM   16 /* Can checksum TCP/UDP over IPV6 */

#define NETIF_F_HIGHDMA     32 /* Can DMA to high memory. */

#define NETIF_F_FRAGLIST 64  /* Scatter/gather IO. */

#define NETIF_F_HW_VLAN_TX  128/* Transmit VLAN hw acceleration */

#define NETIF_F_HW_VLAN_RX  256/* Receive VLAN hw acceleration */

#define NETIF_F_HW_VLAN_FILTER  512/* Receive filtering on VLAN */

#define NETIF_F_VLAN_CHALLENGED 1024  /* Device cannot handle VLAN packets */

#define NETIF_F_GSO      2048  /* Enable software GSO. */

#define NETIF_F_LLTX     4096  /* LockLess TX - deprecated. Please */

                  /* do not use LLTXin new drivers */

#define NETIF_F_NETNS_LOCAL 8192  /* Does not change network namespaces */

#define NETIF_F_GRO      16384 /* Generic receive offload */

#define NETIF_F_LRO      32768 /* large receive offload */

 

/* the GSO_MASK reserves bits 16 through 23 */

#define NETIF_F_FCOE_CRC (1 <<24)/* FCoECRC32 */

#define NETIF_F_SCTP_CSUM   (1<< 25)/* SCTPchecksum offload */

#define NETIF_F_FCOE_MTU (1 <<26)/*Supports max FCoE MTU, 2158 bytes*/

#define NETIF_F_NTUPLE      (1<< 27)/*N-tuple filters supported */

#define NETIF_F_RXHASH      (1<< 28)/*Receive hashing offload */

#define NETIF_F_RXCSUM      (1<< 29)/*Receive checksumming offload */

 

    /* Segmentation offload features */

#define NETIF_F_GSO_SHIFT   16

#define NETIF_F_GSO_MASK 0x00ff0000

#define NETIF_F_TSO      (SKB_GSO_TCPV4<< NETIF_F_GSO_SHIFT)

#define NETIF_F_UFO      (SKB_GSO_UDP<< NETIF_F_GSO_SHIFT)

#define NETIF_F_GSO_ROBUST  (SKB_GSO_DODGY<< NETIF_F_GSO_SHIFT)

#define NETIF_F_TSO_ECN     (SKB_GSO_TCP_ECN<< NETIF_F_GSO_SHIFT)

#define NETIF_F_TSO6     (SKB_GSO_TCPV6<< NETIF_F_GSO_SHIFT)

#define NETIF_F_FSO      (SKB_GSO_FCOE<< NETIF_F_GSO_SHIFT)

#define NETIF_F_ALL_TSO (NETIF_F_TSO| NETIF_F_TSO6 | NETIF_F_TSO_ECN)

 

    /* List of features with softwarefallbacks. */

#define NETIF_F_GSO_SOFTWARE    (NETIF_F_TSO| NETIF_F_TSO_ECN | \

               NETIF_F_TSO6 | NETIF_F_UFO)

 

 

#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM| NETIF_F_HW_CSUM)

#define NETIF_F_V4_CSUM     (NETIF_F_GEN_CSUM| NETIF_F_IP_CSUM)

#define NETIF_F_V6_CSUM     (NETIF_F_GEN_CSUM| NETIF_F_IPV6_CSUM)

#define NETIF_F_ALL_CSUM (NETIF_F_V4_CSUM| NETIF_F_V6_CSUM)

 

    /*

     * If one device supports one of thesefeatures, then enable them

     * for all in netdev_increment_features.

     */

#define NETIF_F_ONE_FOR_ALL (NETIF_F_GSO_SOFTWARE| NETIF_F_GSO_ROBUST | \

               NETIF_F_SG | NETIF_F_HIGHDMA |    \

               NETIF_F_FRAGLIST)

 

    /* Interface index. Unique deviceidentifier  */

    int        ifindex;

    int        iflink;

 

    struct net_device_stats  stats;

 

#ifdef CONFIG_WIRELESS_EXT

    /* List of functions to handleWireless Extensions (instead of ioctl).

     * See <net/iw_handler.h> for details.Jean II */

    const struct iw_handler_def *   wireless_handlers;

    /* Instance data managed by the coreof Wireless Extensions. */

    struct iw_public_data*  wireless_data;

#endif

    /* Management operations */

    const struct net_device_ops *netdev_ops;

    const struct ethtool_ops *ethtool_ops;

 

    /* Hardware header description */

    const struct header_ops *header_ops;

 

    unsigned int      flags; /* interface flags(a la BSD)   */

    unsigned short       gflags;

        unsigned short          priv_flags;/* Like 'flags' but invisible touserspace. */

    unsigned short       padded;    /* How much paddingadded by alloc_netdev() */

 

    unsigned char     operstate; /* RFC2863 operstate */

    unsigned char     link_mode; /* mapping policy to operstate */

 

    unsigned      mtu;  /* interface MTUvalue      */

    unsigned short       type;  /* interfacehardware type  */

    unsigned short       hard_header_len; /* hardware hdr length   */

 

    /* extra head- and tailroom thehardware may need, but not in all cases

     * can this be guaranteed, especially tailroom.Some cases also use

     * LL_MAX_HEADER instead to allocate the skb.

     */

    unsigned short       needed_headroom;

    unsigned short       needed_tailroom;

 

    struct net_device *master;/* Pointer to masterdevice of a group,

                    * which this device is member of.

                    */

 

    /* Interface address info. */

    unsigned char     perm_addr[MAX_ADDR_LEN];/* permanent hwaddress */

    unsigned char     addr_assign_type;/* hw address assignment type */

    unsigned char     addr_len;  /* hardware addresslength  */

    unsigned short          dev_id;      /* for sharednetwork cards */

 

    struct netdev_hw_addr_list  uc;/* Secondary unicast

                        mac addresses */

    int        uc_promisc;

    spinlock_t    addr_list_lock;

    struct dev_addr_list*mc_list; /* Multicast mac addresses  */

    int        mc_count; /* Number of installed mcasts   */

    unsigned int      promiscuity;

    unsigned int      allmulti;

 

 

    /* Protocol specific pointers */

   

#ifdef CONFIG_NET_DSA

    void          *dsa_ptr; /* dsa specific data */

#endif

    void          *atalk_ptr;  /* AppleTalk link    */

    void          *ip_ptr;  /* IPv4 specific data    */

    void                   *dn_ptr;       /* DECnet specific data */

    void                   *ip6_ptr;      /* IPv6specific data */

    void          *ec_ptr;  /* Econet specific data  */

    void          *ax25_ptr;/* AX.25 specific data

                        also used by openvswitch */

    struct wireless_dev *ieee80211_ptr;  /* IEEE 802.11 specific data,

                        assign before registering */

 

/*

 * Cache line mostly used on receivepath (including eth_type_trans())

 */

    unsigned long     last_rx;   /* Time of last Rx   */

    /* Interface address info used ineth_type_trans() */

    unsigned char     *dev_addr;/* hw address,(before bcast

                        because most packets are

                        unicast) */

 

    struct netdev_hw_addr_list  dev_addrs;/* list of device

                           hw addresses */

 

    unsigned char     broadcast[MAX_ADDR_LEN];/* hw bcast add   */

 

    struct netdev_queue  rx_queue;

 

    struct netdev_queue *_tx____cacheline_aligned_in_smp;

 

    /* Number of TX queues allocated atalloc_netdev_mq() time  */

    unsigned int      num_tx_queues;

 

    /* Number of TX queues currentlyactive in device  */

    unsigned int      real_num_tx_queues;

 

    /* root qdisc from userspace point ofview */

    struct Qdisc      *qdisc;

 

    unsigned long     tx_queue_len; /* Max frames perqueue allowed */

    spinlock_t    tx_global_lock;

/*

 * One part is mostly used on xmitpath (device)

 */

    /* These may be needed for futurenetwork-power-down code. */

 

    /*

     * trans_start here is expensive for high speeddevices on SMP,

     * please use netdev_queue->trans_startinstead.

     */

    unsigned long     trans_start;  /* Time (in jiffies)of last Tx */

 

    int        watchdog_timeo;/* used bydev_watchdog() */

    struct timer_list watchdog_timer;

 

    /* Number of references to thisdevice */

    atomic_t      refcnt ____cacheline_aligned_in_smp;

 

    /* delayed register/unregister */

    struct list_head  todo_list;

    /* device index hash chain */

    struct hlist_node index_hlist;

 

    struct net_device *link_watch_next;

 

    /* register/unregister state machine*/

    enum { NETREG_UNINITIALIZED=0,

          NETREG_REGISTERED,/* completedregister_netdevice */

           NETREG_UNREGISTERING, /* called unregister_netdevice */

           NETREG_UNREGISTERED,  /* completed unregister todo */

           NETREG_RELEASED,      /* called free_netdev */

           NETREG_DUMMY,     /* dummy device forNAPI poll */

    } reg_state;

 

    /* Called from unregister, can beused to call free_netdev */

    void (*destructor)(struct net_device*dev);

 

#ifdef CONFIG_NETPOLL

    struct netpoll_info *npinfo;

#endif

 

#ifdef CONFIG_NET_NS

    /* Network namespace this networkdevice is inside */

    struct net    *nd_net;

#endif

 

    /* mid-layer private */

    void          *ml_priv;

 

    /* bridge stuff */

    struct net_bridge_port  *br_port;

    /* macvlan */

    struct macvlan_port *macvlan_port;

    /* GARP */

    struct garp_port  *garp_port;

 

    /* class/net/name entry */

    struct device     dev;

    /* space for optional statistics andwireless sysfs groups */

    const struct attribute_group *sysfs_groups[3];

 

    /* rtnetlink link ops */

    const struct rtnl_link_ops *rtnl_link_ops;

 

    /* VLAN feature mask */

    unsigned long vlan_features;

 

    /* for setting kernel sock attributeon TCP connection setup */

#define GSO_MAX_SIZE     65536

    unsigned int      gso_max_size;

 

#ifdef CONFIG_DCB

    /* Data Center Bridging netlink ops*/

    const struct dcbnl_rtnl_ops *dcbnl_ops;

#endif

 

#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)

    /* max exchange id for FCoE LRO byddp */

    unsigned int      fcoe_ddp_xid;

#endif

};

 

 

4.18  inet_protosw

向IP層註冊socket層的調用操做接口

/* This is used to register socket interfaces for IP protocols.  */

structinet_protosw {

    struct list_head list;

 

        /* These two fields form the lookupkey.  */

    unsigned short    type;    /* This is the 2ndargument to socket(2). */

    unsigned short    protocol; /* This is the L4protocol number.  */

 

    struct proto*prot;

   const struct proto_ops *ops;

 

    char             no_check;  /* checksum on rcv/xmit/none? */

    unsigned char flags;     /* SeeINET_PROTOSW_* below.  */

};

 

4.19  inetsw_array

socket層調用IP層操做接口都在這個數組中註冊。

/* Upon startup we insert all the elements in inetsw_array[] into

 * the linked list inetsw.

 */

static struct inet_protoswinetsw_array[] =

{

    {

       .type =      SOCK_STREAM,

       .protocol=  IPPROTO_TCP,

       .prot =      &tcp_prot,

       .ops =        &inet_stream_ops,

       .no_check=  0,

       .flags =     INET_PROTOSW_PERMANENT |

                 INET_PROTOSW_ICSK,

    },

 

    {

       .type =      SOCK_DGRAM,

       .protocol =  IPPROTO_UDP,

       .prot =       &udp_prot,

       .ops =        &inet_dgram_ops,

       .no_check=  UDP_CSUM_DEFAULT,

       .flags =     INET_PROTOSW_PERMANENT,

       },

 

       {

       .type =      SOCK_DGRAM,

       .protocol=  IPPROTO_ICMP,

       .prot =      &ping_prot,

       .ops =        &inet_dgram_ops,

       .no_check=  UDP_CSUM_DEFAULT,

       .flags =     INET_PROTOSW_REUSE,

       },

 

       {

           .type=      SOCK_RAW,

           .protocol=  IPPROTO_IP, /* wild card */

           .prot=      &raw_prot,

           .ops =        &inet_sockraw_ops,

           .no_check=  UDP_CSUM_DEFAULT,

           .flags=     INET_PROTOSW_REUSE,

       }

};

4.20  sock_type

socket類型

/**

 * enum sock_type - Socket types

 * @SOCK_STREAM: stream (connection)socket

 * @SOCK_DGRAM: datagram (conn.less)socket

 * @SOCK_RAW: raw socket

 * @SOCK_RDM: reliably-deliveredmessage

 * @SOCK_SEQPACKET: sequentialpacket socket

 * @SOCK_DCCP: Datagram CongestionControl Protocol socket

 * @SOCK_PACKET: linux specific wayof getting packets at the dev level.

 *       For writing rarp and other similar things onthe user level.

 *

 * When adding some new socket typeplease

 * grep ARCH_HAS_SOCKET_TYPEinclude/asm-* /socket.h, at least MIPS

 * overrides this enum for binarycompat reasons.

 */

enumsock_type {

    SOCK_STREAM   =1,

    SOCK_DGRAM = 2,

    SOCK_RAW   =3,

    SOCK_RDM   =4,

    SOCK_SEQPACKET    =5,

    SOCK_DCCP  =6,

    SOCK_PACKET   =10,

};

4.21  IPPROTO

傳輸層協議類型ID

/* Standard well-defined IP protocols. */

enum {

  IPPROTO_IP = 0,     /* Dummy protocol for TCP       */

  IPPROTO_ICMP =1,     /* Internet Control Message Protocol   */

  IPPROTO_IGMP =2,     /* Internet Group Management Protocol  */

  IPPROTO_IPIP =4,     /* IPIP tunnels (older KA9Q tunnels use 94) */

  IPPROTO_TCP = 6,       /* Transmission Control Protocol   */

  IPPROTO_EGP = 8,       /* Exterior Gateway Protocol       */

  IPPROTO_PUP = 12,      /* PUP protocol             */

  IPPROTO_UDP = 17,      /* User Datagram Protocol       */

  IPPROTO_IDP = 22,      /* XNS IDP protocol         */

  IPPROTO_DCCP =33,    /* Datagram Congestion Control Protocol */

  IPPROTO_RSVP =46,    /* RSVP protocol         */

  IPPROTO_GRE = 47,      /* Cisco GRE tunnels (rfc 1701,1702)   */

 

  IPPROTO_IPV6 =41,    /* IPv6-in-IPv4tunnelling      */

 

  IPPROTO_ESP = 50,            /* Encapsulation Security Payloadprotocol */

  IPPROTO_AH = 51,             /* Authentication Headerprotocol       */

  IPPROTO_BEETPH =94,         /* IP option pseudoheader for BEET */

  IPPROTO_PIM    =103,      /* ProtocolIndependent Multicast  */

 

  IPPROTO_COMP   =108,               /* CompressionHeader protocol */

  IPPROTO_SCTP   =132,     /* Stream ControlTransport Protocol   */

  IPPROTO_UDPLITE =136,/* UDP-Lite (RFC 3828)          */

 

  IPPROTO_RAW  =255,   /* Raw IP packets        */

  IPPROTO_MAX

};

 

/* The inetsw table contains everything that inet_create needs to

 * build a new socket.

 */

static struct list_head inetsw[SOCK_MAX];

 

 

4.22  net_protocol

用於傳輸層協議向IP層註冊收包的接口

/* This is used to register protocols. */

struct net_protocol{

         int                     (*handler)(structsk_buff*skb);

    void          (*err_handler)(struct sk_buff*skb, u32 info);

    int        (*gso_send_check)(struct sk_buff*skb);

    struct sk_buff          *(*gso_segment)(struct sk_buff*skb,

                         int features);

    struct sk_buff         **(*gro_receive)(struct sk_buff**head,

                         struct sk_buff*skb);

    int        (*gro_complete)(struct sk_buff*skb);

    unsigned int      no_policy:1,

              netns_ok:1;

};

 

實例,UDP向IP層註冊的接口

static const struct net_protocoludp_protocol={

    .handler = udp_rcv,

    .err_handler=    udp_err,

    .gso_send_check= udp4_ufo_send_check,

    .gso_segment= udp4_ufo_fragment,

    .no_policy =  1,

    .netns_ok =   1,

};

 

IP層收包的接口都在這個數組中註冊。

externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];

4.23  packet_type

以太網數據包的結構,包括了以太網幀類型,包處理方法等。

structpacket_type {

    __be16        type;  /* This is really htons(ether_type). */

    struct net_device *dev; /* NULL is wildcarded here       */

    int        (*func) (struct sk_buff*,

                   struct net_device*,

                   struct packet_type*,

                   struct net_device*);

    struct sk_buff      *(*gso_segment)(struct sk_buff*skb,

                     int features);

    int        (*gso_send_check)(struct sk_buff*skb);

    struct sk_buff      **(*gro_receive)(struct sk_buff**head,

                         struct sk_buff*skb);

    int        (*gro_complete)(struct sk_buff*skb);

    void          *af_packet_priv;

    struct list_head  list;

};

 

IP協議向鏈路層註冊的包處理接口。

/*

 *  IPprotocol layer initialiser

 */

static struct packet_typeip_packet_type  ={

    .type = cpu_to_be16(ETH_P_IP),

    .func = ip_rcv,

    .gso_send_check= inet_gso_send_check,

    .gso_segment= inet_gso_segment,

    .gro_receive= inet_gro_receive,

    .gro_complete= inet_gro_complete,

};

/*

 *  Theseare the defined Ethernet Protocol ID's.

 */

 

#define ETH_P_LOOP   0x0060    /* Ethernet Loopback packet */

#define ETH_P_PUP 0x0200     /* Xerox PUP packet      */

#define ETH_P_PUPAT  0x0201    /* Xerox PUP Addr Trans packet  */

#defineETH_P_IP  0x0800    /*Internet Protocol packet */

#define ETH_P_X25 0x0805     /* CCITT X.25        */

#define ETH_P_ARP 0x0806     /* Address Resolution packet    */

#define    ETH_P_BPQ  0x08FF    /* G8BPQ AX.25Ethernet Packet  [ NOT AN OFFICIALLYREGISTERED ID ] */

#define ETH_P_IEEEPUP    0x0a00    /* Xerox IEEE802.3 PUP packet */

#define ETH_P_IEEEPUPAT  0x0a01    /* Xerox IEEE802.3 PUP Addr Trans packet */

#define ETH_P_DEC       0x6000         /* DEC Assigned proto           */

#define ETH_P_DNA_DL    0x6001         /* DEC DNA Dump/Load            */

#define ETH_P_DNA_RC    0x6002         /* DEC DNA Remote Console       */

#define ETH_P_DNA_RT    0x6003         /* DEC DNA Routing              */

#define ETH_P_LAT       0x6004         /* DEC LAT                      */

#define ETH_P_DIAG      0x6005         /* DEC Diagnostics              */

#define ETH_P_CUST      0x6006         /* DEC Customer use             */

#define ETH_P_SCA       0x6007         /* DEC Systems Comms Arch       */

#define ETH_P_TEB 0x6558     /* Trans Ether Bridging     */

#define ETH_P_RARP      0x8035      /* Reverse Addr Res packet  */

#define ETH_P_ATALK  0x809B    /* Appletalk DDP     */

#define ETH_P_AARP   0x80F3    /* Appletalk AARP    */

#define ETH_P_8021Q  0x8100         /* 802.1Q VLAN Extended Header  */

#define ETH_P_IPX 0x8137     /* IPX over DIX          */

#define ETH_P_IPV6   0x86DD    /* IPv6 over bluebook       */

#define ETH_P_PAUSE  0x8808    /* IEEE Pause frames. See 802.3 31B */

#define ETH_P_SLOW   0x8809    /* Slow Protocol. See 802.3ad 43B */

#define ETH_P_WCCP   0x883E    /* Web-cache coordination protocol

                   * defined in draft-wilson-wrec-wccp-v2-00.txt*/

#define ETH_P_PPP_DISC   0x8863    /* PPPoE discovery messages     */

#define ETH_P_PPP_SES    0x8864    /* PPPoE session messages   */

#define ETH_P_MPLS_UC    0x8847    /* MPLS Unicast traffic     */

#define ETH_P_MPLS_MC    0x8848    /* MPLS Multicast traffic   */

#define ETH_P_ATMMPOA    0x884c    /* MultiProtocol Over ATM   */

#define ETH_P_ATMFATE    0x8884    /* Frame-based ATM Transport

                   * over Ethernet

                   */

#define ETH_P_PAE 0x888E     /* Port Access Entity (IEEE 802.1X) */

#define ETH_P_AOE 0x88A2     /* ATA over Ethernet     */

#define ETH_P_TIPC   0x88CA    /* TIPC           */

#define ETH_P_1588   0x88F7    /* IEEE 1588 Timesync */

#define ETH_P_FCOE   0x8906    /* Fibre Channel over Ethernet  */

#define ETH_P_TDLS   0x890D    /* TDLS */

#define ETH_P_FIP 0x8914     /* FCoE Initialization Protocol */

#define ETH_P_EDSA   0xDADA    /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */

#define ETH_P_AF_IUCV   0xFBFB     /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/


 

網絡層向鏈路層註冊操做函數集合在此數據。

static struct list_head ptype_base[PTYPE_HASH_SIZE];

4.24  rtable

路由表結構,描述一個路由表的完整形態。

struct rtable {

    union

    {

       struct dst_entry  dst;

    } u;

 

    /* Cache lookup keys */

    struct flowi      fl;

 

    struct in_device  *idev;

   

    int        rt_genid;

    unsigned      rt_flags;

    __u16         rt_type;

 

    __be32        rt_dst;   /* Path destination  */

    __be32        rt_src;   /* Path source       */

    int        rt_iif;

 

    /* Info on neighbour */

    __be32        rt_gateway;

 

    /* Miscellaneous cached information*/

    __be32        rt_spec_dst;/* RFC1122 specific destination */

    struct inet_peer  *peer;/* long-living peerinfo */

};

 

4.25  rt_hash_bucket

路由表緩存

/*

 * Route cache.

 */

 

/* The locking scheme is rather straight forward:

 *

 * 1) Read-Copy Update protects thebuckets of the central route hash.

 * 2) Only writers remove entries,and they hold the lock

 *   as they look at rtable reference counts.

 * 3) Only readers acquirereferences to rtable entries,

 *   they do so with atomic increments and with the

 *   lock held.

 */

 

structrt_hash_bucket {

    struct rtable *chain;

};

 

4.26  dst_entry

包的去向接口,描述了包的去留,下一跳等路由關鍵信息。

/* Each dst_entry has reference count and sits in some parent list(s).

 * When it is removed from parentlist, it is "freed" (dst_free).

 * After this it enters dead state(dst->obsolete > 0) and if its refcnt

 * is zero, it can be destroyedimmediately, otherwise it is added

 * to gc list and garbage collectorperiodically checks the refcnt.

 */

structdst_entry

{

    struct rcu_head      rcu_head;

    struct dst_entry  *child;

    struct net_device      *dev;

    short         error;

    short         obsolete;

    int        flags;

#define DST_HOST     1

#define DST_NOXFRM       2

#define DST_NOPOLICY     4

#define DST_NOHASH       8

    unsigned long     expires;

 

    unsigned short       header_len;  /* more space athead required */

    unsigned short       trailer_len; /* space to reserveat tail */

 

    unsigned int      rate_tokens;

    unsigned long     rate_last; /* rate limiting forICMP */

 

    struct dst_entry  *path;

 

    struct neighbour  *neighbour;

    struct hh_cache     *hh;

#ifdef CONFIG_XFRM

    struct xfrm_state *xfrm;

#else

    void          *__pad1;

#endif

    int        (*input)(struct sk_buff*);

    int        (*output)(struct sk_buff*);

 

    struct dst_ops          *ops;

 

/* This Red Hat kABI workaround will shift tclassid 32 bit, while we

 * still keep the original size ofdst_entry and assures alignment

 * (see further down).

 */

#ifdef __GENKSYMS__

    u32        metrics[RTAX_MAX_ORIG];

#else

    u32        metrics[RTAX_MAX];

#endif

 

#ifdef CONFIG_NET_CLS_ROUTE

    __u32         tclassid;

#else

    __u32         __pad2;

#endif

 

 

    /*

     * Align __refcnt to a 64 bytes alignment

     * (L1_CACHE_SIZE would be too much)

     */

/* Red Hat kABI workaround to assure aligning __refcnt, while

 * consuming 32 bit of padding forour metrics expansion above.

 * On 32bit archs not padding remains.

 */

#ifdef __GENKSYMS__

#ifdef CONFIG_64BIT

    long          __pad_to_align_refcnt[2];

#else

    long          __pad_to_align_refcnt[1];

#endif

#else  /* __GENKSYMS__ */

#ifdef CONFIG_64BIT

    u32        __pad_hole_in_struct;

    long          __pad_to_align_refcnt[1];

#endif

#endif /*__GENKSYMS__ */

 

    /*

     * __refcnt wants to be on a different cacheline from

     * input/output/ops or performance tanks badly

     */

    atomic_t      __refcnt; /* client references */

    int        __use;

    unsigned long     lastuse;

    union {

       struct dst_entry*next;

       struct rtable   *rt_next;

       struct rt6_info  *rt6_next;

       struct dn_route *dn_next;

    };

};

 

4.27  napi_struct

NAPI調度的結構

NAPI: NAPI是LINUX上採用的一種提升網絡處理效率的技術,它的核心概念就是不採用中斷的方式讀取數據,而代之以首先採用中斷喚醒數據接收服務,而後採用poll的方法來輪詢數據。NAPI技術適用於高速率的短長度數據包的處理。

/*

 * Structure for NAPI schedulingsimilar to tasklet but with weighting

 */

structnapi_struct {

    /* The poll_list must only be managedby the entity which

     * changes the state of the NAPI_STATE_SCHEDbit.  This means

     * whoever atomically sets that bit can addthis napi_struct

     * to the per-cpu poll_list, and whoever clearsthat bit

     * can remove from the list right beforeclearing the bit.

     */

    struct list_head  poll_list;

 

    unsigned long     state;

    int        weight;

    int        (*poll)(structnapi_struct*,int);

#ifdef CONFIG_NETPOLL

    spinlock_t    poll_lock;

    int        poll_owner;

#endif

 

    unsigned int      gro_count;

 

    struct net_device *dev;

    struct list_head  dev_list;

    struct sk_buff      *gro_list;

    struct sk_buff      *skb;

};

5       數據結構類圖

圖2  數據結構

6       協議棧註冊流程

6.1內核啓動流程

當內核完成自解壓過程後進入內核啓動,這一過程先在arch/mips/kernel/head.S 程序中,這個程序負責數據區(BBS)、中斷描述表(IDT)、段描述表(GDT)、頁表和寄存器的初始化,程序中定義了內核的入口函數 kernel_entry( ) , kernel_entry( )函數是體系結構相關的彙編代碼,它首先初始化內核堆棧段爲建立系統中的第一過程進行準備,接着用一段循環將內核映像的未初始化的數據段清零,最後跳到 start_kernel()函數中初始化硬件相關的代碼,完成Linux核心環境的創建。

 

start_kenrel()定義在init/main.c中,真正的內核初始化過程就是從這裏纔開始。函數start_kerenl()將會調用一系列的初始化函數,用來完成內核自己的各方面設置,如中斷,內存管理,進程管理,信號,文件系統,目的是最終創建起基本完整的Linux核心環境

 

start_kernel()函數中主要函數及調用關係以下:

 

 

start_kernel

 

setup_arch

 

sched_init

 

init_IRQ

 

proc_root_init

 

mm_init

 

console_init

 

rest_init

 

cpu_probe

 

prom_init

 

cpu_report

 

arch_mem_init

 

resource_init

 

kernel_init

 

cpu_idle

 

do_basic_setup

 

init_post

 

init_tmpfs

 

driver_init

 

do_initcalls

6.2協議棧初始化流程

 

sock_init:  Initializesk_buff SLAB cache註冊socket文件系統

net_inuse_init:         爲每一個CPU分配緩存。

proto_init:        在/proc/net域下創建protocols文件,註冊相關文件操做函數

net_dev_init:   創建netdevice在/proc/sys相關的數據結構,而且開啓網卡收發中斷。

爲每一個CPU初始化一個數據包接收隊列(softnet_data),包接收的回調。註冊本地迴環操做,註冊默認網絡設備操做。 驅動層

Inet_init:   註冊Inet協議族的socket建立方法,註冊tcp,udp,icmp,igmp 接口基本的收包方法。爲IPv4協議族建立proc文件。

此函數爲協議棧主要的註冊函數:

1.        rc = proto_register(&udp_prot, 1); 註冊inet層udp協議,爲其分配快速緩存。

2.        (void)sock_register(&inet_family_ops); 向static const struct net_proto_family *net_families[NPROTO] ; 結構註冊inet協議族的操做集合(主要是協議族inetsocket的建立操做)。Inet socket層

3.        inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0,  向externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];(網絡層)註冊傳輸層UDP的操做集合。網絡層

4.        static struct list_head inetsw[SOCK_MAX];    for (r = &inetsw[0]; r < &inetsw[SOCK_MAX];++r)    INIT_LIST_HEAD(r);   初始化SOCKET類型數組,其中保存了這是個鏈表數組,每一個元素是一個鏈表,鏈接使用同種socket類型的協議和操做集合。

5.        for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN];++q)

a)        inet_register_protosw(q);

向sock層註冊協議的的調用操做集合   bsd socket層和 inet socket層

6.        arp_init(); 啓動arp協議支持

7.        ip_init(); 啓動Ip協議支持

8.        udp_init(); 啓動UDP協議支持

9.        dev_add_pack(&ip_packet_type);  向 ptype_base[PTYPE_HASH_SIZE] ; 註冊ip 協議的操做集合。 協議無關層

10.    系統調用層:  socket.c中提供的系統調用接口。

 

7      socket建立流程

         本章主要介紹socket建立的流程,參數傳遞過程。fd = socket(family, type, protocol); 建立後,內存中的數據結構的組織結構。

 

圖3  socket建立流程

8      協議棧收包流程

以UDP協議爲例

圖  收發流程頁


 

8.1驅動收包流程

以UDP協議爲例

 

圖         內核收包流程頁

8.2應用層收包流程

以UDP協議爲例

 

圖        應用層收包流程頁

9      協議棧發包流程

以UDP協議爲例

 

圖    UDP發包流程

 

10            總結

本文只是對協議棧流程作了些粗略的分析,裏面涉及到大量的技術思想沒有辦法傳達,要深刻理解可先參考csdn博主yming0221的關於協議棧的文章,連接爲  http://blog.csdn.net/column/details/linux-kernel-net.html。或者直接閱讀linux內核協議棧源碼。

1.       TCP/IP詳解卷一

2 .博客 http://blog.csdn.net/column/details/linux-kernel-net.html

 

圖1    初始化流程

 

 

 

圖2   分層數據結構

 

圖3  socket 建立流程

 

圖4 收發流程 

 

圖 5  內核收包流程細化 (中斷收包)

 

圖6 應用層收包流程   

 

圖7  UDP發包流程

 

 .

相關文章
相關標籤/搜索