LINUX內核協議棧分析 linux
目 錄
1 說明
2 TCP協議
2.1 分層
2.2 TCP/IP的分層
2.3 互聯網的地址
2.4 封裝
2.5 分用
3 數據包格式
3.1 ethhdr
3.2 iphdr
3.3 udphdr
4 數據結構
4.1 內核協議棧分層結構
4.2 msghdr
4.3 iovec
4.4 file
4.5 file_operations
4.6 socket
4.7 sock
4.8 sock_common
4.9 inet_sock
4.10 udp_sock
4.11 proto_ops
4.12 proto
4.13 net_proto_family
4.14 softnet_data
4.15 sk_buff
4.16 sk_buff_head
4.17 net_device
4.18 inet_protosw
4.19 inetsw_array
4.20 sock_type
4.21 IPPROTO
4.22 net_protocol
4.23 packet_type
4.24 rtable
4.25 dst_entry
4.26 napi_struct
5 數據結構類圖
6 協議棧註冊流程
6.1 內核啓動流程
6.2 協議棧初始化流程
7 協議棧收包流程
7.1 驅動收包流程
7.2 應用層收包流程
8 協議棧發包流程
9 總結
10 參考文獻
1 說明
本文檔製做基於版本 linux-2.6.32,本文檔的目的是讓有必定的網絡協議基礎的人瞭解到網絡數據包在協議棧中的傳輸流程,大體理解到從網卡收到數據包傳輸到應用層所經歷的步驟,以及每一個步驟所作的事情。 圖片貼到最後。
2 TCP協議
本章摘自[TCP-IP詳解卷一] 第一章。
網絡協議一般分不一樣層次進行開發,每一層分別負責不一樣的通訊功能。一個協議族,好比T C P / I P,是一組不一樣層次上的多個協議的組合。T C P / I P一般被認爲是一個四層協議系統,如圖1 - 1所示。每一層負責不一樣的功能:
TC P / I P協議族中,網絡層協議包括I P協議(網際協議),I C M P協議(I n t e r n e t互聯網控制報文協議),以及I G M P協議(I n t e r n e t組管理協議)。
3) 運輸層主要爲兩臺主機上的應用程序提供端到端的通訊。在T C P / I P協議族中,有兩個互不相同的傳輸協議:T C P(傳輸控制協議)和U D P(用戶數據報協議)。T C P爲兩臺主機提供高可靠性的數據通訊。它所作的工做包括把應用程序交給它的數據分紅合適的小塊交給下面的網絡層,確認接收到的分組,設置發送最後確認分組的超時時鐘等。因爲運輸層提供了高可靠性的端到端的通訊,所以應用層能夠忽略全部這些細節。而另外一方面,U D P則爲應用層提供一種很是簡單的服務。它只是把稱做數據報的分組從一臺主機發送到另外一臺主機,但並不保證該數據報能到達另外一端。任何須需的可靠性必須由應用層來提供。
4 ) 應用層負責處理特定的應用程序細節。幾乎各類不一樣的T C P / I P實現都會提供下面這些通用的應用程序:
• Telnet 遠程登陸。
• FTP 文件傳輸協議。
• SMTP 簡單郵件傳送協議。
• SNMP 簡單網絡管理協議。
在TC P / I P協議族中,有不少種協議。圖1 - 4給出了本書將要討論的其餘協議。
T C P和U D P是兩種最爲著名的運輸層協議,兩者都使用I P做爲網絡層協議。
雖然T C P使用不可靠的I P服務,但它卻提供一種可靠的運輸層服務。本書第1 7~2 2章將詳細討論T C P的內部操做細節。而後,咱們將介紹一些T C P的應用,如第2 6章中的Te l n e t和R l o g i n、第2 7章中的F T P以及第2 8章中的S M T P等。這些應用一般都是用戶進程。
U D P爲應用程序發送和接收數據報。一個數據報是指從發送方傳輸到接收方的一個信息單元(例如,發送方指定的必定字節數的信息)。可是與T C P不一樣的是,U D P是不可靠的,它不能保證數據報能安全無誤地到達最終目的。本書第11章將討論U D P,而後在第1 4章(D N S :域名系統),第1 5章(T F T P:簡單文件傳送協議),以及第1 6章(BO OT P:引導程序協議)介紹使用U D P的應用程序。S N M P也使用了U D P協議,可是因爲它還要處理許多其餘的協議,所以本書把它留到第2 5章再進行討論。
I P是網絡層上的主要協議,同時被T C P和U D P使用。T C P和U D P的每組數據都經過端系統和每一箇中間路由器中的I P層在互聯網中進行傳輸。在圖1 - 4中,咱們給出了一個直接訪問I P的應用程序。這是不多見的,但也是可能的(一些較老的選路協議就是以這種方式來實現的。固然新的運輸層協議也有可能使用這種方式)。第3章主要討論I P協議,可是爲了使內容更加有針對性,一些細節將留在後面的章節中進行討論。第9章和第1 0章討論I P如何進行選路。
I C M P是I P協議的附屬協議。I P層用它來與其餘主機或路由器交換錯誤報文和其餘重要信息。
第6章對I C M P的有關細節進行討論。儘管I C M P主要被I P使用,但應用程序也有可能訪問它。咱們將分析兩個流行的診斷工具,P i n g和Tr a c e r o u t e(第7章和第8章),它們都使用了I C M P。
I G M P是I n t e r n e t組管理協議。它用來把一個U D P數據報多播到多個主機。咱們在第1 2章中描述廣播(把一個U D P數據報發送到某個指定網絡上的全部主機)和多播的通常特性,而後在第1 3章中對I G M P協議自己進行描述。
A R P(地址解析協議)和R A R P(逆地址解析協議)是某些網絡接口(如以太網和令牌環網)使用的特殊協議,用來轉換I P層和網絡接口層使用的地址。咱們分別在第4章和第5章對這兩種協議進行分析和介紹。
互聯網上的每一個接口必須有一個惟一的I n t er n e t地址(也稱做I P地址)。I P地址長32 bit。I n t e r n e t地址並不採用平面形式的地址空間,如一、二、3等。I P地址具備必定的結構,五類不一樣 的互聯網地址格式如圖1 - 5所示。
這些3 2位的地址一般寫成四個十進制的數,其中每一個整數對應一個字節。這種表示方法稱做「點分十進制表示法(Dotted decimal notation)」。例如,做者的系統就是一個B類地址,它表示爲:1 4 0 . 2 5 2 .1 3 . 3 3。
區分各種地址的最簡單方法是看它的第一個十進制整數。圖1 - 6列出了各種地址的起止範圍,其中第一個十進制整數用加黑字體表示。
須要再次指出的是,多接口主機具備多個I P地址,其中每一個接口都對應一個I P地址。
因爲互聯網上的每一個接口必須有一個惟一的I P地址,所以必需要有一個管理機構爲接入互聯網的網絡分配I P地址。這個管理機構就是互聯網絡信息中心(Internet Network InformationC e n t r e),稱做I n t e r N I C。I n t e r N I C只分配網絡號。主機號的分配由系統管理員來負責。
I n t e r n e t註冊服務( I P地址和D N S域名)過去由N I C來負責,其網絡地址是n i c . d d n . m i l。1 9 9 3年4月1日,I n t e r N I C成立。如今,N I C只負責處理國防數據網的註冊請求,全部其餘的I n t e r n e t用戶註冊請求均由I n t e rN I C負責處理,其網址是:r s . i n t er n i c . n e t。
事實上I n t e r N I C由三部分組成:註冊服務(r s. i n t e r n i c . n e t),目錄和數據庫服
務(d s . i n t e r n i c. n e t),以及信息服務(i s . i n t e rn i c . n e t)。有關I n t e r N I C的其餘信息參見習題1 . 8。
有三類I P地址:單播地址(目的爲單個主機)、廣播地址(目的端爲給定網絡上的全部主機)以及多播地址(目的端爲同一組內的全部主機)。第1 2章和第1 3章將分別討論廣播和多播的更多細節。
在3 . 4節中,咱們在介紹I P選路之後將進一步介紹子網的概念。圖3 - 9給出了幾個特殊的I P地址:主機號和網絡號爲全0或全1。
當應用程序用T C P傳送數據時,數據被送入協議棧中,而後逐個經過每一層直到被看成一串比特流送入網絡。其中每一層對收到的數據都要增長一些首部信息(有時還要增長尾部信息),該過程如圖1 - 7所示。T C P傳給I P的數據單元稱做T C P報文段或簡稱爲T C P段(T C P s e g m e n t)。I P傳給網絡接口層的數據單元稱做I P數據報(IP datagram)。經過以太網傳輸的比特流稱做幀(Fr a m e )。1 - 7中幀頭和幀尾下面所標註的數字是典型以太網幀首部的字節長度
上層協議。這個過程稱做分用( D e m u lt i p l e x i n g),圖1 - 8顯示了該過程是如何發生的。[TCP-IP詳解卷一]
3 數據包格式
* Thisis an Ethernet frame header.
struct ethhdr {
unsigned char h_dest[ETH_ALEN];/* destination ethaddr */
unsigned char h_source[ETH_ALEN]; /* source ether addr */
__be16 h_proto; /* packet type ID field */
} __attribute__((packed));
* Theseare the defined Ethernet Protocol ID's.
#define ETH_P_LOOP 0x0060 /* Ethernet Loopback packet */
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
#define ETH_P_BPQ 0x08FF /* G8BPQ AX.25Ethernet Packet [ NOT AN OFFICIALLYREGISTERED ID ] */
#define ETH_P_IEEEPUP 0x0a00 /* Xerox IEEE802.3 PUP packet */
#define ETH_P_IEEEPUPAT 0x0a01 /* Xerox IEEE802.3 PUP Addr Trans packet */
#define ETH_P_DEC 0x6000 /* DEC Assigned proto */
#define ETH_P_DNA_DL 0x6001 /* DEC DNA Dump/Load */
#define ETH_P_DNA_RC 0x6002 /* DEC DNA Remote Console */
#define ETH_P_DNA_RT 0x6003 /* DEC DNA Routing */
#define ETH_P_LAT 0x6004 /* DEC LAT */
#define ETH_P_DIAG 0x6005 /* DEC Diagnostics */
#define ETH_P_CUST 0x6006 /* DEC Customer use */
#define ETH_P_SCA 0x6007 /* DEC Systems Comms Arch */
#define ETH_P_TEB 0x6558 /* Trans Ether Bridging */
#define ETH_P_RARP 0x8035 /* Reverse Addr Res packet */
#define ETH_P_ATALK 0x809B /* Appletalk DDP */
#define ETH_P_AARP 0x80F3 /* Appletalk AARP */
#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
#define ETH_P_IPX 0x8137 /* IPX over DIX */
#define ETH_P_IPV6 0x86DD /* IPv6 over bluebook */
#define ETH_P_PAUSE 0x8808 /* IEEE Pause frames. See 802.3 31B */
#define ETH_P_SLOW 0x8809 /* Slow Protocol. See 802.3ad 43B */
#define ETH_P_WCCP 0x883E /* Web-cache coordination protocol
* defined in draft-wilson-wrec-wccp-v2-00.txt*/
#define ETH_P_PPP_DISC 0x8863 /* PPPoE discovery messages */
#define ETH_P_PPP_SES 0x8864 /* PPPoE session messages */
#define ETH_P_MPLS_UC 0x8847 /* MPLS Unicast traffic */
#define ETH_P_MPLS_MC 0x8848 /* MPLS Multicast traffic */
#define ETH_P_ATMMPOA 0x884c /* MultiProtocol Over ATM */
#define ETH_P_ATMFATE 0x8884 /* Frame-based ATM Transport
* over Ethernet
#define ETH_P_PAE 0x888E /* Port Access Entity (IEEE 802.1X) */
#define ETH_P_AOE 0x88A2 /* ATA over Ethernet */
#define ETH_P_TIPC 0x88CA /* TIPC */
#define ETH_P_1588 0x88F7 /* IEEE 1588 Timesync */
#define ETH_P_FCOE 0x8906 /* Fibre Channel over Ethernet */
#define ETH_P_TDLS 0x890D /* TDLS */
#define ETH_P_FIP 0x8914 /* FCoE Initialization Protocol */
struct iphdr {
__u8 ihl:4,
#elif defined (__BIG_ENDIAN_BITFIELD)
__u8 version:4,
#error "Please fix<asm/byteorder.h>"
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
struct udphdr {
__be16 source;
__be16 dest;
__be16 len;
__sum16 check;
4 數據結構
圖4-1 內核協議棧分層結構
Physical device hardware : 指的實實在在的物理設備。 對應physical layer
Device agnostic interface : 設備無關層。 對應Link layer
Network protocols : 網絡層。 對應Ip layer 和 transportlayer
Protocol agnostic interface: 協議無關層 適配系統調用層,屏蔽了協議的細節
System callinterface:系統調用層 提供給應用層的系統調用,屏蔽了socket操做的細節
BSD socket: BSD Socket層 提供統一socket操做的接口, socket結構關係緊密
Inet socket: inet socket 層 調用ip層協議的統一接口,sock結構關係緊密
* Aswe do 4.4BSD message passing we use a 4.4BSD message passing
* system,not 4.3. Thus msg_accrights(len) are now missing. They
* belongin an obscure libc emulation or the bin.
struct msghdr {
void * msg_name; /* Socket name */
int msg_namelen; /* Length of name */
struct iovec* msg_iov; /* Data blocks */
__kernel_size_t msg_iovlen; /* Number of blocks */
void * msg_control; /* Per protocolmagic (eg BSD file descriptor passing) */
__kernel_size_t msg_controllen; /* Length of cmsglist */
unsigned msg_flags;
* Berkeleystyle UIO structures - Alan Cox 1994.
* Thisprogram is free software; you can redistribute it and/or
* modifyit under the terms of the GNU General Public License
* aspublished by the Free Software Foundation; either version
* 2of the License, or (at your option) any later version.
struct iovec {
void __user*iov_base; /* BSD uses caddr_t(1003.1g requires void *) */
__kernel_size_t iov_len;/* Must be size_t(1003.1g) */
struct file {
* fu_list becomes invalid after file_free iscalled and queued via
* fu_rcuhead for RCU freeing
union {
struct list_head fu_list;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
#define f_dentry f_path.dentry
#define f_vfsmnt f_path.mnt
const struct file_operations *f_op;
spinlock_t f_lock; /* f_ep_links,f_flags, no IRQ */
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
void *f_security;
/* needed for tty driver, and maybeothers */
void *private_data;
/* Used by fs/eventpoll.c to link allthe hooks to this file */
struct list_head f_ep_links;
#endif /*#ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
unsigned long f_mnt_write_state;
文件操做相關結構體,包括read(), write(), open(),ioctl()等。
* read, write, poll, fsync, readv,writev, unlocked_ioctl and compat_ioctl
* can be called without the bigkernel lock held in all filesystems.
structfile_operations {
struct module *owner;
loff_t (*llseek)(struct file*, loff_t,int);
ssize_t (*read) (struct file*,char __user*,size_t, loff_t*);
ssize_t (*write) (struct file*,constchar __user*,size_t, loff_t*);
ssize_t (*aio_read)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);
ssize_t (*aio_write)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);
int (*readdir)(struct file*,void*, filldir_t);
unsigned int (*poll)(struct file*,struct poll_table_struct *);
int (*ioctl) (struct inode*,struct file*,unsignedint,unsignedlong);
long (*unlocked_ioctl)(struct file*, unsigned int,unsignedlong);
long (*compat_ioctl)(struct file*, unsigned int,unsignedlong);
int (*mmap)(struct file*,struct vm_area_struct *);
int (*open) (struct inode*,struct file*);
int (*flush)(struct file*, fl_owner_t id);
int (*release)(struct inode*,struct file *);
int (*fsync)(struct file*,struct dentry *,int datasync);
int (*aio_fsync)(struct kiocb*, int datasync);
int (*fasync)(int,struct file *,int);
int (*lock)(struct file*,int,struct file_lock *);
ssize_t (*sendpage)(struct file*, struct page *, int, size_t, loff_t *,int);
unsigned long (*get_unmapped_area)(struct file*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);
int (*check_flags)(int);
int (*flock)(struct file*,int,struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info*,struct file *, loff_t*,size_t,unsignedint);
ssize_t (*splice_read)(struct file*, loff_t *,struct pipe_inode_info*,size_t,unsignedint);
int (*setlease)(struct file*,long,struct file_lock **);
嚮應用層提供的BSD socket操做結構體,協議無關,主要做用爲應用層提供統一的socket操做。BSD: BerkeleySoftwareDistribution)
* struct socket - general BSD socket
* @state: socket state (%SS_CONNECTED, etc)
* @type: socket type (%SOCK_STREAM, etc)
* @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc)
* @ops:protocol specific socket operations
* @fasync_list: Asynchronous wake up list
* @file: File back pointer for gc
* @sk:internal networking protocol agnostic socket representation
* @wait: wait queue for several uses
struct socket {
socket_state state;
short type;
unsigned long flags;
* Please keep fasync_list & wait fields inthe same cache line
struct fasync_struct*fasync_list;
wait_queue_head_t wait;
struct file *file;
struct sock *sk;
const struct proto_ops *ops;
typedef enum {
SS_FREE = 0, /* not allocated */
SS_UNCONNECTED, /* unconnected to any socket */
SS_CONNECTING, /* in process of connecting */
SS_CONNECTED, /* connected to socket */
SS_DISCONNECTING /* in process of disconnecting */
} socket_state;
* structsock - network layer representation of sockets
* @__sk_common:shared layout with inet_timewait_sock
* @sk_shutdown:mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
* @sk_userlocks:%SO_SNDBUF and %SO_RCVBUF settings
* @sk_lock: synchronizer
* @sk_rcvbuf:size of receive buffer in bytes
* @sk_sleep:sock wait queue
* @sk_dst_cache:destination cache
* @sk_dst_lock:destination cache lock
* @sk_policy:flow policy
* @sk_rmem_alloc:receive queue bytes committed
* @sk_receive_queue:incoming packets
* @sk_wmem_alloc:transmit queue bytes committed
* @sk_write_queue:Packet sending queue
* @sk_async_wait_queue:DMA copied packets
* @sk_omem_alloc:"o" is "option" or "other"
* @sk_wmem_queued:persistent queue size
* @sk_forward_alloc:space allocated forward
* @sk_allocation:allocation mode
* @sk_sndbuf:size of send buffer in bytes
* @sk_flags:%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPINGsettings
* @sk_no_check:%SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps:route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type:GSO type (e.g. %SKB_GSO_TCPV4)
* @sk_gso_max_size:Maximum GSO segment size to build
* @sk_lingertime:%SO_LINGER l_linger setting
* @sk_backlog:always used with the per-socket spinlock held
* @sk_callback_lock:used with the callbacks in the end of this struct
* @sk_error_queue:rarely used
* @sk_prot_creator:sk_prot of original sock creator (see ipv6_setsockopt,
* IPV6_ADDRFORM for instance)
* @sk_err:last error
* @sk_err_soft:errors that don't cause failure but are the cause of a
* persistent failure not just 'timed out'
* @sk_drops:raw/udp drops counter
* @sk_ack_backlog:current listen backlog
* @sk_max_ack_backlog:listen backlog set in listen()
* @sk_priority:%SO_PRIORITY setting
* @sk_type:socket type (%SOCK_STREAM, etc)
* @sk_protocol:which protocol this socket belongs in this network family
* @sk_peercred:%SO_PEERCRED setting
* @sk_rcvlowat:%SO_RCVLOWAT setting
* @sk_rcvtimeo:%SO_RCVTIMEO setting
* @sk_sndtimeo:%SO_SNDTIMEO setting
* @sk_filter:socket filtering instructions
* @sk_protinfo:private area, net family specific, when not using slab
* @sk_timer:sock cleanup timer
* @sk_stamp:time stamp of last packet received
* @sk_socket:Identd and reporting IO signals
* @sk_user_data:RPC layer private data
* @sk_sndmsg_page:cached page for sendmsg
* @sk_sndmsg_off:cached offset for sendmsg
* @sk_send_head:front of stuff to transmit
* @sk_security:used by security modules
* @sk_mark:generic packet mark
* @sk_write_pending:a write to stream socket waits to start
* @sk_state_change:callback to indicate change in the state of the sock
* @sk_data_ready:callback to indicate there is data to be processed
* @sk_write_space:callback to indicate there is bf sending space available
* @sk_error_report:callback to indicate errors (e.g. %MSG_ERRQUEUE)
* @sk_backlog_rcv:callback to process the backlog
* @sk_destruct:called at sock freeing time, i.e. when all refcnt == 0
struct sock {
* Now struct inet_timewait_sock also usessock_common, so please just
* don't add nothing before this first member(__sk_common) --acme
struct sock_common __sk_common;
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_copy_start __sk_common.skc_hash
#define sk_hash __sk_common.skc_hash
#define sk_family __sk_common.skc_family
#define sk_state __sk_common.skc_state
#define sk_reuse __sk_common.skc_reuse
#define sk_bound_dev_if __sk_common.skc_bound_dev_if
#define sk_bind_node __sk_common.skc_bind_node
#definesk_prot __sk_common.skc_prot
#define sk_net __sk_common.skc_net
unsigned int sk_shutdown : 2,
sk_no_check :2,
sk_userlocks :4,
sk_protocol :8,
sk_type :16;
int sk_rcvbuf;
socket_lock_t sk_lock;
* The backlog queue is special, it is alwaysused with
* the per-socket spinlock held and requireslow latency
* access. Therefore we special case it'simplementation.
struct {
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
rwlock_t sk_dst_lock;
atomic_t sk_rmem_alloc;
atomic_t sk_wmem_alloc;
atomic_t sk_omem_alloc;
int sk_sndbuf;
struct sk_buff_head sk_receive_queue;
struct sk_buff_head sk_write_queue;
struct sk_buff_head sk_async_wait_queue;
int sk_wmem_queued;
int sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
int sk_gso_type;
unsigned int sk_gso_max_size;
int sk_rcvlowat;
unsigned long sk_flags;
unsigned long sk_lingertime;
struct sk_buff_head sk_error_queue;
struct proto *sk_prot_creator;
rwlock_t sk_callback_lock;
int sk_err,
atomic_t sk_drops;
unsigned short sk_ack_backlog;
unsigned short sk_max_ack_backlog;
__u32 sk_priority;
struct ucred sk_peercred;
long sk_rcvtimeo;
long sk_sndtimeo;
struct sk_filter *sk_filter;
void *sk_protinfo;
struct timer_list sk_timer;
ktime_t sk_stamp;
struct socket *sk_socket;
void *sk_user_data;
struct page *sk_sndmsg_page;
struct sk_buff *sk_send_head;
__u32 sk_sndmsg_off;
int sk_write_pending;
void *sk_security;
__u32 sk_mark;
u32 sk_classid;
void (*sk_state_change)(struct sock*sk);
void (*sk_data_ready)(struct sock*sk,int bytes);
void (*sk_write_space)(struct sock*sk);
void (*sk_error_report)(struct sock*sk);
int (*sk_backlog_rcv)(struct sock*sk,
struct sk_buff*skb);
void (*sk_destruct)(struct sock*sk);
* structsock_common - minimal network layer representation of sockets
* @skc_node:main hash linkage for various protocol lookup tables
* @skc_nulls_node:main hash linkage for UDP/UDP-Lite protocol
* @skc_refcnt:reference count
* @skc_hash:hash value used with various protocol lookup tables
* @skc_family:network address family
* @skc_state:Connection state
* @skc_reuse:%SO_REUSEADDR setting
* @skc_bound_dev_if:bound device index if != 0
* @skc_bind_node:bind hash linkage for various protocol lookup tables
* @skc_prot:protocol handlers inside a network family
* @skc_net:reference to the network namespace of this socket
* Thisis the minimal network layer representation of sockets, the header
* forstruct sock and struct inet_timewait_sock.
struct sock_common {
* first fields are not copied in sock_copy()
union {
struct hlist_node skc_node;
struct hlist_nulls_node skc_nulls_node;
atomic_t skc_refcnt;
unsigned int skc_hash;
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_bind_node;
struct proto *skc_prot;
struct net *skc_net;
/** struct inet_sock - representation of INET sockets
* @sk - ancestor class
* @pinet6 - pointer to IPv6 controlblock
* @daddr - Foreign IPv4 addr
* @rcv_saddr - Bound local IPv4addr
* @dport - Destination port
* @num - Local port
* @saddr - Sending source
* @uc_ttl - Unicast TTL
* @sport - Source port
* @id - ID counter for DF pkts
* @tos - TOS
* @mc_ttl - Multicasting TTL
* @is_icsk - is this aninet_connection_sock?
* @mc_index - Multicast deviceindex
* @mc_list - Group array
* @cork - info to build ip hdr oneach ip frag while socket is corked
structinet_sock {
/* sk and pinet6 has to be the firsttwo members of inet_sock */
struct sock sk;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct ipv6_pinfo *pinet6;
/* Socket demultiplex comparisons onincoming packets. */
__be32 daddr;
__be32 rcv_saddr;
__be16 dport;
__u16 num;
__be32 saddr;
__s16 uc_ttl;
__u16 cmsg_flags;
struct ip_options *opt;
__be16 sport;
__u16 id;
__u8 tos;
__u8 mc_ttl;
__u8 pmtudisc;
__u8 recverr:1,
int mc_index;
__be32 mc_addr;
struct ip_mc_socklist *mc_list;
struct {
unsigned int flags;
unsigned int fragsize;
struct ip_options*opt;
struct dst_entry *dst;
int length;/* Total length ofall frames */
__be32 addr;
struct flowi fl;
} cork;
4.10 udp_sock
structudp_sock {
/* inet_sock has to be the firstmember */
struct inet_sock inet;
int pending; /* Any pending frames ? */
unsigned int corkflag; /* Cork is required*/
__u16 encap_type; /* Is this anEncapsulation socket? */
* Following member retains the information tocreate a UDP header
* when the socket is uncorked.
__u16 len; /* total length ofpending frames */
* Fields specific to UDP-Lite.
__u16 pcslen;
__u16 pcrlen;
/* indicator bits used by pcflag: */
#define UDPLITE_BIT 0x1 /* set by udpliteproto init function */
#define UDPLITE_SEND_CC 0x2 /* set via udplitesetsockopt */
#define UDPLITE_RECV_CC 0x4 /* set via udplite setsocktopt */
__u8 pcflag; /* marks socket asUDP-Lite if > 0 */
__u8 unused[3];
* For encapsulation sockets.
int (*encap_rcv)(struct sock*sk,struct sk_buff *skb);
4.11 proto_ops
BSD socket層到inet_sock層接口,主要用於操做socket結構
structproto_ops {
int family;
struct module *owner;
int (*release) (struct socket*sock);
int (*bind) (struct socket*sock,
struct sockaddr*myaddr,
int sockaddr_len);
int (*connect) (struct socket*sock,
struct sockaddr*vaddr,
int sockaddr_len,int flags);
int (*socketpair)(struct socket*sock1,
struct socket*sock2);
int (*accept) (struct socket*sock,
struct socket*newsock,int flags);
int (*getname) (struct socket*sock,
struct sockaddr*addr,
int*sockaddr_len,int peer);
unsigned int (*poll) (struct file*file,struct socket *sock,
struct poll_table_struct*wait);
int (*ioctl) (struct socket*sock,unsignedint cmd,
unsignedlong arg);
int (*compat_ioctl)(struct socket*sock,unsignedint cmd,
unsignedlong arg);
int (*listen) (struct socket*sock,int len);
int (*shutdown) (struct socket*sock,int flags);
int (*setsockopt)(struct socket*sock,int level,
int optname,char __user*optval,unsignedint optlen);
int (*getsockopt)(struct socket*sock,int level,
int optname,char __user*optval,int __user*optlen);
int (*compat_setsockopt)(struct socket*sock,int level,
int optname,char __user*optval,unsignedint optlen);
int (*compat_getsockopt)(struct socket*sock,int level,
int optname,char __user*optval,int __user*optlen);
int (*sendmsg) (struct kiocb*iocb,struct socket*sock,
struct msghdr*m, size_t total_len);
int (*recvmsg) (struct kiocb*iocb,struct socket*sock,
struct msghdr*m, size_t total_len,
int flags);
int (*mmap) (struct file*file,struct socket*sock,
struct vm_area_struct* vma);
ssize_t (*sendpage) (struct socket*sock,struct page *page,
int offset, size_t size,int flags);
ssize_t (*splice_read)(struct socket*sock, loff_t*ppos,
struct pipe_inode_info*pipe, size_t len,unsignedint flags);
4.12 proto
inet_sock 層到傳輸層 操做的統一接口,主要用於操做sock結構
/* Networking protocol blocks we attach to sockets.
* socket layer -> transportlayer interface
* transport -> network interfaceis defined by struct inet_proto
struct proto {
void (*close)(struct sock*sk,
long timeout);
int (*connect)(struct sock*sk,
struct sockaddr*uaddr,
int addr_len);
int (*disconnect)(struct sock*sk,int flags);
struct sock * (*accept)(struct sock*sk,int flags,int*err);
int (*ioctl)(struct sock*sk,int cmd,
unsignedlong arg);
int (*init)(struct sock*sk);
void (*destroy)(struct sock*sk);
void (*shutdown)(struct sock*sk,int how);
int (*setsockopt)(struct sock*sk,int level,
int optname,char __user*optval,
unsignedint optlen);
int (*getsockopt)(struct sock*sk,int level,
int optname,char __user*optval,
int __user*option);
int (*compat_setsockopt)(struct sock*sk,
int level,
int optname,char __user*optval,
unsignedint optlen);
int (*compat_getsockopt)(struct sock*sk,
int level,
int optname,char __user*optval,
int __user*option);
int (*sendmsg)(struct kiocb*iocb,struct sock *sk,
struct msghdr*msg, size_t len);
int (*recvmsg)(struct kiocb*iocb,struct sock *sk,
struct msghdr*msg,
size_t len,int noblock,int flags,
int *addr_len);
int (*sendpage)(struct sock*sk,struct page *page,
int offset, size_t size,int flags);
int (*bind)(struct sock*sk,
struct sockaddr*uaddr,int addr_len);
int (*backlog_rcv)(struct sock*sk,
struct sk_buff*skb);
/* Keeping track of sk's, lookingthem up, and port selection methods. */
void (*hash)(struct sock*sk);
void (*unhash)(struct sock*sk);
int (*get_port)(struct sock*sk,unsignedshort snum);
/* Keeping track of sockets in use */
unsigned int inuse_idx;
/* Memory pressure */
void (*enter_memory_pressure)(struct sock*sk);
atomic_t *memory_allocated; /* Current allocated memory. */
struct percpu_counter *sockets_allocated; /* Current number ofsockets. */
* Pressure flag: try to collapse.
* Technical note: it is used by multiplecontexts non atomically.
* All the __sk_mem_schedule() is of thisnature: accounting
* is strict, actions are advisory and havesome latency.
int *memory_pressure;
int *sysctl_mem;
int *sysctl_wmem;
int *sysctl_rmem;
int max_header;
struct kmem_cache *slab;
unsigned int obj_size;
int slab_flags;
struct percpu_counter *orphan_count;
struct request_sock_ops *rsk_prot;
struct timewait_sock_ops*twsk_prot;
union {
struct inet_hashinfo*hashinfo;
struct udp_table *udp_table;
struct raw_hashinfo *raw_hash;
} h;
struct module *owner;
char name[32];
struct list_head node;
atomic_t socks;
4.13 net_proto_family
用於標識和註冊協議族,常見的協議族有 ipv4, ipv6。
協議族: 用於完成某些特定的功能的協議集合。
structnet_proto_family {
int family;
int (*create)(struct net*net,struct socket *sock,
int protocol,int kern);
struct module *owner;
/* Supported address families. */
#define AF_UNSPEC 0
#define AF_UNIX 1 /* Unix domain sockets */
#define AF_LOCAL 1 /* POSIX name for AF_UNIX */
#define AF_INET 2 /* Internet IP Protocol */
#define AF_AX25 3 /* Amateur Radio AX.25 */
#define AF_IPX 4 /* Novell IPX */
#define AF_APPLETALK 5 /* AppleTalk DDP */
#define AF_NETROM 6 /* Amateur Radio NET/ROM */
#define AF_BRIDGE 7 /* Multiprotocol bridge */
#define AF_ATMPVC 8 /* ATM PVCs */
#define AF_X25 9 /* Reserved for X.25 project */
#define AF_INET6 10 /* IP version 6 */
#define AF_ROSE 11 /* Amateur Radio X.25 PLP */
#define AF_DECnet 12 /* Reserved for DECnet project */
#define AF_NETBEUI 13 /* Reserved for 802.2LLC project*/
#define AF_SECURITY 14 /* Security callback pseudo AF */
#define AF_KEY 15 /* PF_KEY key management API */
#define AF_NETLINK 16
#define AF_ROUTE AF_NETLINK /* Alias to emulate4.4BSD */
#define AF_PACKET 17 /* Packet family */
#define AF_ASH 18 /* Ash */
#define AF_ECONET 19 /* Acorn Econet */
#define AF_ATMSVC 20 /* ATM SVCs */
#define AF_RDS 21 /* RDS sockets */
#define AF_SNA 22 /* Linux SNA Project (nutters!) */
#define AF_IRDA 23 /* IRDA sockets */
#define AF_PPPOX 24 /* PPPoX sockets */
#define AF_WANPIPE 25 /* Wanpipe API Sockets */
#define AF_LLC 26 /* Linux LLC */
#define AF_CAN 29 /* Controller Area Network */
#define AF_TIPC 30 /* TIPC sockets */
#define AF_BLUETOOTH 31 /* Bluetooth sockets */
#define AF_IUCV 32 /* IUCV sockets */
#define AF_RXRPC 33 /* RxRPC sockets */
#define AF_ISDN 34 /* mISDN sockets */
#define AF_PHONET 35 /* Phonet sockets */
#define AF_IEEE802154 36 /* IEEE802154 sockets */
#define AF_MAX 37 /* For now.. */
static const struct net_proto_family *net_families[NPROTO];
4.14 softnet_data
* Incoming packets are placed onper-cpu queues so that
* no locking is needed.
structsoftnet_data {
struct Qdisc *output_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
/* Elements below can be accessedbetween CPUs for RPS */
struct call_single_data csd ____cacheline_aligned_in_smp;
unsigned int input_queue_head;
struct sk_buff_head input_pkt_queue;
struct napi_struct backlog;
4.15 sk_buff
* structsk_buff - socket buffer
* @next:Next buffer in list
* @prev:Previous buffer in list
* @sk:Socket we are owned by
* @tstamp:Time we arrived
* @dev:Device we arrived on/are leaving by
* @transport_header:Transport layer header
* @network_header:Network layer header
* @mac_header:Link layer header
* @_skb_dst:destination entry
* @sp:the security path, used for xfrm
* @cb:Control buffer. Free for use by every layer. Put private vars here
* @len:Length of actual data
* @data_len:Data length
* @mac_len:Length of link layer header
* @hdr_len:writable header length of cloned skb
* @csum:Checksum (must include start/offset pair)
* @csum_start:Offset from skb->head where checksumming should start
* @csum_offset:Offset from csum_start where checksum should be stored
* @local_df:allow local fragmentation
* @cloned:Head may be cloned (check refcnt to be sure)
* @nohdr:Payload reference only, must not modify header
* @pkt_type:Packet class
* @fclone:skbuff clone status
* @ip_summed:Driver fed us an IP checksum
* @priority:Packet queueing priority
* @users:User count - see {datagram,tcp}.c
* @protocol:Packet protocol from driver
* @truesize:Buffer size
* @head:Head of buffer
* @data:Data head pointer
* @tail:Tail pointer
* @end:End pointer
* @destructor:Destruct function
* @mark:Generic packet mark
* @nfct:Associated connection, if any
* @ipvs_property:skbuff is owned by ipvs
* @peeked:this packet has been seen already, so stats have been
* donefor it, don't do them again
* @nf_trace:netfilter packet trace flag
* @nfctinfo:Relationship of this skb to the connection
* @nfct_reasm:netfilter conntrack re-assembly pointer
* @nf_bridge:Saved data about a bridged frame - see br_netfilter.c
* @iif:ifindex of device we arrived on
* @queue_mapping:Queue mapping for multiqueue devices
* @tc_index:Traffic control index
* @tc_verd:traffic control verdict
* @ndisc_nodetype:router type (from link layer)
* @dma_cookie:a cookie to one of several possible DMA operations
* doneby skb DMA functions
* @secmark:security marking
* @vlan_tci:vlan tag control information
struct sk_buff{
/* These two members must be first. */
struct sk_buff *next;
struct sk_buff *prev;
struct sock *sk;
ktime_t tstamp;
struct net_device*dev;
unsigned long _skb_dst;
struct sec_path *sp;
* This is the control buffer. It is free touse for every
* layer. Please put your private variablesthere. If you
* want to keep them across layers you have todo a skb_clone()
* first. This is owned by whoever has the skbqueued ATM.
char cb[48];
unsigned int len,
__u16 mac_len,
union {
__wsum csum;
struct {
__u16 csum_start;
__u16 csum_offset;
__u32 priority;
__u8 local_df:1,
__u8 pkt_type:3,
__be16 protocol:16;
void (*destructor)(struct sk_buff*skb);
struct nf_conntrack *nfct;
struct sk_buff *nfct_reasm;
struct nf_bridge_info *nf_bridge;
int iif;
__u16 tc_index; /* traffic controlindex */
__u16 tc_verd; /* traffic controlverdict */
__u16 queue_mapping:16;
__u8 ndisc_nodetype:2,
__u8 deliver_no_wcard:1;
#ifndef __GENKSYMS__
__u8 ooo_okay:1;
/* 0/13 bit hole */
dma_cookie_t dma_cookie;
__u32 secmark;
union {
__u32 mark;
__u32 dropcount;
__u16 vlan_tci;
#ifndef __GENKSYMS__
__u16 rxhash;
sk_buff_data_t transport_header;
sk_buff_data_t network_header;
sk_buff_data_t mac_header;
/* These elements must be at the end,see alloc_skb() for details. */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head,
unsigned int truesize;
atomic_t users;
4.16 sk_buff_head
structsk_buff_head {
/* These two members must be first.*/
struct sk_buff *next;
struct sk_buff *prev;
__u32 qlen;
spinlock_t lock;
4.17 net_device
* TheDEVICE structure.
* Actually,this whole structure is a big mistake. It mixes I/O
* datawith strictly "high-level" data, and it has to know about
* almostevery data structure used in the INET module.
* FIXME:cleanup struct net_device such that network protocol info
* movesout.
* This is the first field of the"visible" part of this structure
* (i.e. as seen by users in the"Space.c" file). It is thename
* the interface.
char name[IFNAMSIZ];
/* device name hash chain */
struct hlist_node name_hlist;
/* snmp alias */
char *ifalias;
* I/Ospecific fields
* FIXME:Merge these and struct ifmap into one
unsigned long mem_end; /* shared mem end */
unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/Oaddress */
unsigned int irq; /* device IRQ number */
* Somehardware also needs these fields, but they are not
* partof the usual set specified in Space.c.
unsigned char if_port; /* Selectable AUI,TP,..*/
unsigned char dma; /* DMA channel */
unsigned long state;
struct list_head dev_list;
struct list_head napi_list;
/* Net device features */
unsigned long features;
#define NETIF_F_SG 1 /* Scatter/gather IO. */
#define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */
#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
#define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */
#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
#define NETIF_F_HW_VLAN_TX 128/* Transmit VLAN hw acceleration */
#define NETIF_F_HW_VLAN_RX 256/* Receive VLAN hw acceleration */
#define NETIF_F_HW_VLAN_FILTER 512/* Receive filtering on VLAN */
#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
#define NETIF_F_GSO 2048 /* Enable software GSO. */
#define NETIF_F_LLTX 4096 /* LockLess TX - deprecated. Please */
/* do not use LLTXin new drivers */
#define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */
#define NETIF_F_GRO 16384 /* Generic receive offload */
#define NETIF_F_LRO 32768 /* large receive offload */
/* the GSO_MASK reserves bits 16 through 23 */
#define NETIF_F_FCOE_CRC (1 <<24)/* FCoECRC32 */
#define NETIF_F_SCTP_CSUM (1<< 25)/* SCTPchecksum offload */
#define NETIF_F_FCOE_MTU (1 <<26)/*Supports max FCoE MTU, 2158 bytes*/
#define NETIF_F_NTUPLE (1<< 27)/*N-tuple filters supported */
#define NETIF_F_RXHASH (1<< 28)/*Receive hashing offload */
#define NETIF_F_RXCSUM (1<< 29)/*Receive checksumming offload */
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
#define NETIF_F_GSO_MASK 0x00ff0000
/* List of features with softwarefallbacks. */
* If one device supports one of thesefeatures, then enable them
* for all in netdev_increment_features.
/* Interface index. Unique deviceidentifier */
int ifindex;
int iflink;
struct net_device_stats stats;
/* List of functions to handleWireless Extensions (instead of ioctl).
* See <net/iw_handler.h> for details.Jean II */
const struct iw_handler_def * wireless_handlers;
/* Instance data managed by the coreof Wireless Extensions. */
struct iw_public_data* wireless_data;
/* Management operations */
const struct net_device_ops *netdev_ops;
const struct ethtool_ops *ethtool_ops;
/* Hardware header description */
const struct header_ops *header_ops;
unsigned int flags; /* interface flags(a la BSD) */
unsigned short gflags;
unsigned short priv_flags;/* Like 'flags' but invisible touserspace. */
unsigned short padded; /* How much paddingadded by alloc_netdev() */
unsigned char operstate; /* RFC2863 operstate */
unsigned char link_mode; /* mapping policy to operstate */
unsigned mtu; /* interface MTUvalue */
unsigned short type; /* interfacehardware type */
unsigned short hard_header_len; /* hardware hdr length */
/* extra head- and tailroom thehardware may need, but not in all cases
* can this be guaranteed, especially tailroom.Some cases also use
* LL_MAX_HEADER instead to allocate the skb.
unsigned short needed_headroom;
unsigned short needed_tailroom;
struct net_device *master;/* Pointer to masterdevice of a group,
* which this device is member of.
/* Interface address info. */
unsigned char perm_addr[MAX_ADDR_LEN];/* permanent hwaddress */
unsigned char addr_assign_type;/* hw address assignment type */
unsigned char addr_len; /* hardware addresslength */
unsigned short dev_id; /* for sharednetwork cards */
struct netdev_hw_addr_list uc;/* Secondary unicast
mac addresses */
int uc_promisc;
spinlock_t addr_list_lock;
struct dev_addr_list*mc_list; /* Multicast mac addresses */
int mc_count; /* Number of installed mcasts */
unsigned int promiscuity;
unsigned int allmulti;
/* Protocol specific pointers */
void *dsa_ptr; /* dsa specific data */
void *atalk_ptr; /* AppleTalk link */
void *ip_ptr; /* IPv4 specific data */
void *dn_ptr; /* DECnet specific data */
void *ip6_ptr; /* IPv6specific data */
void *ec_ptr; /* Econet specific data */
void *ax25_ptr;/* AX.25 specific data
also used by openvswitch */
struct wireless_dev *ieee80211_ptr; /* IEEE 802.11 specific data,
assign before registering */
* Cache line mostly used on receivepath (including eth_type_trans())
unsigned long last_rx; /* Time of last Rx */
/* Interface address info used ineth_type_trans() */
unsigned char *dev_addr;/* hw address,(before bcast
because most packets are
unicast) */
struct netdev_hw_addr_list dev_addrs;/* list of device
hw addresses */
unsigned char broadcast[MAX_ADDR_LEN];/* hw bcast add */
struct netdev_queue rx_queue;
struct netdev_queue *_tx____cacheline_aligned_in_smp;
/* Number of TX queues allocated atalloc_netdev_mq() time */
unsigned int num_tx_queues;
/* Number of TX queues currentlyactive in device */
unsigned int real_num_tx_queues;
/* root qdisc from userspace point ofview */
struct Qdisc *qdisc;
unsigned long tx_queue_len; /* Max frames perqueue allowed */
spinlock_t tx_global_lock;
* One part is mostly used on xmitpath (device)
/* These may be needed for futurenetwork-power-down code. */
* trans_start here is expensive for high speeddevices on SMP,
* please use netdev_queue->trans_startinstead.
unsigned long trans_start; /* Time (in jiffies)of last Tx */
int watchdog_timeo;/* used bydev_watchdog() */
struct timer_list watchdog_timer;
/* Number of references to thisdevice */
atomic_t refcnt ____cacheline_aligned_in_smp;
/* delayed register/unregister */
struct list_head todo_list;
/* device index hash chain */
struct hlist_node index_hlist;
struct net_device *link_watch_next;
/* register/unregister state machine*/
NETREG_REGISTERED,/* completedregister_netdevice */
NETREG_UNREGISTERING, /* called unregister_netdevice */
NETREG_UNREGISTERED, /* completed unregister todo */
NETREG_RELEASED, /* called free_netdev */
NETREG_DUMMY, /* dummy device forNAPI poll */
} reg_state;
/* Called from unregister, can beused to call free_netdev */
void (*destructor)(struct net_device*dev);
struct netpoll_info *npinfo;
/* Network namespace this networkdevice is inside */
struct net *nd_net;
/* mid-layer private */
void *ml_priv;
/* bridge stuff */
struct net_bridge_port *br_port;
/* macvlan */
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port *garp_port;
/* class/net/name entry */
struct device dev;
/* space for optional statistics andwireless sysfs groups */
const struct attribute_group *sysfs_groups[3];
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
/* VLAN feature mask */
unsigned long vlan_features;
/* for setting kernel sock attributeon TCP connection setup */
#define GSO_MAX_SIZE 65536
unsigned int gso_max_size;
/* Data Center Bridging netlink ops*/
const struct dcbnl_rtnl_ops *dcbnl_ops;
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
/* max exchange id for FCoE LRO byddp */
unsigned int fcoe_ddp_xid;
4.18 inet_protosw
/* This is used to register socket interfaces for IP protocols. */
structinet_protosw {
struct list_head list;
/* These two fields form the lookupkey. */
unsigned short type; /* This is the 2ndargument to socket(2). */
unsigned short protocol; /* This is the L4protocol number. */
struct proto*prot;
const struct proto_ops *ops;
char no_check; /* checksum on rcv/xmit/none? */
unsigned char flags; /* SeeINET_PROTOSW_* below. */
4.19 inetsw_array
/* Upon startup we insert all the elements in inetsw_array[] into
* the linked list inetsw.
static struct inet_protoswinetsw_array[] =
.type = SOCK_STREAM,
.protocol= IPPROTO_TCP,
.prot = &tcp_prot,
.ops = &inet_stream_ops,
.no_check= 0,
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udp_prot,
.ops = &inet_dgram_ops,
.no_check= UDP_CSUM_DEFAULT,
.type = SOCK_DGRAM,
.protocol= IPPROTO_ICMP,
.prot = &ping_prot,
.ops = &inet_dgram_ops,
.no_check= UDP_CSUM_DEFAULT,
.type= SOCK_RAW,
.protocol= IPPROTO_IP, /* wild card */
.prot= &raw_prot,
.ops = &inet_sockraw_ops,
.no_check= UDP_CSUM_DEFAULT,
4.20 sock_type
* enum sock_type - Socket types
* @SOCK_STREAM: stream (connection)socket
* @SOCK_DGRAM: datagram (conn.less)socket
* @SOCK_RAW: raw socket
* @SOCK_RDM: reliably-deliveredmessage
* @SOCK_SEQPACKET: sequentialpacket socket
* @SOCK_DCCP: Datagram CongestionControl Protocol socket
* @SOCK_PACKET: linux specific wayof getting packets at the dev level.
* For writing rarp and other similar things onthe user level.
* When adding some new socket typeplease
* grep ARCH_HAS_SOCKET_TYPEinclude/asm-* /socket.h, at least MIPS
* overrides this enum for binarycompat reasons.
enumsock_type {
/* Standard well-defined IP protocols. */
enum {
IPPROTO_IP = 0, /* Dummy protocol for TCP */
IPPROTO_ICMP =1, /* Internet Control Message Protocol */
IPPROTO_IGMP =2, /* Internet Group Management Protocol */
IPPROTO_IPIP =4, /* IPIP tunnels (older KA9Q tunnels use 94) */
IPPROTO_TCP = 6, /* Transmission Control Protocol */
IPPROTO_EGP = 8, /* Exterior Gateway Protocol */
IPPROTO_PUP = 12, /* PUP protocol */
IPPROTO_UDP = 17, /* User Datagram Protocol */
IPPROTO_IDP = 22, /* XNS IDP protocol */
IPPROTO_DCCP =33, /* Datagram Congestion Control Protocol */
IPPROTO_RSVP =46, /* RSVP protocol */
IPPROTO_GRE = 47, /* Cisco GRE tunnels (rfc 1701,1702) */
IPPROTO_IPV6 =41, /* IPv6-in-IPv4tunnelling */
IPPROTO_ESP = 50, /* Encapsulation Security Payloadprotocol */
IPPROTO_AH = 51, /* Authentication Headerprotocol */
IPPROTO_BEETPH =94, /* IP option pseudoheader for BEET */
IPPROTO_PIM =103, /* ProtocolIndependent Multicast */
IPPROTO_COMP =108, /* CompressionHeader protocol */
IPPROTO_SCTP =132, /* Stream ControlTransport Protocol */
IPPROTO_UDPLITE =136,/* UDP-Lite (RFC 3828) */
IPPROTO_RAW =255, /* Raw IP packets */
/* The inetsw table contains everything that inet_create needs to
* build a new socket.
static struct list_head inetsw[SOCK_MAX];
4.22 net_protocol
/* This is used to register protocols. */
struct net_protocol{
int (*handler)(structsk_buff*skb);
void (*err_handler)(struct sk_buff*skb, u32 info);
int (*gso_send_check)(struct sk_buff*skb);
struct sk_buff *(*gso_segment)(struct sk_buff*skb,
int features);
struct sk_buff **(*gro_receive)(struct sk_buff**head,
struct sk_buff*skb);
int (*gro_complete)(struct sk_buff*skb);
unsigned int no_policy:1,
static const struct net_protocoludp_protocol={
.handler = udp_rcv,
.err_handler= udp_err,
.gso_send_check= udp4_ufo_send_check,
.gso_segment= udp4_ufo_fragment,
.no_policy = 1,
.netns_ok = 1,
externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];
4.23 packet_type
structpacket_type {
__be16 type; /* This is really htons(ether_type). */
struct net_device *dev; /* NULL is wildcarded here */
int (*func) (struct sk_buff*,
struct net_device*,
struct packet_type*,
struct net_device*);
struct sk_buff *(*gso_segment)(struct sk_buff*skb,
int features);
int (*gso_send_check)(struct sk_buff*skb);
struct sk_buff **(*gro_receive)(struct sk_buff**head,
struct sk_buff*skb);
int (*gro_complete)(struct sk_buff*skb);
void *af_packet_priv;
struct list_head list;
* IPprotocol layer initialiser
static struct packet_typeip_packet_type ={
.type = cpu_to_be16(ETH_P_IP),
.func = ip_rcv,
.gso_send_check= inet_gso_send_check,
.gso_segment= inet_gso_segment,
.gro_receive= inet_gro_receive,
.gro_complete= inet_gro_complete,
* Theseare the defined Ethernet Protocol ID's.
#define ETH_P_LOOP 0x0060 /* Ethernet Loopback packet */
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#defineETH_P_IP 0x0800 /*Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
#define ETH_P_BPQ 0x08FF /* G8BPQ AX.25Ethernet Packet [ NOT AN OFFICIALLYREGISTERED ID ] */
#define ETH_P_IEEEPUP 0x0a00 /* Xerox IEEE802.3 PUP packet */
#define ETH_P_IEEEPUPAT 0x0a01 /* Xerox IEEE802.3 PUP Addr Trans packet */
#define ETH_P_DEC 0x6000 /* DEC Assigned proto */
#define ETH_P_DNA_DL 0x6001 /* DEC DNA Dump/Load */
#define ETH_P_DNA_RC 0x6002 /* DEC DNA Remote Console */
#define ETH_P_DNA_RT 0x6003 /* DEC DNA Routing */
#define ETH_P_LAT 0x6004 /* DEC LAT */
#define ETH_P_DIAG 0x6005 /* DEC Diagnostics */
#define ETH_P_CUST 0x6006 /* DEC Customer use */
#define ETH_P_SCA 0x6007 /* DEC Systems Comms Arch */
#define ETH_P_TEB 0x6558 /* Trans Ether Bridging */
#define ETH_P_RARP 0x8035 /* Reverse Addr Res packet */
#define ETH_P_ATALK 0x809B /* Appletalk DDP */
#define ETH_P_AARP 0x80F3 /* Appletalk AARP */
#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
#define ETH_P_IPX 0x8137 /* IPX over DIX */
#define ETH_P_IPV6 0x86DD /* IPv6 over bluebook */
#define ETH_P_PAUSE 0x8808 /* IEEE Pause frames. See 802.3 31B */
#define ETH_P_SLOW 0x8809 /* Slow Protocol. See 802.3ad 43B */
#define ETH_P_WCCP 0x883E /* Web-cache coordination protocol
* defined in draft-wilson-wrec-wccp-v2-00.txt*/
#define ETH_P_PPP_DISC 0x8863 /* PPPoE discovery messages */
#define ETH_P_PPP_SES 0x8864 /* PPPoE session messages */
#define ETH_P_MPLS_UC 0x8847 /* MPLS Unicast traffic */
#define ETH_P_MPLS_MC 0x8848 /* MPLS Multicast traffic */
#define ETH_P_ATMMPOA 0x884c /* MultiProtocol Over ATM */
#define ETH_P_ATMFATE 0x8884 /* Frame-based ATM Transport
* over Ethernet
#define ETH_P_PAE 0x888E /* Port Access Entity (IEEE 802.1X) */
#define ETH_P_AOE 0x88A2 /* ATA over Ethernet */
#define ETH_P_TIPC 0x88CA /* TIPC */
#define ETH_P_1588 0x88F7 /* IEEE 1588 Timesync */
#define ETH_P_FCOE 0x8906 /* Fibre Channel over Ethernet */
#define ETH_P_TDLS 0x890D /* TDLS */
#define ETH_P_FIP 0x8914 /* FCoE Initialization Protocol */
static struct list_head ptype_base[PTYPE_HASH_SIZE];
4.24 rtable
struct rtable {
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
struct in_device *idev;
int rt_genid;
unsigned rt_flags;
__u16 rt_type;
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information*/
__be32 rt_spec_dst;/* RFC1122 specific destination */
struct inet_peer *peer;/* long-living peerinfo */
4.25 rt_hash_bucket
* Route cache.
/* The locking scheme is rather straight forward:
* 1) Read-Copy Update protects thebuckets of the central route hash.
* 2) Only writers remove entries,and they hold the lock
* as they look at rtable reference counts.
* 3) Only readers acquirereferences to rtable entries,
* they do so with atomic increments and with the
* lock held.
structrt_hash_bucket {
struct rtable *chain;
4.26 dst_entry
/* Each dst_entry has reference count and sits in some parent list(s).
* When it is removed from parentlist, it is "freed" (dst_free).
* After this it enters dead state(dst->obsolete > 0) and if its refcnt
* is zero, it can be destroyedimmediately, otherwise it is added
* to gc list and garbage collectorperiodically checks the refcnt.
struct rcu_head rcu_head;
struct dst_entry *child;
struct net_device *dev;
short error;
short obsolete;
int flags;
#define DST_HOST 1
#define DST_NOXFRM 2
#define DST_NOPOLICY 4
#define DST_NOHASH 8
unsigned long expires;
unsigned short header_len; /* more space athead required */
unsigned short trailer_len; /* space to reserveat tail */
unsigned int rate_tokens;
unsigned long rate_last; /* rate limiting forICMP */
struct dst_entry *path;
struct neighbour *neighbour;
struct hh_cache *hh;
struct xfrm_state *xfrm;
void *__pad1;
int (*input)(struct sk_buff*);
int (*output)(struct sk_buff*);
struct dst_ops *ops;
/* This Red Hat kABI workaround will shift tclassid 32 bit, while we
* still keep the original size ofdst_entry and assures alignment
* (see further down).
#ifdef __GENKSYMS__
u32 metrics[RTAX_MAX_ORIG];
u32 metrics[RTAX_MAX];
__u32 tclassid;
__u32 __pad2;
* Align __refcnt to a 64 bytes alignment
* (L1_CACHE_SIZE would be too much)
/* Red Hat kABI workaround to assure aligning __refcnt, while
* consuming 32 bit of padding forour metrics expansion above.
* On 32bit archs not padding remains.
#ifdef __GENKSYMS__
#ifdef CONFIG_64BIT
long __pad_to_align_refcnt[2];
long __pad_to_align_refcnt[1];
#else /* __GENKSYMS__ */
#ifdef CONFIG_64BIT
u32 __pad_hole_in_struct;
long __pad_to_align_refcnt[1];
#endif /*__GENKSYMS__ */
* __refcnt wants to be on a different cacheline from
* input/output/ops or performance tanks badly
atomic_t __refcnt; /* client references */
int __use;
unsigned long lastuse;
union {
struct dst_entry*next;
struct rtable *rt_next;
struct rt6_info *rt6_next;
struct dn_route *dn_next;
4.27 napi_struct
NAPI: NAPI是LINUX上採用的一種提升網絡處理效率的技術,它的核心概念就是不採用中斷的方式讀取數據,而代之以首先採用中斷喚醒數據接收服務,而後採用poll的方法來輪詢數據。NAPI技術適用於高速率的短長度數據包的處理。
* Structure for NAPI schedulingsimilar to tasklet but with weighting
structnapi_struct {
/* The poll_list must only be managedby the entity which
* changes the state of the NAPI_STATE_SCHEDbit. This means
* whoever atomically sets that bit can addthis napi_struct
* to the per-cpu poll_list, and whoever clearsthat bit
* can remove from the list right beforeclearing the bit.
struct list_head poll_list;
unsigned long state;
int weight;
int (*poll)(structnapi_struct*,int);
spinlock_t poll_lock;
int poll_owner;
unsigned int gro_count;
struct net_device *dev;
struct list_head dev_list;
struct sk_buff *gro_list;
struct sk_buff *skb;
5 數據結構類圖
圖2 數據結構
6 協議棧註冊流程
當內核完成自解壓過程後進入內核啓動,這一過程先在arch/mips/kernel/head.S 程序中,這個程序負責數據區(BBS)、中斷描述表(IDT)、段描述表(GDT)、頁表和寄存器的初始化,程序中定義了內核的入口函數 kernel_entry( ) , kernel_entry( )函數是體系結構相關的彙編代碼,它首先初始化內核堆棧段爲建立系統中的第一過程進行準備,接着用一段循環將內核映像的未初始化的數據段清零,最後跳到 start_kernel()函數中初始化硬件相關的代碼,完成Linux核心環境的創建。
start_kernel |
setup_arch |
sched_init |
init_IRQ |
proc_root_init |
mm_init |
console_init |
rest_init |
cpu_probe |
prom_init |
cpu_report |
arch_mem_init |
resource_init |
kernel_init |
cpu_idle |
do_basic_setup |
init_post |
init_tmpfs |
driver_init |
do_initcalls |
sock_init: Initializesk_buff SLAB cache註冊socket文件系統
net_inuse_init: 爲每一個CPU分配緩存。
proto_init: 在/proc/net域下創建protocols文件,註冊相關文件操做函數
net_dev_init: 創建netdevice在/proc/sys相關的數據結構,而且開啓網卡收發中斷。
爲每一個CPU初始化一個數據包接收隊列(softnet_data),包接收的回調。註冊本地迴環操做,註冊默認網絡設備操做。 驅動層
Inet_init: 註冊Inet協議族的socket建立方法,註冊tcp,udp,icmp,igmp 接口基本的收包方法。爲IPv4協議族建立proc文件。
1. rc = proto_register(&udp_prot, 1); 註冊inet層udp協議,爲其分配快速緩存。
2. (void)sock_register(&inet_family_ops); 向static const struct net_proto_family *net_families[NPROTO] ; 結構註冊inet協議族的操做集合(主要是協議族inetsocket的建立操做)。Inet socket層
3. inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0, 向externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];(網絡層)註冊傳輸層UDP的操做集合。網絡層
4. static struct list_head inetsw[SOCK_MAX]; for (r = &inetsw[0]; r < &inetsw[SOCK_MAX];++r) INIT_LIST_HEAD(r); 初始化SOCKET類型數組,其中保存了這是個鏈表數組,每一個元素是一個鏈表,鏈接使用同種socket類型的協議和操做集合。
5. for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN];++q)
a) inet_register_protosw(q);
向sock層註冊協議的的調用操做集合 bsd socket層和 inet socket層
6. arp_init(); 啓動arp協議支持
7. ip_init(); 啓動Ip協議支持
8. udp_init(); 啓動UDP協議支持
9. dev_add_pack(&ip_packet_type); 向 ptype_base[PTYPE_HASH_SIZE] ; 註冊ip 協議的操做集合。 協議無關層
10. 系統調用層: socket.c中提供的系統調用接口。
7 socket建立流程
本章主要介紹socket建立的流程,參數傳遞過程。fd = socket(family, type, protocol); 建立後,內存中的數據結構的組織結構。
圖3 socket建立流程
8 協議棧收包流程
圖 收發流程頁
圖 內核收包流程頁
圖 應用層收包流程頁
9 協議棧發包流程
圖 UDP發包流程
10 總結
1. TCP/IP詳解卷一
2 .博客
圖1 初始化流程
圖2 分層數據結構
圖3 socket 建立流程
圖4 收發流程
圖 5 內核收包流程細化 (中斷收包)
圖6 應用層收包流程
圖7 UDP發包流程