Linux File System Change Monitoring Technology、Notifier Technology

時間 2019-11-19

標籤 linux file system change monitoring technology notifier 欄目 Linux 简体版

原文原文鏈接

catalogphp

1. 爲何要監控文件系統
2: hotplug
3. udev
4. fanotify(fscking all notification system)
5. inotify
6. code example

1. 爲何要監控文件系統html

在平常工做中，人們每每須要知道在某些文件(夾)上都有那些變化，好比：node

1. 通知配置文件的改變
2. 跟蹤某些關鍵的系統文件的變化
3. 監控某個分區磁盤的總體使用狀況
4. 系統崩潰時進行自動清理
5. 自動觸發備份進程
6. 向服務器上傳文件結束時發出通知
7. 殺軟(anti-virus)須要對磁盤上的文件變更進行實時監控，並進行文件內容查殺
8. 一般使用文件輪詢的通知機制，可是這種機制只適用於常常改變的文件(由於它能夠確保每過x秒就能夠獲得i/o)，其餘狀況下都很是低效，而且有時候會丟失某些類型的變化，例如文件的修改時間沒有改變。像Tripwire這樣的數據完整性系統，它們基於時間調度來跟蹤文件變化，可是若是想實時監控文件的變化的話，那麼時間調度就一籌莫展了

Relevant Link:linux

http://www.jiangmiao.org/blog/2179.html
http://www.infoq.com/cn/articles/inotify-linux-file-system-event-monitoring

2: hotplug程序員

Hotplug是一種內核向用戶態應用通報關於熱插拔設備一些事件發生的機制，桌面系統可以利用它對設備進行有效的管理，不管什麼時候一個設備從系統中增刪, 都產生一個"熱插拔事件". 這意味着內核調用用戶空間程序 /sbin/hotplug. 這個程序典型地是一個很是小的 bash 腳本, 只傳遞執行給一系列其餘的位於 /etc/hot-plug.d/ 目錄樹的程序. 對於大部分的 Linux 發佈, 這個腳本看來以下windows

DIR="/etc/hotplug.d" for I in "${DIR}/$1/"*.hotplug "${DIR}/"default/*.hotplug ; do if [ -f $I ]; then test -x $I && $I $1 ; fi done exit 1

這個腳本搜索全部的有 .hotplug 後綴的可能對這個事件感興趣的程序並調用它們, 傳遞給它們許多不一樣的環境變量, 這些環境變量已經被內核設置安全

Relevant Link:bash

http://linux-hotplug.sourceforge.net/?selected=overview http://oss.org.cn/kernel-book/ldd3/ch14s07.html

3. udev服務器

udev是Linux kernel 2.6系列的設備管理器。它主要的功能是管理/dev目錄底下的設備節點。它同時也是用來接替devfs及hotplug的功能，這意味着它要在添加／刪除硬件時處理/dev目錄以及全部用戶空間的行爲，包括加載firmware時
在傳統的Linux系統中，/dev目錄下的設備節點爲一系列靜態存在的文件，而udev則動態提供了在系統中實際存在的設備節點。雖然devfs提供了相似功能，但udev具備如下優勢cookie

1. udev支持設備的固定命名，而並不依賴於設備插入系統的順序。默認的udev設置提供了存儲設備的固定命名。可使用其
    1) vid(vendor)
    2) pid(device)
    3) 設備名稱(model)等屬性
    4) 或其父設備的對應屬性來確認某一設備
2. udev徹底在用戶空間執行，而不是像devfs在內核空間同樣執行。結果就是udev將命名策略從內核中移走，並能夠在節點建立前用任意程序在設備屬性中爲設備命名

0x1: 運行方式

udev是一個通用的內核設備管理器。它以守護進程的方式運行於Linux系統，並監聽在新設備初始化或設備從系統中移除時，內核(經過netlink socket)所發出的uevent
系統提供了一套規則用於匹配可發現的設備事件和屬性的導出值。匹配規則可能命名並建立設備節點，並運行配置程序來對設備進行設置。udev規則能夠匹配像內核子系統、內核設備名稱、設備的物理等屬性，或設備序列號的屬性。規則也能夠請求外部程序提供信息來命名設備，或指定一個永遠同樣的自定義名稱來命名設備，而無論設備何時被系統發現

0x2: 系統架構

udev系統能夠分爲三個部分

1. libudev函數庫: 能夠用來獲取設備的信息 
2. udevd守護進程: 處於用戶空間，用於管理虛擬/dev
3. 管理命令udevadm: 用來診斷出錯狀況 
4. 系統獲取內核經過netlink socket發出的信息

0x3: 命令格式

1. BUS 總線 KERNEL 內核名如sd* ID 設備id 如總線id PLACE
2. SYSFS{filename} 或 ATTR{filename}
3. PROGRAM 調用外部程序 RESULT 匹配program返回的結果 NAME
4. SYMLINK 鏈接規則

Relevant Link:

http://zh.wikipedia.org/wiki/Udev https://www.ibm.com/developerworks/cn/linux/l-cn-udev/ https://wiki.archlinux.org/index.php/Udev_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87) https://www.suse.com/zh-cn/documentation/sles11/singlehtml/book_sle_admin/cha.udev.html

4. fanotify(fscking all notification system)

Fanotify 是一個 notifier，即一種對文件系統變化產生通知的機制，是替代 inotify 的下一代文件系統通知機制，Fanotify (fscking all notifiction and file access system) 是一個 notifier，即一種對文件系統變化產生通知的機制

0x1: fanotify的特性：文件系統事件通知

做爲一個 notifier，最基本的功能是當文件系統出現變化時通知相應的監控程序，在 Linux 的歷史上，最先由 dnotify 提供這種服務，後來 inotify 起而代之，Fanotify 也提供通知功能

1. FAN_ACCESS: File was accessed
2. FAN_MODIFY: File was modified
3. FAN_CLOSE_WRITE: Writtable file closed
4. FAN_CLOSE_NOWRITE: Unwrittable file closed
5. FAN_OPEN: File was opened
6. FAN_OPEN_PERM: File open in perm check
7. FAN_ACCESS_PERM: File accessed in perm check

0x2: fanotify的特性：全文件系統監控

Inotify使用watchdescriptor這個數據結構來對應某個被監控的文件或者目錄。每一個須要被監控的文件系統對象(文件、目錄)都須要一個wd對象來表示

Fanotify 有三個個基本的模式

1. directed: directed 模式和 inotify 相似，直接工做在被監控的對象的 inode 上，一次只能夠監控一個對象。所以須要監控大量目標時也很麻煩
2. per-mount: Per-mount 模式工做在 mount 點上，好比磁盤 /dev/sda2 的 mount 點在 /home，則 /home 目錄下的全部文件系統變化均可以被監控，這其實能夠被看做另一種 Global 模式
3. global: Global 模式則監控整個文件系統，任何變化都會通知 Listener。殺毒軟件便工做在這種模式下 
/*
須要明白的是
fanotify 依然沒法支持 sub-tree 監控。但比 inotify 進了一步的是，fanotify 能夠監控某個目錄下的直接子節點。好比能夠監控 /home 和他的直接子節點，文件 /home/foo1，/home/foo2 等均可以被監控，但 /home/pics/foo1 就不能夠了，由於 /home/pics/foo1 不是 /home 的直接子節點
*/

0x3: fanotify的特性：訪問控制 Access decision

所謂 access descision 即當文件被訪問的時候，監控程序不只能夠得到這個事件通知，還可以決定是否容許該操做。這對於殺毒軟件是必要的：當您試圖打開一個含有病毒的文件時，fanotify 將產生一個通知給做爲 listener 的殺毒軟件，這個時候殺毒軟件不只須要判斷將被打開的文件是否含有病毒，還須要阻止您的這個不安全的操做

當 app 須要打開文件的時候，加入該文件已經被 AV 程序監控，那麼 open 這個操做將引發 fanotify 的通知，在 VFS 容許 open 返回以前，fanotify 先詢問 AV program，假如容許，則 app 的 open 調用成功，不然 app 的 open 調用將失敗。這樣就能夠阻止應用程序打開帶病毒的文件了

0x4: fanotify的特性：Listener groups

Fanotify 容許多個 Listener 同時監控同一個文件系統對象。好比殺毒軟件 V 和桌面搜索軟件 S 會同時監控目錄 /myDocument。當文件 /mydocument/test 被打開的時候，fanotify 將通知 V 和 S，通知的順序遵循Listener groups配置的策略進行
例若有一類軟件叫作 hierarchical storage manager(HSM)，在文件系統中實際存放的可能只是一個 stub 文件，文件真正的內容在下一級存儲設備中，所以當 stub 文件被打開時，fanotify 應該先通知 HSM，讓它先工做，將真正的文件內容導入到 stub 文件中；而後再通知殺毒軟件，對真正的文件內容進行掃描；不然就有這樣的一種可能：殺毒文件只掃描了 stub，而 HSM 隨後將病毒導入
Fanotify 將全部的 Listener 分紅三個 Group，優先級從上到下遞減

1. FAN_CLASS_PRE_CONTENT: 初始化爲 FAN_CLASS_PRE_CONTENT 的 Listener 優先級最高，將最早收到通知，FAN_CLASS_PRE_CONTENT 用於 HSM 等須要在應用程序使用文件的 CONTENT 以前就獲得文件操做權的應用程序 2. FAN_CLASS_CONTENT: 其後是 FAN_CLASS_CONTENT，FAN_CLASS_CONTENT 適用於殺毒軟件等須要檢查文件 CONTENT 的軟件 3. FAN_CLASS_NOTIF: 最後纔是 FAN_CLASS_NOTIF 進程獲得通知，FAN_CLASS_NOTIF 則用於純粹的 notification 軟件，不須要訪問文件內容的應用程序

0x5: fanotify的特性：Listener PID

調用 Inotify 進行監控的進程若是對被監控文件進行操做，也將引發通知。有時候這會形成問題(例如自身形成的無限遞歸事件觸發)

inotify_add_watch (fd, 「/home/lm/loop」, IN_MODIFY | IN_OPEN | IN_CREATE | IN_DELETE); 
// 監控文件 /home/lm/loop for (;;) { readInotifyEvent(); if(event->mask & IN_OPEN) check_what_changed(event); // 檢查有些什麼改動 } void check_what_changed(event) { fd = open(event->name, O_RDWR); // 又觸發 inotify 通知 read (fd, buf,128) … } //函數 check_what_changed() 爲了檢查文件內容是否有變化必須調用 open 打開文件，這裏的 open 操做也會觸發 inotify 通知，從而使得代碼造成一個無限循環

Fanotify 在通知中包含了觸發事件的進程的 Pid，所以上面的問題能夠輕易解決：

1. 在 check_what_changed 函數中判斷引發通知的 pid，若是是監控程序本身，則忽略這個通知，不會再次打開該文件。從而打破無限循環 
2. 實際上，Fanotify 的通知中包含了被監控文件系統對象的 open fd，應用程序能夠直接使用這個 fd 對文件對象進行操做，而不會引發新的通知，即在收到由於fanotify自身的文件操做引起的事件通知後，直接使用fd進行操做，而避免後續的遞歸事件，這也是 Fanotify 相對於 Inotify 改進的一個地方

0x6: fanotify的特性：Decision Cache

殺毒軟件要掃描每個即將被訪問的文件，這對用戶體驗的影響很大。假如一個文件被頻繁使用，且沒有修改，那麼最好只在第一次訪問的時候掃描它，以後便再也不需要掃描了。相似一個 cache，掃描過的文件進入這個 cache，下次再訪問同一個文件時，假如在 cache 中存在，那就不須要再次掃描文件內容了。
Fanotify 支持這種 cache，也叫作 ignore marks。它的工做原理很簡單，假如對一個文件系統對象設置了 ignore marks，那麼下次該文件被訪問時，相應的事件便不會觸發訪問控制的代碼，從而始終容許該文件的訪問。
殺毒軟件能夠這樣使用此特性，當應用程序第一次打開文件 file A 時，Fanotify 將通知殺毒軟件 AV 進行文件內容掃描，若是 AV 軟件發現該文件沒有病毒，在容許本次訪問的同時，對該文件設置一個 ignore mark。以下圖所示：

此後 File A 再次被訪問的時候，Fanotify 將發如今 cache 中已經有相應的 Ignore Mark，所以再也不通知 AV 軟件進行訪問控制而直接容許該文件的訪問請求

當文件內容被修改時，Fanotify 將自動清除 Ignore mark。Ignore Mark 的數量缺省狀況下有必定限制，但用戶能夠經過修改 init flag 設置無限的 mark 數目

0x7: Fanotify 的缺點

1. Fanotify 目前支持的文件系統事件類型比 inotify 少不少
相比 inotify，fanotify 所支持的文件系統事件少不少，尤爲是 fanotify 不支持 move，這使得 fanotify 沒法應用於相似桌面搜索或者實時遠程文件系統同步等應用。當文件從一個目錄移動到另外一個目錄，或者被更名時，fanotify 不產生任何通知。這使得一些使用 inotify 的應用所以沒法遷移到 fanotify 上面來

2. 和 inotify 同樣，目前 fanotify 沒法作到 sub-tree 監控

Relevant Link:

https://www.ibm.com/developerworks/cn/linux/l-cn-fanotify/ http://www.lanedo.com/filesystem-monitoring-linux-kernel/ http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-access-control.c http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-mount.c http://www.lanedo.com/users/amorgado/fanotify/fanotify-example.c

5. inotify

inotify是Linux核心子系統之一，作爲文件系統的附加功能，它可監控文件系統並將異動通知應用程序。本系統的出現取代了舊有Linux核內心，擁有相似功能之dnotify模塊
inotify的主要應用於

1. 桌面搜索軟件，像：Beagle，得以針對有變更的文件從新索引，而沒必要沒有效率地每隔幾分鐘就要掃描整個文件系統。相較於主動輪詢文件系統，經過操做系統主動告知文件異動的方式，讓Beagle等軟件甚至能夠在文件更動後一秒內更新索引 2. 更新目錄查看 3. 從新加載配置文件 4. 追蹤變動 5. 備份 6. 同步甚至上傳等許多自動化做業流程 7. 相較於被inotify取代較舊的 dnotify模塊，inotify有諸多益處。在舊的dnotify模塊中，程序必須爲每個被監控的目錄建立file descriptor，這種做法很容易讓進程擁有的file descriptor逼近系統容許的上限，進而造成瓶頸。dnotify產生的file decriptor也會致使系統資源忙碌，使可移除設備沒法被移除，徒增使用上的困擾。 因爲dnotify只能讓程序員監控目錄層級的變化，"精細度"亦是"dnotify"的劣勢之一。爲此，程序員必須付出額外的心力，自行撰寫代碼以期追蹤更細微的文件系統事件。 inotify相較之下使用較少的file descriptor，亦容許select()與poll()接口，優於dnotify使用的信號系統。這也使得inotify與既有以select()或poll()爲基礎之庫(如：Glib)集成更加便利

0x1: inotify監控事件類型

1. IN_ACCESS: File was accessed (e.g., read(2), execve(2)).
2. IN_ATTRIB: Metadata changed—for example, 
    1) permissions (e.g.,chmod(2))
    2) timestamps (e.g., utimensat(2))
    3) extended attributes (setxattr(2))
    4) link count (since Linux 2.6.25; e.g., for the target of link(2) and for unlink(2))
    5) user/group ID (e.g., chown(2)).
3. IN_CLOSE_WRITE: File opened for writing was closed.
4. IN_CLOSE_NOWRITE: File or directory not opened for writing was closed.
5. IN_CREATE: File/directory created in watched directory (e.g.
    1) open(2) O_CREAT
    2) mkdir(2)
    3) link(2)
    4) symlink(2)
    5) bind(2) on a UNIX domain socket 
6. IN_DELETE: File/directory deleted from watched directory.
7. IN_DELETE_SELF: 
Watched file/directory was itself deleted.  (This event also occurs if an object is moved to another filesystem, since mv(1) in effect copies the file to the other filesystem and then deletes it from the original filesystem.)  In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.
8. IN_MODIFY: File was modified (e.g., write(2), truncate(2)).
9. IN_MOVE_SELF: Watched file/directory was itself moved.
10. IN_MOVED_FROM: Generated for the directory containing the old filename when a file is renamed.
11. IN_MOVED_TO: Generated for the directory containing the new filename when a file is renamed.
12. IN_OPEN: File or directory was opened.
//IN_ALL_EVENTS: macro is defined as a bit mask of all of the above events.
13. IN_MOVE: Equates to IN_MOVED_FROM | IN_MOVED_TO.
14. IN_CLOSE: Equates to IN_CLOSE_WRITE | IN_CLOSE_NOWRITE.
15. IN_DONT_FOLLOW: Don't dereference pathname if it is a symbolic link.
16. IN_EXCL_UNLINK: events are not generated for children after they have been unlinked from the watched directory.
17. IN_MASK_ADD: If a watch instance already exists for the filesystem object corresponding to pathname, add (OR) the events in mask to the watch mask (instead of replacing the mask)
18. IN_ONESHOT: Monitor the filesystem object corresponding to pathname for one event, then remove from watch list.
19. IN_ONLYDIR: Only watch pathname if it is a directory
20. IN_IGNORED: Watch was removed explicitly (inotify_rm_watch(2)) or automatically (file was deleted, or filesystem was unmounted)
21. IN_ISDIR: Subject of this event is a directory.
22. IN_Q_OVERFLOW: Event queue overflowed
23. IN_UNMOUNT: Filesystem containing watched object was unmounted.  In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.

0x2: Examples

1. Suppose an application is watching the directory dir and the file dir/myfile for all events.  The examples below show some events that will be generated for these two objects.
fd = open("dir/myfile", O_RDWR);
    Generates IN_OPEN events for both dir and dir/myfile.
read(fd, buf, count);
    Generates IN_ACCESS events for both dir and dir/myfile.
write(fd, buf, count);
    Generates IN_MODIFY events for both dir and dir/myfile.
fchmod(fd, mode);
    Generates IN_ATTRIB events for both dir and dir/myfile.
close(fd);
    Generates IN_CLOSE_WRITE events for both dir and dir/myfile.


2. Suppose an application is watching the directories dir1 and dir2, and the file dir1/myfile.  The following examples show some events that may be generated.
link("dir1/myfile", "dir2/new");
    Generates an IN_ATTRIB event for myfile and an IN_CREATE event for dir2.
rename("dir1/myfile", "dir2/myfile");
    Generates an IN_MOVED_FROM event for dir1, an IN_MOVED_TO event for dir2, and an IN_MOVE_SELF event for myfile. The IN_MOVED_FROM and IN_MOVED_TO events will have the same cookie value.

3. Suppose that dir1/xx and dir2/yy are (the only) links to the same file, and an application is watching dir1, dir2, dir1/xx, and dir2/yy.  Executing the following calls in the order given below will generate the following events:
unlink("dir2/yy");
    Generates an IN_ATTRIB event for xx (because its link count changes) and an IN_DELETE event for dir2.
unlink("dir1/xx");
    Generates IN_ATTRIB, IN_DELETE_SELF, and IN_IGNORED events for xx, and an IN_DELETE event for dir1.

4. Suppose an application is watching the directory dir and (the empty) directory dir/subdir.  The following examples show some events that may be generated.
mkdir("dir/new", mode);
    Generates an IN_CREATE | IN_ISDIR event for dir.
rmdir("dir/subdir");
    Generates IN_DELETE_SELF and IN_IGNORED events for subdir, and an IN_DELETE | IN_ISDIR event for dir.

0x3: 配置接口/proc interfaces

The following interfaces can be used to limit the amount of kernel memory consumed by inotify:

1. /proc/sys/fs/inotify/max_queued_events
The value in this file is used when an application calls inotify_init(2) to set an upper limit on the number of events that can be queued to the corresponding inotify instance.
Events in excess of this limit are dropped, but an IN_Q_OVERFLOW event is always generated.

2. /proc/sys/fs/inotify/max_user_instances
This specifies an upper limit on the number of inotify instances that can be created per real user ID.

3. /proc/sys/fs/inotify/max_user_watches
This specifies an upper limit on the number of watches that can be created per real user ID.
//須要特別注意的是，inotify對磁盤變更事件的是存在限制的，對於inotify來講，每個目錄就是一個"watches"，linux/windows對max watches都是有個數限制的，由於這會佔用內存，從理論上來講，inotify沒法作到100%的目錄監控，除非採用內核態的文件系統變更監控

0x4: Limitations and caveats

1. The inotify API provides no information about the user or process that triggered the inotify event. In particular, there is no easy way for a process that is monitoring events via inotify to distinguish events that it triggers itself from those that are triggered by other processes.

2. Inotify reports only events that a user-space program triggers through the filesystem API. As a result, it does not catch remote events that occur on network filesystems. (Applications must fall back to polling the filesystem to catch such events.) Furthermore, various pseudo-filesystems such as /proc, /sys, and /dev/pts are not monitorable with inotify.

3. The inotify API does not report file accesses and modifications that may occur because of mmap(2), msync(2), and munmap(2).

4. The inotify API identifies affected files by filename. However, by the time an application processes an inotify event, the filename may already have been deleted or renamed. 這也是任何主機文件變更監控都會遇到的一個技術難題，能夠考慮的解決的方案有block阻斷刪除

4. The inotify API identifies events via watch descriptors. It is the application's responsibility to cache a mapping (if one is needed) between watch descriptors and pathnames. Be aware that directory renamings may affect multiple cached pathnames.

5. Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees.

6. If monitoring an entire directory subtree, and a new subdirectory is created in that tree or an existing directory is renamed into that tree, be aware that by the time you create a watch for the new subdirectory, new files (and subdirectories) may already exist inside the subdirectory. Therefore, you may want to scan the contents of the subdirectory immediately after adding the watch (and, if desired, recursively add watches for any subdirectories that it contains).

7. Note that the event queue can overflow. In this case, events are lost. Robust applications should handle the possibility of lost events gracefully. For example, it may be necessary to rebuild part or all of the application cache. (One simple, but possibly expensive, approach is to close the inotify file descriptor, empty the cache, create a new inotify file descriptor, and then re-create watches and cache entries for the objects to be monitored.)

0x5: 內核實現原理

在內核中，每個 inotify 實例對應一個 inotify_device 結構
/source/fs/notify/inotify/inotify_user.c

struct inotify_device 
{
    /* 
    wait queue for i/o 
    wq 是等待隊列，被 read 調用阻塞的進程將掛在該等待隊列上
    */
    wait_queue_head_t       wq;    
    
    struct mutex            ev_mutex;       /* protects event queue */
    struct mutex            up_mutex;       /* synchronizes watch updates */

    /* 
    list of queued events 
    events 爲該 inotify 實例上發生的事件的列表，被該 inotify 實例監視的全部事件在發生後都將插入到這個列表
    */
    struct list_head        events;         
    
    /* 
    user who opened this dev 
    user 用於描述建立該 inotify 實例的用戶
    */
    struct user_struct      *user;          
    struct inotify_handle   *ih;            /* inotify handle */
    struct fasync_struct    *fa;            /* async notification */

    /* 
    reference count 
    count 是引用計數
    */
    atomic_t                count;   
    
    /* 
    size of the queue (bytes) 
    queue_size 表示該 inotify 實例的事件隊列的字節數
    */
    unsigned int            queue_size; 
    
    /* 
    number of pending events 
    event_count 是 events 列表的事件數
    */
    unsigned int            event_count;    

    /* 
    maximum number of events 
    max_events 爲最大容許的事件數
    */
    unsigned int            max_events;     
};

每個 watch 對應一個 inotify_watch 結構
/source/linux/include/linux/inotify.h

struct inotify_watch 
{
    struct list_head        h_list; /* entry in inotify_handle's list */
    struct list_head        i_list; /* entry in inode's list */
    atomic_t                count;  /* reference count */
    struct inotify_handle   *ih;    /* associated inotify handle */
    struct inode            *inode; /* associated inode */
    __s32                   wd;     /* watch descriptor */
    __u32                   mask;   /* event mask for this watch */
};

結構 inotify_device 在用戶態調用 inotify_init() 時建立，當關閉 inotify_init()返回的文件描述符時將被釋放
不管是目錄仍是文件，在內核中都對應一個 inode 結構，inotify 系統在 inode 結構中增長了兩個字段

struct inode 
{   
    ...
    #ifdef CONFIG_INOTIFY
        /* 
        watches on this inode 
        inotify_watches 是在被監視目標上的 watch 列表，每當用戶調用 inotify_add_watch()時，內核就爲添加的 watch 建立一個 inotify_watch 結構，並把它插入到被監視目標對應的 inode 的 inotify_watches 列表
        */
        struct list_head    inotify_watches;    
        
        /* 
        protects the watches list 
        inotify_mutex用於同步對 inotify_watches 列表的訪問
        */
        struct mutex        inotify_mutex;    
    #endif
    ...
}

對於inotify的架構須要明白的是，文件變更監控須要內核和用戶態應用程序的同時支持，Linux內核代碼在文件系統這一層面原生支持了變更的通知，即全部的文件系統操做的代碼流程中都串行地插入了inotify的通知代碼
當文件系統發生"監控事件"之一時，相應的文件系統代碼將顯示調用fsnotify_* 來把相應的事件報告給 inotify 系統，其中*號就是相應的事件名，目前實現包括

1. fsnotify_move: 文件從一個目錄移動到另外一個目錄
2. fsnotify_nameremove: 文件從目錄中刪除
3. fsnotify_inoderemove: 自刪除
4. fsnotify_create: 建立新文件
5. fsnotify_mkdir: 建立新目錄
6. fsnotify_access: 文件被讀
7. fsnotify_modify: 文件被寫
8. fsnotify_open: 文件被打開
9. fsnotify_close: 文件被關閉
10. fsnotify_xattr: 文件的擴展屬性被修改
11. fsnotify_change: 文件被修改或原數據被修改
12. inotify_unmount_inodes: 它是一個例外，它會在文件系統被 umount 時調用來通知 umount 事件給 inotify 系統

以上提到函數最後都調用 inotify_inode_queue_event(inotify_unmount_inodes直接調用 inotify_dev_queue_event)
/source/fs/notify/inotify/inotify.c

/**
 * inotify_inode_queue_event - queue an event to all watches on this inode
 * @inode: inode event is originating from
 * @mask: event mask describing this event
 * @cookie: cookie for synchronization, or zero
 * @name: filename, if any
 * @n_inode: inode associated with name
 */
void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie, const char *name, struct inode *n_inode)
{
    struct inotify_watch *watch, *next;

    //判斷對應的inode是否被監視，這經過查看 inotify_watches 列表是否爲空來實現
    if (!inotify_inode_watched(inode))
        return;

    mutex_lock(&inode->inotify_mutex);
    //遍歷 inotify_watches 列表，看是否當前的文件操做事件被某個 watch 監視(當前inode結點上的inotify_watches)
    list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) 
    {
        u32 watch_mask = watch->mask;
        if (watch_mask & mask) 
        {
            struct inotify_handle *ih= watch->ih;
            mutex_lock(&ih->mutex);
            if (watch_mask & IN_ONESHOT)
                remove_watch_no_event(watch, ih);
            ih->in_ops->handle_event(watch, watch->wd, mask, cookie, name, n_inode);
            mutex_unlock(&ih->mutex);
        }
    }
    mutex_unlock(&inode->inotify_mutex);
}
EXPORT_SYMBOL_GPL(inotify_inode_queue_event);

inotify是以group調用鏈的形式進行事件通知的，全部的watch點都放置在這個group上
/source/include/linux/fsnotify_backend.h

/*
 * A group is a "thing" that wants to receive notification about filesystem
 * events.  The mask holds the subset of event types this group cares about.
 * refcnt on a group is up to the implementor and at any moment if it goes 0
 * everything will be cleaned up.
 */
struct fsnotify_group 
{
    /*
     * global list of all groups receiving events from fsnotify.
     * anchored by fsnotify_groups and protected by either fsnotify_grp_mutex
     * or fsnotify_grp_srcu depending on write vs read.
     */
    struct list_head group_list;

    /*
     * Defines all of the event types in which this group is interested.
     * This mask is a bitwise OR of the FS_* events from above.  Each time
     * this mask changes for a group (if it changes) the correct functions
     * must be called to update the global structures which indicate global
     * interest in event types.
     */
    __u32 mask;

    /*
     * How the refcnt is used is up to each group.  When the refcnt hits 0
     * fsnotify will clean up all of the resources associated with this group.
     * As an example, the dnotify group will always have a refcnt=1 and that
     * will never change.  Inotify, on the other hand, has a group per
     * inotify_init() and the refcnt will hit 0 only when that fd has been
     * closed.
     */
    atomic_t refcnt;        /* things with interest in this group */
    unsigned int group_num;        /* simply prevents accidental group collision */

    /* 
    how this group handles things 
    這是咱們重點要關注的成員
    */
    const struct fsnotify_ops *ops;    

    /* needed to send notification to userspace */
    struct mutex notification_mutex;    /* protect the notification_list */
    struct list_head notification_list;    /* list of event_holder this group needs to send to userspace */
    wait_queue_head_t notification_waitq;    /* read() on the notification file blocks on this waitq */
    unsigned int q_len;            /* events on the queue */
    unsigned int max_events;        /* maximum events allowed on the list */

    /* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
    spinlock_t mark_lock;        /* protect mark_entries list */
    atomic_t num_marks;        /* 1 for each mark entry and 1 for not being
                     * past the point of no return when freeing
                     * a group */
    struct list_head mark_entries;    /* all inode mark entries for this group */

    /* prevents double list_del of group_list.  protected by global fsnotify_grp_mutex */
    bool on_group_list;

    /* groups can define private fields here or use the void *private */
    union {
        void *private;
#ifdef CONFIG_INOTIFY_USER
        struct inotify_group_private_data {
            spinlock_t    idr_lock;
            struct idr      idr;
            u32             last_wd;
            struct fasync_struct    *fa;    /* async notification */
            struct user_struct      *user;
        } inotify_data;
#endif
    };
};

咱們重點關注const struct fsnotify_ops *ops;

/*
 * Each group much define these ops.  The fsnotify infrastructure will call
 * these operations for each relevant group.
 *
 * should_send_event - given a group, inode, and mask this function determines
 *        if the group is interested in this event.
 * handle_event - main call for a group to handle an fs event
 * free_group_priv - called when a group refcnt hits 0 to clean up the private union
 * freeing-mark - this means that a mark has been flagged to die when everything
 *        finishes using it.  The function is supplied with what must be a
 *        valid group and inode to use to clean up.
 */
struct fsnotify_ops 
{
    bool (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u32 mask);
    int (*handle_event)(struct fsnotify_group *group, struct fsnotify_event *event);
    void (*free_group_priv)(struct fsnotify_group *group);
    void (*freeing_mark)(struct fsnotify_mark_entry *entry, struct fsnotify_group *group);
    void (*free_event_priv)(struct fsnotify_event_private_data *priv);
};

0x6: IN_CLOSE_WRITE 事件監控內核態實現原理

/source/fs/open.c

/*
 * Careful here! We test whether the file pointer is NULL before
 * releasing the fd. This ensures that one clone task can't release
 * an fd while another clone is opening it.
 */
SYSCALL_DEFINE1(close, unsigned int, fd)
{
    struct file * filp;
    struct files_struct *files = current->files;
    struct fdtable *fdt;
    int retval;

    spin_lock(&files->file_lock);
    /*
    獲取指向struct fdtable結構體的指針
    \linux-2.6.32.63\include\linux\fdtable.h
    #define files_fdtable(files) (rcu_dereference((files)->fdt))
    */
    fdt = files_fdtable(files);
    if (fd >= fdt->max_fds)
    {
        goto out_unlock;
    } 
    //獲取須要關閉的文件描述符編號
    filp = fdt->fd[fd];
    if (!filp)
    {
        goto out_unlock;
    } 
    /*
    將fd_array[]中的的指定元素值置null 
    */
    rcu_assign_pointer(fdt->fd[fd], NULL);
    FD_CLR(fd, fdt->close_on_exec); 
    /*
    調用__put_unused_fd函數，將當前fd回收，則下一次打開新的文件又能夠用這個fd了
    static void __put_unused_fd(struct files_struct *files, unsigned int fd)
    {
        struct fdtable *fdt = files_fdtable(files);
        __FD_CLR(fd, fdt->open_fds);
        if (fd < files->next_fd)
        {
            files->next_fd = fd;
        } 
    }
    */
    __put_unused_fd(files, fd);
    spin_unlock(&files->file_lock);
    retval = filp_close(filp, files);

    /* can't restart close syscall because file table entry was cleared */
    if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK))
    {
        retval = -EINTR;
    } 

    return retval;

out_unlock:
    spin_unlock(&files->file_lock);
    return -EBADF;
}
EXPORT_SYMBOL(sys_close);

retval = filp_close(filp, files);

/*
 * "id" is the POSIX thread ID. We use the
 * files pointer for this..
 */
int filp_close(struct file *filp, fl_owner_t id)
{
    int retval = 0;

    if (!file_count(filp)) 
    {
        printk(KERN_ERR "VFS: Close: file count is 0\n");
        return 0;
    }

    if (filp->f_op && filp->f_op->flush)
    {
        retval = filp->f_op->flush(filp, id);
    } 

    dnotify_flush(filp, id);
    locks_remove_posix(filp, id);
    fput(filp);
    return retval;
}
EXPORT_SYMBOL(filp_close);

fput(filp);
/source/fs/file_table.c

void fput(struct file *file)
{
    if (atomic_long_dec_and_test(&file->f_count))
        __fput(file);
}
EXPORT_SYMBOL(fput);

/* __fput is called from task context when aio completion releases the last
 * last use of a struct file *.  Do not use otherwise.
 */
void __fput(struct file *file)
{
    struct dentry *dentry = file->f_path.dentry;
    struct vfsmount *mnt = file->f_path.mnt;
    struct inode *inode = dentry->d_inode;

    might_sleep();

    //inotify內核通知點
    fsnotify_close(file);
    /*
     * The function eventpoll_release() should be the first called
     * in the file cleanup chain.
     */
    eventpoll_release(file);
    locks_remove_flock(file);

    if (unlikely(file->f_flags & FASYNC)) {
        if (file->f_op && file->f_op->fasync)
            file->f_op->fasync(-1, file, 0);
    }
    if (file->f_op && file->f_op->release)
        file->f_op->release(inode, file);

    //LSM Hook點
    security_file_free(file);

    ima_file_free(file);
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
        cdev_put(inode->i_cdev);
    fops_put(file->f_op);
    put_pid(file->f_owner.pid);
    file_kill(file);
    if (file->f_mode & FMODE_WRITE)
        drop_file_write_access(file);
    file->f_path.dentry = NULL;
    file->f_path.mnt = NULL;
    file_free(file);
    dput(dentry);
    mntput(mnt);
}

fsnotify_close(file);
\linux-2.6.32.63\include\linux\fsnotify.h

/*
 * fsnotify_close - file was closed
 */
static inline void fsnotify_close(struct file *file)
{
    struct dentry *dentry = file->f_path.dentry;
    struct inode *inode = dentry->d_inode;
    fmode_t mode = file->f_mode;
    //判斷關閉方式
    __u32 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE;

    if (S_ISDIR(inode->i_mode))
        mask |= FS_IN_ISDIR;

    inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

    fsnotify_parent(dentry, mask);
    fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

6. code example

#include <errno.h>
       #include <poll.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/inotify.h>
       #include <unistd.h>

       /* Read all available inotify events from the file descriptor 'fd'.
          wd is the table of watch descriptors for the directories in argv.
          argc is the length of wd and argv.
          argv is the list of watched directories.
          Entry 0 of wd and argv is unused. */

       static void
       handle_events(int fd, int *wd, int argc, char* argv[])
       {
           /* Some systems cannot read integer variables if they are not
              properly aligned. On other systems, incorrect alignment may
              decrease performance. Hence, the buffer used for reading from
              the inotify file descriptor should have the same alignment as
              struct inotify_event. */

           char buf[4096]
               __attribute__ ((aligned(__alignof__(struct inotify_event))));
           const struct inotify_event *event;
           int i;
           ssize_t len;
           char *ptr;

           /* Loop while events can be read from inotify file descriptor. */

           for (;;) {

               /* Read some events. */

               len = read(fd, buf, sizeof buf);
               if (len == -1 && errno != EAGAIN) {
                   perror("read");
                   exit(EXIT_FAILURE);
               }

               /* If the nonblocking read() found no events to read, then
                  it returns -1 with errno set to EAGAIN. In that case,
                  we exit the loop. */

               if (len <= 0)
                   break;

               /* Loop over all events in the buffer */

               for (ptr = buf; ptr < buf + len;
                       ptr += sizeof(struct inotify_event) + event->len) {

                   event = (const struct inotify_event *) ptr;

                   /* Print event type */

                   if (event->mask & IN_OPEN)
                       printf("IN_OPEN: ");
                   if (event->mask & IN_CLOSE_NOWRITE)
                       printf("IN_CLOSE_NOWRITE: ");
                   if (event->mask & IN_CLOSE_WRITE)
                       printf("IN_CLOSE_WRITE: ");

                   /* Print the name of the watched directory */

                   for (i = 1; i < argc; ++i) {
                       if (wd[i] == event->wd) {
                           printf("%s/", argv[i]);
                           break;
                       }
                   }

                   /* Print the name of the file */

                   if (event->len)
                       printf("%s", event->name);

                   /* Print type of filesystem object */

                   if (event->mask & IN_ISDIR)
                       printf(" [directory]\n");
                   else
                       printf(" [file]\n");
               }
           }
       }

       int
       main(int argc, char* argv[])
       {
           char buf;
           int fd, i, poll_num;
           int *wd;
           nfds_t nfds;
           struct pollfd fds[2];

           if (argc < 2) {
               printf("Usage: %s PATH [PATH ...]\n", argv[0]);
               exit(EXIT_FAILURE);
           }

           printf("Press ENTER key to terminate.\n");

           /* Create the file descriptor for accessing the inotify API */

           fd = inotify_init1(IN_NONBLOCK);
           if (fd == -1) {
               perror("inotify_init1");
               exit(EXIT_FAILURE);
           }

           /* Allocate memory for watch descriptors */

           wd = calloc(argc, sizeof(int));
           if (wd == NULL) {
               perror("calloc");
               exit(EXIT_FAILURE);
           }

           /* Mark directories for events
              - file was opened
              - file was closed */

           for (i = 1; i < argc; i++) {
               wd[i] = inotify_add_watch(fd, argv[i],
                                         IN_OPEN | IN_CLOSE);
               if (wd[i] == -1) {
                   fprintf(stderr, "Cannot watch '%s'\n", argv[i]);
                   perror("inotify_add_watch");
                   exit(EXIT_FAILURE);
               }
           }

           /* Prepare for polling */

           nfds = 2;

           /* Console input */

           fds[0].fd = STDIN_FILENO;
           fds[0].events = POLLIN;

           /* Inotify input */

           fds[1].fd = fd;
           fds[1].events = POLLIN;

           /* Wait for events and/or terminal input */

           printf("Listening for events.\n");
           while (1) {
               poll_num = poll(fds, nfds, -1);
               if (poll_num == -1) {
                   if (errno == EINTR)
                       continue;
                   perror("poll");
                   exit(EXIT_FAILURE);
               }

               if (poll_num > 0) {

                   if (fds[0].revents & POLLIN) {

                       /* Console input is available. Empty stdin and quit */

                       while (read(STDIN_FILENO, &buf, 1) > 0 && buf != '\n')
                           continue;
                       break;
                   }

                   if (fds[1].revents & POLLIN) {

                       /* Inotify events are available */

                       handle_events(fd, wd, argc, argv);
                   }
               }
           }

           printf("Listening for events stopped.\n");

           /* Close inotify file descriptor */

           close(fd);

           free(wd);
           exit(EXIT_SUCCESS);
       }

Relevant Link:

http://linux.die.net/man/7/inotify
http://man7.org/linux/man-pages/man7/inotify.7.html
http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。