Linux manages to support multiple filesystem types through a concept called virtual filesystem.node
The key idea behind the virtual filesystem is to put a wide range of information in the kernel to represent many different types of filesystems.android
The Virtual Filesystem, also known as Virtual Filesystem Switch(or VFS), is a kernel software layer, providing a common interface to several kinds of filesystems.app
It is an abstract layer between application program and filesystem implementation.less
Filesystems supported by VFSide
ð Disk-based filesystem (ext2, vfat, NTFS, USB flash, JFS, etc)oop
ð Network Filesystem (NFS, AFS, CIFS, NCP, etc)ui
ð Special filesystem (/proc, /sys, etc )this
Everything is a file.idea
For FAT(File Allocation Table) filesystems, the files corresponding to the directories exist only as objects in the kernel memory.spa
The kernel does not hardcode file operations. Instead, it uses a pointer for each operation; the pointer is made to point to the proper function for the particular filesystem being accessed.
The common file model consists of the following object types:
ð The superblock object
ð The inode object
ð The file object
ð The dentry object
VFS
ð A common interface
ð A disk cache to speed up access to files
Caches
Hardware cache: a fast static RAM
Memory cache: a software mechanism to bypass the Kernel Memory Allocator
Disk cache: a software mechanism speeding up access to data by allowing the kernel to keep in RAM some information which is normally stored in disk
extern struct list_head super_blocks;
extern spinlock_t sb_lock;
struct super_block {
….
void *s_fs_info; /* Filesystem private info */
}
Each disk-based filesystem has to access and update its allocation bitmap in order to allocate and release disk blocks. The VFS duplicates this information in memory for reason of efficiency(pointed by s_fs_info field of superblock). However, this leads to a synchronization problem between VSF superblock in memory and the actual superblock in disk. This in turn may lead to a familiar problem called a corrupted filesystem. Linux adopts a policy to minimize this problem: writing dirty pages to disk periodically.
Superblock operations
ð alloc_inode(sb)
ð destroy_inode(inode)
ð read_inode(inode) disk->inode object
ð dirty_inode(inode)
ð write_inode(inode, flag) inode object -> disk
ð put_inode(inode)
ð drop_inode(inode)
ð delete_inode(inode)
ð put_super(super) release the superblock object passed as parameter (unmount a filesystem)
ð write_super(super) update a filesystem superblock
ð sync_fs(sb, wait) used by journaling filesystems
ð statfs(super, buf)
ð remount_fs(super, flags, data)
ð clear_inode(inode)
ð umount_begin(super)
ð show_options(seq_file, vfsmount)
The inode object is unique to the file and remains the same as long as the file exists.
Struct inode
{
struct hlist_node i_hash; //linked into inode_hashtable
struct list_head i_list; /* backing dev IO list */
struct list_head i_sb_list; //a per-filesystem doubly linked list
struct list_head i_dentry;
….
}
Each inode object appears in one of the following lists:
1. The list of valid unused inodes
2. The list of in-use inodes
3. The list of dirty inodes
Inode operation
struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
int (*check_acl)(struct inode *, int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
long (*fallocate)(struct inode *inode, int mode, loff_t offset,
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
};
A file object describes how a process interacts with an open file.
Struct file
{
…
}
The main information is the f_pos, which is the file pointer indicating the current file offset. This information is stored in file object instead of inode object because several processes may concurrently access the same file.
f_count field is a reference counter, it counts the number of processes that are using the file object. Lightweight processes created with CLONE_FILES flag share the open file table, thus they use the same file objects. The reference counter is also increased when a dup() call is made.
ð Multithreaded programming must take synchronization into consideration.
Dentry objects have no corresponding image on disk. It is used to represent a directory entry on the disk.
Four possible states of a dentry object:
1. Free
2. Unused
3. In use
4. Negative (The inode associated with the dentry object does not exist.)
Dentry cache: maximize the efficiency in handling dentries
Struct task_struct
{
/* filesystem information */
struct fs_struct *fs;
/* open file information */
struct files_struct *files;
}
struct fs_struct {
int users;
rwlock_t lock;
int umask;
int in_exec;
struct path root, pwd;
};
struct files_struct {
struct fdtable fdtab;
….
struct file * fd_array[NR_OPEN_DEFAULT];
};
struct fdtable {
unsigned int max_fds;
struct file ** fd; /* current fd array */
fd_set *close_on_exec;
fd_set *open_fds;
struct rcu_head rcu;
struct fdtable *next;
};
For every file with an entry in fd array, the array index is the file descriptor.
Note that two elements of the array may point to the same file object.
Special filesystems are not bound to physical block devices. However, the kernel assigns to each mounted special filesystem a fictitious block device with 0 as the major number.
This helps the kernel to handle special filesystems and the regular ones in a uniform way.
The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration.
Registered filesystem çè a file_system_type object
struct file_system_type {
const char *name;
int fs_flags;
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
};
The root directory of a filesystem
The root directory of a process
The system’s root filesystem
Usually, the root directory of a process is the same as the root directory of the system’s root filesystem.
The /proc virtual filesystem is a child of the system’s root filesystem, and thus a sub tree of the tree rooted at the system’s root filesystem.
In Linux 2.6, every process might have its own tree of mounted filesystems, the so-called namespace of the process.
Most processes share the same namespace, which is the tree rooted at the system’s root filesystem. However, a process may gets a new namespace if it is created by clone() call with CLONE_NEWNS flag set.
Linux上文件系統能夠被屢次掛載
root@localhost :/home/James# mount /dev/sdb2 ./mnt1
root@localhost :/home/James# ls mnt1
lost+found
root@localhost :/home/James# mount /dev/sdb2 ./mnt2
root@localhost :/home/James# ls mnt2
lost+found
root@localhost :/home/James# touch ./mnt1/test
root@localhost :/home/James# ls mnt1
lost+found test
root@localhost :/home/James# ls mnt2
lost+found test
Linux上的文件系統掛載會覆蓋以前掛載的文件系統,就像一個stack
root@localhost :/home/James# mount /dev/sdb2 ./mnt1/
root@localhost :/home/James# ls mnt1
lost+found test
root@localhost :/home/James# mount /dev/sdb1 ./mnt1/
root@localhost :/home/James# ls mnt1/
00001.vcf 00004.vcf BaiduMap download LGCameraPro_6.2android_zol.apk recording19190.3gpp u-center.apk
00002.vcf 126681_863ed540-6d14-47c1-b5ee-0e4290ed1a3e.apk bluetooth GPS???(GPS Test).apk LOST.DIR recording71086.3gpp Wildlife.mp4
00003.vcf Android DCIM kugou moboplayer_1.apk songs Wildlife.wmv
root@localhost :/home/James# umount mnt1/
root@localhost :/home/James# ls mnt1/
lost+found test
root@localhost :/home/James# umount mnt1/
root@localhost :/home/James# ls mnt1/
root@localhost :/home/James#
All information related with filesystem mounting is stored in a mounted filesystem descriptor of type vfsmount.
Mount -> sys_mount() -> do_mount() -> do_kern_mount()
1. The kernel mounts the special rootfs filesystem, which simply provides an empty directory that serves as the initial mount point.
2. The kernel mounts the real root filesystem over the empty directory.
The rootfs filesystem allows the kernel to easily change the real root filesystem.
Pathname lookup: how to derive an inode from the corresponding file pathname.
Other than parsing the pathname, several Unix and VFS filesystem features must be taken into consideration when performing pathname lookup procedure:
1. The access rights
2. Symbolic link
3. Identify circular references, breakout an infinite loop
4. A filename maybe a mount point of a mounted filesystem. This situation must be detected and the lookup operation must continue into the new filesystem.
5. Pathname lookup must be performed inside the namespace of the calling process.
int path_lookup(const char *name, unsigned int flags,
struct nameidata *nd)
enum { MAX_NESTED_LINKS = 8 };
struct nameidata {
struct path path;
struct qstr last;
struct path root;
unsigned int flags;
int last_type;
unsigned depth;
char *saved_names[MAX_NESTED_LINKS + 1];
/* Intent data */
union {
struct open_intent open;
} intent;
};
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
};
The core of the pathname lookup operation: link_path_walk
ð Standard pathname lookup
ð Lookup a directory
ð Lookup with symbolic links
Manipulate VFS data structures to implement VFS system calls.
Example: open(), read(), write(), close()
Corresponding system service routine: sys_xxx()
Concurrent access
Synchronization problem
The POSIX standard requires a file-locking mechanism based on fcntl() system call.
For more details about how to use file locks, refer to 「Beginning Linux Programming」.
This kind of lock is known as advisory locks, because it doesn’t work unless other processes cooperate in checking the existence of the file lock before accessing the file. It is similar to semaphores.
POSIX: advisory locks
System V: introduce mandatory locks
Linux: supports both advisory and mandatory locks
System calls: flock(), fcntl()
File-locking data structure in Kernel
Struct file_lock {
….
};