The Virtual Filesystem (Linux)

時間 2019-11-18

標籤 virtual filesystem linux 欄目 Linux 简体版

原文原文鏈接

The Virtual Filesystem

Linux manages to support multiple filesystem types through a concept called virtual filesystem.node

The key idea behind the virtual filesystem is to put a wide range of information in the kernel to represent many different types of filesystems.android

The Role of VFS

The Virtual Filesystem, also known as Virtual Filesystem Switch(or VFS), is a kernel software layer, providing a common interface to several kinds of filesystems.app

It is an abstract layer between application program and filesystem implementation.less

Filesystems supported by VFSide

ð Disk-based filesystem (ext2, vfat, NTFS, USB flash, JFS, etc)oop

ð Network Filesystem (NFS, AFS, CIFS, NCP, etc)ui

ð Special filesystem (/proc, /sys, etc )this

The Common File Model

Everything is a file.idea

For FAT(File Allocation Table) filesystems, the files corresponding to the directories exist only as objects in the kernel memory.spa

The kernel does not hardcode file operations. Instead, it uses a pointer for each operation; the pointer is made to point to the proper function for the particular filesystem being accessed.

The common file model consists of the following object types:

ð The superblock object

ð The inode object

ð The file object

ð The dentry object

VFS

ð A common interface

ð A disk cache to speed up access to files

Caches

Hardware cache: a fast static RAM

Memory cache: a software mechanism to bypass the Kernel Memory Allocator

Disk cache: a software mechanism speeding up access to data by allowing the kernel to keep in RAM some information which is normally stored in disk

VFS Data Structures

Superblock objects

extern struct list_head super_blocks;

extern spinlock_t sb_lock;

struct super_block {

….

void *s_fs_info; /* Filesystem private info */

}

Each disk-based filesystem has to access and update its allocation bitmap in order to allocate and release disk blocks. The VFS duplicates this information in memory for reason of efficiency(pointed by s_fs_info field of superblock). However, this leads to a synchronization problem between VSF superblock in memory and the actual superblock in disk. This in turn may lead to a familiar problem called a corrupted filesystem. Linux adopts a policy to minimize this problem: writing dirty pages to disk periodically.

Superblock operations

ð alloc_inode(sb)

ð destroy_inode(inode)

ð read_inode(inode) disk->inode object

ð dirty_inode(inode)

ð write_inode(inode, flag) inode object -> disk

ð put_inode(inode)

ð drop_inode(inode)

ð delete_inode(inode)

ð put_super(super) release the superblock object passed as parameter (unmount a filesystem)

ð write_super(super) update a filesystem superblock

ð sync_fs(sb, wait) used by journaling filesystems

ð statfs(super, buf)

ð remount_fs(super, flags, data)

ð clear_inode(inode)

ð umount_begin(super)

ð show_options(seq_file, vfsmount)

inode object

The inode object is unique to the file and remains the same as long as the file exists.

Struct inode

{

struct hlist_node i_hash; //linked into inode_hashtable

struct list_head i_list; /* backing dev IO list */

struct list_head i_sb_list; //a per-filesystem doubly linked list

struct list_head i_dentry;

….

}

Each inode object appears in one of the following lists:

1. The list of valid unused inodes

2. The list of in-use inodes

3. The list of dirty inodes

Inode operation

struct inode_operations {

int (*create) (struct inode *,struct dentry *,int, struct nameidata *);

struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);

int (*link) (struct dentry *,struct inode *,struct dentry *);

int (*unlink) (struct inode *,struct dentry *);

int (*symlink) (struct inode *,struct dentry *,const char *);

int (*mkdir) (struct inode *,struct dentry *,int);

int (*rmdir) (struct inode *,struct dentry *);

int (*mknod) (struct inode *,struct dentry *,int,dev_t);

int (*rename) (struct inode *, struct dentry *,

struct inode *, struct dentry *);

int (*readlink) (struct dentry *, char __user *,int);

void * (*follow_link) (struct dentry *, struct nameidata *);

void (*put_link) (struct dentry *, struct nameidata *, void *);

void (*truncate) (struct inode *);

int (*permission) (struct inode *, int);

int (*check_acl)(struct inode *, int);

int (*setattr) (struct dentry *, struct iattr *);

int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);

int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);

ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);

ssize_t (*listxattr) (struct dentry *, char *, size_t);

int (*removexattr) (struct dentry *, const char *);

void (*truncate_range)(struct inode *, loff_t, loff_t);

long (*fallocate)(struct inode *inode, int mode, loff_t offset,

loff_t len);

int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,

u64 len);

};

File Objects

A file object describes how a process interacts with an open file.

Struct file

{

…

}

The main information is the f_pos, which is the file pointer indicating the current file offset. This information is stored in file object instead of inode object because several processes may concurrently access the same file.

f_count field is a reference counter, it counts the number of processes that are using the file object. Lightweight processes created with CLONE_FILES flag share the open file table, thus they use the same file objects. The reference counter is also increased when a dup() call is made.

ð Multithreaded programming must take synchronization into consideration.

Dentry Objects

Dentry objects have no corresponding image on disk. It is used to represent a directory entry on the disk.

Four possible states of a dentry object:

1. Free

2. Unused

3. In use

4. Negative (The inode associated with the dentry object does not exist.)

Dentry cache: maximize the efficiency in handling dentries

Files associated with a process

Struct task_struct

{

/* filesystem information */

struct fs_struct *fs;

/* open file information */

struct files_struct *files;

}

struct fs_struct {

int users;

rwlock_t lock;

int umask;

int in_exec;

struct path root, pwd;

};

struct files_struct {

struct fdtable fdtab;

….

struct file * fd_array[NR_OPEN_DEFAULT];

};

struct fdtable {

unsigned int max_fds;

struct file ** fd; /* current fd array */

fd_set *close_on_exec;

fd_set *open_fds;

struct rcu_head rcu;

struct fdtable *next;

};

For every file with an entry in fd array, the array index is the file descriptor.

Note that two elements of the array may point to the same file object.

Filesystem Types

Special filesystems

Special filesystems are not bound to physical block devices. However, the kernel assigns to each mounted special filesystem a fictitious block device with 0 as the major number.

This helps the kernel to handle special filesystems and the regular ones in a uniform way.

Filesystem Type Registration

The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration.

Registered filesystem çè a file_system_type object

struct file_system_type {

const char *name;

int fs_flags;

int (*get_sb) (struct file_system_type *, int,

const char *, void *, struct vfsmount *);

void (*kill_sb) (struct super_block *);

struct module *owner;

struct file_system_type * next;

struct list_head fs_supers;

};

Filesystem Handling

The root directory of a filesystem

The root directory of a process

The system’s root filesystem

Usually, the root directory of a process is the same as the root directory of the system’s root filesystem.

The /proc virtual filesystem is a child of the system’s root filesystem, and thus a sub tree of the tree rooted at the system’s root filesystem.

Namespaces

In Linux 2.6, every process might have its own tree of mounted filesystems, the so-called namespace of the process.

Most processes share the same namespace, which is the tree rooted at the system’s root filesystem. However, a process may gets a new namespace if it is created by clone() call with CLONE_NEWNS flag set.

Filesystem Mounting

Linux上文件系統能夠被屢次掛載

root@localhost :/home/James# mount /dev/sdb2 ./mnt1

root@localhost :/home/James# ls mnt1

lost+found

root@localhost :/home/James# mount /dev/sdb2 ./mnt2

root@localhost :/home/James# ls mnt2

lost+found

root@localhost :/home/James# touch ./mnt1/test

root@localhost :/home/James# ls mnt1

lost+found test

root@localhost :/home/James# ls mnt2

lost+found test

Linux上的文件系統掛載會覆蓋以前掛載的文件系統，就像一個stack

root@localhost :/home/James# mount /dev/sdb2 ./mnt1/

root@localhost :/home/James# ls mnt1

lost+found test

root@localhost :/home/James# mount /dev/sdb1 ./mnt1/

root@localhost :/home/James# ls mnt1/

00001.vcf 00004.vcf BaiduMap download LGCameraPro_6.2android_zol.apk recording19190.3gpp u-center.apk

00002.vcf 126681_863ed540-6d14-47c1-b5ee-0e4290ed1a3e.apk bluetooth GPS???(GPS Test).apk LOST.DIR recording71086.3gpp Wildlife.mp4

00003.vcf Android DCIM kugou moboplayer_1.apk songs Wildlife.wmv

root@localhost :/home/James# umount mnt1/

root@localhost :/home/James# ls mnt1/

lost+found test

root@localhost :/home/James# umount mnt1/

root@localhost :/home/James# ls mnt1/

root@localhost :/home/James#

All information related with filesystem mounting is stored in a mounted filesystem descriptor of type vfsmount.

Mounting a Generic Filesystem

Mount -> sys_mount() -> do_mount() -> do_kern_mount()

Mounting the Root Filesystem

1. The kernel mounts the special rootfs filesystem, which simply provides an empty directory that serves as the initial mount point.

2. The kernel mounts the real root filesystem over the empty directory.

The rootfs filesystem allows the kernel to easily change the real root filesystem.

Pathname Lookup

Pathname lookup: how to derive an inode from the corresponding file pathname.

Other than parsing the pathname, several Unix and VFS filesystem features must be taken into consideration when performing pathname lookup procedure:

1. The access rights

2. Symbolic link

3. Identify circular references, breakout an infinite loop

4. A filename maybe a mount point of a mounted filesystem. This situation must be detected and the lookup operation must continue into the new filesystem.

5. Pathname lookup must be performed inside the namespace of the calling process.

int path_lookup(const char *name, unsigned int flags,

struct nameidata *nd)

enum { MAX_NESTED_LINKS = 8 };

struct nameidata {

struct path path;

struct qstr last;

struct path root;

unsigned int flags;

int last_type;

unsigned depth;

char *saved_names[MAX_NESTED_LINKS + 1];

/* Intent data */

union {

struct open_intent open;

} intent;

};

struct path {

struct vfsmount *mnt;

struct dentry *dentry;

};

The core of the pathname lookup operation: link_path_walk

ð Standard pathname lookup

ð Lookup a directory

ð Lookup with symbolic links

Implementation of VFS System Calls

Manipulate VFS data structures to implement VFS system calls.

Example: open(), read(), write(), close()

Corresponding system service routine: sys_xxx()

File Locking

Concurrent access

Synchronization problem

The POSIX standard requires a file-locking mechanism based on fcntl() system call.

For more details about how to use file locks, refer to 「Beginning Linux Programming」.

This kind of lock is known as advisory locks, because it doesn’t work unless other processes cooperate in checking the existence of the file lock before accessing the file. It is similar to semaphores.

POSIX: advisory locks

System V: introduce mandatory locks

Linux: supports both advisory and mandatory locks

System calls: flock(), fcntl()

File-locking data structure in Kernel

Struct file_lock {

….

};