In Linux user space, we often need to call system calls. Let's take Linux version 2.6.37 as an example to track the implementation of the read system call. System call implementations may vary between versions of Linux.
In some applications, we can see the following definition:
scssCopy code #define real_read(fd, buf, count ) (syscall(SYS_read, (fd), (buf), (count)))
Actually, what is actually called is the system function syscall(SYS_read), that is, the sys_read() function. In the Linux version 2.6.37, this function is implemented through several macro definitions.
Linux system call (SCI, system call interface) is actually a process of multi-channel aggregation and decomposition. The aggregation point is the 0x80 interrupt entry point (X86 system structure). That is to say, all system calls are aggregated from user space to the 0x80 interrupt point, and the specific system call number is saved at the same time. When the 0x80 interrupt handler is running, different system calls will be processed separately according to the system call number, that is, different kernel functions will be called for processing.
There are two ways to cause system calls:
(1) int $0×80, this is the only way to cause a system call in old Linux kernel versions.
(2) sysenter assembly instructions
In the Linux kernel, we can use the following macro definitions to make system calls.
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret; }
The macro definition of SYSCALL_DEFINE3 is as follows:
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
## means that the characters in the macro are directly replaced,
If name = read, then __NR_##name is replaced with __NR_read in the macro. NR##name is the system call number, ## refers to two macro expansions. That is, replace "name" with the actual system call name, and then expand __NR.... If name == ioctl, it is __NR_ioctl.
#ifdef CONFIG_FTRACE_SYSCALLS #define SYSCALL_DEFINEx(x, sname, ...) \ static const char *types_##sname[] = { \ __SC_STR_TDECL##x(__VA_ARGS__) \ }; \ static const char *args_##sname[] = { \ __SC_STR_ADECL##x(__VA_ARGS__) \ }; \ SYSCALL_METADATA(sname, x); \ __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #else #define SYSCALL_DEFINEx(x, sname, ...) \ __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #endif
Regardless of whether the CONFIG_FTRACE_SYSCALLS macro is defined or not, the following macro definition will eventually be executed:
__SYSCALL_DEFINEx(x, sname, VA_ARGS)
#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS #define SYSCALL_DEFINE(name) static inline long SYSC_##name #define __SYSCALL_DEFINEx(x, name, ...) \ asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__)); \ static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__)); \ asmlinkage long SyS##name(__SC_LONG##x(__VA_ARGS__)) \ { \ __SC_TEST##x(__VA_ARGS__); \ return (long) SYSC##name(__SC_CAST##x(__VA_ARGS__)); \ } \ SYSCALL_ALIAS(sys##name, SyS##name); \ static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__)) #else /* CONFIG_HAVE_SYSCALL_WRAPPERS */ #define SYSCALL_DEFINE(name) asmlinkage long sys_##name #define __SYSCALL_DEFINEx(x, name, ...) \ asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__)) #endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */
The following types of macro definitions will eventually be called:
asmlinkage long sys##name(__SC_DECL##x(VA_ARGS))
That is the sys_read() system function we mentioned earlier.
asmlinkage tells the compiler to extract only the function's arguments from the stack. All system calls require this qualifier! This is similar to the macro definition mentioned in our previous article quagga.
That is, the following code in the macro definition:
struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret;
Code analysis:
if (file->f_op->read)
ret = file->f_op->read(file, buf, count, pos);
At this point, the processing of the virtual file system layer is completed, and control is handed over to the ext2 file system layer.
The above is the detailed content of Syscall system call Linux kernel tracing. For more information, please follow other related articles on the PHP Chinese website!