Linux System Calls: The User-Kernel Interface

18 min

Understand Linux system calls through interactive visualizations. Learn how user programs communicate with the kernel, protection rings, and syscall performance.

Best viewed on desktop for optimal interactive experience

The Gateway to the Kernel

Every time your program needs to interact with hardware, read a file, or create a process, it must ask the kernel for help. But user programs can't directly access kernel memory or execute privileged instructions - that would be chaos! Instead, they use system calls - the carefully controlled gateway between user space and kernel space.

Think of system calls as a restaurant's service window. Customers (user programs) can't walk into the kitchen (kernel space) - that would be unsafe and chaotic. Instead, they place orders through the window (system calls), and the kitchen staff (kernel) fulfills those requests with proper safety checks and resource management.

Let's explore this fascinating boundary where user programs meet the almighty kernel.

Interactive System Call Visualization

Explore how system calls transition between user and kernel space:

System Call Execution Flow

CPU Protection Rings

Ring 0: KernelRing 1: Device DriversRing 2: Device DriversRing 3: User Applications
1
User Application
Program calls library function
Ring 3
2
Glibc Wrapper
Prepare syscall number and arguments
Ring 3
3
SYSCALL Instruction
Trigger transition to kernel mode
Ring 3
4
Kernel Entry
Save user context, switch to kernel stack
Ring 0
5
System Call Handler
Dispatch to appropriate kernel function
Ring 0
6
Kernel Operation
Perform requested operation
Ring 0
7
Prepare Return
Set return value and error codes
Ring 0
8
Return to User
Restore user context, switch back
Ring 3

Context Switch Overhead

Save Registers
~50 cycles
TLB Flush
~100 cycles
Total Overhead
~200-1000 cycles

Understanding System Calls

What Are System Calls?

System calls are the fundamental interface between applications and the Linux kernel:

// User program wants to write to screen printf("Hello"); // Library function // printf internally calls write() syscall write(1, "Hello", 5); // System call // At assembly level: // mov $1, %rax ; System call number (write) // mov $1, %rdi ; File descriptor (stdout) // mov $msg, %rsi ; Buffer address // mov $5, %rdx ; Number of bytes // syscall ; Invoke system call

The System Call Table

The kernel maintains a table of all system calls:

// System call table (simplified) void *sys_call_table[] = { [0] = sys_read, [1] = sys_write, [2] = sys_open, [3] = sys_close, [4] = sys_stat, [5] = sys_fstat, // ... ~335 more entries [57] = sys_fork, [59] = sys_execve, [60] = sys_exit, [62] = sys_kill, // ... }; // View all syscalls ausyscall --dump man syscalls

CPU Protection Rings

Ring Levels

x86 processors provide 4 privilege levels:

Ring 0 (Kernel Mode): - Full hardware access - All instructions available - Can access all memory - Linux kernel runs here Ring 1 & 2 (Unused in Linux): - Originally for device drivers - Linux doesn't use these Ring 3 (User Mode): - Restricted instruction set - Cannot access kernel memory - Cannot directly access hardware - All user applications run here

Privilege Transitions

// User space (Ring 3) int fd = open("/tmp/file", O_RDONLY); // SYSCALL instruction triggers transition // Kernel space (Ring 0) SYSCALL_DEFINE3(open, const char *, filename, int, flags, umode_t, mode) { // Kernel can now: // - Access hardware // - Modify page tables // - Access all memory // - Execute privileged instructions } // Return to user space (Ring 3)

Making System Calls

Direct System Call

#include <unistd.h> #include <sys/syscall.h> // Using syscall() function long result = syscall(SYS_getpid); // Using inline assembly (x86_64) long my_getpid() { long pid; asm volatile( "mov $39, %%rax\n" // SYS_getpid = 39 "syscall\n" "mov %%rax, %0" : "=r" (pid) : : "rax" ); return pid; }

System Call Conventions (x86_64)

// Register usage for system calls: // %rax - System call number // %rdi - First argument // %rsi - Second argument // %rdx - Third argument // %r10 - Fourth argument // %r8 - Fifth argument // %r9 - Sixth argument // %rax - Return value // Example: write(fd, buf, count) // %rax = 1 (SYS_write) // %rdi = fd // %rsi = buf // %rdx = count

Common System Calls

File Operations

// Opening files int fd = open("/path/to/file", O_RDONLY); int fd = openat(AT_FDCWD, "file", O_WRONLY | O_CREAT, 0644); // Reading and writing ssize_t bytes = read(fd, buffer, sizeof(buffer)); ssize_t bytes = write(fd, data, length); ssize_t bytes = pread(fd, buffer, count, offset); // Position-independent // File information struct stat st; fstat(fd, &st); lstat("/path/to/link", &st); // Don't follow symlinks // Directory operations DIR *dir = opendir("/path"); struct dirent *entry = readdir(dir);

Process Management

// Process creation pid_t pid = fork(); pid_t pid = clone(CLONE_VM | CLONE_FILES, stack); // Process execution execve("/bin/ls", argv, envp); execl("/bin/sh", "sh", "-c", "command", NULL); // Process termination exit(0); _exit(1); // Immediate exit, no cleanup // Process synchronization pid_t pid = wait(&status); pid_t pid = waitpid(child_pid, &status, WNOHANG); // Signals kill(pid, SIGTERM); raise(SIGINT); sigaction(SIGCHLD, &act, NULL);

Memory Management

// Memory mapping void *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); munmap(addr, length); // Memory protection mprotect(addr, length, PROT_READ | PROT_EXEC); // Heap management void *old_brk = sbrk(0); brk(new_address); // Memory locking mlock(addr, length); mlockall(MCL_CURRENT | MCL_FUTURE); // Memory advice madvise(addr, length, MADV_SEQUENTIAL); madvise(addr, length, MADV_WILLNEED);

Network Operations

// Socket creation int sock = socket(AF_INET, SOCK_STREAM, 0); // Connection connect(sock, (struct sockaddr *)&addr, sizeof(addr)); bind(sock, (struct sockaddr *)&addr, sizeof(addr)); listen(sock, backlog); int client = accept(sock, &client_addr, &addr_len); // Data transfer send(sock, buffer, length, 0); recv(sock, buffer, sizeof(buffer), 0); sendto(sock, buffer, length, 0, &dest_addr, addr_len); recvfrom(sock, buffer, sizeof(buffer), 0, &src_addr, &addr_len); // Multiplexing select(nfds, &readfds, &writefds, &exceptfds, &timeout); poll(fds, nfds, timeout); epoll_wait(epfd, events, maxevents, timeout);

System Call Implementation

Kernel Entry Point

// arch/x86/entry/entry_64.S ENTRY(entry_SYSCALL_64) // Save user registers swapgs // Switch to kernel GS movq %rsp, PER_CPU_VAR(rsp_scratch) movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp // Save user context pushq $__USER_DS // User stack segment pushq PER_CPU_VAR(rsp_scratch) // User RSP pushq %r11 // User RFLAGS pushq $__USER_CS // User code segment pushq %rcx // User RIP // Call system call handler movq %rax, %rdi // Syscall number as first arg movq %rsp, %rsi // pt_regs as second arg call do_syscall_64 // Return to user space USERGS_SYSRET64 END(entry_SYSCALL_64)

System Call Dispatcher

// arch/x86/entry/common.c __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) { // Enable interrupts local_irq_enable(); // Check syscall number if (likely(nr < NR_syscalls)) { // Call the system call regs->ax = sys_call_table[nr](regs); } else { // Invalid syscall regs->ax = -ENOSYS; } // Check for pending signals, etc. syscall_return_slowpath(regs); }

Tracing System Calls

Using strace

# Basic tracing strace ls strace -p 1234 # Attach to process # Detailed output strace -v ls # Verbose strace -x ls # Print strings in hex strace -y ls # Print paths for file descriptors # Filter syscalls strace -e open,read,write ls strace -e trace=file ls # File operations strace -e trace=process ls # Process operations strace -e trace=network ls # Network operations # Timing information strace -T ls # Time spent in syscall strace -t ls # Timestamp strace -r ls # Relative timestamp # Statistics strace -c ls # Count syscalls strace -S calls ls # Sort by call count # Output to file strace -o trace.log ls

Using perf

# Trace all syscalls perf trace ls # Trace specific syscalls perf trace -e open,read,write ls # System-wide tracing perf trace -a # With stack traces perf trace --call-graph dwarf ls

Using bpftrace

# Count syscalls by name bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' # Trace specific syscall bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); }' # Syscall latency bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; } tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

Performance Optimization

vDSO (Virtual Dynamic Shared Object)

The kernel provides a shared library mapped into every process:

// These "syscalls" don't enter kernel: gettimeofday(&tv, NULL); // Uses vDSO clock_gettime(CLOCK_REALTIME, &ts); // Uses vDSO getcpu(&cpu, &node); // Uses vDSO // View vDSO mapping cat /proc/self/maps | grep vdso // Check vDSO functions objdump -T /lib/x86_64-linux-gnu/linux-vdso.so.1

Batching System Calls

// Instead of multiple write() calls write(fd, buf1, len1); write(fd, buf2, len2); write(fd, buf3, len3); // Use writev() for vectored I/O struct iovec iov[3]; iov[0].iov_base = buf1; iov[0].iov_len = len1; iov[1].iov_base = buf2; iov[1].iov_len = len2; iov[2].iov_base = buf3; iov[2].iov_len = len3; writev(fd, iov, 3); // Single syscall // io_uring for async batch I/O struct io_uring ring; io_uring_queue_init(32, &ring, 0); // Submit multiple operations io_uring_submit(&ring);

Reducing System Calls

// Buffer I/O in userspace setvbuf(file, buffer, _IOFBF, 4096); // Use mmap instead of read/write void *data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0); // Access data directly, no syscalls needed process_data(data); munmap(data, file_size); // Cache frequently used data static __thread pid_t cached_pid = 0; pid_t fast_getpid() { if (!cached_pid) cached_pid = getpid(); return cached_pid; }

Security Considerations

Seccomp (Secure Computing)

// Restrict available syscalls #include <linux/seccomp.h> #include <linux/filter.h> struct sock_filter filter[] = { // Allow only read, write, exit BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), }; struct sock_fprog prog = { .len = sizeof(filter) / sizeof(filter[0]), .filter = filter, }; prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

Syscall Auditing

# Enable syscall auditing auditctl -a always,exit -F arch=b64 -S open -S openat # View audit logs ausearch -sc open # Monitor specific user auditctl -a always,exit -F arch=b64 -F uid=1000 -S execve

Common Pitfalls

EINTR Handling

// Wrong - doesn't handle EINTR ssize_t n = read(fd, buf, count); // Correct - retry on EINTR ssize_t n; do { n = read(fd, buf, count); } while (n == -1 && errno == EINTR);

Error Checking

// Always check return values int fd = open(file, O_RDONLY); if (fd == -1) { perror("open"); return -1; } // Check specific errors void *addr = mmap(...); if (addr == MAP_FAILED) { if (errno == ENOMEM) // Handle out of memory else if (errno == EACCES) // Handle permission denied }

Best Practices

  1. Minimize syscalls - They're expensive
  2. Batch operations - Use vectored I/O
  3. Buffer in userspace - Reduce I/O syscalls
  4. Use vDSO - For time and CPU info
  5. Handle EINTR - Syscalls can be interrupted
  6. Check errors - Always check return values
  7. Trace sparingly - strace adds overhead
  8. Consider io_uring - For high-performance I/O

Conclusion

System calls are the fundamental boundary between user applications and the kernel. They provide controlled access to hardware and kernel services while maintaining security and stability. The interactive visualizations showed how every syscall involves a privilege transition, context switch, and careful validation.

Understanding system calls helps you write more efficient programs, debug complex issues, and appreciate the elegant design of the Linux kernel. Remember: every file you open, every byte you read, and every process you create goes through this carefully orchestrated dance between user and kernel space.

Next: Linux Networking Stack → ← Back to Memory Management

If you found this explanation helpful, consider sharing it with others.

Mastodon