Linux: mmap() System Call — Part II

This is a two part series on the mmap() system call. Part I presents the internal workings of private and shared mappings of both anonymous and file backed mappings. Part II will go into details about other related system calls and additional flags used with mmap()

msync()

When a shared file-backed mapping is used, with the help of the MAP_SHARED flag, the synchronization to disk is decided by the kernel — providing no guarantees. The msync() system call can be used to tell the kernel to write the in-memory copy of the page to disk. This can be useful in situations where you need to make the changes durable in the event of a system crash, which might result in data loss.

The msync() system call is as:

#include <sys/mman.h>

int msync(void *addr, size_t length, int flags);

The addr specifies the starting address of the memory region to be synchronized. This needs to be page aligned. The length specifies the size of the memory region to be synchronized, starting from addr. The length is always rounded to the next multiple of the system page size. A return value of -1 specifies an error from the called system call.

The flag value can be either of MS_SYNC or MS_ASYNC or MS_INVALIDATE. Using MS_SYNC specifies the kernel to perform a synchronous file write-back. This call blocks until all the requested pages are written back to disk. And, MS_ASYNC specifies the kernel to perform an asynchronous file write-back. The kernel will then write out the pages later but will update its page cache so that, when a read() is done by another process, the changes are reflected for that read.

So, essentially MS_SYNC writes the changes to the disk and MS_ASYNC merely writes to the page cache, which in-turn will be flushed by the pdflush kernel thread.

The MS_INVALIDATE flag is used to invalidate the cached pages in memory, for the specified memory region. All pages of the memory region that are inconsistent with the disk copy are marked as invalid. Next time when the operation is performed on the memory region, the pages are bought in from disk again and hence new changes made by other process and written to disk, are bought in back afresh.

More details about msync() can be seen here.

mremap()

Using the mremap() system call, we can move/resize an existing mapping. This helps in situations where you might want to move a region of memory, along with the underlying data. Without using such a facility, you need to create a new mapping and then perform a copy of the data from the old location, and then destroy the older mapping.

The glibc’s realloc() system call uses mremap() under the hood to expand an already allocated anonymous memory region.

Note: This system call is not present on Unix and is available only on Linux.

The mremap() system call is as:

#include <sys/mman.h>

void *mremap(void *old_address, size_t old_size, size_t new_size, int flags, ...);

This system call returns the starting address of the remapped memory region or MAP_FAILED on error. The old_address arg is the starting address of the existing mapping. Usually this is the same as the starting address returned by an earlier mmap() call and the old_address should be page aligned. The old_size is the size of the existing mapping. The new_size is the new size desired. The values specified in old_size and new_size will be rounded to the nearest page size, if needed.

The value of the flags determine if the kernel may relocate the existing mapping within the process’s address space. This can be 0, to not allow moving the existing mapping or can be a combination of MREMAP_MAYMOVE - allows the kernel to move the existing mapping and if not possible, the call returns ENOMEM as error, and, MREMAP_FIXED - if specified, the system call takes in an additional new_address, that is page aligned, for the starting address of the moved mapping. The MREMAP_FIXED can only be used in conjunction with the MREMAP_MAYMOVE flag.

Another thing to note, once a relocation happens, old pointers will not be pointing to the desired element, that were valid earlier. So, it is good to have offsets to mappings within mapping, that you might think will be reallocated in its lifetime.

More details about mremap() can be seen here.

mprotect()

The mprotect() system call changes the protection on a memory region.

The mprotect() system call is as:

#include <sys/mman.h>

int mprotect(void *addr, size_t length, int prot);

This call returns -1 on error and 0 for success. The value of addr must be multiples of the page size and page aligned. The length provided will be rounded to the nearest page size.

The prot arg can either be PROT_NONE or a combination of PROT_READ, PROT_WRITE and PROT_EXEC. These have the same meaning as seen in part 1. Accessing a memory region without the necessary permissions will generate a SIGSEGV signal and a core dump produced for the process.

More details about mprotect() can be seen here.

mlock()/munlock()

The mlock() system call is used to lock pages in memory, so that no page fault occurs when trying to access it. This is needed for applications which work with latency constraints and also where you know some pages will be accessed by most applications at most times. This technique is hence used to improve performance by not hitting the disk for the requested page. The munlock() system call is to remove the protection created using a call to mlock().

The mlock()/munlock() system call is as:

#include <sys/mman.h>

int mlock(void *addr, size_t length);
int munlock(void *addr, size_t length);

The addr is the starting address of the memory location you want to lock for continuing length bytes. The addr need not be page-aligned and the OS locks pages at the next page boundary. The end of lock is the next page boundary greater than length + addr.

After the call, the memory region specified is guaranteed to be locked by the kernel and stay in memory. The mlock() system call fails if there is not enough memory to service the system call.

The munlock() system call unlocks the pages locked by a call to mlock(). The addr and the length has the same meaning as mlock() system call. An important note here: calling munlock() does not immediately remove the page from memory, until there is memory constraint in the system. Also, the locks are removed when the process that called mlock() terminates or if the pages have been unmapped using another munmap() system call.

No limits are enforced for a privileged process using the capability CAP_IPC_LOCK. So, a privileged process can lock any number of pages provided there is enough memory to accommodate the request. An unprivileged process is constrained by the RLIMIT_MEMLOCK value defined. The default value for RLIMIT_MEMLOCK is 8 pages (32768 - On my machine, where the page size is 4096)

Memory locks are not inherited by the child during a fork() nor are they preserved during an exec() and multiple process can share a page that is locked by one process. And when that process, that locked the page, terminates or unlocks the page, the lock guarantee is lost.

More details about mlock()/munlock() can be seen here.

mlockall()/munlockall()

The mlockall() system call locks all the pages in the calling process’s address space - in the present or from future, based on the flags passed to it. The munlockall() call unlocks all locked pages in the calling process’s address space in the present. This does not affect future lock calls to the pages, by the same process.

The mlockall()/munlockall() system call is as:

#include <sys/mman.h>

int mlockall(int flags);
int munlockall();

Both the system calls return -1 on error and the flags value in mlockall() system call is one or a combination of: MCL_CURRENT - locks all the pages in the calling process’s address space. This does not lock the pages that follow after the current call and MCL_FUTURE - locks all future pages brought in by the current calling process. A combination of both the flags results in locking all the present and the future pages of the calling process. The same constraints and lifetime of the locks apply, as seen earlier.

More details about mlockall()/munlockall() can be seen here.

madvise()

The madvise() system call is to let the kernel know about a process’s usage pattern of the pages mapped onto its address space. This is under the assumption that the kernel will take this context when performing any operations on those pages. Note, this is just a hint and it is up to the kernel to follow this when operating on the said pages.

The madvise() system call is as:

#include <sys/mman.h>

int madvise(void *addr, size_t length, int advice);

Again, the addr is the starting address and should be page aligned. length is effectively rounded to the next multiple of the page size. The advice argument is one of the following: MADV_NORMAL - Default setting for all the pages. Some amount of read-behind and read-ahead is performed by the kernel, MADV_RANDOM - The pages will be accessed in a random fashion and hence the kernel need not perform any read-behind or read-ahead, MADV_SEQUENTIAL - tells the kernel that the access pattern is sequential so as to prioritize read-ahead, MADV_WILLNEED - tells the kernel that the process will require the specified pages and need to keep it to memory, MADV_DONTNEED - tells the kernel that the process does not need the pages anymore.

More details about madvise() can be seen here.

mmap() Flags Revisited

To complete this series, let us look at the additional flags that can be passed to the mmap() system call. The mmap() system call, as seen in Part I, is as:

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

The following are the flags that can be additionally passed to the above system call:

  • MAP_FIXED - This tells the kernel that the addr specified to the system call is not a hint and should be used as is. Hence, the addr specified should be page aligned
  • MAP_LOCKED - This flag tells the kernel to preload the mapping to memory and lock the same. This reduces the system calls where else, you need to call mmap() followed by mlock() using the same addr
  • MAP_POPULATE - This tells the kernel to perform a read-ahead of the file into memory, from its disk copy. Later, when accessing it is faster, since the page is already in memory and hence no page faults occur to read the contents from disk

That’s it. For any discussion, tweet here.

[1] http://man7.org/tlpi/
[2] https://landley.net/writing/memory-faq.txt
[3] https://man7.org/linux/man-pages/man2/mmap.2.html
[4] https://www.diskodev.com/posts/linux-virtual-memory/