Jun 30, 2020

Linux: mmap() System Call — Part I

This is a two part series on the mmap() system call. Part I presents the internal workings of private and shared mappings of both anonymous and file backed mappings. Part II will go into details about other related system calls and additional flags used with mmap()

The mmap() system call is used to create memory mappings on the calling process’s virtual address space. A memory mapping is a set of Page Table Entries (PTEs) describing a virtual address range specific to a process. A memory mapping has a start address, length and permissions associated with it. The address, with an offset is used to read/write to that address space. The permissions tell if a process can read/write/execute the memory mappings. A call to munmap() destroys the memory mappings created with mmap().

A memory mapping can be file/device backed or anonymous. And each file/device backed or an anonymous mapping can be private, shared or copy-on-write. Creating new memory mappings results in new PTEs and thereby virtual address allocations. Physical memory allocation is done lazily on access of that said virtual memory — by generating page faults to be handled by the page fault handler. If you are interested in knowing how Linux’s virtual memory works, you can refer to one of my earlier post here.

The below is the mmap() and munmap() system call — used to create/destroy a memory mapping:

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t length);

The addr indicate the starting address of the new memory mapping. This address is used as a hint by the kernel and it cannot guarantee the same exact address, due to memory alignment restrictions or other constraints. If you need the mapping to absolutely start at the given address, you can make use of special flags but satisfying the alignment requirements is passed on to the caller. The argument can be passed NULL wherein the kernel will decide on the starting address of the mapping which is returned to the user from the sys call.

The length argument takes in the length of the memory mapping requested, in bytes. Although the caller need not pass in multiples of page size, the sys call returns mappings in multiples of page size.

The prot arguments takes in a set of flags that acts as protection bits on the mapping. A combination can be used by ORing the flags together. The various values of the flags are: PROT_READ, PROT_WRITE, PROT_EXECUTE and PROT_NONE. PROT_EXECUTE is used when mapping an executable/shared library to a process’s address space and a PROT_NONE means the mapping is not accessible. One use of PROT_NONE flag is to creating a mapping that acts as a fence to other mappings created by the process.

The flags argument indicate a set of flags that lets you control the mapping. The most important flags are: MAP_PRIVATE and MAP_SHARED. They are used to respectively create a private or a shared mapping (We shall see them in detail below). Another flag that is most commonly used is MAP_ANONYMOUS that is used to create an anonymous mapping.

The fd argument represents the file descriptor, of a file, that is the source of the mapping. The file descriptor can be closed after the call to mmap().

The offset represents the starting point in the mapping that starts the backing boundary. The offset needs to be in the multiples of the page size. The length along with the offset determines the region of the backing file that needs to be mapped on to the mapping.

On success, the mmap() system call returns the starting address of the memory mapping and on failure, the MAP_FAILED constant is returned. It is better to check for the constant than the absolute value since it might change across platforms. So, it is good behavior to do the following:

s_addr = mmap(...);
if (s_addr == MAP_FAILED) {
    printf("mmap failed...\n");
    exit(1);
}

The mappings created using mmap() are returned to the OS by calling the munmap() system call and passing in the starting address of the memory mapping along with the length of the mapping. At a high level this can be done as so:

s_addr = mmap(...);
if (s_addr == MAP_FAILED) {
    printf("mmap failed...\n");
    exit(1);
}

....

ret = munmap(s_addr, len);
if (ret == -1) {
    print("munmap failed...\n");
    exit(1);
}

Anonymous Memory Mappings

An anonymous memory mapping is not backed by any underlying source. An anonymous mapping may be private or shared. Private anonymous mappings are primarily used to allocate new memory for the calling process. malloc() internally uses mmap() to request large blocks of memory from the OS. It then manages these blocks to better service allocation/deallocation requests from the process. A shared anonymous mapping can be used for very fast Inter-Process Communication (IPC). We will look at each one of them below.

Private Anonymous Mapping

A private anonymous mapping can be created with the following call:

unsigned short buf_size = sizeof(int);
std::cout << "Allocating " << buf_size << " byte(s)" << std::endl;
int *s_addr = (int *)mmap(NULL,
                    buf_size,
                    PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS,
                    -1,
                    0);
if (s_addr == MAP_FAILED) {
    std::cerr << "Error: mmap() failed. " << strerror(errno) << std::endl;
    return errno;
}

This creates new PTE in the page table that describe the mapping just created. The necessary flags are set - VM_MAYREAD & VM_MAYWRITE and no physical allocation is done. This can be seen in the below illustration:

Fig 1: Private Anonymous Mapping - Create

When the first read is done after this point, a copy-on-write zero page is allocated (if not already present) and the PTE is made to point to this zero page in memory. This can be seen with the below code:

std::cout << "Starting address of the buffer, of size " << buf_size
                << ": " << s_addr << std::endl;
std::cout << "Value at address(" << s_addr << "): " << *s_addr << std::endl;

// [Output]
// Starting address of the buffer, of size 4: 0x11372d000
// Value at address(0x11372d000): 0

When a read is requested, the page fault handler kicks in and updates the PTE to point to the copy-on-write zero page. This can be seen below:

std::cout << "Writing to address(" << s_addr << "): " << s_addr << std::endl;
*s_addr = 42;
std::cout << "New value at address(" << s_addr << "): " << *s_addr << std::endl;

// [output]
// Writing to address(0x11372d000): 0x11372d000
// New value at address(0x11372d000): 42

Now, when servicing a write to that mapping (like the code above), the page fault handler kicks in due to the copy-on-write protection on the zero page. A new physical page is then allocated and the PTE is adjusted to point to the new physical page. The write is then done on the new page. This can be visualized as:

Fig 3: Private Anonymous Mapping - Write

Now, let us look at what happens when we fork() a child and reuse the virtual address. Note, we have request for a MAP_PRIVATE mapping using mmap(). We will make use of the below code block:

pid_t pid = fork();
if (pid == -1) {
    std::cerr << "Error: fork() failed. " << strerror(errno) << std::endl;
    return errno;
} else if (pid == 0) {
    std::cout << "Child process executing with PID: " << getpid() << std::endl;
    std::cout << "Accessing " << s_addr << " from child (" << getpid() << ")"
                << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
    std::cout << "Child Process modifying value at " << s_addr << std::endl;
    *s_addr = 100;
    std::cout << "New value at " << s_addr << "for PID: " << *s_addr
                << std::endl;
    return 0;
} else {
    std::cout << "Forked a child with PID: " << pid << std::endl;
    std::cout << "Parent process executing with PID: " << getpid() << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
    std::cout << "Waiting for child..." << std::endl;
    int child_ret;
    pid_t w_ret = waitpid(pid, &child_ret, 0);
    if (w_ret == -1) {
        std::cerr << "Error: waitpid() failed. " << strerror(errno)
                    << std::endl;
        return errno;
    }
    std::cout << "Child's(" << pid << ") return status: " << child_ret
                << std::endl;
    std::cout << "Accessing " << s_addr << "from parent(" << getpid()
                <<")" << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
}

// [output]
// Forked a child with PID: 88381
// Parent process executing with PID: 88380
// Value at 0x11372d000: 42
// Waiting for child...
// Child process executing with PID: 88381
// Accessing 0x11372d000 from child (88381)
// Value at 0x11372d000: 42
// Child Process modifying value at 0x11372d000
// New value at 0x11372d000for PID: 100
// Child's(88381) return status: 0
// Accessing 0x11372d000from parent(88380)
// Value at 0x11372d000: 42

The child process starts with the copy-on-write protection of the parent’s physical page. When you do a read, you will get the earlier value shared between the parent and child. This can be seen in the code and below:

Fig 4: Private Anonymous Mapping after fork() - read

And, when the child write to the same virtual address, the page fault handler then allocates a new physical page and the PTE is modified to point to this new page. The write is then done on the new page. The parent’s and the child’s page now differs from each other. This is seen in the code above and illustrated below:

Fig 5: Private Anonymous Mapping after fork() - write

Shared Anonymous Mapping

A shared anonymous mapping can be created with the following call:

unsigned short buf_size = sizeof(int);
std::cout << "Allocating " << buf_size << " byte(s)" << std::endl;
int *s_addr = (int *)mmap(NULL,
                    buf_size,
                    PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_ANONYMOUS,
                    -1,
                    0);
if (s_addr == MAP_FAILED) {
    std::cerr << "Error: mmap() failed. " << strerror(errno) << std::endl;
    return errno;
}

Since, this is a shared mapping, when a fork() happens, the parent and the child can see each other’s update and this can be used as a facility to enable Inter Process Communication (IPC). The first read goes through the copy-on-write zero page as seen before. When the write happens, the physical page is a shared one — wherein the page will be reference counted. The count starts at 1 and when the parent forks a child, the child now increases the count to 2. This can be seen below:

Now the output for the same code seen earlier would be:

pid_t pid = fork();
if (pid == -1) {
    std::cerr << "Error: fork() failed. " << strerror(errno) << std::endl;
    return errno;
} else if (pid == 0) {
    std::cout << "Child process executing with PID: " << getpid() << std::endl;
    std::cout << "Accessing " << s_addr << " from child (" << getpid() << ")"
                << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
    std::cout << "Child Process modifying value at " << s_addr << std::endl;
    *s_addr = 100;
    std::cout << "New value at " << s_addr << "for PID: " << *s_addr
                << std::endl;
    return 0;
} else {
    std::cout << "Forked a child with PID: " << pid << std::endl;
    std::cout << "Parent process executing with PID: " << getpid() << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
    std::cout << "Waiting for child..." << std::endl;
    int child_ret;
    pid_t w_ret = waitpid(pid, &child_ret, 0);
    if (w_ret == -1) {
        std::cerr << "Error: waitpid() failed. " << strerror(errno)
                    << std::endl;
        return errno;
    }
    std::cout << "Child's(" << pid << ") return status: " << child_ret
                << std::endl;
    std::cout << "Accessing " << s_addr << "from parent(" << getpid()
                <<")" << std::endl;
    std::cout << "Value at " << s_addr << ": " << *s_addr << std::endl;
}

// [output]
// Forked a child with PID: 88381
// Parent process executing with PID: 88380
// Value at 0x11372d000: 42
// Waiting for child...
// Child process executing with PID: 88381
// Accessing 0x11372d000 from child (88381)
// Value at 0x11372d000: 42
// Child Process modifying value at 0x11372d000
// New value at 0x11372d000for PID: 100
// Child's(88381) return status: 0
// Accessing 0x11372d000from parent(88380)
// Value at 0x11372d000: 100

Note, you can also use the device file /dev/zero to create an anonymous mapping as such:

fd = open("/dev/zero", O_RDWR);
...
s_addr = mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
...

As you can see, this involves more typing than the earlier method but you can also accomplish this using the fd to /dev/zero.

File Backed Memory Mappings

A file backed mapping is backed by the underlying file, which acts as the source of the mapping. A file backed mapping may also be private or shared. A file backed mapping maps the contents of the file onto the address space of the calling process. The process can then read and write directly to the file location without needing read()/write(). To map a file using mmap(), you pass in the file descriptor along with the offset from the starting position. If 0, the mapping starts from the beginning of the file. The length argument denotes the total bytes to map — from the offset. The amount of bytes mapped is determined by the pair [offset, length]. A pair of [0, sizeof(file)] maps the entire file on to the process’s address space.

The way the file backed memory mapping works is illustrated below (assuming a 32-bit architecture):

Note, that memory mappings can also be backed by real/virtual devices. We will not be looking at them in this blog post, but if you are interested, google might be your friend

As we shall see below, there are situations for both private and shared file backed mapping to be made use of. We will look at each one of them below.

Private File Backed Mapping

A private file backed mapping is similar to a private anonymous mapping — the only difference being that the mapping is now backed by a file. A private file backed mapping can be created with the following call:

int fd = open("some_file", O_RDWR);
...
stat st;
fstat(fd, &st);
...
char *s_addr = (char *)mmap(NULL,
                    st.st_size,
                    PROT_READ | PROT_WRITE,
                    MAP_PRIVATE,
                    fd,
                    0);
if (s_addr == MAP_FAILED) {
    std::cerr << "Error: mmap() failed. " << strerror(errno) << std::endl;
    return errno;
}

The above is a straight forward call but the thing to note is to use appropriate flags for the open() call that matches the protection flags (PROT_READ, PROT_WRITE and PROT_EXEC) to be used for the call to mmap().

The internal mechanics are the same as we saw with private/shared anonymous mapping. For a read, the access is faulted on and a new physical page is allocated and the region of the file requested is mapped to the physical page. The PTE is then modified to point to this physical page. This is illustrated as below:

Fig 8: File Backed Private Memory Mapping - Read

For a write, the write will be carried to the physical page and be valid till the process is alive. The changes get destroyed when the process dies.

So, why would we be using a private file backed mapping? The kernel when mapping an initialized data segment from another process or a shared library uses them. The mappings are mapped private so that the changes made by the dependent process are not carried over to the executable/shared library. Same private mappings are used by debugger tools to create a copy of the underlying source and to make sure to not carry over the changes.

Shared File Backed Mapping

A shared file backed mapping can be created with the following call:

int fd = open("some_file", O_RDWR);
...
stat st;
fstat(fd, &st);
...
char *s_addr = (char *)mmap(NULL,
                    st.st_size,
                    PROT_READ | PROT_WRITE,
                    MAP_SHARED,
                    fd,
                    0);
if (s_addr == MAP_FAILED) {
    std::cerr << "Error: mmap() failed. " << strerror(errno) << std::endl;
    return errno;
}

Since this is a shared mapping, a process writing to the same location in the underlying file — all other process mapping the same region can see the updates written. This is because all the process that map the same region in the file share the same physical pages and hence the updates are seen across processes. The write back is handled by the kernel and and the kernel makes sure the updates are carried out to the underlying file too. You can also make use of the msync() system call to force flush the changes to the underlying file. These will be covered in Part II of this blog post.

Shared file mappings are mainly used for Memory Mapped I/O and Inter Process Communication (IPC). The code to share file backed memory mappings are similar to the one we saw above. Using the file for Memory Mapped I/O, you open the region that you want to read/write and perform your operations as memory access and let the kernel manage the write back. For IPC, multiple processes open the same region of the file to communicate with each other. This is illustrated below:

Boundary Cases for File Backed Mapping

As explained earlier, the mmap() system call allocates mappings — in multiples of page size taking into account the length argument passed to the system call. This gives rise to a few boundary cases that you need to keep in mind.

Let us look at the case where the size of the mapping exceeds the size of the backed file. This can be illustrated as below:

Fig 10: File Backed Memory Mapping - Exceeding File Size

As we can see from above, the call is to create a mapping of size 8196 bytes (2 4K pages) — with the underlying file being only 1000 bytes in size. Hence, the 1000 bytes from the file are bought on to the mapping. Since this is less than the multiple of page size (4k here), the rest of the bytes, up to 4095 bytes, are filled with zeros. Although this region will be shared with other processes, the updates to this region does not carry over to the backed file. The region after the page boundary (4095) will be unmapped and access in this region will result in the SIGBUS signal being delivered to the process. This situation might be unnecessary, but do note the size of the backed file can be changed by making write() calls after the call to mmap(). This would result in new file data being mapped to the existing memory mapping.

The other case is where the backed file is larger than the requested mapping — but mmap() is called without the length being multiples of the page size. This is illustrated as below:

Fig 11: File Backed Memory Mapping - Not Multiple of Page Size

Here, we have requested a mapping of length 1000 bytes. The kernel rounds the request to the nearest multiple of page size, 4K in this case. 4095 bytes are brought in from the file and copied over to the physical page representing the mapping. I/O can be performed directly on the page and the kernel will manage the write-back to the backed file. Any access made to the file contents beyond this length will result in SIGSEGV to the process and the core dumped.

Notes on File Backed Mapping

Using a file backed mapping for I/O may sometimes be beneficial depending on situations, due to it doing away with user space buffer — involved in read()/write() calls. As with read()/write(), there are two buffers involved in transferring the data. One is the kernel buffer and the other being the user space buffer. When using Memory Mapped I/O, it eliminates the need for a user space buffer since the page, which contains the region of the file, can be read/written to when it gets into the page cache (kernel buffer). This optimization might or might not be what you want and you need to keep into mind the time it takes for the kernel to perform Memory Mapped I/O — PTE entries, faulting, physical allocation, data transfer to the kernel buffer etc.

If multiple process access the same regions of a file, probably Memory Mapped I/O will help, since it eliminates all the identical user space buffers that multiple process maintain and this can reduce the memory constraints of the system. But these need to be profiled and tested with regards to your use case to see if Memory Mapped I/O actually helps.

Viewing Memory Mappings

You can use various command line utils to print out the virtual memory areas of a program. You can use the output from /proc/<PID>/maps to check out the various memory maps of the process.

vagrant@ubuntu-bionic:~/Projects/Process$ cat /proc/7272/maps
558680331000-558680332000 r-xp 00000000 08:01 263150                     /home/vagrant/Projects/Process/a.out
558680531000-558680532000 r--p 00000000 08:01 263150                     /home/vagrant/Projects/Process/a.out
558680532000-558680533000 rw-p 00001000 08:01 263150                     /home/vagrant/Projects/Process/a.out
55868136f000-558681390000 rw-p 00000000 00:00 0                          [heap]
7f09fd60f000-7f09fd7f6000 r-xp 00000000 08:01 2083                       /lib/x86_64-linux-gnu/libc-2.27.so
7f09fd7f6000-7f09fd9f6000 ---p 001e7000 08:01 2083                       /lib/x86_64-linux-gnu/libc-2.27.so
7f09fd9f6000-7f09fd9fa000 r--p 001e7000 08:01 2083                       /lib/x86_64-linux-gnu/libc-2.27.so
7f09fd9fa000-7f09fd9fc000 rw-p 001eb000 08:01 2083                       /lib/x86_64-linux-gnu/libc-2.27.so
7f09fd9fc000-7f09fda00000 rw-p 00000000 00:00 0
7f09fda00000-7f09fda27000 r-xp 00000000 08:01 2079                       /lib/x86_64-linux-gnu/ld-2.27.so
7f09fdc1e000-7f09fdc20000 rw-p 00000000 00:00 0
7f09fdc27000-7f09fdc28000 r--p 00027000 08:01 2079                       /lib/x86_64-linux-gnu/ld-2.27.so
7f09fdc28000-7f09fdc29000 rw-p 00028000 08:01 2079                       /lib/x86_64-linux-gnu/ld-2.27.so
7f09fdc29000-7f09fdc2a000 rw-p 00000000 00:00 0
7ffda246f000-7ffda2490000 rw-p 00000000 00:00 0                          [stack]
7ffda257f000-7ffda2582000 r--p 00000000 00:00 0                          [vvar]
7ffda2582000-7ffda2584000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

vagrant@ubuntu-bionic:~/Projects/Process$ pmap 7272
7272:   ./a.out
0000558680331000      4K r-x-- a.out
0000558680531000      4K r---- a.out
0000558680532000      4K rw--- a.out
000055868136f000    132K rw---   [ anon ]
00007f09fd60f000   1948K r-x-- libc-2.27.so
00007f09fd7f6000   2048K ----- libc-2.27.so
00007f09fd9f6000     16K r---- libc-2.27.so
00007f09fd9fa000      8K rw--- libc-2.27.so
00007f09fd9fc000     16K rw---   [ anon ]
00007f09fda00000    156K r-x-- ld-2.27.so
00007f09fdc1e000      8K rw---   [ anon ]
00007f09fdc27000      4K r---- ld-2.27.so
00007f09fdc28000      4K rw--- ld-2.27.so
00007f09fdc29000      4K rw---   [ anon ]
00007ffda246f000    132K rw---   [ stack ]
00007ffda257f000     12K r----   [ anon ]
00007ffda2582000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total             4512K

In Part II, we shall see how to perform an immediate write back using the msync() system call, remapping a mapped region using mremap(), additional flags passed to mmap() and other related system calls like mlock(), madvise() and mprotect().

That’s it for Part I. For any discussion, tweet here.

[1] http://man7.org/tlpi/
[2] https://landley.net/writing/memory-faq.txt
[3] https://man7.org/linux/man-pages/man2/mmap.2.html
[4] https://www.diskodev.com/posts/linux-virtual-memory/

Linux: mmap() System Call — Part I

Anonymous Memory Mappings

Private Anonymous Mapping

Fig 1: Private Anonymous Mapping - Create

Fig 2: Private Anonymous Mapping - Read

Fig 3: Private Anonymous Mapping - Write

Fig 4: Private Anonymous Mapping after fork() - read

Fig 5: Private Anonymous Mapping after fork() - write

Shared Anonymous Mapping

Fig 6: Shared Anonymous Mapping

File Backed Memory Mappings

Fig 7: File Backed Memory Mapping

Private File Backed Mapping

Fig 8: File Backed Private Memory Mapping - Read

Shared File Backed Mapping

Fig 9: File Backed Memory Mapping - IPC

Boundary Cases for File Backed Mapping

Fig 10: File Backed Memory Mapping - Exceeding File Size

Fig 11: File Backed Memory Mapping - Not Multiple of Page Size

Notes on File Backed Mapping

Viewing Memory Mappings