As you know, the very popular container technology is based on Linux Namespace and Cgroups implementations. What is Linux namespace then? How is Linux namespace used in containers? In this post, some experiments will be conducted to introduce the use of Linux namespace in containers.

namespace

Table of Contents

Linux namespace

There’s a definition from Linux manual introducing Linux namespace:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.

Linux namespace is a feature of the kernel that isolates a number of system resources, e.g. PID(Process ID), User ID, Netwok, etc…

There are currently 6 different types of namespace implemented in Linux:

Namespace type Syscall arg Kernel version
Mount Namespace CLONE_NEWNS 2.4.19
UTS Namespace CLONE_NEWUTS 2.6.19
IPC Namespace CLONE_NEWIPC 2.6.19
PID Namespace CLONE_NEWPID 2.6.24
Network Namespace CLONE_NEWNET 2.6.29
User Namespace CLONE_NEWUSER 3.8

The namespace API uses the following 3 main system calls:

  • clone(): Create new process. Determine which types of namespace are created based on system call parameters, and their children are included in these namespaces.
  • unshare(): Removes a process from a namespace.
  • setns(): Add the process to a namespace.

User Namespace

User namespace for isolating UID and GID, the key point is that the unprivileged user outside the namespace can be mapped to the root-user inside the new namespace by creating a UID namespace with UID mappings.

Starting with Linux kernel 3.8, user namespace can be created by non-root users and this user can be mapped as root within the container and have root privileges within the container.

Let’s take an example, on my Ubuntu I want to create a user namespace and use non-root user to run the process:

$ id
uid=1000(lushenle) gid=1000(lushenle) groups=1000(lushenle)

to the root user in the container:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUSER,

		UidMappings: []syscall.SysProcIDMap{
			{
				ContainerID: 0,
				HostID:      syscall.Getuid(),
				Size:        1,
			},
		},

		GidMappings: []syscall.SysProcIDMap{
			{
				ContainerID: 0,
				HostID:      syscall.Getgid(),
				Size:        1,
			},
		},
	}

	cmd.Stdout = os.Stdout
	cmd.Stdin = os.Stdin
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

UidMappings implements the user mapping between host and container, it will map the user id in the host to the user id in the container, the size parameter indicates it’s a contiguous range mapping. If size is 10 and containerID is 0, HostID is 1000, it means 1000-1010 will be mapped to 0-10.

GidMappings is the same mechanism as UidMappings, it represents Group id.

If UidMappings and GidMappings are not specified, then it will be mapped to nobody user and nogroup group.

A brief explanation of the code that, exec.Command("sh") this is used to specify the initial command within the new process from which the fork is derived, and is executed by default using sh. The CLONE_NEWUSER identifier to create a UTS namespace. Golang wraps a call to the clone() function for us, and this code executes into a sh runtime environment.

Network Namespace

Network namespace is used to isolate network devices, IP address ports, and other network stacks.

Network namespace allows each container to have its own independent (virtual) network device, and applications within the container can be bound to their own ports. There are no port conflicts between namespaces, once the network-bridge is created in the host, it is easy to communicate between containers, and different containers can use the same port.

Let’s take an example, view the network devices of the host:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether b0:5c:da:a9:3c:36 brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.18.127.149/24 brd 10.18.127.255 scope global dynamic noprefixroute eno1
       valid_lft 656027sec preferred_lft 656027sec
    inet6 fe80::22de:8bb9:a987:6ce5/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: wlo1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ec:5c:68:00:22:f3 brd ff:ff:ff:ff:ff:ff
    altname wlp2s0

to the root user in the container:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWNET,
	}

	cmd.Stdout = os.Stdout
	cmd.Stdin = os.Stdin
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}
$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

It is not difficult to find that there are lo, eno1, wlo1 network devices on the host and only lo network devices in the container. This is enough to show that the network between Network namespace and the host is isolated.

PID Namespace

PID namespace is used for the off-process pid. The same process in different pid namespace can use different pid. This way you can understand that the pid of the process running in the foreground inside the container is always 1, and outside the container, you can find the same process with a different pid, which is what the pid namespace does.

Let’s take an example, creat a pid namespace using golang:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWPID,
	}

	cmd.Stdout = os.Stdout
	cmd.Stdin = os.Stdin
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

Run and check its pid:

# go run pid.go
# echo $$
1

On the host, the pid is:

Open new terminal, check the pid of this process use the command pstree -pl.

host_pid

We can see that this process has a pid of 1 in the container and a pid of 90918 in the host. This means that the pid 90918 is mapped to namespace and the pid is 1.

UTS Namespace

UTS namespace is used to isolate the nodename and hostname system identifiers. Within the UTS namespace, each namespace is allowed to have its own hostname.

The following example uses golang to demonstrate the UTS namespace:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS,
	}

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

Execute the go run uts.go command and output the current pid:

# go run uts.go
# echo $$
92301

Verify that the parent and child processes are not in the same uts namespace:

Open a new terminal and use the command pstree -pl to get the pid of the parent process, the pid of the parent process obtained here is 92296.

# readlink /proc/92296/ns/uts
uts:[4026531838]
# readlink /proc/92301/ns/uts
uts:[4026532724]

We can see that they are indeed not in the same UTS namespace. UTS namespace isolate the hostname, so modifying the hostname in this environment will not affect the host, verify it.

# hostname -b uts.idocker.io
# hostname
uts.idocker.io

Start a new shell on the host and check the hostname:

# hostname
lushenle-HP-348-G7

As we expected, the hostname of the host was not affected by the changes inside the container.

IPC Namespace

IPC namespace is used to isolate the System V IPC and POSIX message queues. Each IPC namespace has its own System V IPC and POSIX message queue.

Let’s demonstrate the capabilities provided by IPC namespace with an example:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWIPC,
	}

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

First, open a new terminal on the host and look at the existing ipc message queues:

# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

There are currently no IPC queues, so create a message queue:

# ipcmk -Q
Message queue id: 1

# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x2dfff2c8 1          root       644        0            0

Above we have created an IPC message queue with an ID of 0. Next, run the golang code and check ipc message queues using a new terminal:

# go run ipc.go
# ipcs	-q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

IPC message queues are not visible in the IPC namespace.

The above experiment shows that the message queues created on the host are not visible in the new namespace, which means that the IPC namespace has been created successfully and the IPC has been isolated.

Mount Namespace

Mount namespace is used to isolate the mount view seen by each process. The file system hierarchy is seen differently in different namespace processes. Calling mount and umount in mount namespace only affects the filesystem in the current namespace, it has no effect on the global filesystem.

Mount namespace is the first namespace type implemented by the Linux kernel, so its system call argument is NEWNS (short for New Namespace).

When you see this, you might think of chroot(). It is possible to program the root of a subdirectory to touch. Surprisingly, mount namespace not only does this, but does it in a more flexible and secure way.

Okay, let’s go through the example to illustrate the capabilities provided by mount namespace.

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWNS,
	}

	cmd.Stdout = os.Stdout
	cmd.Stdin = os.Stdin
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

Run the above code and look at the contents of the files under /proc use command ls /proc.

/proc is a pseudo-file system that provides additional mechanisms to send information to processes via the kernel and kernel modules.

At this point, you will find that it seems to be the same as the host’s. Next, we mount /proc to our namespace and look at the contents of the /proc directory again. Very different from before, there are a lot fewer documents that.

# mount -t proc proc /proc
# ls /proc
...

The process with a pid of 1 in the mount namespace is sh. This means that the current mount in the mount namespace is isolated from the external space. The mount operation does not affect the host, which is the same feature used by docker volume.

Summary

In this post, introduces each of the 6 different namespaces in Linux and gives a brief hands-on look at them with golang code. The important role of namespace for resource isolation startup, and how to run the application into the namespace we specified. Application containerization is based on Linux namespace, which shows the importance of namespace technology.

Reference