|
|
Subscribe / Log in / New account

Checkpoint/restore tool v1.0

After years of work, version 1.0 of the checkpoint/restore tool is available. This is a mostly user-space-based tool that is able to capture the state of a set of processes to persistent storage and restore it at some future time, possibly on a different system. See this 2013 Kernel Summit article for details on the current state of this functionality.

(Log in to post comments)

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 19:14 UTC (Mon) by arekm (subscriber, #4846) [Link]

So which kernel provides all necessary interfaces for criu tool to work?

criu --check could actually tell what's missing since that is not obvious:

$ uname -a
Linux t400 3.12.1 #73 SMP PREEMPT Wed Nov 20 22:46:34 CET 2013 x86_64 Intel(R)_Core(TM)2_Duo_CPU_____T9400__@_2.53GHz PLD Linux
$ sudo criu check
/proc/<pid>/map_files directory is missing.
(00.001273) Error (sk-unix.c:353): Can't stat socket 0x31e00f(./tmp/ksocket-arekm/plasma-desktophK7986.slave-socket): No such file or directory
/proc/sys/kernel/ns_last_pid sysctl is missing.
System call kcmp is not supported
prctl: PR_GET_TID_ADDRESS is not supported
/proc/sys/kernel/sem_next_id sysctl is missing.
(00.002998) Warn (cr-check.c:514): Dirty tracking is OFF. Memory snapshot will not work.
/proc/<pid>/timers file is missing.

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 19:31 UTC (Mon) by rmini (subscriber, #4991) [Link]

It looks like your kernel may not be built with the CONFIG_CHECKPOINT_RESTORE option, which enables the missing APIs.

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 19:42 UTC (Mon) by arekm (subscriber, #4846) [Link]

It's also hidden under CONFIG_EXPERT option which I had disabled. Thanks.

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 20:19 UTC (Mon) by arekm (subscriber, #4846) [Link]

3.12.1, seems still much work before this is usable

dumping mc:
$ sudo criu dump -t 6089
(00.002344) Error (sk-unix.c:353): Can't stat socket 0x66bf(./tmp/ksocket-arekm/kmailoy5661.slave-socket): No such file or directory
(00.002694) Error (sk-unix.c:353): Can't stat socket 0x45f6(./tmp/ksocket-arekm/plasma-desktopKS5342.slave-socket): No such file or directory
(00.009186) Error (tty.c:203): tty: Can't obtain ptmx index: Inappropriate ioctl for device
(00.009231) Error (cr-dump.c:1491): Dump files (pid: 6089) failed with -1
(00.009661) Error (cr-dump.c:1811): Dumping FAILED.

restoring bash:
$ sudo criu restore
(00.004443) 6216: Error (tty.c:178): tty: Found slave peer index 4 without correspond master peer
(00.004621) Error (cr-restore.c:1062): 6216 exited, status=1
(00.004636) Error (cr-restore.c:1597): Restoring FAILED

dumping && restoring "sleep 1000 < /dev/null > /dev/null &"
$ sudo criu dump -t 6280 --shell-job
(00.002147) Error (sk-unix.c:353): Can't stat socket 0x45f6(./tmp/ksocket-arekm/plasma-desktopKS5342.slave-socket): No such file or directory
$ sudo criu restore
(00.018195) 6280: Error (tty.c:178): tty: Found slave peer index 4 without correspond master peer
(00.018406) Error (cr-restore.c:1062): 6280 exited, status=1
(00.018430) Error (cr-restore.c:1597): Restoring FAILED.

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 21:06 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> $ sudo criu restore
> (00.004443) 6216: Error (tty.c:178): tty: Found slave peer index 4 without correspond master peer

You're missing `--shell-job` here (from my reading of the docs[1]). Also probably needed for `mc` dumping as well.

For the socket issues, try `--ext-unix-sk`[2].

[1]http://criu.org/Advanced_usage#Shell_jobs_C.2FR
[2]http://criu.org/Advanced_usage#External_UNIX_sockets

Checkpoint/restore tool v1.0

Posted Nov 25, 2013 19:38 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

On Fedora 20, all I get is a warning about a tun interface, but "OK" after that (with and without the --ms options).

Checkpoint/restore tool v1.0

Posted Nov 26, 2013 11:42 UTC (Tue) by Anssi (subscriber, #52242) [Link]

This reminds me a bit of the old CryoPID project: http://cryopid.berlios.de/

I guess this new solution is considerably more robust :)

Checkpoint/restore tool v1.0

Posted Nov 27, 2013 22:03 UTC (Wed) by gvy (guest, #11981) [Link]

Glad to hear ;-)

Checkpoint/restore tool v1.0

Posted Nov 27, 2013 23:04 UTC (Wed) by kolyshkin (guest, #34342) [Link]

I would like to point out that version 1.0 also signifies an important milestone. Since all the stuff needed for CRIU to work is already in vanilla kernel, CRIU is no longer "mostly user-space-based tool", is is just a tool.

Also, it seems like many people do not quite grasp why CRIU fails to checkpoint something (usually "an external resource"). So here's the article that tries to explain it http://criu.org/What_cannot_be_checkpointed

Hibernation

Posted Nov 28, 2013 2:49 UTC (Thu) by bojan (subscriber, #14302) [Link]

Maybe I'm misunderstanding what this is, but it looks like a better way to do hibernation. Instead of mucking around with restoring the original kernel, one just boots are new one and restores the apps (which in the end is the whole point of hibernation).

Hibernation

Posted Nov 28, 2013 3:36 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

I actually spent a while trying something like this a few years ago. The answer is basically "yes, but". You can handle it for most apps, but some have pushed context into the kernel and assume that it's still there. For instance, if an application has set an output rate on the audio device, that would be lost. You'd need a mechanism for telling the application to restore it's state manually. That's a violation of current expectations.

The audio case could be fixed by special casing Pulseaudio, but you have similar issues with some other hardware. Fixing it would be a pretty significant amount of effort.

Hibernation

Posted Nov 28, 2013 3:40 UTC (Thu) by raven667 (subscriber, #5198) [Link]

That's one way you could use this, another would be to do migrations of live applications, or virtual machines, between individual hosts in a cluster.

Hibernation

Posted Nov 28, 2013 5:47 UTC (Thu) by dlang (guest, #313) [Link]

the difference is that you don't have to do all applications, checkpoint/restore can freeze and restore individual applications.

you may want to move some apps to a different machine, or just free up the resources that some app is using without loosing the work that it's done.

Also hibernation doesn't allow for you to change kernels in the stop/start process and it keeps kernel state.

they may sound similar, but since hibernation and checkpoint/restore have different scope to what they deal with, the resulting uses are very different.


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds