Postcopy
‘Postcopy’ migration is a way to deal with migrations that refuse to converge (or take too long to converge) its plus side is that there is an upper bound on the amount of migration traffic and time it takes, the down side is that during the postcopy phase, a failure of either side causes the guest to be lost.
In postcopy the destination CPUs are started before all the memory has been transferred, and accesses to pages that are yet to be transferred cause a fault that’s translated by QEMU into a request to the source QEMU.
Postcopy can be combined with precopy (i.e. normal migration) so that if precopy doesn’t finish in a given time the switch is made to postcopy.
Enabling postcopy
To enable postcopy, issue this command on the monitor (both source and destination) prior to the start of migration:
migrate_set_capability postcopy-ram on
The normal commands are then used to start a migration, which is still started in precopy mode. Issuing:
migrate_start_postcopy
will now cause the transition from precopy to postcopy. It can be issued immediately after migration is started or any time later on. Issuing it after the end of a migration is harmless.
Blocktime is a postcopy live migration metric, intended to show how long the vCPU was in state of interruptible sleep due to pagefault. That metric is calculated both for all vCPUs as overlapped value, and separately for each vCPU. These values are calculated on destination side. To enable postcopy blocktime calculation, enter following command on destination monitor:
migrate_set_capability postcopy-blocktime on
Postcopy blocktime can be retrieved by query-migrate qmp command. postcopy-blocktime value of qmp command will show overlapped blocking time for all vCPU, postcopy-vcpu-blocktime will show list of blocking time per vCPU.
Note
During the postcopy phase, the bandwidth limits set using
migrate_set_parameter is ignored (to avoid delaying requested pages that
the destination is waiting for).
Postcopy internals
State machine
Postcopy moves through a series of states (see postcopy_state) from ADVISE->DISCARD->LISTEN->RUNNING->END
Advise
Set at the start of migration if postcopy is enabled, even if it hasn’t had the start command; here the destination checks that its OS has the support needed for postcopy, and performs setup to ensure the RAM mappings are suitable for later postcopy. The destination will fail early in migration at this point if the required OS support is not present. (Triggered by reception of POSTCOPY_ADVISE command)
Discard
Entered on receipt of the first ‘discard’ command; prior to the first Discard being performed, hugepages are switched off (using madvise) to ensure that no new huge pages are created during the postcopy phase, and to cause any huge pages that have discards on them to be broken.
Listen
The first command in the package, POSTCOPY_LISTEN, switches the destination state to Listen, and starts a new thread (the ‘listen thread’) which takes over the job of receiving pages off the migration stream, while the main thread carries on processing the blob. With this thread able to process page reception, the destination now ‘sensitises’ the RAM to detect any access to missing pages (on Linux using the ‘userfault’ system).
Running
POSTCOPY_RUN causes the destination to synchronise all state and start the CPUs and IO devices running. The main thread now finishes processing the migration package and now carries on as it would for normal precopy migration (although it can’t do the cleanup it would do as it finishes a normal migration).
End
The listen thread can now quit, and perform the cleanup of migration state, the migration is now complete.
Device transfer
Loading of device data may cause the device emulation to access guest RAM that may trigger faults that have to be resolved by the source, as such the migration stream has to be able to respond with page data during the device load, and hence the device data has to be read from the stream completely before the device load begins to free the stream up. This is achieved by ‘packaging’ the device data into a blob that’s read in one go.
Source behaviour
Until postcopy is entered the migration stream is identical to normal precopy, except for the addition of a ‘postcopy advise’ command at the beginning, to tell the destination that postcopy might happen. When postcopy starts the source sends the page discard data and then forms the ‘package’ containing:
Command: ‘postcopy listen’
The device state
A series of sections, identical to the precopy streams device state stream containing everything except postcopiable devices (i.e. RAM)
Command: ‘postcopy run’
The ‘package’ is sent as the data part of a Command: CMD_PACKAGED, and the
contents are formatted in the same way as the main migration stream.
During postcopy the source scans the list of dirty pages and sends them to the destination without being requested (in much the same way as precopy), however when a page request is received from the destination, the dirty page scanning restarts from the requested location. This causes requested pages to be sent quickly, and also causes pages directly after the requested page to be sent quickly in the hope that those pages are likely to be used by the destination soon.
Destination behaviour
Initially the destination looks the same as precopy, with a single thread reading the migration stream; the ‘postcopy advise’ and ‘discard’ commands are processed to change the way RAM is managed, but don’t affect the stream processing.
------------------------------------------------------------------------------
                        1      2   3     4 5                      6   7
main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
thread                             |       |
                                   |     (page request)
                                   |        \___
                                   v            \
listen thread:                     --- page -- page -- page -- page -- page --
                                   a   b        c
------------------------------------------------------------------------------
- On receipt of - CMD_PACKAGED(1)- All the data associated with the package - the ( … ) section in the diagram - is read into memory, and the main thread recurses into qemu_loadvm_state_main to process the contents of the package (2) which contains commands (3,6) and devices (4…) 
- On receipt of ‘postcopy listen’ - 3 -(i.e. the 1st command in the package) - a new thread (a) is started that takes over servicing the migration stream, while the main thread carries on loading the package. It loads normal background page data (b) but if during a device load a fault happens (5) the returned page (c) is loaded by the listen thread allowing the main threads device load to carry on. 
- The last thing in the - CMD_PACKAGEDis a ‘RUN’ command (6)- letting the destination CPUs start running. At the end of the - CMD_PACKAGED(7) the main thread returns to normal running behaviour and is no longer used by migration, while the listen thread carries on servicing page data until the end of migration.
Source side page bitmap
The ‘migration bitmap’ in postcopy is basically the same as in the precopy, where each of the bit to indicate that page is ‘dirty’ - i.e. needs sending. During the precopy phase this is updated as the CPU dirties pages, however during postcopy the CPUs are stopped and nothing should dirty anything any more. Instead, dirty bits are cleared when the relevant pages are sent during postcopy.
Postcopy features
Postcopy recovery
Comparing to precopy, postcopy is special on error handlings. When any error happens (in this case, mostly network errors), QEMU cannot easily fail a migration because VM data resides in both source and destination QEMU instances. On the other hand, when issue happens QEMU on both sides will go into a paused state. It’ll need a recovery phase to continue a paused postcopy migration.
The recovery phase normally contains a few steps:
When network issue occurs, both QEMU will go into POSTCOPY_PAUSED migration state.
When the network is recovered (or a new network is provided), the admin can setup the new channel for migration using QMP command ‘migrate-recover’ on destination node, preparing for a resume.
On source host, the admin can continue the interrupted postcopy migration using QMP command ‘migrate’ with resume=true flag set. Source QEMU will go into POSTCOPY_RECOVER_SETUP state trying to re-establish the channels.
When both sides of QEMU successfully reconnect using a new or fixed up channel, they will go into POSTCOPY_RECOVER state, some handshake procedure will be needed to properly synchronize the VM states between the two QEMUs to continue the postcopy migration. For example, there can be pages sent right during the window when the network is interrupted, then the handshake will guarantee pages lost in-flight will be resent again.
After a proper handshake synchronization, QEMU will continue the postcopy migration on both sides and go back to POSTCOPY_ACTIVE state. Postcopy migration will continue.
During a paused postcopy migration, the VM can logically still continue running, and it will not be impacted from any page access to pages that were already migrated to destination VM before the interruption happens. However, if any of the missing pages got accessed on destination VM, the VM thread will be halted waiting for the page to be migrated, it means it can be halted until the recovery is complete.
The impact of accessing missing pages can be relevant to different configurations of the guest. For example, when with async page fault enabled, logically the guest can proactively schedule out the threads accessing missing pages.
Postcopy with hugepages
Postcopy now works with hugetlbfs backed memory:
The linux kernel on the destination must support userfault on hugepages.
The huge-page configuration on the source and destination VMs must be identical; i.e. RAMBlocks on both sides must use the same page size.
Note that
-mem-path /dev/hugepageswill fall back to allocating normal RAM if it doesn’t have enough hugepages, triggering (b) to fail. Using-mem-preallocenforces the allocation using hugepages.
Care should be taken with the size of hugepage used; postcopy with 2MB hugepages works well, however 1GB hugepages are likely to be problematic since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, and until the full page is transferred the destination thread is blocked.
Postcopy preemption mode
Postcopy preempt is a new capability introduced in 8.0 QEMU release, it allows urgent pages (those got page fault requested from destination QEMU explicitly) to be sent in a separate preempt channel, rather than queued in the background migration channel. Anyone who cares about latencies of page faults during a postcopy migration should enable this feature. By default, it’s not enabled.