[Data Liberation] Re-entrant WP_Stream_Importer #2004
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Adds re-entrancy semantics to the importer API to enable pausing and resuming data imports:
Motivation
Most WordPress importers fail because they assume a happy path: we have enough memory, we have enough time, all the assets will be available, and so on.
In Data Liberation, I want to assume the worst possible path through thorny quicksand in full sun with venomous wasps stinging us. We'll run out of memory after the first post, all the assets will be 40GB large, and half of them won't be possible to download.
Pausing, resuming, and recovering from errors should be a basic primitive of the system. The first step to supporting that is the ability to suspend the import operation and restart it from the same spot later on. And that's exactly what this PR adds.
Re-entrancy interface
This PR doesn't store any information in the database yet. It merely adds the plumbing for pausing and resuming the
WP_Stream_Importer
instance.WP_Byte_Stream re-entrancy
The
WP_Byte_Stream
interface directly exposes atell(): int
andseek($offset)
methods. There's no need for anything fancier than that – we're only interested in an offset in the stream. It seems to work well for simple byte streams.My only worry is we may need to revisit this interface later on to support fetching fixed-size chunks from large files using byte ranges.
WP_XML_Processor re-entrancy
WP_XML_Processor
supports exporting state via:get_reentrancy_cursor()
methodcreate($xml, $options, $cursor=null)
.get_token_byte_offset_in_the_input_stream()
No method in the XML processor API will ever accept the cursor or the byte offset as a way of moving to another location in the document. You can only create a new XML processor at
$cursor
.This is a measure to:
seek()
-ing. We already have named bookmarks for that.Usage:
WP_WXR_Reader re-entrancy
The
WP_WXR_Reader
class uses the sameget_reentrancy_cursor()
interface asWP_XML_Processor
.WP_Stream_Importer re-entrancy
The
WP_Stream_Importer
class uses the sameget_reentrancy_cursor()
interface asWP_XML_Processor
. See the example at the top of this description.Testing instructions
TBD. We don't yet have a good way of running PHPUnit in the WordPress context yet. @zaerl is working on running import in CLI, we may need to wait for that before adding tests to this PR and shipping it.