ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.
Currently RESTBase stores all blobs together in a JSON document { "html": "lalala", "data-parsoid": "trulala" }, fetches the whole thing during read and only returns the requested portion. Previously we indeed have stored parts of the page bundle separately in separate tables, but eventually simplified it with not visible performance impact. The performance considerations are tied to backend implementation (Cassandra in RESTBase case), so might not how true for MW ParserCache backend. However, I propose not to optimize prematurely and start with storing the whole page bundle. We can revision it later if we find the overhead of deserializing data-parsoid to be important.
ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.
I would like to separate these concerns, and have ParserCache concentrate on caching, and introduce a separate component for stashing at a later stage. The requirements for these two components are drastically different with ParserCache having 2-level cache deduplicating by used options, different expiry semantics and different key for access. ParserStash (name TBD) is a simple key-value with TTL expiry.
In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution
This can probably be achieved in the beginning by introducing an entirely separate ParserCache, (ParsoidCache?) service, and using the appropriate one in appropriate places. Once we have tighter integration we can create a wrapper class that would route calls to the appropriate parser cache.
The ParserOutput object also extends a base class, CacheTime, which contains a bunch of ParserCache-specific expiry code. If this is appropriate for the new MPC implementation, we can include it in the base class we'd like to factor out of ParserOutput; if it is not, then we should keep it out of the base class of ParserOutput and include it (maybe as a trait) in the LegacyParserOutput used by the legacy parsercache and legacy parser.
I've been thinking to extract CacheTime interface (plus a few more methods) into CacheableParserOutput interface and make ParserCache work with any instance of CacheableParserOutput. Then we could either make Parsoid's PageBundle implement it, or create a wrapper implementing the interface. This is still not decided though, will keep the ticket updated with latest developments.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4