A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.mediawiki.org/wiki/Content_translation/Documentation/Translation_Architecture below:

Content translation/Technical Architecture - MediaWiki

This document tries to capture the architectural understanding for the Content Translation project. This document evolves as the team goes to the depths of each component.

  1. The translation aids that we want to provide are mainly from third party service providers, or otherwise external to the CX software itself.
  2. Depending on the external system capability, varying amount of delays can be expected. It emphasises the need for server-side optimizations of service provider APIs such as
  3. We are providing a free-to-edit interface, so we should not block any edits (translation in the context of CX) if the user decides not to use translation aids. Incrementally we can prepare all translation aids support and provide them in the context.
  4. caching of results from them
    proper scheduling to better utilize capacities and to operate in the limits of API usage
  1. We can provide translation aids for the content in increments as and when we receive them from service providers. We do not need to wait for all service providers to return data before proceeding.
  2. For large articles, we will have to do some kind of paging mechanism to provide this data in batches.
  3. This means that client and server should communicate in regular intervals or as and when data is available at server. We considered traditional http pull methods (Ajax) and push communication (websockets).
  1. jQuery
  2. contenteditable with optional rich text editor
  3. LESS, Grid
  1. Node.js with express
  2. Varnish Cache

Node.js built-in cluster module-based instance management system. Currently the code is borrowed from Parsoid. We have a cluster that forks express server instances depending on number of processors available in the system. It also fork new processes for replacing instances killed/suicided.

This approach uses Node.js built-in in cluster. A better alternative is node module cluster from socket.io developers http://learnboost.github.io/cluster/

To be designed

To be designed. https://wikitech.wikimedia.org/wiki/Parsoid gives some idea

The content itself might be sensitive (private wikis) and thus should not always be shared through translation memory. We also need to restrict access to service providers, especially if they are paid ones.

See Content_translation/Segmentation

See Content_translation/Workflow

See Publishing

We need to capture translators improvements for the translation suggestions (for all translation aids). If we are allowing free text translation, the extent to which we can do this is limited as explained in the segment alignment section of this document. But for the draft/final version of translation, we need to capture the following to our CX system to continuously improve CX system and potentially provide this to service providers

  1. Dictionary corrections/improvements - Whether the user used CX suggestions or used a new meaning. Also out of n suggestions, which suggestion was used.
  2. Link localization-TBD: Can we give this back to wikidata?
  3. MT - Edited translations
  4. Translation memory - if they are used, it affects the confidence score
  5. Segmentation corrections

One potential for this is parallel corpora for many language pairs, which is quite useful for any kind for morphological analysis.

We allow free text editing for translation. We can give word meaning, translation templates. But since we are allowing users to summarize, combine multiple sentences, choice of not translating something, we will not be able learn the translation.

We need to align the source and target translations - example: translatedSentence24 is the translation of originalSentence58

This context is completely different than a structured translation we do with Translate extension. The following type of edits are possible

  1. Translator combines multiple segments to single segment - like summarising
  2. Translator splits a single segment to multiple segments to construct simple sentences
  3. Translator leaves multiple segments untranslated
  4. Translator re-orders segments in a paragraph or even in multiple paragraphs
  5. Translator paraphrases content from multiple segments to another set of segments

TODO: replace the word alignment diagram opposite with a suitable segment alignment diagram

The CX UI provides a visual synchronization for segments wherever possible. From the UI perspective, word and sentence alignment will help user orientation but they are not a critical component that will break the experience if it is not 100% perfect.

Alignment of words and sentences are useful to provide additional context to the user. When getting info on a word from the translation it is good to have it highlighted in the source so that it becomes easier to see the context in which the word is used. This is something not illustrated in the prototype but it is something that google translate does, for example.

Pau thinks it is totally fine if the alignment is not working in all cases. For example, we may want to support it when the translation is done by modifying the initial template, but it is harder if the user starts from scratch so we don't do it in the later case.

The following best try approach is proposed

  1. We are considering the source article annotated with identifiers for segments. When we use machine translation for template, we can try copying this annotations to template. The feasibility of copying annotations to template will depend on many things. (a) The ability of MT providers to take html and give the translations back without losing these annotations (b) Mapping the link ids by using href as key from our server side (c) If MT backend not available, we are giving source article as translation template- no issue there
  2. In the translation template, if user edits, it is possible that the segment markup get destroyed. It can happen when two sentences combined or even just because of deletion of first word. But wherever the annotations remains, it should be easy for us to do visual sync up.
  3. If we don't get enough matching of annotations(ids), we ignore.

In iterations, it is possible to improve this approach in multiple ways

  1. Consider incorporating sentence alignment using minimal linguistic data for simple language pairs if possible. Example: English-Welsh (?)
  2. Consider preserving contenteditable nodes in a translation template so that we have segment ids to match (Follow the VE team developments on this area)

An approach to be evaluated: Instead of making the whole translation column content editable, when a translation template is inserted, mark each segment as contenteditable. This will prevent from destruction of segments.

See Caching

This should be done with the help of Wikidata. Also to be cached.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4