Uploading corpora

You can upload two types of files to use as corpora in Globalese: XML-based files and delimited files (CSV/TSV).

The maximum number of uploads allowed in one go is 20, and no file can be larger than 600 MB.

Accepted file formats

XML-based files

The source and target languages are automatically detected from the information in the files themselves.

When preparing XML-based files to be used as Corpora in Globalese, it is important to let CAT tools do the preparations, or to follow the TMX 1.4b specifications otherwise. There is no need to remove formatting or placeholder tags from standard TMX files.

The following XML-based file formats are accepted:

  • .mqxliff

  • .mxliff

  • .sdlxliff

  • .tbx

  • .tmx

  • .txlf

  • .xliff / .xlf

  • .xlz

Delimited text files

Delimited files must be bilingual text files where the source and target segments are on the same line, separated by a tab character (.bi, .tsv), a semicolon (.csv) or a comma (.csv).

Since there is no way to automatically detect the languages, you must specify them before uploading the files. The source language is the language of the first column of the uploaded files, and the target language is the language of the second column.

The following delimited file formats are accepted:

  • .bi

  • .csv (using comma or semicolon as delimiter)

  • .tsv

Uploading new corpora

  1. Go to Corpora.

  2. Click the Upload button.

  3. Select at least one group to assign the uploaded file(s) to.

  4. Optionally specify any metadata.

  5. Corpora in CSV, TBX, and TSV formats can be marked as keywords by activating the "Keyword list" checkbox.

  6. Select at least one file to upload. You can also drag and drop files into the dropzone.

  7. If one or more or the selected files has a delimited file format (i.e. not XML-based), you also need to specify the source and target language.

  8. Click the Upload button.

Uploading a new version of an existing corpus

If you want to update the contents of a corpus in Globalese, you have two options: either re-import it if it originates from a CAT tool, or upload the changed corpus manually:

  1. Go to the corpus you want to update.

  2. Click the Update button.

  3. The languages and file format will be pre-selected.

  4. Browse to, or drop the new file in the modal window.

  5. Click the Upload button.

Each time you update an existing corpus, the version number of the corpus will automatically be incremented. You can see the version history of a corpus on the appropriately named Versions tab.

Updating an existing corpus does not affect the engines that were previously trained on the corpus. You will need to retrain the engines for the additions, deletions and changes to take effect.