Add kind that demonstrates how to modify the upstream graph in a transform #438
firefoxci-taskcluster / dataset-news-crawl-news_2008-ru
succeeded
Feb 12, 2024 in 2m 33s
FirefoxCI (pull_request)
Fetch news-crawl dataset for {src_locale}
Details
View task in Taskcluster
View logs in Taskcluster
[taskcluster 2024-02-12 19:59:00.284Z] Task ID: b0QR88XDSNu6oI1RIyv5bw
[taskcluster 2024-02-12 19:59:00.284Z] Worker ID: 4930825421926680236
[taskcluster 2024-02-12 19:59:00.284Z] Worker Group: us-west1
[taskcluster 2024-02-12 19:59:00.284Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-02-12 19:59:00.284Z] Worker Pool: translations-1/b-linux-large-gcp
[taskcluster 2024-02-12 19:59:00.284Z] Worker Version: 38.0.5
[taskcluster 2024-02-12 19:59:00.284Z] Public IP: 34.105.105.76
[taskcluster 2024-02-12 19:59:00.284Z] Hostname: translations-1-b-linux-large-gcp-bkxkz5kwtxmim1se1fb46w
[taskcluster 2024-02-12 19:59:00.284Z] using cache "translations-level-1-checkouts-v3-58974d7dcf0417b3fe53-POZi__wrQou1jDZzMTKMiQ" -> /builds/worker/checkouts
[taskcluster 2024-02-12 19:59:05.545Z] Downloading artifact "public/image.tar.zst" from task ID: POZi__wrQou1jDZzMTKMiQ.
[taskcluster 2024-02-12 19:59:10.545Z] Download Progress: 81.65%
[taskcluster 2024-02-12 19:59:11.738Z] Downloaded artifact successfully.
[taskcluster 2024-02-12 19:59:11.738Z] Downloaded 775.919 mb
[taskcluster 2024-02-12 19:59:11.739Z] Decompressing downloaded image
[taskcluster 2024-02-12 19:59:17.058Z] Loading docker image from downloaded archive.
[taskcluster 2024-02-12 19:59:50.775Z] Image 'public/image.tar.zst' from task 'POZi__wrQou1jDZzMTKMiQ' loaded. Using image ID sha256:20930bbb7441964357bb7a066c02e9343c8320da085971c7900be610d4f412cd.
[taskcluster 2024-02-12 19:59:51.074Z] === Task Starting ===
[setup 2024-02-12T19:59:59.564Z] run-task started in /builds/worker
[setup 2024-02-12T19:59:59.564Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c $VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst
[setup 2024-02-12T19:59:59.564Z] Python version: 3.10.12
[cache 2024-02-12T19:59:59.566Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2024-02-12T19:59:59.566Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2024-02-12T19:59:59.566Z] volume /builds/worker/checkouts is a cache
[setup 2024-02-12T19:59:59.566Z] running as worker:worker
[vcs 2024-02-12T19:59:59.566Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-02-12T19:59:59.568Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/builds/worker/checkouts/vcs']
[vcs 2024-02-12T19:59:59.570Z] Cloning into '/builds/worker/checkouts/vcs'...
[vcs 2024-02-12T20:00:00.296Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/bhearsum/firefox-translations-training', 'graph-mod']
[vcs 2024-02-12T20:00:00.722Z] From https://github.com/bhearsum/firefox-translations-training
[vcs 2024-02-12T20:00:00.722Z] * branch graph-mod -> FETCH_HEAD
[vcs 2024-02-12T20:00:00.724Z] executing ['git', 'checkout', '-f', '-B', 'graph-mod', '0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826']
[vcs 2024-02-12T20:00:00.778Z] Switched to a new branch 'graph-mod'
[vcs 2024-02-12T20:00:00.778Z] executing ['git', 'submodule', 'init']
[vcs 2024-02-12T20:00:00.797Z] Submodule '3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path '3rd_party/browsermt-marian-dev'
[vcs 2024-02-12T20:00:00.797Z] Submodule 'extract-lex' (https://github.com/marian-nmt/extract-lex) registered for path '3rd_party/extract-lex'
[vcs 2024-02-12T20:00:00.797Z] Submodule 'fast_align' (https://github.com/clab/fast_align) registered for path '3rd_party/fast_align'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/kenlm' (https://github.com/kpu/kenlm) registered for path '3rd_party/kenlm'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/marian-dev' (https://github.com/marian-nmt/marian-dev) registered for path '3rd_party/marian-dev'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/preprocess' (https://github.com/kpu/preprocess.git) registered for path '3rd_party/preprocess'
[vcs 2024-02-12T20:00:00.799Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-02-12T20:00:00.818Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/browsermt-marian-dev'...
[vcs 2024-02-12T20:00:02.186Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/extract-lex'...
[vcs 2024-02-12T20:00:02.719Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/fast_align'...
[vcs 2024-02-12T20:00:03.095Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/kenlm'...
[vcs 2024-02-12T20:00:03.922Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/marian-dev'...
[vcs 2024-02-12T20:00:05.990Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/preprocess'...
[vcs 2024-02-12T20:00:06.500Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-02-12T20:00:06.509Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-02-12T20:00:06.519Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-02-12T20:00:06.541Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-02-12T20:00:06.640Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-02-12T20:00:06.657Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-02-12T20:00:06.657Z] cleaning git checkout...
[vcs 2024-02-12T20:00:06.657Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-02-12T20:00:06.660Z] removing []
[vcs 2024-02-12T20:00:06.660Z] successfully cleaned git checkout!
[vcs 2024-02-12T20:00:06.662Z] TinderboxPrint:<a href='https://github.com/bhearsum/firefox-translations-training/commit/0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826' title='Built from firefox-translations-training commit 0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826'>0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826</a>
[task 2024-02-12T20:00:06.662Z] executing ['bash', '-c', '$VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst']
[task 2024-02-12T20:00:06.664Z] + set -euo pipefail
[task 2024-02-12T20:00:06.664Z] + dataset=news-crawl_news.2008
[task 2024-02-12T20:00:06.664Z] + lang=ru
[task 2024-02-12T20:00:06.664Z] + max_sent=10000
[task 2024-02-12T20:00:06.664Z] + output_path=/builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-12T20:00:06.664Z] + coef=0.1
[task 2024-02-12T20:00:06.664Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-12T20:00:06.664Z] + ARTIFACT_EXT=zst
[task 2024-02-12T20:00:06.664Z] + echo '###### Downloading monolingual data for language ru dataset news-crawl_news.2008'
[task 2024-02-12T20:00:06.664Z] ###### Downloading monolingual data for language ru dataset news-crawl_news.2008
[task 2024-02-12T20:00:06.664Z] ++ dirname /builds/worker/checkouts/vcs/pipeline/data/download-mono.sh
[task 2024-02-12T20:00:06.665Z] + cd /builds/worker/checkouts/vcs/pipeline/data
[task 2024-02-12T20:00:06.665Z] ++ dirname /builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-12T20:00:06.666Z] + tmp=/builds/worker/artifacts/original
[task 2024-02-12T20:00:06.666Z] + mkdir -p /builds/worker/artifacts/original
[task 2024-02-12T20:00:06.667Z] + echo '### Downloading dataset'
[task 2024-02-12T20:00:06.667Z] ### Downloading dataset
[task 2024-02-12T20:00:06.667Z] + original_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-12T20:00:06.667Z] + name=news.2008
[task 2024-02-12T20:00:06.667Z] + type=news-crawl
[task 2024-02-12T20:00:06.667Z] + test -s /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:06.667Z] + bash importers/mono/news-crawl.sh ru /builds/worker/artifacts/original/news-crawl_news.2008.original.ru news.2008
[task 2024-02-12T20:00:06.668Z] + set -euo pipefail
[task 2024-02-12T20:00:06.668Z] + lang=ru
[task 2024-02-12T20:00:06.668Z] + output_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-12T20:00:06.668Z] + dataset=news.2008
[task 2024-02-12T20:00:06.668Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-12T20:00:06.668Z] + ARTIFACT_EXT=zst
[task 2024-02-12T20:00:06.668Z] + WGET=wget
[task 2024-02-12T20:00:06.668Z] + echo '###### Downloading WMT newscrawl monolingual data'
[task 2024-02-12T20:00:06.668Z] ###### Downloading WMT newscrawl monolingual data
[task 2024-02-12T20:00:06.668Z] + wget -O - http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:06.668Z] + gunzip
[task 2024-02-12T20:00:06.668Z] + zstdmt -c
[task 2024-02-12T20:00:06.670Z] --2024-02-12 20:00:06-- http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:06.706Z] Resolving data.statmt.org (data.statmt.org)... 129.215.32.28
[task 2024-02-12T20:00:06.845Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:80... connected.
[task 2024-02-12T20:00:06.983Z] HTTP request sent, awaiting response... 301 Moved Permanently
[task 2024-02-12T20:00:06.983Z] Location: https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz [following]
[task 2024-02-12T20:00:06.983Z] --2024-02-12 20:00:06-- https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:07.123Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:443... connected.
[task 2024-02-12T20:00:07.552Z] HTTP request sent, awaiting response... 200 OK
[task 2024-02-12T20:00:07.552Z] Length: 2312968 (2.2M) [application/x-gzip]
[task 2024-02-12T20:00:07.552Z] Saving to: ‘STDOUT’
[task 2024-02-12T20:00:07.552Z]
[task 2024-02-12T20:00:07.827Z] 0K .......... .......... .......... .......... .......... 2% 181K 12s
[task 2024-02-12T20:00:07.965Z] 50K .......... .......... .......... .......... .......... 4% 363K 9s
[task 2024-02-12T20:00:07.965Z] 100K .......... .......... .......... .......... .......... 6% 163M 6s
[task 2024-02-12T20:00:08.104Z] 150K .......... .......... .......... .......... .......... 8% 362K 6s
[task 2024-02-12T20:00:08.104Z] 200K .......... .......... .......... .......... .......... 11% 175M 4s
[task 2024-02-12T20:00:08.104Z] 250K .......... .......... .......... .......... .......... 13% 77.7M 4s
[task 2024-02-12T20:00:08.109Z] 300K .......... .......... .......... .......... .......... 15% 10.0M 3s
[task 2024-02-12T20:00:08.241Z] 350K .......... .......... .......... .......... .......... 17% 379K 3s
[task 2024-02-12T20:00:08.242Z] 400K .......... .......... .......... .......... .......... 19% 44.0M 3s
[task 2024-02-12T20:00:08.242Z] 450K .......... .......... .......... .......... .......... 22% 200M 2s
[task 2024-02-12T20:00:08.243Z] 500K .......... .......... .......... .......... .......... 24% 220M 2s
[task 2024-02-12T20:00:08.248Z] 550K .......... .......... .......... .......... .......... 26% 9.21M 2s
[task 2024-02-12T20:00:08.248Z] 600K .......... .......... .......... .......... .......... 28% 284M 2s
[task 2024-02-12T20:00:08.379Z] 650K .......... .......... .......... .......... .......... 30% 383K 2s
[task 2024-02-12T20:00:08.379Z] 700K .......... .......... .......... .......... .......... 33% 107M 2s
[task 2024-02-12T20:00:08.380Z] 750K .......... .......... .......... .......... .......... 35% 59.0M 2s
[task 2024-02-12T20:00:08.385Z] 800K .......... .......... .......... .......... .......... 37% 10.3M 1s
[task 2024-02-12T20:00:08.385Z] 850K .......... .......... .......... .......... .......... 39% 199M 1s
[task 2024-02-12T20:00:08.385Z] 900K .......... .......... .......... .......... .......... 42% 261M 1s
[task 2024-02-12T20:00:08.385Z] 950K .......... .......... .......... .......... .......... 44% 124M 1s
[task 2024-02-12T20:00:08.386Z] 1000K .......... .......... .......... .......... .......... 46% 288M 1s
[task 2024-02-12T20:00:08.391Z] 1050K .......... .......... .......... .......... .......... 48% 9.13M 1s
[task 2024-02-12T20:00:08.391Z] 1100K .......... .......... .......... .......... .......... 50% 260M 1s
[task 2024-02-12T20:00:08.391Z] 1150K .......... .......... .......... .......... .......... 53% 287M 1s
[task 2024-02-12T20:00:08.391Z] 1200K .......... .......... .......... .......... .......... 55% 229M 1s
[task 2024-02-12T20:00:08.392Z] 1250K .......... .......... .......... .......... .......... 57% 298M 1s
[task 2024-02-12T20:00:08.397Z] 1300K .......... .......... .......... .......... .......... 59% 9.20M 1s
[task 2024-02-12T20:00:08.516Z] 1350K .......... .......... .......... .......... .......... 61% 419K 1s
[task 2024-02-12T20:00:08.517Z] 1400K .......... .......... .......... .......... .......... 64% 155M 1s
[task 2024-02-12T20:00:08.517Z] 1450K .......... .......... .......... .......... .......... 66% 132M 0s
[task 2024-02-12T20:00:08.518Z] 1500K .......... .......... .......... .......... .......... 68% 75.2M 0s
[task 2024-02-12T20:00:08.523Z] 1550K .......... .......... .......... .......... .......... 70% 146M 0s
[task 2024-02-12T20:00:08.523Z] 1600K .......... .......... .......... .......... .......... 73% 9.87M 0s
[task 2024-02-12T20:00:08.523Z] 1650K .......... .......... .......... .......... .......... 75% 207M 0s
[task 2024-02-12T20:00:08.523Z] 1700K .......... .......... .......... .......... .......... 77% 182M 0s
[task 2024-02-12T20:00:08.524Z] 1750K .......... .......... .......... .......... .......... 79% 290M 0s
[task 2024-02-12T20:00:08.524Z] 1800K .......... .......... .......... .......... .......... 81% 272M 0s
[task 2024-02-12T20:00:08.529Z] 1850K .......... .......... .......... .......... .......... 84% 9.22M 0s
[task 2024-02-12T20:00:08.529Z] 1900K .......... .......... .......... .......... .......... 86% 345M 0s
[task 2024-02-12T20:00:08.529Z] 1950K .......... .......... .......... .......... .......... 88% 230M 0s
[task 2024-02-12T20:00:08.530Z] 2000K .......... .......... .......... .......... .......... 90% 296M 0s
[task 2024-02-12T20:00:08.530Z] 2050K .......... .......... .......... .......... .......... 92% 341M 0s
[task 2024-02-12T20:00:08.535Z] 2100K .......... .......... .......... .......... .......... 95% 9.21M 0s
[task 2024-02-12T20:00:08.535Z] 2150K .......... .......... .......... .......... .......... 97% 255M 0s
[task 2024-02-12T20:00:08.535Z] 2200K .......... .......... .......... .......... .......... 99% 200M 0s
[task 2024-02-12T20:00:08.536Z] 2250K ........ 100% 286M=1.0s
[task 2024-02-12T20:00:08.536Z]
[task 2024-02-12T20:00:08.536Z] 2024-02-12 20:00:08 (2.24 MB/s) - written to stdout [2312968/2312968]
[task 2024-02-12T20:00:08.536Z]
[task 2024-02-12T20:00:08.598Z] + echo '###### Done: Downloading WMT newscrawl monolingual data'
[task 2024-02-12T20:00:08.598Z] ###### Done: Downloading WMT newscrawl monolingual data
[task 2024-02-12T20:00:08.598Z] + echo '### Sampling dataset'
[task 2024-02-12T20:00:08.598Z] ### Sampling dataset
[task 2024-02-12T20:00:08.598Z] + set +o pipefail
[task 2024-02-12T20:00:08.599Z] + zstdmt -dc /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:08.599Z] + perl -ne 'print if(split(/\s/, $_) < 100)'
[task 2024-02-12T20:00:08.599Z] + head -n 10000
[task 2024-02-12T20:00:08.599Z] ++ bc -l
[task 2024-02-12T20:00:08.599Z] + zstdmt
[task 2024-02-12T20:00:08.600Z] + shuf -n 11000
[task 2024-02-12T20:00:08.662Z] + set -o pipefail
[task 2024-02-12T20:00:08.662Z] + rm -rf /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:08.663Z] + echo '###### Done: Downloading monolingual data'
[task 2024-02-12T20:00:08.663Z] ###### Done: Downloading monolingual data
[taskcluster 2024-02-12 20:00:09.009Z] === Task Finished ===
[taskcluster 2024-02-12 20:00:09.200Z] Successful task run with exit code: 0 completed in 68.917 seconds
Loading