Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kind that demonstrates how to modify the upstream graph in a transform #438

Closed
wants to merge 2 commits into from

Add hacky kind that can modify upstream parts of the graph

0d686a1
Select commit
Loading
Failed to load commit list.
Closed

Add kind that demonstrates how to modify the upstream graph in a transform #438

Add hacky kind that can modify upstream parts of the graph
0d686a1
Select commit
Loading
Failed to load commit list.
firefoxci-taskcluster / dataset-news-crawl-news_2008-ru succeeded Feb 12, 2024 in 2m 33s

FirefoxCI (pull_request)

Fetch news-crawl dataset for {src_locale}

Details

View task in Taskcluster
View logs in Taskcluster


[taskcluster 2024-02-12 19:59:00.284Z] Task ID: b0QR88XDSNu6oI1RIyv5bw
[taskcluster 2024-02-12 19:59:00.284Z] Worker ID: 4930825421926680236
[taskcluster 2024-02-12 19:59:00.284Z] Worker Group: us-west1
[taskcluster 2024-02-12 19:59:00.284Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-02-12 19:59:00.284Z] Worker Pool: translations-1/b-linux-large-gcp
[taskcluster 2024-02-12 19:59:00.284Z] Worker Version: 38.0.5
[taskcluster 2024-02-12 19:59:00.284Z] Public IP: 34.105.105.76
[taskcluster 2024-02-12 19:59:00.284Z] Hostname: translations-1-b-linux-large-gcp-bkxkz5kwtxmim1se1fb46w
[taskcluster 2024-02-12 19:59:00.284Z] using cache "translations-level-1-checkouts-v3-58974d7dcf0417b3fe53-POZi__wrQou1jDZzMTKMiQ" -> /builds/worker/checkouts

[taskcluster 2024-02-12 19:59:05.545Z] Downloading artifact "public/image.tar.zst" from task ID: POZi__wrQou1jDZzMTKMiQ.
[taskcluster 2024-02-12 19:59:10.545Z] Download Progress: 81.65%
[taskcluster 2024-02-12 19:59:11.738Z] Downloaded artifact successfully.
[taskcluster 2024-02-12 19:59:11.738Z] Downloaded 775.919 mb
[taskcluster 2024-02-12 19:59:11.739Z] Decompressing downloaded image
[taskcluster 2024-02-12 19:59:17.058Z] Loading docker image from downloaded archive.
[taskcluster 2024-02-12 19:59:50.775Z] Image 'public/image.tar.zst' from task 'POZi__wrQou1jDZzMTKMiQ' loaded.  Using image ID sha256:20930bbb7441964357bb7a066c02e9343c8320da085971c7900be610d4f412cd.
[taskcluster 2024-02-12 19:59:51.074Z] === Task Starting ===
[setup 2024-02-12T19:59:59.564Z] run-task started in /builds/worker
[setup 2024-02-12T19:59:59.564Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c $VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst
[setup 2024-02-12T19:59:59.564Z] Python version: 3.10.12
[cache 2024-02-12T19:59:59.566Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2024-02-12T19:59:59.566Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2024-02-12T19:59:59.566Z] volume /builds/worker/checkouts is a cache
[setup 2024-02-12T19:59:59.566Z] running as worker:worker
[vcs 2024-02-12T19:59:59.566Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-02-12T19:59:59.568Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/builds/worker/checkouts/vcs']
[vcs 2024-02-12T19:59:59.570Z] Cloning into '/builds/worker/checkouts/vcs'...
[vcs 2024-02-12T20:00:00.296Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/bhearsum/firefox-translations-training', 'graph-mod']
[vcs 2024-02-12T20:00:00.722Z] From https://github.com/bhearsum/firefox-translations-training
[vcs 2024-02-12T20:00:00.722Z]  * branch            graph-mod  -> FETCH_HEAD
[vcs 2024-02-12T20:00:00.724Z] executing ['git', 'checkout', '-f', '-B', 'graph-mod', '0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826']
[vcs 2024-02-12T20:00:00.778Z] Switched to a new branch 'graph-mod'
[vcs 2024-02-12T20:00:00.778Z] executing ['git', 'submodule', 'init']
[vcs 2024-02-12T20:00:00.797Z] Submodule '3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path '3rd_party/browsermt-marian-dev'
[vcs 2024-02-12T20:00:00.797Z] Submodule 'extract-lex' (https://github.com/marian-nmt/extract-lex) registered for path '3rd_party/extract-lex'
[vcs 2024-02-12T20:00:00.797Z] Submodule 'fast_align' (https://github.com/clab/fast_align) registered for path '3rd_party/fast_align'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/kenlm' (https://github.com/kpu/kenlm) registered for path '3rd_party/kenlm'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/marian-dev' (https://github.com/marian-nmt/marian-dev) registered for path '3rd_party/marian-dev'
[vcs 2024-02-12T20:00:00.798Z] Submodule '3rd_party/preprocess' (https://github.com/kpu/preprocess.git) registered for path '3rd_party/preprocess'
[vcs 2024-02-12T20:00:00.799Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-02-12T20:00:00.818Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/browsermt-marian-dev'...
[vcs 2024-02-12T20:00:02.186Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/extract-lex'...
[vcs 2024-02-12T20:00:02.719Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/fast_align'...
[vcs 2024-02-12T20:00:03.095Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/kenlm'...
[vcs 2024-02-12T20:00:03.922Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/marian-dev'...
[vcs 2024-02-12T20:00:05.990Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/preprocess'...
[vcs 2024-02-12T20:00:06.500Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-02-12T20:00:06.509Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-02-12T20:00:06.519Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-02-12T20:00:06.541Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-02-12T20:00:06.640Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-02-12T20:00:06.657Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-02-12T20:00:06.657Z] cleaning git checkout...
[vcs 2024-02-12T20:00:06.657Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-02-12T20:00:06.660Z] removing []
[vcs 2024-02-12T20:00:06.660Z] successfully cleaned git checkout!
[vcs 2024-02-12T20:00:06.662Z] TinderboxPrint:<a href='https://github.com/bhearsum/firefox-translations-training/commit/0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826' title='Built from firefox-translations-training commit 0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826'>0d686a1b21f9e8395ead4d22f4e2e0a8bcf5b826</a>
[task 2024-02-12T20:00:06.662Z] executing ['bash', '-c', '$VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst']
[task 2024-02-12T20:00:06.664Z] + set -euo pipefail
[task 2024-02-12T20:00:06.664Z] + dataset=news-crawl_news.2008
[task 2024-02-12T20:00:06.664Z] + lang=ru
[task 2024-02-12T20:00:06.664Z] + max_sent=10000
[task 2024-02-12T20:00:06.664Z] + output_path=/builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-12T20:00:06.664Z] + coef=0.1
[task 2024-02-12T20:00:06.664Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-12T20:00:06.664Z] + ARTIFACT_EXT=zst
[task 2024-02-12T20:00:06.664Z] + echo '###### Downloading monolingual data for language ru dataset news-crawl_news.2008'
[task 2024-02-12T20:00:06.664Z] ###### Downloading monolingual data for language ru dataset news-crawl_news.2008
[task 2024-02-12T20:00:06.664Z] ++ dirname /builds/worker/checkouts/vcs/pipeline/data/download-mono.sh
[task 2024-02-12T20:00:06.665Z] + cd /builds/worker/checkouts/vcs/pipeline/data
[task 2024-02-12T20:00:06.665Z] ++ dirname /builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-12T20:00:06.666Z] + tmp=/builds/worker/artifacts/original
[task 2024-02-12T20:00:06.666Z] + mkdir -p /builds/worker/artifacts/original
[task 2024-02-12T20:00:06.667Z] + echo '### Downloading dataset'
[task 2024-02-12T20:00:06.667Z] ### Downloading dataset
[task 2024-02-12T20:00:06.667Z] + original_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-12T20:00:06.667Z] + name=news.2008
[task 2024-02-12T20:00:06.667Z] + type=news-crawl
[task 2024-02-12T20:00:06.667Z] + test -s /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:06.667Z] + bash importers/mono/news-crawl.sh ru /builds/worker/artifacts/original/news-crawl_news.2008.original.ru news.2008
[task 2024-02-12T20:00:06.668Z] + set -euo pipefail
[task 2024-02-12T20:00:06.668Z] + lang=ru
[task 2024-02-12T20:00:06.668Z] + output_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-12T20:00:06.668Z] + dataset=news.2008
[task 2024-02-12T20:00:06.668Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-12T20:00:06.668Z] + ARTIFACT_EXT=zst
[task 2024-02-12T20:00:06.668Z] + WGET=wget
[task 2024-02-12T20:00:06.668Z] + echo '###### Downloading WMT newscrawl monolingual data'
[task 2024-02-12T20:00:06.668Z] ###### Downloading WMT newscrawl monolingual data
[task 2024-02-12T20:00:06.668Z] + wget -O - http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:06.668Z] + gunzip
[task 2024-02-12T20:00:06.668Z] + zstdmt -c
[task 2024-02-12T20:00:06.670Z] --2024-02-12 20:00:06--  http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:06.706Z] Resolving data.statmt.org (data.statmt.org)... 129.215.32.28
[task 2024-02-12T20:00:06.845Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:80... connected.
[task 2024-02-12T20:00:06.983Z] HTTP request sent, awaiting response... 301 Moved Permanently
[task 2024-02-12T20:00:06.983Z] Location: https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz [following]
[task 2024-02-12T20:00:06.983Z] --2024-02-12 20:00:06--  https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-12T20:00:07.123Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:443... connected.
[task 2024-02-12T20:00:07.552Z] HTTP request sent, awaiting response... 200 OK
[task 2024-02-12T20:00:07.552Z] Length: 2312968 (2.2M) [application/x-gzip]
[task 2024-02-12T20:00:07.552Z] Saving to: ‘STDOUT’
[task 2024-02-12T20:00:07.552Z] 
[task 2024-02-12T20:00:07.827Z]      0K .......... .......... .......... .......... ..........  2%  181K 12s
[task 2024-02-12T20:00:07.965Z]     50K .......... .......... .......... .......... ..........  4%  363K 9s
[task 2024-02-12T20:00:07.965Z]    100K .......... .......... .......... .......... ..........  6%  163M 6s
[task 2024-02-12T20:00:08.104Z]    150K .......... .......... .......... .......... ..........  8%  362K 6s
[task 2024-02-12T20:00:08.104Z]    200K .......... .......... .......... .......... .......... 11%  175M 4s
[task 2024-02-12T20:00:08.104Z]    250K .......... .......... .......... .......... .......... 13% 77.7M 4s
[task 2024-02-12T20:00:08.109Z]    300K .......... .......... .......... .......... .......... 15% 10.0M 3s
[task 2024-02-12T20:00:08.241Z]    350K .......... .......... .......... .......... .......... 17%  379K 3s
[task 2024-02-12T20:00:08.242Z]    400K .......... .......... .......... .......... .......... 19% 44.0M 3s
[task 2024-02-12T20:00:08.242Z]    450K .......... .......... .......... .......... .......... 22%  200M 2s
[task 2024-02-12T20:00:08.243Z]    500K .......... .......... .......... .......... .......... 24%  220M 2s
[task 2024-02-12T20:00:08.248Z]    550K .......... .......... .......... .......... .......... 26% 9.21M 2s
[task 2024-02-12T20:00:08.248Z]    600K .......... .......... .......... .......... .......... 28%  284M 2s
[task 2024-02-12T20:00:08.379Z]    650K .......... .......... .......... .......... .......... 30%  383K 2s
[task 2024-02-12T20:00:08.379Z]    700K .......... .......... .......... .......... .......... 33%  107M 2s
[task 2024-02-12T20:00:08.380Z]    750K .......... .......... .......... .......... .......... 35% 59.0M 2s
[task 2024-02-12T20:00:08.385Z]    800K .......... .......... .......... .......... .......... 37% 10.3M 1s
[task 2024-02-12T20:00:08.385Z]    850K .......... .......... .......... .......... .......... 39%  199M 1s
[task 2024-02-12T20:00:08.385Z]    900K .......... .......... .......... .......... .......... 42%  261M 1s
[task 2024-02-12T20:00:08.385Z]    950K .......... .......... .......... .......... .......... 44%  124M 1s
[task 2024-02-12T20:00:08.386Z]   1000K .......... .......... .......... .......... .......... 46%  288M 1s
[task 2024-02-12T20:00:08.391Z]   1050K .......... .......... .......... .......... .......... 48% 9.13M 1s
[task 2024-02-12T20:00:08.391Z]   1100K .......... .......... .......... .......... .......... 50%  260M 1s
[task 2024-02-12T20:00:08.391Z]   1150K .......... .......... .......... .......... .......... 53%  287M 1s
[task 2024-02-12T20:00:08.391Z]   1200K .......... .......... .......... .......... .......... 55%  229M 1s
[task 2024-02-12T20:00:08.392Z]   1250K .......... .......... .......... .......... .......... 57%  298M 1s
[task 2024-02-12T20:00:08.397Z]   1300K .......... .......... .......... .......... .......... 59% 9.20M 1s
[task 2024-02-12T20:00:08.516Z]   1350K .......... .......... .......... .......... .......... 61%  419K 1s
[task 2024-02-12T20:00:08.517Z]   1400K .......... .......... .......... .......... .......... 64%  155M 1s
[task 2024-02-12T20:00:08.517Z]   1450K .......... .......... .......... .......... .......... 66%  132M 0s
[task 2024-02-12T20:00:08.518Z]   1500K .......... .......... .......... .......... .......... 68% 75.2M 0s
[task 2024-02-12T20:00:08.523Z]   1550K .......... .......... .......... .......... .......... 70%  146M 0s
[task 2024-02-12T20:00:08.523Z]   1600K .......... .......... .......... .......... .......... 73% 9.87M 0s
[task 2024-02-12T20:00:08.523Z]   1650K .......... .......... .......... .......... .......... 75%  207M 0s
[task 2024-02-12T20:00:08.523Z]   1700K .......... .......... .......... .......... .......... 77%  182M 0s
[task 2024-02-12T20:00:08.524Z]   1750K .......... .......... .......... .......... .......... 79%  290M 0s
[task 2024-02-12T20:00:08.524Z]   1800K .......... .......... .......... .......... .......... 81%  272M 0s
[task 2024-02-12T20:00:08.529Z]   1850K .......... .......... .......... .......... .......... 84% 9.22M 0s
[task 2024-02-12T20:00:08.529Z]   1900K .......... .......... .......... .......... .......... 86%  345M 0s
[task 2024-02-12T20:00:08.529Z]   1950K .......... .......... .......... .......... .......... 88%  230M 0s
[task 2024-02-12T20:00:08.530Z]   2000K .......... .......... .......... .......... .......... 90%  296M 0s
[task 2024-02-12T20:00:08.530Z]   2050K .......... .......... .......... .......... .......... 92%  341M 0s
[task 2024-02-12T20:00:08.535Z]   2100K .......... .......... .......... .......... .......... 95% 9.21M 0s
[task 2024-02-12T20:00:08.535Z]   2150K .......... .......... .......... .......... .......... 97%  255M 0s
[task 2024-02-12T20:00:08.535Z]   2200K .......... .......... .......... .......... .......... 99%  200M 0s
[task 2024-02-12T20:00:08.536Z]   2250K ........                                              100%  286M=1.0s
[task 2024-02-12T20:00:08.536Z] 
[task 2024-02-12T20:00:08.536Z] 2024-02-12 20:00:08 (2.24 MB/s) - written to stdout [2312968/2312968]
[task 2024-02-12T20:00:08.536Z] 
[task 2024-02-12T20:00:08.598Z] + echo '###### Done: Downloading WMT newscrawl monolingual data'
[task 2024-02-12T20:00:08.598Z] ###### Done: Downloading WMT newscrawl monolingual data
[task 2024-02-12T20:00:08.598Z] + echo '### Sampling dataset'
[task 2024-02-12T20:00:08.598Z] ### Sampling dataset
[task 2024-02-12T20:00:08.598Z] + set +o pipefail
[task 2024-02-12T20:00:08.599Z] + zstdmt -dc /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:08.599Z] + perl -ne 'print if(split(/\s/, $_) < 100)'
[task 2024-02-12T20:00:08.599Z] + head -n 10000
[task 2024-02-12T20:00:08.599Z] ++ bc -l
[task 2024-02-12T20:00:08.599Z] + zstdmt
[task 2024-02-12T20:00:08.600Z] + shuf -n 11000
[task 2024-02-12T20:00:08.662Z] + set -o pipefail
[task 2024-02-12T20:00:08.662Z] + rm -rf /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-12T20:00:08.663Z] + echo '###### Done: Downloading monolingual data'
[task 2024-02-12T20:00:08.663Z] ###### Done: Downloading monolingual data
[taskcluster 2024-02-12 20:00:09.009Z] === Task Finished ===
[taskcluster 2024-02-12 20:00:09.200Z] Successful task run with exit code: 0 completed in 68.917 seconds