[00:38:53] RECOVERY - Puppet errors on integration-cumin is OK: OK: Less than 1.00% above the threshold [0.0] [02:05:37] hey there, here with the VOISE music platform, VOISE offers a free decentralized market for artists to promote and sell 100% of their work on their own terms [02:06:13] just release Alpha 0.1, has a ways to go still but it's currently the best time to get some VOISE before it goes big in the next few months time [02:06:21] sitting at it's base right now [02:06:36] livecoin.net [02:06:44] BTC to VOISE if you buy some [02:07:12] VOISE is here to stay, only going to be much larger then what it is now in the next few years [02:08:19] even Janet Jackson knows about it already, she put 1 free track up on their Alpha platform [02:08:39] ain't shit right now, but everything starts from somewhere and that's where you want to be in [02:08:41] at the bottom [02:10:43] if you think it will only die, that it's that bad of an idea, that it doesn't have a market then I'd say don't buy it [02:10:49] but it has a market [02:11:40] but if you think it's going to be around in the next few years, I'd say buy a few and stash them away, forget about them for a few years [02:12:01] that way you don't make any stupid moves in the mean time [02:12:41] my dad told me I was an idiot for buying dogecoin [02:13:00] I told him it was going to be around in the next few years [02:14:04] that's why I bought it, I didn't think it would die [02:15:07] just remember you have to go BTC to VOISE as there is no volume on the other pairs [02:15:25] c01nwarr10r[HD], hi [02:15:59] how goes it [02:16:06] not so bad [02:16:17] but stop evangelising whatever you're evangelising, thanks [02:17:10] I think it's evangelizing with a zed [02:17:14] isn't it [02:17:25] only if you're american [02:17:34] Canadian [02:17:43] even worse [02:17:44] Zed not Zee [02:18:08] so you don't like the investment I offer to you [02:18:14] after all my hard work [02:18:18] my research [02:18:29] correct, please stop going channel-to-channel on freenode talking about it [02:19:02] well, each to their own [02:19:16] maybe you should stop telling people to stop [02:19:23] how about that [02:19:30] and buy some VOISE [02:19:41] support a great idea [03:58:25] Yippee, build fixed! [03:58:26] Project selenium-MultimediaViewer » firefox,mediawiki,Linux,BrowserTests build #555: 09FIXED in 2 min 25 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=mediawiki,PLATFORM=Linux,label=BrowserTests/555/ [04:16:47] Yippee, build fixed! [04:16:47] Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #555: 09FIXED in 20 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/555/ [05:25:41] 10Release-Engineering-Team (Kanban), 10Release Pipeline, 10Patch-For-Review: Define new Jenkins pipeline for container build phase - https://phabricator.wikimedia.org/T175297#3702508 (10Joe) As a suggestion: I would host your own registry under the CI project in labs for testing/managing local build you migh... [06:13:47] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [06:38:50] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [06:57:19] PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:13:49] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [07:35:05] 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team: ORESFetchScoreJob: RuntimeException No model available for [goodfaith] - https://phabricator.wikimedia.org/T178792#3702691 (10hashar) [07:56:19] (03CR) 10Hashar: [C: 032] Add tox-jessie tests for mediawiki/tools/cookiecutter-library [integration/config] - 10https://gerrit.wikimedia.org/r/385817 (https://phabricator.wikimedia.org/T178727) (owner: 10MarcoAurelio) [07:58:42] (03Merged) 10jenkins-bot: Add tox-jessie tests for mediawiki/tools/cookiecutter-library [integration/config] - 10https://gerrit.wikimedia.org/r/385817 (https://phabricator.wikimedia.org/T178727) (owner: 10MarcoAurelio) [08:48:50] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [08:50:06] (03PS2) 10Hashar: Castor for Docker jobs [integration/config] - 10https://gerrit.wikimedia.org/r/385390 [08:50:21] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [08:55:42] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [09:02:02] PROBLEM - Puppet errors on deployment-trending01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:02:25] 10Continuous-Integration-Config, 10User-MarcoAurelio: Configure CI to run tox jobs for mediawiki/tools/cookiecutter-library - https://phabricator.wikimedia.org/T178727#3702785 (10MarcoAurelio) 05Open>03Resolved Done: tox-jessie jobs ran at https://gerrit.wikimedia.org/r/#/c/385948/ and if my interpretation... [09:26:07] (03PS3) 10Hashar: Castor for Docker jobs [integration/config] - 10https://gerrit.wikimedia.org/r/385390 [09:26:09] (03PS1) 10Hashar: docker: add python3 to wmfreleng/tox image [integration/config] - 10https://gerrit.wikimedia.org/r/385950 [09:26:17] docker push wmfreleng/tox:v2017.10.23.09.05 && docker push wmfreleng/tox:latest - https://gerrit.wikimedia.org/r/385950 [09:26:21] !log docker push wmfreleng/tox:v2017.10.23.09.05 && docker push wmfreleng/tox:latest - https://gerrit.wikimedia.org/r/385950 [09:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:26:34] (03CR) 10Hashar: "docker push wmfreleng/tox:v2017.10.23.09.05 && docker push wmfreleng/tox:latest" [integration/config] - 10https://gerrit.wikimedia.org/r/385950 (owner: 10Hashar) [09:26:46] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [09:29:22] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [09:29:34] !log fab docker_pull_image:wmfreleng/tox [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:44:09] 10Release-Engineering-Team (Kanban), 10Readers-Web-Backlog, 10RelatedArticles, 10Browser-Tests, and 4 others: Automated browser tests cannot create pages on the Beta Cluster as anonymous user in RelatedArticles tests - https://phabricator.wikimedia.org/T176315#3702865 (10hashar) ``` name=wmf-config/Initial... [09:56:22] (03PS2) 10Hashar: docker: add python3 to wmfreleng/tox image [integration/config] - 10https://gerrit.wikimedia.org/r/385950 [09:57:14] (03PS4) 10Hashar: Castor for Docker jobs [integration/config] - 10https://gerrit.wikimedia.org/r/385390 [09:57:46] (03CR) 10Hashar: "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385950 (owner: 10Hashar) [09:57:53] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [10:21:38] (03PS5) 10Hashar: Castor for Docker jobs [integration/config] - 10https://gerrit.wikimedia.org/r/385390 [10:27:41] (03CR) 10Hashar: [C: 04-2] "While trying to copy the cache from the Docker host instance to castor, the generated command is:" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [10:46:41] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10Developer-Relations (Oct-Dec 2017), 10User-zeljkofilipin: WebdriverIO tech talk - https://phabricator.wikimedia.org/T171852#3702961 (10zeljkofilipin) [12:36:10] (03CR) 10Hashar: [C: 032] docker: add python3 to wmfreleng/tox image [integration/config] - 10https://gerrit.wikimedia.org/r/385950 (owner: 10Hashar) [12:37:17] (03Merged) 10jenkins-bot: docker: add python3 to wmfreleng/tox image [integration/config] - 10https://gerrit.wikimedia.org/r/385950 (owner: 10Hashar) [14:08:27] (03PS1) 10Mholloway: Revert "Provide Android SDK location as an argument to non-periodic test scripts" [integration/config] - 10https://gerrit.wikimedia.org/r/385975 [14:09:00] 10Release-Engineering-Team (Kanban), 10Readers-Web-Backlog, 10RelatedArticles, 10Browser-Tests, and 4 others: Automated browser tests cannot create pages on the Beta Cluster as anonymous user in RelatedArticles tests - https://phabricator.wikimedia.org/T176315#3703299 (10Jdlrobson) The title may be a littl... [14:17:22] (03CR) 10Hashar: [C: 032] Revert "Provide Android SDK location as an argument to non-periodic test scripts" [integration/config] - 10https://gerrit.wikimedia.org/r/385975 (owner: 10Mholloway) [14:19:39] (03Merged) 10jenkins-bot: Revert "Provide Android SDK location as an argument to non-periodic test scripts" [integration/config] - 10https://gerrit.wikimedia.org/r/385975 (owner: 10Mholloway) [14:58:23] (03PS1) 10Gehel: Publish full maven site [integration/config] - 10https://gerrit.wikimedia.org/r/385984 [15:01:55] (03PS2) 10Gehel: Publish full maven site [integration/config] - 10https://gerrit.wikimedia.org/r/385984 [15:02:57] (03CR) 10jerkins-bot: [V: 04-1] Publish full maven site [integration/config] - 10https://gerrit.wikimedia.org/r/385984 (owner: 10Gehel) [15:10:09] PROBLEM - Puppet errors on deployment-cache-upload04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:21:51] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:34:04] PROBLEM - Free space - all mounts on deployment-kafka01 is CRITICAL: CRITICAL: deployment-prep.deployment-kafka01.diskspace.root.byte_percentfree (<100.00%) [15:34:45] (03PS6) 10Hashar: Castor for Docker jobs [integration/config] - 10https://gerrit.wikimedia.org/r/385390 [15:34:48] (03PS1) 10Hashar: docker: fix pip cache permissions [integration/config] - 10https://gerrit.wikimedia.org/r/385993 [15:35:31] (03CR) 10Hashar: "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385993 (owner: 10Hashar) [15:41:27] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [15:41:48] (03PS1) 10Giuseppe Lavagetto: Convert ci-jessie to use docker-pkg [integration/config] - 10https://gerrit.wikimedia.org/r/385996 [15:49:53] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [16:00:50] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [16:01:22] (03CR) 10Hashar: [C: 04-2] "check experimental" [integration/config] - 10https://gerrit.wikimedia.org/r/385390 (owner: 10Hashar) [16:03:38] (03PS3) 10Gehel: Publish full maven site [integration/config] - 10https://gerrit.wikimedia.org/r/385984 [16:07:38] (03CR) 10Hashar: [C: 04-2] "I think it is mostly fine to be done with follow up change: https://gerrit.wikimedia.org/r/#/c/385390/" [integration/config] - 10https://gerrit.wikimedia.org/r/385993 (owner: 10Hashar) [16:10:49] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [16:19:54] 10Beta-Cluster-Infrastructure, 10Multimedia: Reimage deployment-tmh01 with Debian Jessie - https://phabricator.wikimedia.org/T174477#3563286 (10MarkTraceur) While this is related to TMH, nobody on our team has any idea or expertise related to this task - maybe opsen should take a look? [16:36:48] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [16:40:23] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Wikidata, and 3 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3703832 (10zeljkofilipin) Videos are here: https://integration.wikimedia.org/ci/vi... [16:45:08] 10Release-Engineering-Team (Kanban), 10Wikimedia-Blog-Content: [Technical Debt Series]What is Technical Debt - https://phabricator.wikimedia.org/T175181#3703851 (10Jrbranaa) Due to have it up for review 20171023. [16:46:56] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Wikidata, and 3 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3703857 (10zeljkofilipin) [16:47:15] 10MediaWiki-Releasing, 10Release-Engineering-Team (Kanban), 10MW-1.29-release-notes, 10Patch-For-Review: Include release extensions/skins/vendor as submodules of core - https://phabricator.wikimedia.org/T137564#3703859 (10demon) p:05High>03Lowest [16:49:04] 10Release-Engineering-Team (Kanban): Identify Orphaned components/code - https://phabricator.wikimedia.org/T173349#3703862 (10Jrbranaa) @Aklapper at this point the scope is what's listed at https://www.mediawiki.org/wiki/Developers/Maintainers. However this is the time to expand (or contract) as necessary. Orp... [16:49:24] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.31.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T174360#3703863 (10demon) 05Open>03Resolved [16:49:51] no_justification hi, could you add me to https://gerrit-review.googlesource.com/#/admin/groups/819ed1064786ed5c11fc9a1fe617b0103fd18d03 please? :) [16:50:11] you told me to remind you on monday :). (also i gave correct link, excluded uuid- from the url) [16:56:10] any thoughts on what this is? [16:56:10] 16:55:33 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default', 'fetch', '--refresh-config'] on deployment-parsoid09.deployment-prep.eqiad.wmflabs returned [70]: http://deployment-tin.deployment-prep.eqiad.wmflabs/parsoid/deploy/.git [16:57:22] 10Release-Engineering-Team (Watching / External), 10Electron-PDFs, 10Operations, 10Proton, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3703879 (10greg) [16:58:51] 10Release-Engineering-Team (Next), 10Release Pipeline: Define pipeline failure developer feedback - https://phabricator.wikimedia.org/T177868#3703887 (10greg) [16:59:31] 10Release-Engineering-Team (Backlog), 10Jenkins: Allow users to view build history in jenkins - https://phabricator.wikimedia.org/T177827#3703890 (10greg) [16:59:44] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog), 10Operations, 10Jenkins: Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#3703892 (10greg) [17:00:10] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Release Pipeline: On CI, upgrade docker-ce from 17.06.2 to 17.09.0 - https://phabricator.wikimedia.org/T177499#3703896 (10greg) [17:00:59] 10Gerrit, 10Release-Engineering-Team (Backlog): Update gerrit to 2.15 - https://phabricator.wikimedia.org/T177201#3703899 (10greg) [17:01:18] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Operations: Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3703903 (10greg) [17:01:31] 10Release-Engineering-Team (Watching / External), 10Wikidata, 10Patch-For-Review, 10User-Addshore: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948#3703907 (10greg) [17:06:32] 10Continuous-Integration-Config, 10Release-Engineering-Team (Backlog), 10Composer: Composer failed in Selenium job but job didn't stop - https://phabricator.wikimedia.org/T177047#3703933 (10greg) [17:06:52] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Watching / External), 10MediaWiki-Parser, 10Parsing-Team, 10Readers-Web-Backlog (Tracking): Templates rendering as links on beta cluster - https://phabricator.wikimedia.org/T173576#3703935 (10greg) [17:16:08] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Operations: Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3703944 (10Dzahn) p:05Lowest>03Low Moving up from Lowest to Low, we are getting closer since the latest systemd improvements on cobalt :) [17:22:20] 10Release-Engineering-Team (Watching / External), 10Deployments, 10Operations, 10HHVM, 10Performance-Team (Radar): Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3703958 (10demon) We've been running the reusable TC in beta for over 2 m... [17:22:31] (03CR) 10Legoktm: "Can we add a branch filter to this? REL1_30 e.g. started failing due to https://gerrit.wikimedia.org/r/#/c/386020/1, I suspect older branc" [integration/jenkins] - 10https://gerrit.wikimedia.org/r/385405 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [17:22:43] no_justification are you in https://gerrit-review.googlesource.com/#/admin/groups/9792c6b6b5bd8d96fc472bdde5d4e3e651c63248 ? which would explain why you carn't reach https://gerrit-review.googlesource.com/#/admin/groups/819ed1064786ed5c11fc9a1fe617b0103fd18d03 :) [17:23:08] The latter URL works, the former does not [17:23:30] ah ok. Hmm. [17:23:53] could you add me to the former one please? Need to get luca to move me to its-phabricator after. [17:25:22] ah thanks [17:25:28] Done in the latter yeah [17:25:31] Former won't load for me :p [17:25:49] thanks :) [17:26:27] no_justification sorry for ping, just letting you know im going to merge https://gerrit-review.googlesource.com/#/c/plugins/its-phabricator/+/133790/ (breaking change for us) but will improve setup for new users + you can revoke tokens now improving security :) [17:26:48] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [17:35:37] it's merged now with upgrade doc heh [17:36:04] stable-2.14 branch created. [17:36:56] 10Gerrit, 10Patch-For-Review: Replace using certificates with tokens when using its-phabricator - https://phabricator.wikimedia.org/T178385#3704020 (10Paladox) Change has been merged now. [17:37:14] 10Gerrit, 10Patch-For-Review: Replace using certificates with tokens when using its-phabricator - https://phabricator.wikimedia.org/T178385#3704021 (10Paladox) [17:37:19] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Patch-For-Review: Update gerrit to 2.14.5.1 - https://phabricator.wikimedia.org/T156120#3704022 (10Paladox) [17:56:08] Here's my demo on custom themes in polygerrit http://recordit.co/TtCzphuIlo :). [17:57:15] 10Continuous-Integration-Infrastructure (shipyard): Document minimum required version of docker to build CI images - https://phabricator.wikimedia.org/T178821#3704114 (10Legoktm) [18:07:15] im going to merge https://gerrit-review.googlesource.com/#/c/plugins/its-phabricator/+/98613/ (not breaking change) but adds a new feature + some cleanup. [18:09:07] it's been tested so dosen't break anything. [18:12:50] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [18:35:00] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:39:04] PROBLEM - Puppet errors on deployment-aqs02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:39:22] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:44:49] PROBLEM - Puppet errors on deployment-aqs03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:48:40] poor aqs, this is my fault [19:24:08] no_justification i have preperation patches for the switch from certificates to tokens https://gerrit.wikimedia.org/r/#/c/384901/ :) [19:24:23] https://gerrit.wikimedia.org/r/#/c/384902/ [19:27:40] 10Continuous-Integration-Config, 10Goal, 10I18n, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), and 2 others: Configure banana checker for i18n files to run on all MediaWiki extensions and skins - https://phabricator.wikimedia.org/T94547#3704428 (10Umherirrender) [20:01:27] (03PS1) 10Hashar: castor in a container [integration/config] - 10https://gerrit.wikimedia.org/r/386053 [20:08:10] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Wikipedia' not found on 'https://en.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 327 bytes in 0.016 second response time [20:08:32] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 327 bytes in 0.019 second response time [20:10:30] it's down [20:11:48] "502 Bad Gateway" [20:13:57] !log Puppet run still failing on Beta cluster varnish: "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::prometheus::varnish_exporter" [20:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:14:13] ^ There is the problem. [20:15:00] https://github.com/wikimedia/puppet/commit/aa887fa1a3502834ff8ee4a0b75c408c172b1536 [20:15:04] Krinkle ^^ [20:15:13] it's been renamed [20:15:17] to profile::prometheus::varnish_exporter [20:15:40] Yeah, but there is 0 matches for "role::prometheus::varnish_exporter" in puppet.git [20:15:43] so why would that be an error [20:15:47] Maybe a local patch is still using it [20:16:18] but what about if it's included in horizion and not in puppet.git? [20:17:09] I don't think we use horizon overrides for beta cluster [20:17:20] Those are puppetised afaik, but maybe not [20:21:05] yeah, not being used in horizon. [20:22:16] ok [20:23:25] https://tools.wmflabs.org/openstack-browser/puppetclass/role::prometheus::varnish_exporter [20:23:28] Krinkle ^^ [20:23:34] deployment-cache-text04.deployment-prep.eqiad.wmflabs is using it [20:25:26] paladox: Yeah, but that's presumably indirectly. The only direct adds are " role::cache::text" and "role::beta::availability_collector" [20:25:35] oh [20:25:38] It's showing it in the os-browser because puppet hasn't run there recently because of this error [20:25:45] ah i see [20:27:05] Hm.. might be applied globally from elsewhere [20:27:07] * Krinkle tries something [20:28:39] !log Edit horizon "Other classes" config for deployment-prep/deployment-cache-text04. Rename role::prometheus::varnish_exporter to profile::prometheus::varnish_exporter [20:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:29:16] !log Previous edit failed. Horizon saved the field as blank. Presumably because the class is unknown in the current version of puppet manifests it has. Strange that it normalises in this way. [20:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:29:38] !log Puppet still failing, now with: "Error 400 on SERVER: Could not find data item cache::fe_transient_gb in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/cache/text.pp:12 on node deployment-cache-text04.deployment-prep.eqiad.wmflabs" [20:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:30:09] no_justification: greg-g: OK. Took an initial stab just because I noticed the issue and reported it 3 days ago as well. but letting someone else try, don't know it well enough :) [20:30:30] Probably local patches doing things no longer compatible with latest version of operations/puppet, but in a way that didnd't cause merge conflicts. [20:31:47] Krinkle i guess set cache::fe_transient_gb to 5? [20:31:56] https://github.com/wikimedia/puppet/blob/88df633b4305ceb79f1db137ec77f8f07f0755c8/hieradata/role/common/cache/text.yaml#L79 [20:32:42] 10Beta-Cluster-Infrastructure: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Ryasmeen) [20:36:08] Krinkle or set it to 0 [20:36:12] which used to be the default [20:36:17] before https://github.com/wikimedia/puppet/commit/48c8680d443c2a609b262ad682a3be85d30a4fa3 [20:36:34] paladox: The logic is not the problem. What you see in puppet.git is working fine [20:36:39] otherwise wikipedia.org would be down right now [20:36:41] which it isn't. [20:36:58] oh, but profile doint set the default any more [20:37:30] for hiera [20:37:31] so if someone set it for prod but did not for labs, it would break in labs :). [20:41:02] Project selenium-Echo » chrome,beta,Linux,BrowserTests build #556: 04FAILURE in 1.8 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/556/ [20:41:03] Project selenium-Echo » firefox,beta,Linux,BrowserTests build #556: 04FAILURE in 1.5 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/556/ [20:41:35] failing because it carn't reach beta. [20:43:42] $fe_transient_gb = hiera('cache::fe_transient_gb', 0) is now $fe_transient_gb = hiera('cache::fe_transient_gb'), (no defaults now) [20:44:18] cache::fe_transient_gb is set in hieradata/role/common/cache/canary.yaml and hieradata/role/common/cache/text.yaml but wont be applied to labs. [20:57:08] RECOVERY - Free space - all mounts on integration-slave-jessie-android is OK: OK: integration.integration-slave-jessie-android.diskspace._mnt.byte_percentfree (No valid datapoints found) [20:59:15] hello! https://integration.wikimedia.org/ci/computer/integration-slave-jessie-android/ is marked offline, and consequently the android app's periodic tests aren't running. it looks like it went offline because of an accumulation of screenshots in the /tmp folder. i've cleared the screenshots but i'm not sure how to bring the instance back online. do i have to disconnect first, along the lines of the instructions at [20:59:15] https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Troubleshooting#Jenkins_executioner_lock? [21:00:48] (i guess hashar isn't in this channel, i'll ping him in -mobile) [21:01:39] hasharDinner ^^ [21:01:53] mdholloway: but hey that is horrible! :) [21:01:59] ah, forgot to check up in the 'voiced' section again :[ [21:02:12] try: hash :] [21:02:38] true, that would have been a good technique as well! [21:04:12] mdholloway: I think we went with setting up TMPDIR and making sure the job delete it after it has completed [21:04:37] maybe before launch one can: [21:04:45] TMPDIR=$WORKSPACE/tmp [21:04:49] rm -fR $TMPDIR [21:04:51] export TMPDIR [21:04:55] hasharDinner: seems they're not being deleted. i had to clear them once before, too, on august 24 [21:04:59] ./whateverbuildcommand [21:05:42] mdholloway: and while you are around, I dont think android-26 work afterall :( [21:06:06] oh no, what's the trouble? [21:06:13] not sure maybe it was just me [21:06:22] I touched that job like a week or so ago [21:06:36] and I think I had to put it back to android-25 [21:07:03] ah, i hadn't looked didn't know it was back on 25 [21:07:14] hmm no [21:07:14] https://integration.wikimedia.org/ci/job/apps-android-wikipedia-periodic-test/1996/consoleFull [21:07:18] is android-26 [21:07:19] :) [21:07:47] qemu-system-i386: goldfish_battery_read: Bad offset 000000000000002c [21:07:48] bah [21:07:52] hasharDinner: Who can address beta varnish being down? [21:08:19] Krinkle: #wikimedia-traffic I guess? You can check the systemctl output for the various varnish [21:08:22] systemctl --failed [21:08:35] most probably a vcl refuses to compile [21:08:42] puppet is failing. [21:08:47] Shouldn't the topic be updated to say that beta cluster is down? [21:08:49] ● varnish-frontend.service loaded failed failed varnish-frontend (Varnish HTTP Accelerator) [21:08:49] ● varnish.service loaded failed failed varnish (Varnish HTTP Accelerator) [21:09:00] Error 400 on SERVER: Could not find data item cache::fe_transient_gb in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/cache/text.pp:12 on node deployment-cache-text04.deployment-prep.eqiad.wmflabs" [21:09:15] https://github.com/wikimedia/puppet/commit/48c8680d443c2a609b262ad682a3be85d30a4fa3 [21:09:21] cache::fe_transient_gb: 0 [21:09:24] should fix that [21:10:17] Incompatible VMOD netmapper [21:10:18] File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_netmapper.so [21:10:18] ('/etc/varnish/wikimedia_text-frontend.vcl' Line 6 Pos 8) [21:10:18] import netmapper; [21:10:44] oh, wasen't varnish updated to 5? [21:11:30] Not related. [21:13:24] Krinkle: so yeah #wikimedia-traffic [21:13:44] the errors are in the journal: sudo journalctl -u varnish [21:14:23] Incompatible VMOD vslp [21:14:23] File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_vslp.so [21:15:12] andrewbogott: ^ [21:15:13] and puppet got broken 14 days ago or so [21:15:16] so that surely does not help [21:17:25] paladox: thx https://gerrit.wikimedia.org/r/386077 beta: set cache::fe_transient_gb: 0 [21:17:25] . [21:17:59] your welcome :) [21:19:28] Error 400 on SERVER: Could not find class role::prometheus::varnish_exporter for deployment-cache-text04.deployment-prep.eqiad.wmflabs on node deployment-cache-text04.deployment-prep.eqiad.wmflabs [21:19:28] bah [21:19:54] that should be fixed [21:19:58] earlier by Krinkle [21:20:09] which got us to the new error that your patch should fix :) [21:20:55] well it is still included somehow [21:20:55] PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:20:57] got renamed to a profile [21:21:15] hasharDinner: Krinkle tried to change it but then reverted his change, I'm trying to sort that piece out now [21:21:25] ohh [21:21:33] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Zuul: Update zuul to upstream master - https://phabricator.wikimedia.org/T158243#3704850 (10Paladox) I think the 2.14 update will be done after this is complete, we only need to backport https://github.com/openstack-infra/zuul/commit/c7370284742f2ed5c136d0e5f... [21:21:37] andrewbogott: I guess that is set via Horizon and the host regex [21:21:54] hasharDinner: not according to Krinkle [21:21:57] but anyway, I'll try to sort it [21:21:57] 10Beta-Cluster-Infrastructure: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10greg) Investigation happening in -releng irc channel. [21:23:32] this is strange, someone set up a deployment-cache-text regex [21:23:38] but then applied changes to the host instead of to the prefix [21:23:42] so it's all jumbled [21:23:55] I'm going to move things to the prefix [21:27:59] andrewbogott: well the instance has role::prometheus::varnish_exporter [21:28:14] which can be removed since that role no more exist [21:28:44] it's replaced with a profile [21:28:55] https://github.com/wikimedia/puppet/commit/aa887fa1a3502834ff8ee4a0b75c408c172b1536 [21:35:00] andrewbogott: that might have fixed it [21:35:05] next error is cache::be_transient_gb [21:35:08] which I fixed [21:35:12] yeah it is running :) [21:35:18] well at least the catalog compiled [21:35:21] I think there's more to do still... [21:35:32] the new profile takes an arg which the role didn't, I don't know what should be passed in there [21:35:51] and letsencrypt fails [21:36:29] couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/ [21:36:32] couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/XXXX [21:36:33] :) [21:36:45] I've moved everything to the profile now, so if you make changes please make them there [21:36:52] so letsencrypt fails [21:36:53] nginx fails [21:37:00] and thus varnish manifests can be updated [21:37:04] and varnish does not start :( [21:37:08] OK, probably the prometheus bit is ok now [21:41:15] I don't have any particular opinion/knowledge about LE so I'm going to leave this to those who care (or know something) [21:41:28] also varnish got upgraded [21:41:31] to varnish 5 [21:41:49] but the libvmod are nota vailable or not compatible [21:41:51] bha [21:43:26] andrewbogott: I am afraid I am gonna give up it is almost midnight there [21:43:27] https://github.com/wikimedia/puppet/commit/d70eb56fc44caf6c8778fff1a344d2346571a2e1 [21:43:39] I have no clue why varnish got upgraded [21:43:44] most probably to test it out [21:44:08] but right now the situation is that puppet fails generating letsencrypt because it has to reach http://beta.wmflabs.org/ [21:44:15] which is served by varnish but varnish is down [21:44:24] see https://github.com/wikimedia/puppet/commit/d70eb56fc44caf6c8778fff1a344d2346571a2e1 [21:44:49] i think this may be fixed by setting profile::cache::base::varnish_version to 5 [21:44:50] paladox: yup [21:44:57] since 4 installs libvmod-vslp [21:45:09] then the VCL are probably out of date [21:45:27] it might be easier to just spawn a new instance [21:45:31] is vcl installed by libvmod-vslp? [21:45:32] yeh [21:45:35] but really I have been awake for too long [21:46:03] ok [21:47:05] Krinkle: so basically it is broken [21:47:34] and at this hour I have no idea how to fix. There is too many moving bits :\ [21:48:08] greg-g: ^^ [21:48:28] brb (kids) [21:58:02] Project selenium-PageTriage » chrome,beta,Linux,BrowserTests build #553: 04FAILURE in 1.4 sec: https://integration.wikimedia.org/ci/job/selenium-PageTriage/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/553/ [21:58:02] Project selenium-PageTriage » firefox,beta,Linux,BrowserTests build #553: 04FAILURE in 1.8 sec: https://integration.wikimedia.org/ci/job/selenium-PageTriage/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/553/ [22:00:53] RECOVERY - Puppet errors on deployment-ms-be04 is OK: OK: Less than 1.00% above the threshold [0.0] [22:02:45] andrewbogott: were you able to get anything? maybe worth a quick summary on the task to get help from others? https://phabricator.wikimedia.org/T178841 [22:05:03] 10Beta-Cluster-Infrastructure: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Andrew) I've only looked at deployment-cache-text04. Puppet had been broken there for ages due to several puppet changes. Horizon has a prefix defined for deployment-cache-text* but there were also some p... [22:05:09] updated [22:08:31] 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704963 (10greg) Adding #operations to ask for assistance with diagnosing/resolving this. [22:08:40] andrewbogott: thanks [22:13:45] 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10hashar) Since varnish is down, Puppet fails to trigger the letsencrypt certificate renewal since that attempts to reach http://beta.wmflabs.org/ :/ [22:15:58] huh [22:16:03] * paladox trys to get someone to merge https://gerrit-review.googlesource.com/#/c/plugins/its-base/+/108215/ :). [22:17:02] oh, you guys let puppet break again? [22:17:24] andrewbogott: is cloud services firewalling of port 80 ? [22:17:35] It hasn't been running on beta for 2 weeks. Not sure what exactly made it fail today [22:18:11] hasharDinner: I don't understand the question [22:18:37] andrewbogott: no worries I found my mistake :) [22:18:40] ok :) [22:19:22] Krinkle, what, puppet hasn't been running? [22:19:32] Krenair: failed to compile as of 13 days ago [22:19:55] I mean, it's trying, but the puppetmaster itself is failing to produce [22:20:02] Project selenium-CentralAuth » firefox,beta,Linux,BrowserTests build #558: 04FAILURE in 1.7 sec: https://integration.wikimedia.org/ci/job/selenium-CentralAuth/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/558/ [22:20:07] so the theotically unaffected clients don't get anything either [22:20:10] (non-varnish servers) [22:20:23] not 100% sure about that though [22:20:42] hm [22:20:43] Incompatible VMOD vslp [22:20:43] File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_vslp.so [22:20:43] VMOD version 3.2 [22:20:43] varnishd version 6.0 [22:20:48] andrewbogott: I got nginx listening on port 80 on deployment-cache-upload04 and that should serve http://beta.wmflabs.org/ but I get a connection refused [22:20:56] krenair@deployment-cache-text04:~$ apt-cache search libvmod_vslp [22:20:56] krenair@deployment-cache-text04:~$ dpkg -S /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_vslp.so [22:20:56] libvmod-vslp: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_vslp.so [22:20:56] krenair@deployment-cache-text04:~$ apt-cache policy libvmod-vslp [22:20:57] libvmod-vslp: [22:20:58] Installed: 0.1-1wm [22:21:00] Candidate: 0.1-1wm [22:21:02] Version table: [22:21:04] *** 0.1-1wm 0 [22:21:06] 1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/experimental amd64 Packages [22:21:10] 1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages [22:21:12] 100 /var/lib/dpkg/status [22:21:14] krenair@deployment-cache-text04:~$ [22:21:50] hasharDinner: it's quite likely that that host is running a puppetized firewall. But of course also all VMs are fully firewalled by default unless the security policy explicitly opens a port. [22:22:09] krenair@deployment-cache-text04:~$ sudo lsof -i :80 [22:22:09] krenair@deployment-cache-text04:~$ [22:23:05] tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 23726/nginx -g daem [22:23:29] andrewbogott: yeah I checked both. iptables is disabled and the instance has ALLOW 80:80/tcp from 0.0.0.0/0 [22:23:44] * andrewbogott out for now [22:24:13] I don't see that on netstat hasharDinner [22:25:43] -> -traffic [22:30:38] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704999 (10hashar) Puppet fails with: Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server. /usr/local/sbin/acme_tiny.py --account-ke... [22:33:13] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47325 bytes in 0.820 second response time [22:33:33] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 35393 bytes in 0.877 second response time [22:33:50] Domain not configured [22:37:02] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705021 (10hashar) @Krenair / @BBlack are looking into it. They both know about Letsencrypt/Varnish. [22:38:03] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705026 (10Paladox) If it uses ferm, it will not take notice of security group settings. I found that out with jenkins-slave-01 and other instances. try this sudo iptables -A INPUT -p tc... [22:44:10] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: Connection refused [22:44:32] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: Connection refused [22:44:56] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Dzahn) > sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual thi... [22:49:13] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47329 bytes in 0.781 second response time [22:49:33] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 35375 bytes in 1.055 second response time [22:51:19] PROBLEM - Puppet errors on deployment-mediawiki05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:51:47] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [22:57:50] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:03:59] 10Beta-Cluster-Infrastructure, 10Operations, 10Puppet, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#3705080 (10hashar) [23:04:03] 10Beta-Cluster-Infrastructure, 10Goal, 10Patch-For-Review, 10Puppet: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#3705076 (10hashar) 05Open>03Resolved a:03yuvipanda It has been nicely cleaned up by @yuvipanda There are still some classes left but not much we can imple... [23:06:32] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705082 (10Krenair) Between the three of us it's been brought back up. [23:07:49] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [23:25:40] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705126 (10Krenair) HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster: ```profile::cache::base::varnish_v... [23:26:20] RECOVERY - Puppet errors on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:08] (03Draft3) 10MacFan4000: Rm numerous extensions [integration/config] - 10https://gerrit.wikimedia.org/r/386070