[00:31:16] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [02:15:11] Well that’s a interesting bug in phabricator... [02:15:47] Apple broke one press of the search button on iPhones with iOS 11.3 [02:16:00] I have to hold and wait for it to stay open [02:47:32] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<20.00%) [03:38:38] Project mediawiki-core-code-coverage-php7 build #189: 04FAILURE in 38 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-php7/189/ [04:22:52] [9e5298b37de5869408924779] [no req] Wikimedia\Rdbms\DBQueryError from line 1418 of /srv/jenkins-workspace/workspace/mediawiki-core-code-coverage-php7/src/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [04:22:52] 03:38:37 Query: SELECT page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model FROM unittest_page WHERE page_namespace = '1' AND page_title = 'Not_Main_Page' LIMIT 1 [04:22:52] 03:38:37 Function: WikiPage::pageData [04:22:52] 03:38:37 Error: 1 no such table: unittest_page [04:28:19] Project mediawiki-core-code-coverage build #3428: 04FAILURE in 1 hr 28 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/3428/ [04:55:15] PROBLEM - App Server Main HTTP Response on deployment-mediawiki06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:13] RECOVERY - App Server Main HTTP Response on deployment-mediawiki06 is OK: HTTP OK: HTTP/1.1 200 OK - 47311 bytes in 7.716 second response time [05:48:00] Project mwext-phpunit-coverage-publish build #3035: 04FAILURE in 3.7 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3035/ [05:49:59] Yippee, build fixed! [05:50:00] Project mwext-phpunit-coverage-publish build #3036: 09FIXED in 1 min 59 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3036/ [07:12:34] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:40:05] !log removed mediawiki-deployment07 from deployment-prep (T191578) [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:40:08] T191578: deployment-mediawiki07: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - https://phabricator.wikimedia.org/T191578 [07:40:40] PROBLEM - Host deployment-mediawiki07 is DOWN: CRITICAL - Host Unreachable (10.68.17.40) [07:42:19] 10Beta-Cluster-Infrastructure: deployment-mediawiki07: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - https://phabricator.wikimedia.org/T191578#4111117 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This instance wasn't active, it was only created to test/fix the compatibility of the mediawiki ap... [07:52:54] Project mwext-phpunit-coverage-publish build #3042: 04FAILURE in 3.9 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3042/ [07:52:56] Project mwext-phpunit-coverage-publish build #3043: 04STILL FAILING in 0.91 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3043/ [07:53:04] Project mwext-phpunit-coverage-publish build #3044: 04STILL FAILING in 3.3 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3044/ [07:54:23] Yippee, build fixed! [07:54:23] Project mwext-phpunit-coverage-publish build #3045: 09FIXED in 1 min 12 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3045/ [08:06:25] 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4111142 (10fgiunchedi) a:05fgiunchedi>03None Indeed, I'm removing myself as assignee since I... [08:08:39] (03PS2) 10Thiemo Kreuz (WMDE): Make unused global variables sniff much more robust [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 [08:08:48] (03CR) 10Thiemo Kreuz (WMDE): Make unused global variables sniff much more robust (032 comments) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [08:16:21] 10Release-Engineering-Team (Kanban), 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-zeljkofilipin, 10Wikimedia-log-errors (Jenkins Failure): Warning: Task "stylelint:src" failed due to postcss-less@1.1.4 - https://phabricator.wikimedia.org/T190269#4068266 (10Um... [08:58:08] Project beta-mediawiki-config-update-eqiad build #10693: 04FAILURE in 2.1 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/10693/ [09:02:11] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [09:15:04] 10Release-Engineering-Team (Kanban), 10Patch-For-Review, 10Release, 10Train Deployments: 1.31.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T183967#4111315 (10mmodell) 05Open>03Resolved [09:28:16] Yippee, build fixed! [09:28:16] Project beta-mediawiki-config-update-eqiad build #10694: 09FIXED in 5.3 sec: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/10694/ [09:32:17] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:41:57] 10Release-Engineering-Team (Kanban), 10Wikimedia-Hackathon-2018, 10JavaScript, 10User-zeljkofilipin: Write Selenium tests in JavaScript/Node.js workshop - https://phabricator.wikimedia.org/T190046#4111385 (10zeljkofilipin) [09:41:59] 10Release-Engineering-Team (Kanban), 10Wikimedia-Hackathon-2018, 10JavaScript, 10User-zeljkofilipin: Pair on writing Selenium tests in JavaScript/Node.js - https://phabricator.wikimedia.org/T190687#4111386 (10zeljkofilipin) [10:42:21] (03PS6) 10Thiemo Kreuz (WMDE): Add checks for invalid annotations [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/420159 (https://phabricator.wikimedia.org/T182057) (owner: 10MaxSem) [10:42:23] (03PS1) 10Thiemo Kreuz (WMDE): Untangle and fix FunctionAnnotations sniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424554 (https://phabricator.wikimedia.org/T182057) [10:47:39] (03CR) 10jerkins-bot: [V: 04-1] Untangle and fix FunctionAnnotations sniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424554 (https://phabricator.wikimedia.org/T182057) (owner: 10Thiemo Kreuz (WMDE)) [10:47:43] (03CR) 10jerkins-bot: [V: 04-1] Add checks for invalid annotations [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/420159 (https://phabricator.wikimedia.org/T182057) (owner: 10MaxSem) [10:48:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Add checks for invalid annotations (034 comments) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/420159 (https://phabricator.wikimedia.org/T182057) (owner: 10MaxSem) [10:56:16] 10Continuous-Integration-Infrastructure, 10MediaWiki-extensions-Newsletter: Unit tests for Newsletter extension failing in Wikimedia CI (ApiNewsletterSubscribeTest) - https://phabricator.wikimedia.org/T191284#4111543 (10Umherirrender) 05Open>03Resolved a:03Umherirrender Fixed with https://gerrit.wikimedi... [10:57:03] (03PS1) 10Thiemo Kreuz (WMDE): Add test for 'You must use "/**" style comments for a function' [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424561 [10:59:24] 10Release-Engineering-Team (Someday), 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#4111550 (10zeljkofilipin) a:03zeljkofilipin [10:59:42] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#3716147 (10zeljkofilipin) [11:00:34] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#3716147 (10zeljkofilipin) [11:01:38] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#4111561 (10zeljkofilipin) p:05Low>03High [11:02:10] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#3716147 (10zeljkofilipin) [11:04:56] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Puppet: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111574 (10EddieGP) That check is a joke. It's there because we don't want long-lived cherry-picks on the puppetm... [11:21:37] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Puppet: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111635 (10MarcoAurelio) Maybe reset and repull everything or there are changes that were directly done on the pu... [11:38:41] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'https://en.m.wikipedia.beta.wmflabs.org:443/wiki/Main_Page?debug=true' - 1976 bytes in 0.030 second response time [11:38:41] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 1342 bytes in 0.002 second response time [11:39:35] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Puppet: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111688 (10EddieGP) The check asks "Is the average number of cherry-picks on the puppet master in the last 48h gr... [11:41:02] !log upgrading deployment-prep to wikidiff2 1.6.0 (T190717) [11:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:41:05] T190717: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717 [11:42:44] (03CR) 10Umherirrender: [C: 031] Make unused global variables sniff much more robust (032 comments) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [11:43:45] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 36037 bytes in 6.456 second response time [11:43:45] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 47357 bytes in 3.849 second response time [11:46:48] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:44] RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 47351 bytes in 3.980 second response time [11:57:16] (03CR) 10Thiemo Kreuz (WMDE): Make unused global variables sniff much more robust (031 comment) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [12:02:15] PROBLEM - App Server Main HTTP Response on deployment-mediawiki06 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:10] RECOVERY - App Server Main HTTP Response on deployment-mediawiki06 is OK: HTTP OK: HTTP/1.1 200 OK - 47357 bytes in 5.576 second response time [12:13:06] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10Patch-For-Review, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#4111762 (10zeljkofilipin) [12:18:26] (03PS3) 10Umherirrender: Make unused global variables sniff much more robust [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [12:18:29] (03CR) 10Umherirrender: [C: 032] Make unused global variables sniff much more robust [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [12:19:35] (03Merged) 10jenkins-bot: Make unused global variables sniff much more robust [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [12:20:01] (03CR) 10jenkins-bot: Make unused global variables sniff much more robust [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424323 (owner: 10Thiemo Kreuz (WMDE)) [12:20:23] (03PS2) 10Umherirrender: Add test for 'You must use "/**" style comments for a function' [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424561 (owner: 10Thiemo Kreuz (WMDE)) [12:20:26] (03CR) 10Umherirrender: [C: 032] Add test for 'You must use "/**" style comments for a function' [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424561 (owner: 10Thiemo Kreuz (WMDE)) [12:21:13] (03Merged) 10jenkins-bot: Add test for 'You must use "/**" style comments for a function' [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424561 (owner: 10Thiemo Kreuz (WMDE)) [12:21:57] (03CR) 10jenkins-bot: Add test for 'You must use "/**" style comments for a function' [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/424561 (owner: 10Thiemo Kreuz (WMDE)) [13:00:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:01:45] 10Gerrit, 10Phabricator, 10Release-Engineering-Team (Someday): Consider disabling differential - https://phabricator.wikimedia.org/T191182#4111915 (10Aklapper) 05declined>03Open I explained in T191182#4103647 why I think this is wrong. So I am reopening this task. If Differential has no future in Wikimed... [13:08:20] 10Gerrit, 10Phabricator, 10Release-Engineering-Team (Someday): Consider disabling differential - https://phabricator.wikimedia.org/T191182#4111937 (10Dereckson) I've received a pull request for a tool forge tool on GitHub. Honestly, I forgot this tool source code were there and I suspect it was there because... [13:08:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:22:27] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10Patch-For-Review, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#4111974 (10zeljkofilipin) [13:24:14] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10Patch-For-Review, 10User-zeljkofilipin: Run Cucumber+Selenium+Node.js in CI - https://phabricator.wikimedia.org/T179190#3716147 (10zeljkofilipin) [13:43:11] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112048 (10Tgr) [13:43:49] meh [13:44:00] im gonna revert the thing in core that is breaking CI for WMDE [13:54:39] (03PS1) 10Zfilipin: WIP Run Cucumber+Selenium+Node.js in CI [integration/config] - 10https://gerrit.wikimedia.org/r/424592 (https://phabricator.wikimedia.org/T179190) [13:57:06] (03PS2) 10Zfilipin: WIP Run Cucumber+Selenium+Node.js in CI [integration/config] - 10https://gerrit.wikimedia.org/r/424592 (https://phabricator.wikimedia.org/T179190) [14:06:40] (03PS3) 10Zfilipin: Run `npm run selenium` instead of `grunt webdriver:test` [integration/config] - 10https://gerrit.wikimedia.org/r/424592 (https://phabricator.wikimedia.org/T179190) [14:17:12] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112127 (10CCicalese_WMF) [14:22:46] Project mwext-phpunit-coverage-publish build #3052: 04FAILURE in 1 min 34 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3052/ [14:25:12] 10Continuous-Integration-Infrastructure, 10MediaWiki-Codesniffer, 10Test-Coverage: Post-merge build failed for mediawiki/tools/codesniffer - https://phabricator.wikimedia.org/T191637#4112153 (10Umherirrender) [14:46:21] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Tests, 10JavaScript, 10Patch-For-Review, 10User-zeljkofilipin: Run Selenium Cucumber tests in CI - https://phabricator.wikimedia.org/T179190#4112261 (10zeljkofilipin) [14:47:33] Yippee, build fixed! [14:47:34] Project mwext-phpunit-coverage-publish build #3053: 09FIXED in 2 min 14 sec: https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-publish/3053/ [14:53:16] PROBLEM - Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) [15:08:28] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112324 (10Tgr) https://github.com/wikimedia/mediawiki-extensions-OATHAuth/blob/mast... [15:08:28] sigh [15:10:50] Hmm? [15:12:43] does shinken use puppetdb? [15:13:01] if so, what about removing deployment-puppetdb01 from puppet? [15:13:16] ie puppet clean cert deployment-puppetdb01 [15:14:34] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112342 (10Tgr) [15:14:57] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#3689364 (10Tgr) WikiEditor already sets `$wgDefaultUserOptions['usebetatoolbar'] = 1`. [15:15:46] Nothing uses puppetdb afaik, it's been down for ... weeks? months? [15:16:05] im wondering how does shinken add it? [15:16:39] Umm some API to openstack? It's done for all instances automatically I think. [15:17:04] oh [15:18:36] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112348 (10Tgr) WikiEditor is a requirement for CodeEditor. Is that something the in... [15:19:08] That's just an uneducated guess. But I don't think those have to be (can be) manually added/removed from shinken [15:19:35] I doint think they generate it for all hosts [15:19:43] as there webui does not show it for all [15:20:01] (my project is not listed anyways :)) [15:20:05] i use icinga2 [15:20:11] At least for all in deployment-prep it does? [15:20:19] most i think [15:20:38] Of course the ones without errors aren't shown ;-) [15:21:57] heh [15:30:47] Looks like upstream may be about to redesgn the header of polygerrit heh [15:37:25] Project mediawiki-core-code-coverage-php7 build #190: 04STILL FAILING in 37 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-php7/190/ [15:48:41] PROBLEM - Puppet errors on integration-slave-docker-1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:06:00] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4112450 (10CCicalese_WMF) [16:27:44] Project mediawiki-core-code-coverage build #3429: 04STILL FAILING in 1 hr 27 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/3429/ [16:56:17] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Operations, 10Ops-Access-Requests: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4112621 (10herron) p:05Triage>03Normal [18:07:24] PROBLEM - Puppet errors on deployment-puppetmaster02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:10:53] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:12:26] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:25:52] RECOVERY - Puppet errors on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [18:34:35] 11:29:23 Who here understands how scap3 works in labs? [18:34:35] 11:29:46 I'm trying to deploy from deployment-tin to deployment-maps03, but deployment-maps03 seems to be refusing the SSH key from keyholder [18:34:35] 11:29:53 I ran scap deploy -v in /srv/deployment/tilerator/deploy [18:34:35] 11:30:23 (I also had to manually add the SSH key for deployment-maps03 to /etc/ssh/ssh_known_hosts, that file claims to be puppetized but that puppetization seems to be broken) [18:35:59] PROBLEM - Puppet errors on deployment-etcd-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:36:37] hrm, the ssh_known_hosts thing was flapping in labs at some point due to...something with puppetdb but I never dug into that much further. deployment-maps03 refusing ssh key is weird. Let me catch up. [18:37:24] RECOVERY - Puppet errors on deployment-puppetmaster02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:38:48] Thanks [18:39:03] Also those puppet errors are probably me messing with the hiera config trying to get scap to work [18:39:40] I tried: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-maps03.deployment-prep.eqiad.wmflabs [18:39:48] and that let me in (after accepting the host key) [18:39:55] ha [18:39:57] so scap should be able to get in [18:40:01] Let me try again then [18:40:08] k I'll deployment stalk [18:40:39] scap still fails :( [18:40:40] huh [18:41:07] ah I think I see the issue [18:41:11] > Agent admitted failure to sign using the key [18:41:13] https://usercontent.irccloud-cdn.com/file/RtMwV8K6/Screenshot%20from%202018-04-06%2011-40-44.png [18:41:22] Yeah what does that even mean? [18:41:23] that's a keyholder message you must not be in the group allowed to use that key [18:41:29] lemme check that [18:42:08] it's ssh saying that ssh-agent is refusing to sign the key, and since keyholder is proxying the agent, it's probably keyholder rejecting you [18:42:17] Hah, you're right, I'm not in the deploy-service group [18:42:58] And the deploy-service key only allows that group according to the hiera config [18:43:14] I could just change the hiera config to allow the project-deployment-prep group for that key? [18:43:40] yep [18:43:44] Every other keyholder key in deployment-prep trusts that group but deploy-service [18:43:51] OK I'll do that then [18:43:56] yeah, on beta I think that makes sense [18:45:12] OK, rerunning puppet to apply that now [18:46:30] YAY it's working [18:46:32] Thanks thcipriani [18:46:41] yw, glad I could help :) [18:52:26] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [18:54:29] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:54:47] thcipriani: So, next problem :/ [18:54:52] I've now gotten it to deploy [18:55:05] But it's not writing the config files to /etc/kartotherian/config.yaml and /etc/tilerator/config.yaml [18:55:27] Even though config-files.yaml tells it to, and the deploy user has write acccess to /etc/kartotherian [18:56:24] * thcipriani looks [18:57:01] Oh wait [18:57:04] That might be due to https://gerrit.wikimedia.org/r/#/c/423939/ [18:57:08] I'll git pull and redeploy [18:59:39] Yup that was it [18:59:41] oh interesting [19:07:33] 10Release-Engineering-Team (Next): When "scap pull" does a (slow) CDB rebuild, it should tell me that that's what it's doing - https://phabricator.wikimedia.org/T162207#4112958 (10demon) [19:13:01] 10Release-Engineering-Team (Next): When "scap pull" does a (slow) CDB rebuild, it should tell me that that's what it's doing - https://phabricator.wikimedia.org/T162207#4112966 (10demon) 05Open>03Resolved [19:36:01] 10Beta-Cluster-Infrastructure, 10Puppet: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4113032 (10EddieGP) @Joe, as the creator of this instance, do you know (a) whether it's still needed and (b) if yes, what value should be set here? [19:51:38] 10Beta-Cluster-Infrastructure, 10Operations, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#4113102 (10EddieGP) Also (labeled at "UNKNOWN" in openstack browser, but logging in there and looking at /etc/os-release) these are still trusty: - deployment-urldownloader... [19:54:03] 10Release-Engineering-Team (Watching / External), 10Operations, 10Parsing-Team, 10HHVM, 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113126 (10Pchelolo) [19:55:57] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Operations, 10Ops-Access-Requests: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4113132 (10hashar) @RobH indeed we are looking at adding Tyler to the `contint-roots`group. That grant root acces... [19:57:15] 10Continuous-Integration-Infrastructure, 10MediaWiki-Codesniffer, 10Test-Coverage: Post-merge build failed for mediawiki/tools/codesniffer - https://phabricator.wikimedia.org/T191637#4113133 (10hashar) [20:00:02] 10Release-Engineering-Team (Kanban), 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-zeljkofilipin, 10Wikimedia-log-errors (Jenkins Failure): Warning: Task "stylelint:src" failed due to postcss-less@1.1.4 - https://phabricator.wikimedia.org/T190269#4113140 (10ha... [20:05:48] 10Beta-Cluster-Infrastructure: deployment-secureredirexperiment puppet error - https://phabricator.wikimedia.org/T191663#4113160 (10EddieGP) [20:15:22] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4113191 (10Bawolff) >>! In T178349#4110374, @MaxSem wrote: > LocalisationUpdate also... [20:39:19] 10Release-Engineering-Team (Watching / External), 10Operations, 10Parsing-Team, 10HHVM, 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113237 (10ssastry) Looks like this one can be closed as well since the subtasks as well as parent tasks are resolved. Anythi... [20:40:13] 10Release-Engineering-Team (Watching / External), 10Operations, 10Parsing-Team, 10HHVM, 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113250 (10ssastry) 05Open>03Resolved a:03ssastry Feel free to re-open / create a new ticket with anything else left to... [20:58:34] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4113289 (10MaxSem) I've poked around a few third-party wikis with LU installed - non... [21:01:22] 10MediaWiki-Releasing, 10MediaWiki-Installer, 10MW-1.31-release: Expand the set of bundled extensions to achieve a default MediaWiki experience that's comparable to Wikimedia sites - https://phabricator.wikimedia.org/T178349#4113290 (10Bawolff) For reference, because it wasn't immediately obvious how to find... [21:08:22] PROBLEM - Puppet errors on deployment-mx02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:15:56] Krinkle: question. Do browsers cache images you send via data: uris with base64? [21:16:11] Like, do they have to decode it each time? [21:16:25] Thought you might know [21:50:59] !log beta: Cherry-picking https://gerrit.wikimedia.org/r/c/424707/ , test for T173887 [21:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:51:02] T173887: Wikimedia.org portal broken in Beta Cluster (Domain unavailable) - https://phabricator.wikimedia.org/T173887 [22:02:23] PROBLEM - SSH on integration-slave-docker-1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:51] no_justification i guess we want the stick figure as the default image? [22:05:54] i can upload it to [22:06:01] All-Avatars [22:07:16] RECOVERY - SSH on integration-slave-docker-1015 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [22:28:09] no_justification https://gerrit.wikimedia.org/r/c/424715/ :) [22:40:15] 10MediaWiki-Codesniffer, 10PHP 7.0 support: Report PREG_REPLACE_EVAL as deprecated in codesniffer - https://phabricator.wikimedia.org/T191676#4113561 (10Umherirrender) [22:48:22] 10MediaWiki-Releasing, 10Release-Engineering-Team, 10MW-1.31-release: Upgrade patches for tarball releases don't apply cleanly to tarball installation - https://phabricator.wikimedia.org/T73379#748261 (10Legoktm) I think the ideal long term fix is to generate the patch files by diffing the two tarballs inste... [23:22:47] no_justification: (RE: base64 decode parse) Good question. I don't actually know. I do know that more generally the decoding of embedded images has to happen synchronously and has been identified by various perf engineers in the industry as an anti-pattern that especially hurts mobile. [23:23:11] So I do have it on the roadmap to cut back a bit on our use of data URI embedding. [23:23:41] Because we're now at a point where the showing of the HTML text on a page has only 1 blocking sub resource (the stylesheet). Which is good, except that it's huge, and it contains many images. [23:24:34] Back in 2011, this choice mad a lot of sense, it was considered an optimisation because there was very significant overhead with additional requests and connections. So embedding overall "felt" better than it being a request later on, even if the image isn't used on all pages above the fold. [23:25:12] But now, I think we should move towards using @embed only for critical images above the fold. And use regular url references other wise which will lazy load it. [23:25:51] There is no lazy parsing of data uris, even though the browser should be able to defer that in theory if the selector doesn't match currently. Browsers already do that for regular urls. They're only downloaded once relevant. [23:25:59] More at https://phabricator.wikimedia.org/T127328 :) [23:33:13] Interesting! My use case actually was for avatars in Gerrit. Rather than exposing a new endpoint to serve them I considered embedding them in data: uris. [23:33:51] Not gonna do that after all for a couple of reasons, but was curious how browsers would react to getting sent the same base64 data multiple times in one response [23:34:06] polygerrit caches things no_justification though [23:34:15] so it should not be hitting the url [23:34:23] unless someone refreshes [23:35:41] The response is cached, but the data uris would still be in that response [23:35:58] And the browser would have to decode them [23:36:01] yep [23:36:09] ( i was meaning if we hit the url) :) [23:36:35] But like we discussed, serving over apache directly is best. We can cache aggressively and skip writing a new endpoint [23:37:06] yeh [23:43:58] no_justification: Oh, that suggests and not CSS, that'd be slow I imagine, given it'd be blockingly part of the HTML. [23:44:42] Even if it was cached, it'd be fairly expensive to download. Avatars are the kind of thing that should probably render after the critical text and all. [23:45:15] no_justification: sounds good. (RE: apache, caching) [23:45:45] Presumably on a cookieless domain that shares the same IP and HTTPS certificate to allow use of a single multiplexed connection? [23:49:24] Krinkle is not returned by gerrit. the url is returned though [23:49:32] in polygerrit is fast in my testing [23:49:45] https://gerrit.git.wmflabs.org/r/c/3/?polygerrit=1 [23:50:43] paladox: The use of vs a custom element is insignificant. [23:50:54] The problem, if it was using a data uri, is that it is embedded in the HTML. [23:50:59] oh [23:51:15] It seems polygerrit is worse, it uses inline style="" which is even slower because it cannot be pre-fetched by the look-ahead parser in browsers. [23:51:24] Whereas would be pre-parsed and starting to decode much earlier. [23:51:33] Krinkle gr-avatar [23:51:43] Yes, custom element with inline style="", as I said. [23:51:45] https://github.com/GerritCodeReview/gerrit/blob/master/polygerrit-ui/app/elements/shared/gr-avatar/gr-avatar.html [23:51:47] https://github.com/GerritCodeReview/gerrit/blob/master/polygerrit-ui/app/elements/shared/gr-avatar/gr-avatar.js [23:52:18] this.style.backgroundImage = 'url("' + url + '")'; [23:53:45] Krinkle: so you suggest a different domain? [23:54:51] paladox: That code is not actually used. That code only applies when it changes. The initial render comes from the server, although it might re-use that, but afaik it doesn't. [23:55:15] no_justification: Depends. Exposing it on a sub-path of the main application domain is also fine. [23:55:26] no_justification: I just assumed that would be messy and/or insecure. [23:55:34] Krinkle this is what it returns server side [23:55:35] https://github.com/GerritCodeReview/plugins_avatars-external/blob/master/src/main/java/com/googlesource/gerrit/plugins/avatars/external/ExternalUrlAvatarProvider.java#L96 [23:55:38] Not messy. Insecure possibly [23:55:53] Since the Gerrit cookie path is / not /r/ [23:56:28] Why am I thinking about phabricator in my mind? [23:56:32] Right, Gerrit. [23:56:45] Could easily add a gerrit.wmfusercontent.org [23:56:54] Yeah, I was thinking the same. [23:56:58] And serve from a separate vhost [23:57:04] no_justification we could serve polygerrit on there too [23:57:09] as a cdn [23:57:28] that's what upstream do [23:57:34] Not really necessary? [23:57:42] Agreed, there would not be a performance benefit here. [23:57:42] nope [23:57:50] The cookies are insignificant with H2 compression. [23:57:55] Just the security aspect. [23:58:14] Putting it on a different domain doens't make it a CDN. upstream has an actually separate CDN. [23:58:27] oh [23:58:28] We have our own infra, and all traffic is equally cacheable based on response headers. [23:58:33] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<40.00%) [23:58:56] If you change an apache response in our infra from private to public, tada, it's a CDN.