[00:01:04] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [00:56:04] RECOVERY - Puppet errors on deployment-mediawiki-jhuneidi is OK: OK: Less than 1.00% above the threshold [2.0] [01:01:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [01:16:01] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [02:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [04:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [04:45:44] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<10.00%) [05:48:53] 10Phabricator: Merge the Phabricator Priority values "Low" and "Lowest" - https://phabricator.wikimedia.org/T228759 (10Urbanecm) This priority has its sense IMO. Redefining it as "can sit here for any time", or "not urgent at all" is a better thing to do IMO, while low priority is for tasks that are more importa... [05:49:48] 10Phabricator: Merge the Phabricator Priority values "Low" and "Lowest" - https://phabricator.wikimedia.org/T228759 (10Urbanecm) p:05Low→03Lowest To give a concrete idea, this is lowest IMO. Can sit here for years, if needed. [06:11:04] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [06:45:11] Yippee, build fixed! [06:45:12] Project mwcore-phpunit-coverage-master build #142: 09FIXED in 3 hr 45 min: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/142/ [06:50:48] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:16:44] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Simetrical) Again, there are two concerns... [08:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [08:25:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:47:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:47:57] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Daimona) >>! In T225730#5453421, @Simetri... [09:10:39] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Simetrical) For the overloading issue, is... [10:11:04] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [10:46:39] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (201908), 10WMDE-Analytics-Engineering, and 2 others: Make "analytics/wmde/toolkit-analyzer-build" use git lfs - https://phabricator.wikimedia.org/T230015 (10hashar) 05Open→03Resolved a:03Ladsgroup The reposi... [10:48:22] PROBLEM - Free space - all mounts on deployment-mediawiki-07 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki-07.diskspace.root.byte_percentfree (<100.00%) [10:53:40] !log deployment-mediawiki-07: nuke /mnt/mediawiki , last touched in April 2018 and using space on / [10:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:03:24] RECOVERY - Free space - all mounts on deployment-mediawiki-07 is OK: OK: All targets OK [12:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [13:21:56] (03PS1) 10Hashar: zuul: expand DonationInterface jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533518 [13:23:38] (03CR) 10jerkins-bot: [V: 04-1] zuul: expand DonationInterface jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533518 (owner: 10Hashar) [13:28:23] (03PS2) 10Hashar: zuul: expand DonationInterface jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533518 [14:07:04] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Ladsgroup) >>! In T225730#5452274, @Daimo... [14:10:30] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Daimona) >>! In T225730#5454020, @Ladsgro... [14:11:04] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [14:11:23] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:40] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Ladsgroup) >>! In T225730#5454035, @Daimo... [14:16:11] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 48229 bytes in 1.026 second response time [14:17:04] 10Continuous-Integration-Config, 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO, 10MediaWiki-Core-Testing, and 5 others: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 (10Daimona) >>! In T225730#5454066, @Ladsgro... [14:48:30] (03PS1) 10Hashar: zuul: pipelines for fundraising branches [integration/config] - 10https://gerrit.wikimedia.org/r/533532 [15:18:13] (03PS1) 10Hashar: zuul: drop skip-if for fundraising/REL branch [integration/config] - 10https://gerrit.wikimedia.org/r/533536 [15:18:35] moare madness ^^^ :-\ [15:33:21] (03CR) 10Jforrester: "Very neat. Want to deploy it?" [integration/config] - 10https://gerrit.wikimedia.org/r/533536 (owner: 10Hashar) [15:40:53] !log Updated "Prod Error" task form to have Title at the top and Trace below Description. [15:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:00:07] (03PS1) 10Jforrester: layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 [16:01:52] (03CR) 10jerkins-bot: [V: 04-1] layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 (owner: 10Jforrester) [16:03:15] (03PS2) 10Jforrester: layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 [16:08:04] (03CR) 10jerkins-bot: [V: 04-1] layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 (owner: 10Jforrester) [16:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [16:22:36] (03CR) 10Hashar: "Gotta dig into it a bit more later on but yeah that would be easier to understand :]" (033 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/518088 (owner: 10Jforrester) [16:24:05] James_F: so eventually I might have a found a way to slightly simplify the zuul jobs filters [16:24:16] (03CR) 10Jforrester: layout: Replace negative branch matches of quibble jobs with positive ones (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/518088 (owner: 10Jforrester) [16:24:21] and I am tempted to just have a test and gate pipeline for each of the branches (master / wmf / REL1_31 etc) [16:24:33] so maybe this way we can just drop most of those branch / skip-if filters [16:24:55] Yeah, that's be lovely. [16:25:11] and sorry about the lack of review on your patch to simplify the branches [16:25:33] but now that I have dig into the problem to simplify the fundraisng related filters, I can actually have a look at your change hehe [16:27:07] :-) [16:44:15] and I am gone. Will dig into that mess again on monday and hopefully have some small patches that are easy to review/apply etc ;-] [16:44:21] * hashar waves [16:56:39] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) p:05Triage→03Normal [16:59:34] 10Phabricator: Allow others than admins to edit forms - https://phabricator.wikimedia.org/T181031 (10mmodell) Unfortunately, yes this requires upstream changes. I actually spent quite a bit of time looking into this and I haven't been able to figure out which part of the phabricator code is enforcing this or how... [17:00:53] (03PS3) 10Jforrester: layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 [17:12:43] (03CR) 10jerkins-bot: [V: 04-1] layout: Use positive branch matches for quibble, and explicitly list matched jobs [integration/config] - 10https://gerrit.wikimedia.org/r/533546 (owner: 10Jforrester) [17:55:24] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-jhuneidi is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 353 bytes in 0.047 second response time [18:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [18:25:07] jrbranaa: Did you want https://gerrit.wikimedia.org/r/c/integration/config/+/532832 deployed? [18:25:54] james_f yes, please :-) [18:26:09] (03PS2) 10Jforrester: added Popups and MobileFrontend extensions to codehealth pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/532832 (owner: 10Jrbranaa) [18:26:09] (03CR) 10Jforrester: [C: 03+2] added Popups and MobileFrontend extensions to codehealth pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/532832 (owner: 10Jrbranaa) [18:26:53] ^ thanks! [18:27:48] (03Merged) 10jenkins-bot: added Popups and MobileFrontend extensions to codehealth pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/532832 (owner: 10Jrbranaa) [18:28:08] !log Zuul: Move Popups and MobileFrontend to codehealth [18:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:37:11] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10Reedy) >One thing that could improve the situation would be supporting pre-generated reset codes, however, that isn't currently supported in phabricator. The fact phab doesn't have this seems a massive ov... [18:38:20] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) @reedy: yeah I'm inclined to look into adding this feature to Phabricator though I am not sure how it should be implemented. [18:38:56] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10Reedy) Generate N codes, save them in a database column somewhere? :) [18:41:38] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) >>! In T231667#5454888, @Reedy wrote: > Generate N codes, save them in a database column somewhere? :) Yeah, I guess that covers the core of it. Also need a UI for the user to access and save th... [18:43:54] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10Reedy) Re-generation and other UI-niceness is definitely a nice to have... But having them generated at enrollment, and then useable on request for a TOTP would seemingly be a MVP Which is basically the s... [18:44:02] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) So the question is whether this is worth the effort to implement. I'd imagine it would eventually pay for itself in time saved dealing with 2factor resets. And that is not even to mention the imp... [18:45:25] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10Reedy) >>! In T231667#5454898, @mmodell wrote: > So the question is whether this is worth the effort to implement. I'd imagine it would eventually pay for itself in time saved dealing with 2factor resets.... [18:58:53] 10Continuous-Integration-Config: trigger-mediawiki-pipeline-dev always fails - https://phabricator.wikimedia.org/T231679 (10Zoranzoki21) [19:16:56] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10epriestley) See also some discussion in T187256. [19:34:12] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201908), 10Release, 10Train Deployments, 10User-zeljkofilipin: 1.34.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T220745 (10Jdforrester-WMF) [19:53:23] 10Beta-Cluster-Infrastructure: Request access to deployment-prep and beta-cluster logstash - https://phabricator.wikimedia.org/T231684 (10D3r1ck01) [20:07:55] 10Continuous-Integration-Config, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (201908), 10serviceops, and 2 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [20:11:04] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [20:13:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:19:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:00:44] 10Continuous-Integration-Config, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (201908), 10Performance-Team, and 3 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Krinkle) [21:42:40] 10Release-Engineering-Team (Local Dev), 10Release-Engineering-Team-TODO, 10Developer Productivity, 10Release Pipeline, and 2 others: Define a .pipeline/blubber.yaml for mediawiki/core - https://phabricator.wikimedia.org/T218360 (10hashar) [21:42:42] 10Continuous-Integration-Config: trigger-mediawiki-pipeline-dev always fails - https://phabricator.wikimedia.org/T231679 (10hashar) [21:44:39] 10Continuous-Integration-Config: trigger-mediawiki-pipeline-dev always fails - https://phabricator.wikimedia.org/T231679 (10hashar) See T218360, but in short the job is an experiment. Though we might want to trigger it on demand / manually instead of on every single change merged. [21:52:36] (03PS3) 10D3r1ck01: Archive ViewFiles extension in Integration Config [integration/config] - 10https://gerrit.wikimedia.org/r/533326 (https://phabricator.wikimedia.org/T228367) [22:11:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [23:22:00] 10Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) [23:22:02] 10Phabricator (Upstream), 10Upstream: Add another way to add two factor auth - https://phabricator.wikimedia.org/T187256 (10mmodell) [23:22:06] 10Phabricator (Upstream), 10Upstream: Phabricator multi-factor auth (2fa) should provide printable recovery codes - https://phabricator.wikimedia.org/T182624 (10mmodell) [23:22:22] 10Phabricator: Phabricator: Simplify the multifactor auth reset procedure - https://phabricator.wikimedia.org/T231667 (10mmodell) [23:46:47] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<20.00%)