[00:37:09] 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Quiddity) Yes, I'm willing to become an owner for the list, but I request a 2nd (and ideally 3rd) owner join to avoid SPOF. > (maybe all new subscribers should be moderated?) It's a low traffic list,... [01:22:24] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:34:04] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 59.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:56] (03CR) 10Huji: "Question: would you need me to schedule this for a deployment window and be present myself? Or is it straightforward enough that you could" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [04:16:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:46] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:45] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Izno) The change was noticed at https://en.wikipedia.org/wiki/WP:VPT#Aut... [04:52:16] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:55:56] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:57:22] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:59:20] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:02:01] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:02:38] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:15:11] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:16:06] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:16:46] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:18:42] so yeah wikitech-static is down [05:21:09] I have created https://phabricator.wikimedia.org/T295266 [05:26:46] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:33:14] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:33:50] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:38:56] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:39:28] hello hello 👋 [05:39:42] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:39:43] victorops isn't showing me the alert. nice [05:40:01] what's the alert? [05:40:16] https://portal.victorops.com/ui/wikimedia/incident/1627/details [05:40:27] > PROBLEM: Icinga on alert1001.wikimedia.org is CRITICAL (check email for details) [05:40:39] and yeah if wikitech-static is down that explains it [05:40:44] yeah [05:40:53] email says: check_icinga@wikitech-static.wikimedia.org found Icinga CRITICAL on alert1001.wikimedia.org Issues of attempt 1 of 3: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) [05:40:57] (and same for 2 of 3, 3 of 3) [05:41:05] it looks like it is flapping [05:43:06] rackspace networking issue, maybe? I can't quite work out how it can't reach icinga but it can still fire the page, but I guess if it's just intermittent [05:43:40] yeah, I was reading the docs to see how to troubleshoot this further [05:44:28] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:44:53] but yeah, it looks like rackspace, I cannot reach it from my laptop in Spain and from a vm in London either [05:45:09] hmm, the rackspace username and password in pwstore aren't working for me [05:45:25] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:46:47] rzl: :-/ [05:47:35] ohh they work on "Rackspace Customer Portal" but not on "Cloud Office Control Panel" [05:47:54] (https://wikitech.wikimedia.org/wiki/Wikitech-static says I wanted "Cloud Control Panel" but I don't see that anywhere) [05:48:37] yeah, this is also from 2015 https://wikitech.wikimedia.org/wiki/Rackspace_Cloud so I guess it's changed a lot [05:49:30] heh I guess so [05:49:41] okay I got through to the wikitech-static-ord management page though [05:49:53] 0% packet loss from rackspace's POV [05:50:16] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:50:37] a little tempted to try rebooting it, but I don't actually have any reason to think that'd fix anything specific :) [05:50:47] I cannot reach wikitestatic from here still [05:51:00] hm, same [05:51:30] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:52:01] rzl: Rebooting it will not make things any worse [05:52:11] I can ping it but I can't actually load any pages from it [05:52:17] yeah, true! here goes [05:52:34] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Fri 14 Jan 2022 02:46:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [05:52:51] rzl: There is no way for you to log-in into the CLI there, no? [05:53:00] !log rebooted wikitech-static via rackspace web UI - T295266 [05:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:03] T295266: wikitech-static down - https://phabricator.wikimedia.org/T295266 [05:53:36] sorry too late :/ there's an "emergency console" option, I guess I could have gone in via that first [05:53:46] will give it a shot, if it doesn't come back up cleanly [05:54:29] ping is back [05:55:01] works over http for me now, too [05:55:03] and wikitech-static is back for me [05:55:08] \o/ [05:56:03] rzl: does this need update? https://wikitech.wikimedia.org/wiki/Rackspace_Cloud ? [05:57:08] yeah, that and the wikitech-static page both [05:57:36] on my list -- right now trying to ssh into the machine and see if logs shed any light [05:57:38] yeah, the accessing part is important those instructions do not work [05:57:47] ok, let me update the task [05:59:11] wmf-update-known-hosts-production doesn't get me a host key though, and I can't find the fingerprint on wikitech either, except for https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/wikitech-static-jessie.wikimedia.org which doesn't match [05:59:33] just gonna proceed with caution for now and worry about that later [06:00:18] and the emergency console? maybe you can do it via there [06:01:11] hm yeah, that works [06:07:46] oom-killer invoked at 05:51:12? suspicious [06:08:16] check_icinga failures started well before that, but [06:09:08] also a check_icinga segfault at 04:58 [06:09:43] honestly not sure if syslog on this host always looks quite this haunted [06:10:48] MWException "Missing text field in import" from WikiImporter.php:974, at 04:20 [06:11:12] I'd be copy-pasting these, except I can't copy from this web console :/ [06:14:46] maybe just take a screenshot [06:14:50] and we can paste it on the task [06:15:43] the first alert I see from wikitech is from 04:52 utc [06:15:59] maybe OOM killed apache? [06:16:17] ah no that's well before the OOM you mentioned [06:16:28] I realized I was being a little slow :) the right play is to just use the virtual console to check the ssh fingerprint, then connect more normally [06:16:45] so I'm in properly, and I'll just copy a bunch of syslog into a phab paste and link it on that task [06:16:50] riiight [06:16:53] will update the fingerprint page too [06:16:59] +1 [06:18:18] heh I don't have rights to edit protected pages on wikitech [06:18:31] let me check if I do [06:18:36] which page? [06:20:16] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/wikitech-static-jessie.wikimedia.org is where I'm looking but I guess it's an old hostname -- should probably just create https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/wikitech-static.wikimedia.org and fully protect it [06:20:44] yep, I cannot edit it either [06:20:56] mkay, AI for somebody tomorrow then [06:21:05] let me paste that on the task too [06:21:10] cheers [06:26:20] ha, the oom killed killed mysqld [06:26:23] *killer killed [06:26:36] so, that'll definitely do it -- but it was after we'd gotten paged already, not our root cause [06:26:54] well, obviously not our root cause, something else was running us out of ram -- but not our trigger either [06:27:16] that's the last syslog entry before the reboot [06:27:24] posting it all on task now [06:29:00] thanks [06:29:10] yeah, mysqld is usually the first oom-killer victim :( [06:30:09] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech-static down - https://phabricator.wikimedia.org/T295266 (10RLazarus) [06:31:02] I'm going to make a couple of those other wikitech edits quickly and then get to bed, I have a meeting with data persistence team in a little over 8 hours and you know how they are ;) [06:32:32] oh, resolving the VO incident too [06:36:43] https://wikitech.wikimedia.org/w/index.php?title=Wikitech-static&diff=1931762&oldid=1923700 [06:38:49] actually better yet, https://wikitech.wikimedia.org/w/index.php?title=Wikitech-static&type=revision&diff=1931763&oldid=1923700 [06:39:51] I'm going to just make https://wikitech.wikimedia.org/wiki/Rackspace_Cloud into a redirect there, too [06:42:12] {{done}} [06:44:10] marostegui: wikitech-static still seems to be okay for now, I'm going to call it a night, unless you can think of anything? [06:45:08] rzl: it works for me too [06:45:13] you can go to sleep!! [06:45:19] thanks! [06:45:43] thanks for being around <3 talk to you later! [06:49:24] <_joe_> looks like I just missed all the fun by not looking at IRC, sorry [07:13:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:16] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:30] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10ayounsi) 05Resolved→03Open I set an arbitrary high value, I'll leave it to DCops to find a proper threshold. `Phase, AA:L-L/N, Active Power` was alerting as well, set it to 2000 too. [07:18:39] (03PS1) 10Elukey: kserve-inference: improve labels and network policy rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/737319 (https://phabricator.wikimedia.org/T289834) [07:18:41] (03PS1) 10Elukey: helmfile.d: add namespace to kserve's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/737320 (https://phabricator.wikimedia.org/T289834) [07:23:54] (03CR) 10Elukey: sslcert::trusted_ca: check if the bundle .pem is defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737095 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:25:50] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:30] (03PS4) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [07:39:41] (03PS7) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [07:39:43] (03PS9) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [07:39:45] (03PS2) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 [07:41:22] (03CR) 10Elukey: "LGTM, I think that we can remove all the old references of 0.15.x just to avoid confusion (never really released etc..)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [07:42:10] (03CR) 10Elukey: [C: 04-1] "Some work on the define that deploys the bundle is needed, going to wait on this :)" [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [07:43:22] (03CR) 10Urbanecm: [C: 03+1] create 2022 namespace for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [07:51:17] (03PS2) 10Elukey: kserve-inference: improve labels and network policy rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/737319 (https://phabricator.wikimedia.org/T289834) [07:51:19] (03PS2) 10Elukey: helmfile.d: add namespace to kserve's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/737320 (https://phabricator.wikimedia.org/T289834) [07:53:17] (03PS13) 10JMeybohm: Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [07:54:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM; a couple minor comments." [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [07:54:35] (03CR) 10Elukey: [C: 03+1] Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [07:56:49] (03CR) 10Jayprakash12345: create 2022 namespace for wikimaniawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737082 (https://phabricator.wikimedia.org/T295267) (owner: 10Bodhisattwa) [08:03:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add Jetstack's cert-manager (v1.5.4) images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [08:04:04] (03PS1) 10Muehlenhoff: Add drmrs to Hiera list of datacentres [puppet] - 10https://gerrit.wikimedia.org/r/737328 [08:05:56] (03PS1) 10JMeybohm: Add cfss-issuer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) [08:07:00] (03PS2) 10JMeybohm: Add cfssl-issuer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) [08:08:54] (03PS3) 10Elukey: kserve-inference: improve labels and network policy rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/737319 (https://phabricator.wikimedia.org/T289834) [08:08:56] (03PS3) 10Elukey: helmfile.d: add namespace to kserve's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/737320 (https://phabricator.wikimedia.org/T289834) [08:10:35] (03PS9) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [08:10:37] (03PS8) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [08:10:39] (03PS6) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [08:10:41] (03PS1) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [08:12:20] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32191/console" [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [08:19:45] (03CR) 10Elukey: [C: 03+2] kserve-inference: improve labels and network policy rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/737319 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [08:24:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [08:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:11] (03PS1) 10Ayounsi: Add drmrs to prefix-lists and confederation [homer/public] - 10https://gerrit.wikimedia.org/r/737331 (https://phabricator.wikimedia.org/T283050) [08:25:12] (03PS1) 10Ayounsi: drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) [08:25:57] (03CR) 10jerkins-bot: [V: 04-1] drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [08:26:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:27:45] (03PS2) 10Ayounsi: Add drmrs to prefix-lists and confederation [homer/public] - 10https://gerrit.wikimedia.org/r/737331 (https://phabricator.wikimedia.org/T283050) [08:27:47] (03PS2) 10Ayounsi: drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) [08:28:23] (03CR) 10jerkins-bot: [V: 04-1] drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [08:28:41] (03PS3) 10Ayounsi: drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) [08:30:36] (03CR) 10Ayounsi: "Split through other CRs: I808fe75150e5a87b35218c51d6b4fbb2ec380855 829a981a" [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [08:31:50] (03Abandoned) 10Ayounsi: drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [08:31:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:39:50] (03PS1) 10Elukey: helmfile.d: add basic egress GlobalNetworkPolicies for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/737333 (https://phabricator.wikimedia.org/T289834) [08:40:13] (03CR) 10Elukey: [C: 03+2] helmfile.d: add namespace to kserve's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/737320 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [08:41:56] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:48:56] (03CR) 10Elukey: [C: 03+2] helmfile.d: add basic egress GlobalNetworkPolicies for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/737333 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [08:51:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:57:14] these were probably due to knative --^ [08:58:27] context https://phabricator.wikimedia.org/T288549 [09:01:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736201 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:02:11] (03CR) 10Jbond: [C: 03+1] ceph::auth: Add codfw1dev-compute client key [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:03:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:47] (03PS1) 10JMeybohm: profile::docker::builder: Add building cert-manager images [puppet] - 10https://gerrit.wikimedia.org/r/737335 (https://phabricator.wikimedia.org/T294560) [09:05:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:09] (03CR) 10Jbond: [C: 03+1] ceph::auth::load_all: allow generating the keyring path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:05:36] (03PS1) 10Ema: varnish: install varnishxcache like other mtail programs [puppet] - 10https://gerrit.wikimedia.org/r/737336 (https://phabricator.wikimedia.org/T293879) [09:06:39] (03PS3) 10JMeybohm: Add cfssl-issuer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) [09:06:58] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10TheDJ) https://en.wikipedia.org/wiki/WP:VPT#Automatically_renamed_users... [09:07:04] (03CR) 10David Caro: ceph::auth: add deploy profile and classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736201 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:07:34] (03CR) 10Jbond: [C: 03+1] ceph::auth: skip keys with no keydata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:09:25] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rename everything to cfssl-issuer, ensure e2e completed [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736807 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:11:30] (03CR) 10Jbond: sslcert::trusted_ca: check if the bundle .pem is defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737095 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:11:47] (03CR) 10JMeybohm: [C: 03+2] Import chart cert-manager v1.5.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/737167 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:16:11] (03Merged) 10jenkins-bot: Import chart cert-manager v1.5.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/737167 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:16:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [09:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] (03PS10) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [09:17:22] (03PS3) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 [09:17:32] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:18:54] (03CR) 10JMeybohm: Add simple-cfssl image for development and e2e tests (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:19:20] (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [09:19:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/737331 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:21:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [09:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [09:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:15] (03CR) 10Jbond: drmrs: add bgp support to mr1 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:22:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:42] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [09:23:47] (03CR) 10Jbond: [C: 03+1] Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:25:28] (03CR) 10JMeybohm: Implement CFSSL API signer (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:27:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:28:26] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:16] (03CR) 10Elukey: [V: 03+1 C: 04-1] "Need to refactor the truststore base bits, -1 for the moment :)" [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:31:00] (03CR) 10ArielGlenn: [C: 03+1] "Looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:32:14] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:32:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:33:09] (03CR) 10ArielGlenn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:36:26] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [09:43:59] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] Revert "Revert "prometheus::ops: Add haproxy-tls@cache_upload config"" [puppet] - 10https://gerrit.wikimedia.org/r/737092 (owner: 10Vgutierrez) [09:46:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [09:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:16] !log all core routers: add drmrs to prefix lists + confed [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:49] (03CR) 10Majavah: etcd: Use cfssl for peer-to-peer communication (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [10:03:37] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#7488442, @TheDJ wrote: > https://en.wikipedia.org/wi... [10:08:09] jouncebot: nowandnext [10:08:09] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [10:08:09] In 1 hour(s) and 51 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1200) [10:16:56] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737336 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:17:34] !log Deployed patch for T294693 [10:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:40] !log depool cp4026 to be reimaged as a haproxy-tls test node - T290005 [10:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:43] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:28:43] (03CR) 10Vgutierrez: [C: 03+2] site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:29:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) 05Open→03In progress [10:29:22] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10ema) This is a recurring problem, see for example T273599 T222072 T295253. T222075 has ideas on how to tackle the issue. I've tried to acc... [10:29:28] (03PS7) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [10:29:36] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737336 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:35:13] (03CR) 10Ayounsi: [C: 03+2] Add drmrs to prefix-lists and confederation [homer/public] - 10https://gerrit.wikimedia.org/r/737331 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [10:35:47] (03Merged) 10jenkins-bot: Add drmrs to prefix-lists and confederation [homer/public] - 10https://gerrit.wikimedia.org/r/737331 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [10:39:57] (03CR) 10Ema: [C: 03+2] varnish: install varnishxcache like other mtail programs [puppet] - 10https://gerrit.wikimedia.org/r/737336 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:40:51] (03PS1) 10Vgutierrez: hieradata: Enable UDS for varnish-fe@cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/737343 (https://phabricator.wikimedia.org/T290005) [10:43:36] (03CR) 10Ayounsi: drmrs: add bgp support to mr1 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [10:45:11] (03PS2) 10Vgutierrez: hieradata: Enable UDS for varnish-fe@cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/737343 (https://phabricator.wikimedia.org/T290005) [10:47:01] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Enable UDS for varnish-fe@cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/737343 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:49:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4026.ulsfo.wmnet with OS buster [10:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4026.ulsfo.wmnet with OS buster [10:49:38] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:49:38] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:49] (03CR) 10MMandere: [C: 03+1] "LGTM, though not sure we are ready to have drmrs enlisted as last time we had numerous icinga alerts. Adding Brandon as well to help clari" [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [10:53:53] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:53:53] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:57] (03PS14) 10David Caro: ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 (https://phabricator.wikimedia.org/T293752) [11:00:59] (03PS7) 10David Caro: ceph::auth: Add codfw1dev-compute client key [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) [11:01:01] (03CR) 10David Caro: ceph::auth: Add codfw1dev-compute client key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:01:03] (03PS3) 10David Caro: ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) [11:01:05] (03CR) 10David Caro: ceph::auth::load_all: allow generating the keyring path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:01:07] (03PS4) 10David Caro: ceph::auth: skip keys with no keydata [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) [11:01:09] (03CR) 10David Caro: ceph::auth: skip keys with no keydata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:01:11] (03PS1) 10David Caro: Add codfw cloudvirts depoly profile [puppet] - 10https://gerrit.wikimedia.org/r/737345 [11:01:24] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS buster [11:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:32] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6002.drmrs.wmnet with OS buster [11:06:14] (03PS1) 10Arturo Borrero Gonzalez: secret: add openstack networktests sshkeys placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/737346 (https://phabricator.wikimedia.org/T294955) [11:06:47] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] secret: add openstack networktests sshkeys placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/737346 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [11:08:39] (03PS9) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:14:21] (03PS10) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:17:11] (03PS11) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:20:03] (03PS12) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:20:44] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10Majavah) I cleaned up old jobs from `/srv/jenkins-workspace/puppet-compiler`. The cleanup jobs `delete-old-output-files.service` and `delet... [11:21:00] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737200 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [11:22:27] (03PS13) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:24:56] (03PS14) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:25:28] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [11:26:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [11:31:32] Hallo. Is this the right place to be for backport deployment? [11:32:21] yes! [11:32:26] jouncebot: next [11:32:26] In 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1200) [11:32:37] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4026.ulsfo.wmnet with OS buster [11:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:48] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4026.ulsfo.wmnet with OS buster e... [11:34:17] Lucas_WMDE, thanks. I saw the +1 from you. I'm still not entirely sure where these names shown in the actual interface, so I'm not sure how to test ;) [11:35:19] https://www.wikidata.org/wiki/Q42?uselang=kea has Kabuverdianu with uppercase K in the language selector and as the first row in the “in more languages” table [11:37:52] Oh, OK. I guess it's good enough for testing. It's a rather horrible hack, though: ota is not supposed to be a valid interface language. But OK, it works for toda.y [11:40:23] (03CR) 10Jelto: gitlab: accept backup file argument (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [11:41:38] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6002.drmrs.wmnet with OS buster [11:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:47] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6002.drmrs.wmnet with OS buster executed with errors: - gan... [11:43:21] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:53] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:46:01] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:46:01] (03PS15) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:46:37] (03PS4) 10Alexandros Kosiaris: elasticsearch::cirrus: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735946 (https://phabricator.wikimedia.org/T275752) [11:46:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/735946 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [11:46:56] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] elasticsearch::cirrus: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735946 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [11:47:39] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:47:47] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:49:11] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.61 ms [11:49:37] (03PS16) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:50:10] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [11:51:50] (03PS17) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:52:20] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [11:55:41] (03PS18) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:56:14] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [11:59:28] (03PS19) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [11:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions logpager recentchanges recentchangeslinked watchlist from s5 codfw T263127', diff saved to https://phabricator.wikimedia.org/P17707 and previous config saved to /var/cache/conftool/dbconfig/20211108-115945-marostegui.json [11:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:49] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [11:59:57] (03PS1) 10Giuseppe Lavagetto: profile::puppet_compiler: fix systemd timer intervals [puppet] - 10https://gerrit.wikimedia.org/r/737350 [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1200). [12:00:05] aharoni and seddon: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] o/ [12:00:11] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [12:00:26] I can deploy today [12:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust weights for s5 codfw replicas after removing special groups from them T263127', diff saved to https://phabricator.wikimedia.org/P17708 and previous config saved to /var/cache/conftool/dbconfig/20211108-120203-marostegui.json [12:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:52] (03CR) 10Majavah: [C: 03+1] profile::puppet_compiler: fix systemd timer intervals [puppet] - 10https://gerrit.wikimedia.org/r/737350 (owner: 10Giuseppe Lavagetto) [12:03:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I’ll test this on mwdebug to ensure that removing the two language codes from $wgExtraLanguageNames doesn’t confuse Wikibase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [12:03:44] (03CR) 10Jelto: [C: 03+1] "lgtm apart from the SOURCE_VERSION tag" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [12:05:12] (03PS20) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [12:05:47] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [12:05:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "No, I don’t think we can merge this one. sjd and sje are explicitly declared as term language codes in Wikibase (so removing them from $wg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [12:07:21] Lucas_WMDE if they are in Names.php, doesn't it make them automatically available for everything? [12:07:27] (03PS21) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [12:07:27] I don’t think so [12:07:32] but I’ll check on mwdebug [12:07:38] I’ll just manually edit the file instead of merging the config change first [12:07:53] looks like we have exactly one lexeme in each language so far, https://www.wikidata.org/wiki/Lexeme:L229838 and https://www.wikidata.org/wiki/Lexeme:L230180 [12:07:56] (03PS22) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [12:09:51] (03CR) 10Ayounsi: [C: 03+2] drmrs: add bgp support to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/737332 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [12:10:53] hm, maybe it’s okay after all [12:11:50] https://meta.wikimedia.org/?uselang=sjd has translations in some cyrillic script and a cyrillic-script language name in ULS that looks like some kind of Sami to me [12:11:57] and metawiki doesn’t have the custom wmgExtraLanguageNames at all [12:12:05] so it looks like you’re right and Names.php is enough? [12:12:23] (I was also able to edit https://www.wikidata.org/wiki/Lexeme:L123 with a sjd lemma on mwdebug1001) [12:13:28] Is it all deployed? [12:13:51] I mean, on mwdebug1001 [12:14:11] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:10] not yet [12:15:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s try it out (I’ll do some more testing with the full change `scap pull`ed to mwdebug1001)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [12:16:01] I had only commented out the sjd and sje lines on mwdebug1001 [12:16:06] (and only in the wikidatawiki block, too) [12:16:24] (03Merged) 10jenkins-bot: Update autonyms in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) (owner: 10Amire80) [12:16:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The work you've done here seems mostly very good and relatively simple - given the heavy lifting is thankfully done by the cfssl library." [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [12:16:59] aharoni: the full change should be on mwdebug1001 now [12:17:12] testing [12:17:36] tested, lgtm [12:17:44] https://www.wikidata.org/w/api.php?action=query&format=json&meta=wbcontentlanguages&formatversion=2&wbclcontext=term-lexicographical&wbclprop=code%7Cautonym%7Cname lgtm too [12:19:02] I thought MediaWiki had a different list for languages that are actually supported as interface languages, and languages that it knows the names of [12:19:12] but it looks like Names.php is the list of interface languages [12:19:18] and the extra names come from CLDR? [12:19:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:36] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:699692|Update autonyms in wmgExtraLanguageNames (T284870)]] (duration: 00m 56s) [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:39] T284870: Harmonize language codes and autonyms between language-data and Wikimedia wmgExtraLanguageNames - https://phabricator.wikimedia.org/T284870 [12:21:20] ok, I think we’re done with that config change [12:21:24] Seddon: are you there? [12:21:56] (03PS1) 10Ssingh: dnsdist: allow setting additional custom HTTP response headers [puppet] - 10https://gerrit.wikimedia.org/r/737364 [12:23:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "looks sensible to me (though I’m just trusting you that the single pipe is the correct “or” syntax, I don’t know that part); two improveme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [12:24:16] (03PS1) 10Vgutierrez: cache:haproxy: Sort puppet dependencies [puppet] - 10https://gerrit.wikimedia.org/r/737365 (https://phabricator.wikimedia.org/T290005) [12:26:14] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32206/console" [puppet] - 10https://gerrit.wikimedia.org/r/737365 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:26:43] (03CR) 10Ssingh: "PPC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/32205/doh1001.wikimedia.org/index.html." [puppet] - 10https://gerrit.wikimedia.org/r/737364 (owner: 10Ssingh) [12:27:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache:haproxy: Sort puppet dependencies [puppet] - 10https://gerrit.wikimedia.org/r/737365 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:28:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4026.ulsfo.wmnet with OS buster [12:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:26] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4026.ulsfo.wmnet with OS buster [12:31:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:31:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32204/" [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [12:32:01] (03PS1) 10Matthias Mullie: Explicitly disable references support on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737370 (https://phabricator.wikimedia.org/T230315) [12:33:03] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:38:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:48] !log UTC morning backport+config window done [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:05] (03PS1) 10Matthias Mullie: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 [12:55:36] (03CR) 10Urbanecm: [C: 03+1] Specify the default language of beta cluster votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [12:55:39] (03CR) 10Urbanecm: [C: 03+2] Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [12:56:35] (03Merged) 10jenkins-bot: Specify the default language of beta cluster votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737181 (https://phabricator.wikimedia.org/T295242) (owner: 10Huji) [13:00:02] (03CR) 10Awight: "I can merge this during a deployment window, I would say it's received plenty of review for what it is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735769 (owner: 10Awight) [13:01:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] (03PS2) 10Matthias Mullie: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 [13:02:18] (03CR) 10jerkins-bot: [V: 04-1] Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:03:11] (03CR) 10Matthias Mullie: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:05:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:34] (03CR) 10JMeybohm: Add cfssl-issuer docker image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:06:54] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Systemd time spec incompatibility: On(UnitIn)activeSec not compatable with 'Daily' - https://phabricator.wikimedia.org/T295284 (10jbond) p:05Triage→03Medium [13:07:07] (03PS2) 10Jbond: profile::puppet_compiler: fix systemd timer intervals [puppet] - 10https://gerrit.wikimedia.org/r/737350 (https://phabricator.wikimedia.org/T295284) (owner: 10Giuseppe Lavagetto) [13:07:39] (03CR) 10Jbond: [C: 03+2] "Thanks seems like there are a couple more occurrences of this. will merged this and have created a task to follow up on the others" [puppet] - 10https://gerrit.wikimedia.org/r/737350 (https://phabricator.wikimedia.org/T295284) (owner: 10Giuseppe Lavagetto) [13:10:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [13:12:42] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:14:32] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:18:22] (03PS1) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:18:43] (03PS4) 10Alexandros Kosiaris: cloudelastic: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735947 (https://phabricator.wikimedia.org/T275752) [13:18:50] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/735947 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [13:20:13] (03PS2) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:20:17] 10SRE, 10Infrastructure-Foundations, 10observability, 10puppet-compiler, and 2 others: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10jbond) [13:21:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [13:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:54] (03PS3) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:22:05] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:22:51] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10jbond) after[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/737350 | joe's patch ]] the timers are now working again. I have also... [13:25:47] (03PS4) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:26:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:04] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:26:18] (03CR) 10jerkins-bot: [V: 04-1] varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:27:40] (03PS5) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:29:36] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:29:54] (03CR) 10Jbond: dnsdist: allow setting additional custom HTTP response headers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737364 (owner: 10Ssingh) [13:32:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4026.ulsfo.wmnet with OS buster [13:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:30] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4026.ulsfo.wmnet with OS buster c... [13:33:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:34:20] (03PS6) 10Kormat: mariadb: Set important db host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) [13:34:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:19] (03PS6) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [13:38:39] Lucas_WMDE: I'm really sorry. Sleeping pattern is totally messed up and slept through the window. [13:39:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:29] Seddon: no problem, we can try again later [13:43:37] (it didn’t look like it was very urgent iirc) [13:44:21] Seddon: just an FYI, I find sleeping in a bed much better than through a window [13:44:38] ... ... ... [13:44:41] Funny :p [13:46:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [13:47:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: correct some problems [puppet] - 10https://gerrit.wikimedia.org/r/737392 (https://phabricator.wikimedia.org/T294955) [13:49:15] xD [13:49:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: networktests: correct some problems [puppet] - 10https://gerrit.wikimedia.org/r/737392 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:50:21] (03CR) 10Jbond: "fyi jayme this is i think good to g, however i would prefer to add a few more unit tests before merging. however it that becomes a blocke" [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [13:51:13] (03PS5) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [13:51:27] (03PS5) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [13:52:49] (03PS7) 10Kormat: mariadb: Set important db host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) [13:53:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:54:33] (03CR) 10Kormat: "Ok, this should be ready for review now." [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [13:55:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [13:57:32] (03PS1) 10Ayounsi: Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) [13:57:36] (03CR) 10Kormat: "Hey, thanks for this CR, and the suggestion! I've turned it into something more comprehensive for our environment at https://gerrit.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/735689 (https://phabricator.wikimedia.org/T233684) (owner: 10Dzahn) [14:00:25] (03CR) 10jerkins-bot: [V: 04-1] Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [14:11:03] (03CR) 10Vgutierrez: varnish: use systemd template files for varnishmtail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [14:11:23] (03PS6) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:12:43] (03CR) 10David Caro: [C: 03+2] ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:12:53] (03CR) 10David Caro: [C: 03+2] ceph::auth: Add codfw1dev-compute client key [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:13:04] (03CR) 10David Caro: [C: 03+2] ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:13:16] (03CR) 10David Caro: [C: 03+2] ceph::auth: skip keys with no keydata [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:13:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32209/console" [puppet] - 10https://gerrit.wikimedia.org/r/737335 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:14:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32211/console" [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [14:15:18] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32210/console" [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:16:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32214/console" [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [14:17:55] (03CR) 10David Caro: [V: 03+1 C: 03+2] ceph::auth: skip keys with no keydata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:18:17] (03CR) 10David Caro: [V: 03+1 C: 03+2] ceph::auth: skip keys with no keydata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:19:04] (03CR) 10Elukey: [V: 03+1 C: 03+1] "LGTM (the puppet diff is missing, afaics, changes to manage-production-images.sh but it may be a corner case or a PEBCAK on my side)." [puppet] - 10https://gerrit.wikimedia.org/r/737335 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:21:43] (03CR) 10Btullis: [C: 03+1] "Looks good to me. We are going to restart the analytics-matomo instance with new settings as well." [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:22:09] (03CR) 10Majavah: mariadb: Set important db host monitoring to critical. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [14:22:30] (03PS8) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [14:22:32] (03PS11) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [14:22:34] (03PS4) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 [14:22:50] (03CR) 10JMeybohm: Implement CFSSL API signer (034 comments) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:23:20] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32216/console" [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:27:49] (03PS12) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [14:27:51] (03PS5) 10JMeybohm: Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 [14:33:46] (03PS1) 10Jgiannelos: maps: Force jq to generate single line output [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) [14:34:04] (03CR) 10Kormat: mariadb: Set important db host monitoring to critical. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [14:34:13] (03CR) 10Ssingh: dnsdist: allow setting additional custom HTTP response headers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737364 (owner: 10Ssingh) [14:34:26] (03CR) 10jerkins-bot: [V: 04-1] maps: Force jq to generate single line output [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:35:36] (03PS2) 10Jgiannelos: maps: Force jq to generate single line output [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) [14:36:07] (03CR) 10jerkins-bot: [V: 04-1] maps: Force jq to generate single line output [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:36:14] (03Abandoned) 10Elukey: sslcert::trusted_ca: check if the bundle .pem is defined [puppet] - 10https://gerrit.wikimedia.org/r/737095 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:36:19] (03Abandoned) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:38:11] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, and 2 others: Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10NRodriguez) 05Open→03Resolved a:03NRodriguez 🪄🧞‍♀️ Thanks everyone for your great work on this cross-departmental coordination and deployment! [14:39:32] (03PS3) 10Jgiannelos: maps: Force jq to generate single line output [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) [14:39:47] (03CR) 10Ssingh: [C: 03+2] bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [14:44:38] (03CR) 10Jgiannelos: "Some context around this patch. JQ by default pretty-prints each JSON output in multiple lines. This causes xargs to fail with:" [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:44:47] (03PS1) 10Elukey: kserve: add missing egress policies for the controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/737399 (https://phabricator.wikimedia.org/T289834) [14:44:49] (03PS1) 10Elukey: knative-serving: add basic egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737400 (https://phabricator.wikimedia.org/T289834) [14:46:00] (03PS7) 10Ema: varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) [14:46:20] (03CR) 10Ema: varnish: use systemd template files for varnishmtail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [14:48:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/737397 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:50:43] (03PS2) 10David Caro: Add codfw cloudvirts depoly profile [puppet] - 10https://gerrit.wikimedia.org/r/737345 [14:50:45] (03PS1) 10David Caro: ceph:mon/osd: remove admin class [puppet] - 10https://gerrit.wikimedia.org/r/737401 (https://phabricator.wikimedia.org/T293752) [14:51:14] (03CR) 10jerkins-bot: [V: 04-1] Add codfw cloudvirts depoly profile [puppet] - 10https://gerrit.wikimedia.org/r/737345 (owner: 10David Caro) [14:53:11] (03CR) 10jerkins-bot: [V: 04-1] ceph:mon/osd: remove admin class [puppet] - 10https://gerrit.wikimedia.org/r/737401 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:54:31] (03PS3) 10David Caro: Add codfw cloudvirts depoly profile [puppet] - 10https://gerrit.wikimedia.org/r/737345 (https://phabricator.wikimedia.org/T293752) [14:56:21] (03CR) 10David Caro: [V: 03+1] "PCC looks ok, the main class is added, but as it's disabled nothing else goes in and nothing is really done: https://puppet-compiler.wmfla" [puppet] - 10https://gerrit.wikimedia.org/r/737345 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:56:34] (03PS2) 10David Caro: ceph:mon/osd: remove admin class [puppet] - 10https://gerrit.wikimedia.org/r/737401 (https://phabricator.wikimedia.org/T293752) [14:57:22] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) 05Open→03Resolved a:03Gehel The WDQS Flink based Streaming Updater is now in production, let's close this ticket. [14:58:15] (03PS1) 10Elukey: profile::base::certificates: add sslcert::trusted_ca options [puppet] - 10https://gerrit.wikimedia.org/r/737403 (https://phabricator.wikimedia.org/T291905) [14:58:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, don't see anything related to the Bullseye migration (which happened since the patch was written) which would cause any issue " [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [14:58:51] (03PS1) 10Ssingh: wikidough: customize anycast-healthchecker logging [puppet] - 10https://gerrit.wikimedia.org/r/737404 [15:00:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32221/console" [puppet] - 10https://gerrit.wikimedia.org/r/737403 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:01:51] (03CR) 10Jbond: dnsdist: allow setting additional custom HTTP response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737364 (owner: 10Ssingh) [15:02:58] !log merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/737385 with puppet disabled on A:cp T293879 [15:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:01] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [15:03:32] (03CR) 10Elukey: "Valentin: o/ I see that you are using a similar thing for ATS, lemme know if this draft/change is something that you like (in case we coul" [puppet] - 10https://gerrit.wikimedia.org/r/737403 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:03:56] (03CR) 10Ema: [C: 03+2] varnish: use systemd template files for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737385 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [15:03:59] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/32224/doh1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/737404 (owner: 10Ssingh) [15:04:01] (03PS2) 10Elukey: kserve: add missing egress policies for the controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/737399 (https://phabricator.wikimedia.org/T289834) [15:04:03] (03PS2) 10Elukey: knative-serving: add basic egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737400 (https://phabricator.wikimedia.org/T289834) [15:05:46] (03PS2) 10Alexandros Kosiaris: relforge: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735973 (https://phabricator.wikimedia.org/T275752) [15:05:56] (03CR) 10Ssingh: [C: 03+2] wikidough: customize anycast-healthchecker logging [puppet] - 10https://gerrit.wikimedia.org/r/737404 (owner: 10Ssingh) [15:05:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] "🎆🎆🎆" [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [15:06:46] (03PS4) 10Jsn.sherman: Enable TheWikipediaLibrary on meta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) [15:06:56] (03CR) 10Btullis: [C: 03+2] analytics:refinery:job:refine_sanitize: Avoid monitor false alerts [puppet] - 10https://gerrit.wikimedia.org/r/737110 (owner: 10Mforns) [15:07:55] (03PS5) 10Jsn.sherman: Enable TheWikipediaLibrary on meta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) [15:09:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:39] (03PS1) 10Vgutierrez: prometheus: Avoid filename collitions between haproxy jobs [puppet] - 10https://gerrit.wikimedia.org/r/737406 (https://phabricator.wikimedia.org/T290005) [15:12:35] !log A:cp re-enable puppet after testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/737385 on cp4021 T293879 [15:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:39] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [15:12:41] (03CR) 10JMeybohm: [C: 03+2] profile::docker::builder: Add building cert-manager images [puppet] - 10https://gerrit.wikimedia.org/r/737335 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:13:36] (03CR) 10Jsn.sherman: "Based on feedback I got on the releng channel in IRC, I updated this to enable on meta and testwiki rather than enabling on all projects s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [15:17:07] (03CR) 10Ayounsi: [C: 03+1] bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [15:18:58] !log asw1-b13-drmrs: "delete forwarding-options dhcp-relay forward-only" to fix dhcp+installer issues in this rack. [15:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:27] (03PS3) 10Elukey: kserve: add missing egress policies for the controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/737399 (https://phabricator.wikimedia.org/T289834) [15:22:29] (03PS3) 10Elukey: knative-serving: add basic egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737400 (https://phabricator.wikimedia.org/T289834) [15:26:23] (03PS1) 10Jbond: P:trafficserver::backend: use ca provided by P:base::certificates [puppet] - 10https://gerrit.wikimedia.org/r/737408 (https://phabricator.wikimedia.org/T291905) [15:26:53] (03PS4) 10David Caro: Add codfw cloudvirts ceph::auth::deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/737345 (https://phabricator.wikimedia.org/T293752) [15:26:55] (03PS1) 10David Caro: libvirt|ceph: small refactor and remove keyrings [puppet] - 10https://gerrit.wikimedia.org/r/737410 (https://phabricator.wikimedia.org/T293752) [15:27:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32225/console" [puppet] - 10https://gerrit.wikimedia.org/r/737408 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [15:28:55] (03PS6) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) [15:29:04] (03CR) 10Jforrester: "This is now good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [15:29:43] (03CR) 10Elukey: [C: 03+2] kserve: add missing egress policies for the controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/737399 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [15:29:45] (03Abandoned) 10Jforrester: build: Suppress phan failure [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726964 (owner: 10Jforrester) [15:30:05] 10SRE, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) a:05ayounsi→03Papaul According to @akosiaris this is due to a failed hard drive, and it might not come back up from a reboot. @Papaul when you're back, let's replace FPC7 wi... [15:31:05] (03PS1) 10JMeybohm: Fix cert-manager build image name [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737412 (https://phabricator.wikimedia.org/T294560) [15:31:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix cert-manager build image name [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737412 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:34:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:33] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32226/console" [puppet] - 10https://gerrit.wikimedia.org/r/737410 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [15:37:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] relforge: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735973 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [15:37:23] (03PS1) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/737413 [15:39:02] (03PS2) 10Ayounsi: Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) [15:39:33] (03CR) 10jerkins-bot: [V: 04-1] Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [15:40:16] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Tol) (I was the one at the Village Pump.) It looks like the renames didn... [15:40:31] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32227/console" [puppet] - 10https://gerrit.wikimedia.org/r/737406 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:42:39] (03PS1) 10Jdlrobson: Instrument mobile talk page clicks [skins/MinervaNeue] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737084 (https://phabricator.wikimedia.org/T294738) [15:42:52] (03CR) 10JMeybohm: cfssl::config: support per profile auth keys (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [15:43:09] (03PS3) 10Ayounsi: Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) [15:43:16] (03PS4) 10Elukey: knative-serving: add basic egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737400 (https://phabricator.wikimedia.org/T289834) [15:43:18] (03PS1) 10Elukey: kserve: move labels from StatefulSet to its pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/737414 (https://phabricator.wikimedia.org/T289834) [15:43:32] (03CR) 10Ema: [V: 03+1 C: 03+1] prometheus: Avoid filename collitions between haproxy jobs [puppet] - 10https://gerrit.wikimedia.org/r/737406 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:43:34] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:51] looking ^ [15:45:19] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Avoid filename collitions between haproxy jobs [puppet] - 10https://gerrit.wikimedia.org/r/737406 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:46:17] PROBLEM - Check systemd state on cp2041 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:15] (03PS1) 10Ppchelko: PageUpdater: apply tags even if RC suppressed. [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737085 (https://phabricator.wikimedia.org/T291967) [15:47:47] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:10] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:39] so apparently there's a possible race in the varnishmtail removal part, nothing bad for actual production metrics given that we start the new instance just fine [15:48:54] apologies for the icinga spam [15:50:30] RECOVERY - Check systemd state on cp2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:39] !log create RIPE RPKI ROA for 2a02:ec80:600::/48 and 2a02:ec80:500::/48 [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:10] (03CR) 10Dzahn: "but remove from WMF when adding to ops, not both at the same time or the consistency check will mail about it" [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [15:51:23] !log remove ROA for 185.15.58.0/23 [15:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:22] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:38] (03PS4) 10Ayounsi: Advertise drmrs from esams [homer/public] - 10https://gerrit.wikimedia.org/r/737395 (https://phabricator.wikimedia.org/T283050) [15:56:19] (03CR) 10Ottomata: [C: 03+2] Remove unused bigtop hive and oozie database creation code [puppet] - 10https://gerrit.wikimedia.org/r/736034 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [15:56:31] (03PS1) 10Ema: varnish: remove ensure-absent for varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/737417 (https://phabricator.wikimedia.org/T293879) [15:56:54] (03CR) 10Elukey: [C: 03+2] kserve: move labels from StatefulSet to its pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/737414 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [15:59:29] (03PS1) 10Arturo Borrero Gonzalez: cloud: networktests: add missing ROUTING_SOURCE_IP envvar [puppet] - 10https://gerrit.wikimedia.org/r/737418 (https://phabricator.wikimedia.org/T294955) [16:00:06] (03PS3) 10Jbond: cfssl::config: support per profile auth keys [puppet] - 10https://gerrit.wikimedia.org/r/737036 [16:00:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: networktests: add missing ROUTING_SOURCE_IP envvar [puppet] - 10https://gerrit.wikimedia.org/r/737418 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [16:00:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:29] (03CR) 10Elukey: [C: 03+2] knative-serving: add basic egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/737400 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [16:06:18] !log pool cp4026 using haproxy as the TLS termination layer - T290005 [16:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:21] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:06:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:32] (03PS1) 10Elukey: knative-serving: add a name to the controller's network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/737421 (https://phabricator.wikimedia.org/T289834) [16:11:37] (03PS1) 10Ahmon Dancy: CommonSettings.php: Don't set MW_DEBUG_LOCAL for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/737422 [16:11:58] (03CR) 10Ahmon Dancy: [C: 03+2] CommonSettings.php: Don't set MW_DEBUG_LOCAL for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/737422 (owner: 10Ahmon Dancy) [16:12:41] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MPhamWMF) [16:12:48] (03Merged) 10jenkins-bot: CommonSettings.php: Don't set MW_DEBUG_LOCAL for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/737422 (owner: 10Ahmon Dancy) [16:13:33] !log depool cp4026 - T290005 [16:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:36] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:14:16] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:15:26] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:16] (03CR) 10Elukey: [C: 03+2] knative-serving: add a name to the controller's network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/737421 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [16:18:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:44] (03PS1) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [16:22:13] (03PS2) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [16:23:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:55] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737425 (https://phabricator.wikimedia.org/T128546) [16:28:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10wiki_willy) [16:29:14] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10RKemper) [16:29:16] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10Gehel) Let's make sure the kernel version is pinned somewhere in our puppet code! Then we can wait to see if the problem is re... [16:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1630). [16:30:18] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10RKemper) [16:30:27] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737425 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:31:07] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737425 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:32:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:33:10] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10RKemper) >>! In T294961#7489817, @Gehel wrote: > Let's make sure the kernel version is pinned somewhere in our puppet code! Th... [16:33:41] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:737425| Bumping portals to master (T128546)]] (duration: 00m 56s) [16:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:44] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:34:38] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:737425| Bumping portals to master (T128546)]] (duration: 00m 56s) [16:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:47] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MPhamWMF) [16:37:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:12] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10RKemper) Note from Search triage meeting: This ticket is for the debian upgrade specifically, so we can upgrade the OS itself (re-imaging the fleet) before we upgrad... [16:39:24] (03CR) 10Andrew Bogott: [C: 03+2] start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [16:40:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 58.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:40:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:23] (03Merged) 10jenkins-bot: start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [16:42:42] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 88.36 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:48:09] (03CR) 10Ottomata: statistics::product_analytics: Update contact group for monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [16:49:10] (03PS1) 10Elukey: knative-serving: move labels to pod templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/737428 (https://phabricator.wikimedia.org/T289834) [16:51:06] (03PS2) 10Elukey: knative-serving: move labels to pod templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/737428 (https://phabricator.wikimedia.org/T289834) [16:53:14] (03CR) 10Jbond: cfssl::config: support per profile auth keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [16:57:25] (03CR) 10Elukey: [C: 03+2] knative-serving: move labels to pod templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/737428 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [16:58:33] elukey: did netpols work out ok after all? [16:59:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:03] (03CR) 10BBlack: [C: 04-1] "Please hold on this. We tried this once already, but with most of the host-based services in drmrs missing, this creates a bunch of icing" [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [17:01:38] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: Drop python 2 redis client [puppet] - 10https://gerrit.wikimedia.org/r/737173 (https://phabricator.wikimedia.org/T295235) (owner: 10Majavah) [17:01:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:03:37] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) Hi, >>! In T294380#7483005, @fkaelin wrote: >> 1. Are you OK with using the `S3` protocol (rather than the Swift protocol)? > Yes that would work well with the largish file l... [17:06:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:07:37] (03Abandoned) 10Dzahn: mariadb/icinga: page people if sanitarium master goes down [puppet] - 10https://gerrit.wikimedia.org/r/735689 (https://phabricator.wikimedia.org/T233684) (owner: 10Dzahn) [17:08:10] 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10MediaWiki-API, and 3 others: Large PDF upload issue - https://phabricator.wikimedia.org/T254459 (10Legoktm) 05Open→03Resolved Please re-open if it is. [17:19:18] (03PS2) 10Jdlrobson: Instrument mobile talk page clicks [skins/MinervaNeue] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737084 (https://phabricator.wikimedia.org/T294738) [17:19:49] (03PS1) 10Elukey: helmfile.d: move Docker registry's IPs to ml-serve.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/737432 (https://phabricator.wikimedia.org/T289834) [17:21:41] akosiaris: o/ sorry just seen the ping, yes it seems working! I am extending the approach (for the moment) to the docker registry, since knative's controller needs to reach it to convert image tags to digests (I have added a dedicated network policy to allow only the controller pod to fetch from the registry) [17:21:50] I am currently fixing a pebcak, after that it should work [17:22:02] (I am concentrating on knative's egress for the moment) [17:22:41] cool, good to know [17:23:44] (03PS1) 10Vgutierrez: cache:haproxy: Bump maxconn limits [puppet] - 10https://gerrit.wikimedia.org/r/737433 (https://phabricator.wikimedia.org/T290005) [17:25:17] (03CR) 10Elukey: [C: 03+2] helmfile.d: move Docker registry's IPs to ml-serve.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/737432 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [17:27:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:22] (03PS1) 10Vgutierrez: cache:haproxy: Bump ulimit -n for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/737434 (https://phabricator.wikimedia.org/T290005) [17:30:33] (03PS1) 10David Caro: r:wmcs::openstack::codfw1dev::virt: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/737437 [17:31:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [17:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32229/console" [puppet] - 10https://gerrit.wikimedia.org/r/737433 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:33:25] (03CR) 10Dzahn: [C: 03+2] Add https://ferdinando.me to the Italian planet [puppet] - 10https://gerrit.wikimedia.org/r/737185 (owner: 10Amire80) [17:34:30] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache:haproxy: Bump maxconn limits [puppet] - 10https://gerrit.wikimedia.org/r/737433 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:35:19] mutante: is ok if I merge your CR as well? [17:35:34] vgutierrez: yes please, I was still loading my SSH keys to do that :) [17:35:38] ack [17:35:41] ty [17:35:52] done [17:36:16] (03CR) 10Vgutierrez: [C: 03+2] cache:haproxy: Bump ulimit -n for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/737434 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:36:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:37:46] (03CR) 10Dzahn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [17:37:55] (03CR) 10Dzahn: [C: 03+2] A more focused feed for lu.is for the Wikimedia Planet [puppet] - 10https://gerrit.wikimedia.org/r/737186 (owner: 10Amire80) [17:39:02] !log pool cp4026 - T290005 [17:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:05] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [17:42:39] (03PS1) 10Cwhite: logstash: rename knative-serving:activator error field [puppet] - 10https://gerrit.wikimedia.org/r/737440 (https://phabricator.wikimedia.org/T288549) [17:45:04] 10SRE, 10SRE-swift-storage, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Legoktm) >>! In T228292#5344679, @fgiunchedi wrote: > The errors... [17:47:16] (03CR) 10Cwhite: [C: 03+2] logstash: rename knative-serving:activator error field [puppet] - 10https://gerrit.wikimedia.org/r/737440 (https://phabricator.wikimedia.org/T288549) (owner: 10Cwhite) [17:48:18] (03CR) 10Elukey: "Thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/737440 (https://phabricator.wikimedia.org/T288549) (owner: 10Cwhite) [17:48:51] (03CR) 10Klausman: [C: 03+1] logstash: rename knative-serving:activator error field [puppet] - 10https://gerrit.wikimedia.org/r/737440 (https://phabricator.wikimedia.org/T288549) (owner: 10Cwhite) [17:51:53] !log depool cp4026 - T290005 [17:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:56] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [17:52:08] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) The import that caused everything to fall over last time completed. I'm not sure that's enough to declare this f... [17:53:33] (03CR) 10Andrew Bogott: [C: 03+2] base: amend notify_maintainers to decode ldap member and email ldap responses [puppet] - 10https://gerrit.wikimedia.org/r/734693 (owner: 10Cwhite) [17:53:44] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [17:57:30] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: prepare pdns-recursor for the 4.5.5 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [17:58:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) a:03Jclark-ctr [17:58:05] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: add support for enabling EDNS padding [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) (owner: 10Ssingh) [17:59:16] (03CR) 10Andrew Bogott: [C: 03+2] "thank you for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/737199 (owner: 10Majavah) [18:00:05] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1800). [18:08:23] (03CR) 10Dzahn: [C: 03+1] "respectfully removing myself because this seems between DBA and WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481) (owner: 10Zabe) [18:13:50] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10wiki_willy) a:03Jclark-ctr [18:20:15] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) [18:20:47] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner timeouts on cross-DC file uploads because of HTTP/2 - https://phabricator.wikimedia.org/T275752 (10Legoktm) 05Open→03Resolved I think all the technical work is done, please re-open if I mi... [18:28:59] (03PS1) 10Ahmon Dancy: CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) [18:30:34] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) (owner: 10Ahmon Dancy) [18:31:59] (03PS2) 10Ahmon Dancy: CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) [18:32:55] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) (owner: 10Ahmon Dancy) [18:34:14] (03PS3) 10Ahmon Dancy: CommonSettings.php: Only write to /tmp/mw-cache-* if running as www-data user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) [18:40:56] (03PS1) 10Elukey: profile::kafka::broker: move to profile::base::certificates for pki [puppet] - 10https://gerrit.wikimedia.org/r/737470 (https://phabricator.wikimedia.org/T291905) [18:44:04] (03PS2) 10Elukey: profile::kafka::broker: move to profile::base::certificates for pki [puppet] - 10https://gerrit.wikimedia.org/r/737470 (https://phabricator.wikimedia.org/T291905) [18:47:11] (03PS3) 10Elukey: profile::kafka::broker: move to profile::base::certificates for pki [puppet] - 10https://gerrit.wikimedia.org/r/737470 (https://phabricator.wikimedia.org/T291905) [18:48:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32234/console" [puppet] - 10https://gerrit.wikimedia.org/r/737470 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [18:54:30] (03PS1) 10Vgutierrez: varnish: Check remote.ip for local tls terminator detection [puppet] - 10https://gerrit.wikimedia.org/r/737474 (https://phabricator.wikimedia.org/T290005) [19:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T1900). [19:00:05] MatmaRex, JSherman, dontpanic, seddon, Jdlrobson, and Pchelolo: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:50] hello [19:00:55] A full window! I can deploy in ~10 minutes or so (unless someone beats me) [19:01:16] hey [19:01:35] MatmaRex: is your backport independent on the config? [19:01:37] no prob, pls ping when it's my turn :) [19:01:59] (I'm going to +2 the backports from mobile, and in 10 mins I'll be at my laptop to deploy them) [19:02:08] urbanecm: yes [19:02:14] Goof [19:02:17] *good [19:02:29] here [19:02:32] (03CR) 10Urbanecm: [C: 03+2] ArticleTargetSaver: ve.init may be undefined [extensions/VisualEditor] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737077 (https://phabricator.wikimedia.org/T294981) (owner: 10Bartosz Dziewoński) [19:03:08] Jdlrobson: hey! +2'ing your backports then :). [19:03:22] (03CR) 10Urbanecm: [C: 03+2] WikidataPageBanner should disable table of contents using public functions [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737075 (https://phabricator.wikimedia.org/T295003) (owner: 10Jdlrobson) [19:03:25] (03CR) 10Urbanecm: [C: 03+2] Instrument mobile talk page clicks [skins/MinervaNeue] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737084 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [19:03:51] urbanecm: thx [19:04:05] Hello Pchelolo, do you plan on self servicing? [19:04:21] (and if so should i +2 the bsckport now to give CI some time) [19:05:02] here [19:06:00] (03PS1) 10Jgiannelos: Install wait-for-it debian package [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737478 [19:06:04] Hey JSherman, nice to see you here. I see secreview and perfreview checked in the task. Was there any other review requested prior production deployment? [19:07:13] Hello Seddon, just checking you're around for the window :). [19:08:09] Hi, Not that I'm aware of. [19:09:16] Sounds good then. [19:11:19] (03PS1) 10Jgiannelos: tegola-vector-tiles: Wait for DB before pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/737479 (https://phabricator.wikimedia.org/T295290) [19:12:54] (03PS2) 10Urbanecm: Make reply tool available as opt-out on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736588 (https://phabricator.wikimedia.org/T294591) (owner: 10Bartosz Dziewoński) [19:13:02] (03CR) 10Urbanecm: [C: 03+2] Make reply tool available as opt-out on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736588 (https://phabricator.wikimedia.org/T294591) (owner: 10Bartosz Dziewoński) [19:13:42] (03PS2) 10Urbanecm: kswiki: Adding wordmark and tagline files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737054 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:13:49] @urbanecm I am! [19:13:52] (03CR) 10Urbanecm: [C: 03+2] kswiki: Adding wordmark and tagline files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737054 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:14:00] excellent Seddon! I'll ping you when it's your turn then :) [19:14:15] dontpanic: I'll sync the static patch right away, as there's not much to test [19:14:27] okay, ty [19:14:51] (03Merged) 10jenkins-bot: Make reply tool available as opt-out on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736588 (https://phabricator.wikimedia.org/T294591) (owner: 10Bartosz Dziewoński) [19:15:07] MatmaRex: your patch is available at mwdebug1001, please test :). [19:15:18] looking [19:15:45] (03PS1) 10Jgiannelos: tile-pregeneration: Wait for envoy to get ready [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737481 (https://phabricator.wikimedia.org/T295290) [19:15:54] (03Merged) 10jenkins-bot: kswiki: Adding wordmark and tagline files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737054 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:16:18] urbanecm: yeah, looks good [19:16:38] Okay, syncing. [19:17:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:40] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) Here's the plan to migrate codfw Kubernetes to `helm3` as well: * Announce maintenance some days ahead on ops list * Downtime Kubernetes services in codfw ` is this needed? is t... [19:18:50] (03PS2) 10Ssingh: dnsdist: allow setting additional custom HTTP response headers [puppet] - 10https://gerrit.wikimedia.org/r/737364 [19:20:33] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Ladsgroup) I use this opportunity to write down something I have been pitching to basically everyone. Use [[https://en.wikipedia.org/wiki/PID_controller|PID contro... [19:20:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:10] (03PS3) 10Ssingh: dnsdist: allow setting additional custom HTTP response headers [puppet] - 10https://gerrit.wikimedia.org/r/737364 [19:21:18] (03PS2) 10Jgiannelos: tile-pregeneration: Wait for envoy to get ready [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737481 (https://phabricator.wikimedia.org/T295290) [19:21:43] (03PS2) 10Urbanecm: kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:21:48] (03CR) 10Urbanecm: [C: 03+2] kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:22:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bf70a8bd3337bc24cc23b1f257f7eb99ec2607b8: Make reply tool available as opt-out on dewiki (T294591) (duration: 00m 56s) [19:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:03] T294591: Config change: Deploy Reply Tool as opt-out preference at de.wiki - https://phabricator.wikimedia.org/T294591 [19:22:36] dontpanic: sorry, just saw that...you'll need a followup [19:22:36] (03CR) 10jerkins-bot: [V: 04-1] kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:22:42] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737054 adds Wikipedia [19:22:47] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737061 uses wikipedia [19:22:57] oh [19:23:00] sorry for that [19:23:02] can you rename the files to lowercase please? [19:23:17] sure [19:23:21] thx [19:24:01] Seddon: any opinion on Lucas's comments on the patch? [19:24:10] (03Merged) 10jenkins-bot: ArticleTargetSaver: ve.init may be undefined [extensions/VisualEditor] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737077 (https://phabricator.wikimedia.org/T294981) (owner: 10Bartosz Dziewoński) [19:24:57] MatmaRex: your backport is at mwdebug1001 now, can you test please? [19:24:59] @urbanecm the piping is indeed the correct syntax here but the comments both seem valid. Happy to punt to the next window [19:25:19] Seddon: feel free to amend the patch now, and I'll deploy it today :-) [19:25:45] (03PS5) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [19:25:46] urbanecm: i can't really test it, i couldn't reproduce the issue in production in the first place [19:25:56] (03Merged) 10jenkins-bot: WikidataPageBanner should disable table of contents using public functions [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737075 (https://phabricator.wikimedia.org/T295003) (owner: 10Jdlrobson) [19:25:56] urbanecm: the patch is small and harmless though [19:25:58] (03Merged) 10jenkins-bot: Instrument mobile talk page clicks [skins/MinervaNeue] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737084 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [19:26:03] ... famous last words ;) [19:26:03] MatmaRex: okay, just sending it out then :) [19:26:11] (03CR) 10Ssingh: dnsdist: allow setting additional custom HTTP response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737364 (owner: 10Ssingh) [19:26:19] perryprog: it's JS, harmless from appserver standpoint :) [19:26:23] phew [19:26:23] oh, if Seddon is here, apparently he had ran into it [19:27:05] MatmaRex: oh is this one I commented on earlier? [19:27:10] Discussion tools? [19:27:17] yeah [19:27:22] https://phabricator.wikimedia.org/T294981 [19:27:24] * urbanecm aborts deploying for now, waiting for Seddon to test [19:28:04] also, i think i know why it wasn't visible in Logstash, i'll make a bug and a patch about it later [19:28:24] even better :) [19:28:57] testing now [19:29:02] thanks [19:29:35] @MatmaRex & @urbanecm can confirm patch worked! [19:29:42] excellent! syncing for real then :) [19:30:05] thanks [19:30:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:00] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.ArticleTargetSaver.js: 9d7cde492a69dc4e18403f3dbacd2de27c3f05e0: ArticleTargetSaver: ve.init may be undefined (T294981) (duration: 00m 55s) [19:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:03] T294981: "TypeError: Cannot read properties of undefined (reading 'editingSessionId')" while posting using the new topic tool as logged out user - https://phabricator.wikimedia.org/T294981 [19:31:09] MatmaRex: and live! anything else i can do for you today? [19:31:19] that's all, thanks urbanecm [19:31:23] no problem [19:31:56] Jdlrobson: your patches should be at mwdebug1001, can you test 'em please? [19:32:13] (both of them) [19:33:44] Seddon: let me know if you want to do the amends in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/726993 now, do them in a followup or reschedule [19:33:48] all of those are fine by me [19:33:51] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) Hi, That sounds good to me. Thank you! Please let me know what the next steps are. [19:34:06] (03PS1) 10Tks4Fish: kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) [19:34:14] (03PS3) 10Jgiannelos: tile-pregeneration: Wait for envoy to get ready [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737481 (https://phabricator.wikimedia.org/T295290) [19:34:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:35] dontpanic: you might know that, but gerrit shows https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737483/ adds some images, but it doesn't show any removal [19:34:39] urbanecm: on it [19:35:05] thanks Jdlrobson [19:35:05] oh [19:35:17] (03PS4) 10Jgiannelos: tile-pregeneration: Wait for envoy to get ready [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737481 (https://phabricator.wikimedia.org/T295290) [19:35:29] dontpanic: your original patch that added them got merged (/me noticed it too late) [19:35:38] (03PS1) 10Ebernhardson: query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) [19:35:40] (03PS1) 10Ebernhardson: query_service: Remove prometheus resource cleanups [puppet] - 10https://gerrit.wikimedia.org/r/737485 (https://phabricator.wikimedia.org/T280008) [19:36:05] (03PS6) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [19:36:33] urbanecm: Wikivoyage one is good to sync [19:36:38] looking at the Minerva one still [19:36:41] wikivoyage? [19:36:55] wikidatapagebanner [19:37:19] urbanecm: other one is good [19:37:24] @urbanecm made the change [19:37:26] urbanecm: basically you can sync both my patches now :) [19:37:29] you mean the WikidataPageBanner one? [19:37:39] ok :) [19:37:53] (03PS5) 10Jgiannelos: tile-pregeneration: Wait for envoy to get ready [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/737481 (https://phabricator.wikimedia.org/T295290) [19:39:32] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/WikidataPageBanner/includes/WikidataPageBanner.php: 2c74457b26e0d288a371fd76bcb91b697554f9fd: WikidataPageBanner should disable table of contents using public functions (T295003) (duration: 00m 55s) [19:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:35] T295003: [1.38.0-wmf.7] WikidataPageBanner: Disable mechanism of table of contents is no longer working - https://phabricator.wikimedia.org/T295003 [19:41:10] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/skins/MinervaNeue/includes/Skins/SkinMinerva.php: 8375e38ee4d57b4ff3d30be96473b927ac8e4ef0: Instrument mobile talk page clicks (T294738) (duration: 00m 54s) [19:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:13] T294738: Define and instrument bounce rate on talk pages - https://phabricator.wikimedia.org/T294738 [19:41:18] urbanecm: looking into it, got called here, sorry [19:41:28] dontpanic: np, take your time :) [19:41:52] (03CR) 10Urbanecm: [C: 03+2] Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [19:42:16] (03Abandoned) 10Tks4Fish: kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:42:17] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/skins/MinervaNeue/resources/: 8375e38ee4d57b4ff3d30be96473b927ac8e4ef0: Instrument mobile talk page clicks (T294738) (duration: 00m 54s) [19:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:23] Jdlrobson: both should be live now [19:42:26] anything else from you? [19:42:41] urbanecm: nope i'll just confirm the fix in production [19:42:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:42:45] (03Merged) 10jenkins-bot: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [19:42:50] sounds good to me -- ping if any changes are needed [19:42:52] thanks a bunch for managing a very busy backport window! [19:43:14] happy to help :) [19:43:27] Seddon: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/726993 is now at mwdebug1001, can you test please? [19:44:06] (03PS6) 10Urbanecm: Enable TheWikipediaLibrary on meta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:44:50] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:46:00] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation CREATE USER failed for admin@localhost on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:46:10] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:21] !log upload pdns-recursor 4.5.7-1wm1 to apt.wm.o (buster) [19:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:15] urbanecm: looks good! [19:47:21] thanks, syncing [19:47:29] (03CR) 10Urbanecm: [C: 03+2] Enable TheWikipediaLibrary on meta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:48:03] ottomata: see icinga ^ [19:48:28] thanks yeah, responding in #wikimedia-analytics [19:48:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1ca184b142f1f14b4c1537f6503e0ef893155453: Add a new "all assessments" option to MediaSearch assessments dropdown (T285349) (duration: 00m 55s) [19:48:57] Seddon: should be live now. Anything else i can do for you? [19:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:58] T285349: [M] Modify MediaSearch filter language to be consistent and add "all assessments" option - https://phabricator.wikimedia.org/T285349 [19:49:09] urbanecm: all good! [19:49:13] great : [19:49:14] :) [19:49:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:31] (03PS4) 10Ssingh: dnsrecursor: prepare pdns-recursor for the 4.5.7 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 [19:49:33] (03Merged) 10jenkins-bot: Enable TheWikipediaLibrary on meta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:50:11] (03CR) 10Ssingh: "(Updated commit message and rebased; no code change otherwise.)" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [19:50:17] JSherman: hey, can you please test your patch at mwdebug1001? [19:50:25] (03CR) 10Gehel: "WARNING: is this going to restart the service? If it does, we need to be careful on how we deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [19:50:28] (not sure if you went through a backport before -- happy to answer any questions if there are any) [19:51:08] (03Restored) 10Tks4Fish: kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:51:11] I have not. How do I access mwdebug1001? [19:51:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32239/console" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [19:51:28] JSherman: so, the gist of what you need to do is at https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage [19:51:58] basically, you need to install an extension called WikimediaDebug, which will let you access the debug server [19:52:00] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: prepare pdns-recursor for the 4.5.7 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [19:52:01] !log an-coord1002: drop user 'admin'@'localhost'; start slave; to fix broken replication - T284150 [19:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:05] T284150: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 [19:52:20] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:52:41] you don't need any of the checkboxes selected, just selecting right debug server (mwdebug1001) and switching the toggle to on should work [19:52:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:11] I applied the patch at the debug server only to let you ensure it behaves as you would expect, avoiding breaking stuff for real users with full deployment [19:54:11] (03PS2) 10Urbanecm: kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:54:30] JSherman: let me know if the explanations make sense to you, or if you run into any issues. [19:54:59] (03CR) 10Andrew Bogott: [C: 03+2] maintain-views.yaml: remove dropped ep_* tables [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481) (owner: 10Zabe) [19:54:59] ah, I see, yes, this only comes into play for accounts with the configured global editcount, which we are starting with a very high value (50000) I actually don't have an account that meets the criteria, so I'd need to pull in someone else to actually make it do anything currently [19:55:08] (03PS2) 10Andrew Bogott: maintain-views.yaml: remove dropped ep_* tables [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481) (owner: 10Zabe) [19:55:20] We did testing in labs with a much lower threshold so that we could trigger it easily [19:55:57] JSherman: got it. So, are you saying the extension doesn't do anything at all for users with <50k edits globally? [19:56:28] correct [19:56:43] got it [19:56:48] (03CR) 10Razzi: [C: 03+1] maintain-views.yaml: remove dropped ep_* tables [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481) (owner: 10Zabe) [19:56:52] in that case, I'm going to deploy it w/o a test [19:56:59] knowing the beta tests were done [19:57:13] :+1 [19:57:37] (03CR) 10Urbanecm: [C: 03+2] kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:58:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e66bd53b54ee29423affcba768b7a3cf1a81714a: Enable TheWikipediaLibrary on meta & testwiki (T288070) (duration: 00m 55s) [19:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:16] T288070: Deploy The Wikipedia Library Echo notification with 50,000 edit count threshold - https://phabricator.wikimedia.org/T288070 [19:58:24] JSherman: just out of curiosity, as I've 170k edits globally, what should i see now? :-) [19:58:30] (your patch is live now) [19:58:32] (03Merged) 10jenkins-bot: kswiki: Renaming logo files to lowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737483 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:58:54] (03PS3) 10Urbanecm: kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:59:02] (03CR) 10Urbanecm: kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:59:05] (03CR) 10Urbanecm: [C: 03+2] kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [19:59:09] when you make your next edit, you should get an echo notification telling you about the wikipedia library [19:59:31] got it, thanks for explaining JSherman [20:00:08] no problem; thanks for rolling through what is clearly a very busy window! [20:00:14] (03Merged) 10jenkins-bot: kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [20:00:35] happy to help :). It's almost done now (just a single patch remaining) [20:00:37] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:01:00] dontpanic: your patches are at mwdebug1001 (all of SVG upload, SVG rename and the IS.php one) [20:01:02] can you test? [20:01:34] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:01:54] (03PS1) 10Andrew Bogott: dbproxy1018: depool s1 primary server for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737487 [20:02:21] yep [20:02:28] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1018: depool s1 primary server for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737487 (owner: 10Andrew Bogott) [20:02:36] dontpanic: let me know how it goes :) [20:02:42] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:02:52] looks great :D [20:02:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:03] dontpanic: does that mean it works? :) [20:03:10] yes :) [20:03:17] (03PS2) 10Andrew Bogott: dbproxy1018: depool s1 primary server for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737487 (https://phabricator.wikimedia.org/T216481) [20:03:34] in that case, {{syncing}} :) [20:03:43] thanks a ton, and sorry for the mess :( [20:04:04] (03CR) 10Andrew Bogott: [C: 03+2] dbproxy1018: depool s1 primary server for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737487 (https://phabricator.wikimedia.org/T216481) (owner: 10Andrew Bogott) [20:04:31] np, it happens :)) [20:05:39] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: 5f7864f: 54e7f74: kswiki: Adding wordmark and tagline files (T294093) (duration: 00m 54s) [20:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:43] T294093: Requesting wordmark change for ks.wikipedia.org - https://phabricator.wikimedia.org/T294093 [20:06:02] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Legoktm) >>! In T251305#7490456, @Jelto wrote: > Some things which are not clear to me: > * Is this the right approach to use the `sre.switchdc.services` cookbook? Kind of, do note tha... [20:06:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:06:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c09793f5a918d280df444ade10e28eca136f7508: kswiki: Adding wordmark and tagline to IS.php (T294093) (duration: 00m 55s) [20:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:05] dontpanic: should be live! [20:08:17] thanks a ton :D [20:09:25] np [20:11:28] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:08] (03Abandoned) 10Ebernhardson: query_service: Remove prometheus resource cleanups [puppet] - 10https://gerrit.wikimedia.org/r/737485 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [20:24:43] (03PS2) 10Ebernhardson: query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) [20:32:40] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:34:22] (03PS1) 10RLazarus: admin: Add dmartin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/737493 (https://phabricator.wikimedia.org/T295264) [20:35:58] (03CR) 10RLazarus: [C: 03+2] admin: Add dmartin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/737493 (https://phabricator.wikimedia.org/T295264) (owner: 10RLazarus) [20:38:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for David Martin - https://phabricator.wikimedia.org/T295264 (10RLazarus) 05Open→03Resolved a:03RLazarus Done: ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf | grep dmartin member: uid=dmartin,ou=people,dc=wikimedia,dc=org ` Welco... [20:45:46] PROBLEM - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 11335 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [21:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T2100) [21:09:16] (03PS1) 10Ladsgroup: [WIP] mediawiki-cache-warmup: Add support for POST requests [puppet] - 10https://gerrit.wikimedia.org/r/737498 (https://phabricator.wikimedia.org/T290989) [21:25:40] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11387 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [21:27:57] PROBLEM - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 11169 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [21:36:27] (03PS1) 10Andrew Bogott: dbproxy1018: depool s2-s8 primary servers for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737502 (https://phabricator.wikimedia.org/T216481) [21:36:55] (03CR) 10Krinkle: [C: 03+1] "I vaguely recall there being a circular dependency here in that wgConf is used at runtime in cross-wiki contexts (e.g. JobQueue, WikiMap, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737209 (owner: 10Awight) [21:39:45] (03CR) 10Andrew Bogott: [C: 03+2] dbproxy1018: depool s2-s8 primary servers for maintain_views run [puppet] - 10https://gerrit.wikimedia.org/r/737502 (https://phabricator.wikimedia.org/T216481) (owner: 10Andrew Bogott) [21:40:49] (03CR) 10Krinkle: Extract reused dblists code into function (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 (owner: 10Awight) [21:40:59] (03PS1) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [21:41:46] (03CR) 10jerkins-bot: [V: 04-1] Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [21:43:06] (03PS6) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [21:43:27] (03PS6) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [21:44:44] (03PS1) 10Andrew Bogott: Revert "dbproxy1018: depool s1-s8 primary servers for maintain_views run" [puppet] - 10https://gerrit.wikimedia.org/r/737505 [21:47:09] (03CR) 10Andrew Bogott: [C: 03+2] Revert "dbproxy1018: depool s1-s8 primary servers for maintain_views run" [puppet] - 10https://gerrit.wikimedia.org/r/737505 (owner: 10Andrew Bogott) [21:49:28] (03CR) 10Krinkle: [C: 04-1] "I don't think this should be enforced at this level. There are lots of things written at runtime in different components whether wmf-confi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737468 (https://phabricator.wikimedia.org/T295310) (owner: 10Ahmon Dancy) [21:51:55] (03PS2) 10Andrew Bogott: Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah) [21:52:53] (03CR) 10Andrew Bogott: [C: 03+2] Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah) [21:58:41] (03PS2) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [22:00:05] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T2200) [22:00:47] (03PS3) 10EllenR: Set up beta test environment for QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [22:25:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:25:40] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:32:28] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737509 (owner: 10Awight) [22:32:43] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech-static down - https://phabricator.wikimedia.org/T295266 (10colewhite) p:05Triage→03Medium [22:33:18] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: Character encoding issues on daily-image-l - https://phabricator.wikimedia.org/T295096 (10colewhite) p:05Triage→03Medium [22:33:51] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10colewhite) p:05Triage→03Medium [22:34:17] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10colewhite) p:05Triage→03Medium [22:34:40] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10procurement: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10colewhite) p:05Triage→03Medium [22:35:21] 10SRE, 10SRE Observability (FY2021/2022-Q2): Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10colewhite) p:05Triage→03Medium [22:35:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10colewhite) p:05Triage→03Medium [22:36:09] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10colewhite) p:05Triage→03Medium [22:38:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10colewhite) p:05Triage→03Medium [23:00:16] (03CR) 10Andrew Bogott: [C: 03+1] O:puppetmaster::puppetdb: rename role to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [23:01:14] jouncebot: noe [23:01:26] jouncebot: now [23:01:26] For the next 0 hour(s) and 58 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211108T2200) [23:01:33] (03CR) 10Andrew Bogott: [C: 03+1] "Moritz, want me to go ahead and merge this when I have some time to keep an eye on it?" [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [23:02:55] jouncebot: next [23:02:55] In 0 hour(s) and 57 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211109T0000) [23:03:07] Ahh... okay [23:03:26] jouncebot: help [23:03:26] **** JounceBot Help **** [23:03:27] JounceBot is a deployment helper bot for the Wikimedia movement. [23:03:27] Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot [23:03:27] Available commands: [23:03:27] HELP Print all commands known to the server. [23:03:27] NEXT Get the next deployment event(s if they happen at the same time). [23:03:27] NOW Get the current deployment event(s) or the time until the next. [23:03:28] NOWANDNEXT Get the current and next deployment event(s). [23:03:28] REFRESH Refresh my knowledge about deployments. [23:26:13] (03PS1) 10Urbanecm: [beta] votewiki: ensure no group has securepoll-view-voter-pii [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737515 [23:32:50] (03CR) 10Urbanecm: [C: 03+2] [beta] votewiki: ensure no group has securepoll-view-voter-pii [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737515 (owner: 10Urbanecm) [23:33:45] (03Merged) 10jenkins-bot: [beta] votewiki: ensure no group has securepoll-view-voter-pii [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737515 (owner: 10Urbanecm) [23:36:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:12] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:50:12] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27