[00:00:07] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T0000). [00:01:56] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2043.codfw.wmnet ` The log... [00:18:28] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:20] RECOVERY - Host elastic2043 is UP: PING WARNING - Packet loss = 75%, RTA = 32.53 ms [00:24:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2043.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2043.codfw.wmnet'] ` [00:31:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2043.codfw.wmnet ` The log... [00:50:27] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2043.codfw.wmnet with reason: REIMAGE [00:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:38] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2043.codfw.wmnet with reason: REIMAGE [00:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:35] (03CR) 10Legoktm: Add configuration for running on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [01:04:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2043.codfw.wmnet'] ` and were **ALL** successful. [01:05:57] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:28] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Legoktm) I checked all the channels in the table, the only outstanding things I see are: * -dcops is not publicly logged. I will ask there if that's intentional or not. * -analytics doesn't ha... [01:42:13] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:05] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10LanmeiCN) No,these two emails are from two administrators [02:02:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:27] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [02:18:52] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) ` In reference to your Hewlett Packard Enterprise Support Case Number 5357298848, the following Customer Self Repair Par... [02:26:47] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:36] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Shizhao) There is already zhwiki-l, why do we need to create this mailing list? [03:02:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:27] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:17] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:10] ACKNOWLEDGEMENT - MegaRAID on db1175 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T287137 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:35:13] 10SRE, 10ops-eqiad: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10ops-monitoring-bot) [04:02:06] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:13] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:26:57] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:59] (03PS2) 10ArielGlenn: dumps: Drop absented cron in kiwix [puppet] - 10https://gerrit.wikimedia.org/r/705898 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [04:46:58] (03CR) 10ArielGlenn: [C: 03+2] dumps: Drop absented cron in kiwix [puppet] - 10https://gerrit.wikimedia.org/r/705898 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:02:09] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:25] 10ops-eqiad, 10DBA: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Marostegui) p:05Triage→03Medium Can we get a new disk for this host? [05:14:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Patch lgtm; I would probably also uncomment the other stanzas." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [05:19:43] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:22:08] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10LanmeiCN) zhwiki-l post update is not active [05:23:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is a lot of leftovers in the manifest, if we're removing TLS completely." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [05:26:33] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "thanos-swift envoy listener: rewrite HTTP host header" [puppet] - 10https://gerrit.wikimedia.org/r/705480 (owner: 10DCausse) [05:31:20] !log T281327 [Elastic] Unbanned `elastic2043.codfw.wmnet` from all 3 cirrus/elasticsearch clusters; node is back in the fleet [05:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:28] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [05:31:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Hiera should only be looked up in profiles." [puppet] - 10https://gerrit.wikimedia.org/r/699427 (https://phabricator.wikimedia.org/T73480) (owner: 10Hashar) [05:39:01] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:40:55] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:41:00] !log [WDQS] Restarted `wdqs-blazegraph` on `wdqs1013` [05:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:59] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [06:20:31] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:36] !log [WDQS] Pooled `wdqs1006` (was still depooled following data-transfer cookbook runs from several hours ago) [06:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:10] (03PS7) 10KartikMistry: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [06:25:57] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:37] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10elukey) Voiced wikibugs, and also added several +o to analytics/ml people. I thought I already did it, is there something more to do other than `/mode #channel +o nick` ? [06:34:52] (03PS3) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [06:36:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [06:36:44] (03PS3) 10Ryan Kemper: elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) [06:37:36] (03PS4) 10Ryan Kemper: elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) [06:39:39] (03PS4) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [06:40:07] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [06:40:21] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Operator873) >>! In T283230#7229244, @elukey wrote: > Voiced wikibugs, and also added several +o to analytics/ml people. I thought I already did it, is there something more to do other than `/m... [06:41:49] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:42:07] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: further refactor rolling-operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:42:09] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:43:03] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Majavah) >>! In T283230#7229244, @elukey wrote: > is there something more to do other than `/mode #channel +o nick` ? Changing modes with `/mode` does not persist when someone leaves the chann... [06:47:47] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:48:50] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10elukey) Perfect thanks for the tip, fixed! [06:49:15] (03PS5) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [06:50:39] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [06:51:00] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Volans) >>! In T283230#7229142, @Legoktm wrote: > Also, I can't see the -sre-foundations ACL. I don't think there's any reason for it to be private, asking in channel. Set PUBACL ON, thanks fo... [06:52:50] (03PS6) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [06:53:59] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.094e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:54:14] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [06:59:41] (03PS7) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [06:59:43] (03CR) 10Majavah: "Is there a reason not to just use prometheus's integrated openstack discovery, which takes care of this automatically? Config for that loo" [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [07:00:22] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [07:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170 (s2, s7), pool db1105 (s2) and db1098 (s7) into dump T286888', diff saved to https://phabricator.wikimedia.org/P16844 and previous config saved to /var/cache/conftool/dbconfig/20210722-070114-marostegui.json [07:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:23] T286888: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 [07:02:19] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) This host was pooled for dumps which is not moved to codfw, so it can potentially cause issues if dumps were about to start. I have depooled it and placed others in s2 and s7 to serv... [07:05:22] (03PS1) 10Ryan Kemper: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 [07:06:43] (03PS2) 10Ryan Kemper: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) [07:07:40] (03CR) 10Volans: elastic: pull out execute_on_clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:08:21] (03CR) 10Ryan Kemper: [C: 03+1] Revert "thanos-swift envoy listener: rewrite HTTP host header" [puppet] - 10https://gerrit.wikimedia.org/r/705480 (owner: 10DCausse) [07:10:18] (03CR) 10jerkins-bot: [V: 04-1] elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:12:21] (03PS3) 10Ryan Kemper: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) [07:12:59] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:15:19] (03CR) 10jerkins-bot: [V: 04-1] elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:24:23] (03PS4) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [07:24:47] !log hashar@deploy1002 Started deploy [integration/docroot@b3e39b0]: build: Updating mediawiki/mediawiki-codesniffer to 37.0.0 [07:24:52] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [07:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:56] !log hashar@deploy1002 Finished deploy [integration/docroot@b3e39b0]: build: Updating mediawiki/mediawiki-codesniffer to 37.0.0 (duration: 00m 09s) [07:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:06] (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [07:25:19] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:49] (03PS5) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [07:28:18] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [07:33:45] (03PS1) 10Giuseppe Lavagetto: helmfile: allow performing a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/706293 [07:35:11] (03PS8) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [07:39:37] this is great effie --^ [07:40:40] (03CR) 10JMeybohm: [C: 03+1] "Great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/706293 (owner: 10Giuseppe Lavagetto) [07:42:19] elukey: yeah [08:00:49] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:39] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:26] (03PS1) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [08:08:39] (03PS2) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [08:10:42] (03PS3) 10Filippo Giunchedi: hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 [08:11:20] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks everyone!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [08:11:37] (03PS4) 10Filippo Giunchedi: hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 [08:24:27] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [08:25:09] (03PS1) 10Filippo Giunchedi: hieradata: fix o11y services public hostnames in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/706316 [08:25:10] that's me ^ fix incoming [08:25:47] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix o11y services public hostnames in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/706316 (owner: 10Filippo Giunchedi) [08:26:29] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:11] the planet alert seems flappy, known/being looked at ? [08:29:03] I did mention it few days ago, it's possible it was converted from cron to systemd timer so it could have been like that all the time, we're noticing it just now, but I'm not sure if anyone checked it or was just a theory [08:30:55] ah that makes sense, thanks volans [08:31:12] nevertheless it's annoying [08:31:42] I just stubbed my toe once again on the fact that alert1001 doesn't have alert2001 in its icinga config :( looking forward to ripping all of this out though [08:34:23] (03PS1) 10Filippo Giunchedi: hieradata: no 'monitoring' section for alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/706324 [08:34:37] !log cr2-codfw> request chassis fpc slot 0 offline - T287110 [08:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:45] T287110: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 [08:35:10] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: no 'monitoring' section for alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/706324 (owner: 10Filippo Giunchedi) [08:39:18] (03PS1) 10Filippo Giunchedi: hieradata: enable all non-lvs o11y services [puppet] - 10https://gerrit.wikimedia.org/r/706325 [08:41:08] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable all non-lvs o11y services [puppet] - 10https://gerrit.wikimedia.org/r/706325 (owner: 10Filippo Giunchedi) [08:45:02] (03CR) 10Hashar: "Compiler output looks about right: https://puppet-compiler.wmflabs.org/compiler1003/30295/" [puppet] - 10https://gerrit.wikimedia.org/r/706042 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [08:45:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [08:47:50] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/30296/ , it is not THAT helpful since it does a diff against origin/produ" [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [08:48:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [08:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:08] (03CR) 10Elukey: Add support for knative serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [08:54:50] (03CR) 10Kormat: "Change looks good, but let's wait until next week before merging, just to confirm the weekly run of the systemd timer succeeds." [puppet] - 10https://gerrit.wikimedia.org/r/705901 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:56:28] (03CR) 10Elukey: Add support for knative serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [09:02:15] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:42] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:02:45] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2021-10-13 08:01:48 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:02:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:02:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:03:11] yo [09:03:20] * volans in meeting, bug ping if I'm needed [09:03:21] yo-2 [09:03:28] ohhi [09:03:41] codfw-eqiad transport [09:04:02] that is a little odd right [09:04:09] I know what's going on [09:04:32] XioNoX: didn't you warned us about the transport being used a lot a couple of days ago? [09:04:41] I recall that you and Valentin were talking about it [09:04:46] elukey: this, and the linecard failed yesterday [09:04:47] yes [09:04:47] RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2021-10-13 08:01:48 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:04:57] ah lovely I missed the linecard failure [09:05:00] putting more strain on the other link [09:05:15] see the big jump on https://librenms.wikimedia.org/graphs/to=1626944400/id=8284/type=port_bits/from=1626858000/ [09:05:56] we should depool eqiad to help a bit [09:06:22] (03PS9) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [09:06:42] I am not doing any backup in that direction currently that I could stop [09:06:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:06:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:07:17] I acked the alerts [09:07:43] XioNoX: so you are suggesting to dns-depool eqiad to reduce the load on the 10g transport link? [09:07:48] XioNox: was looking there, yeah makes sense - we already mentioned the combined usage was approaching 10G. [09:07:49] elukey: exactly [09:08:05] https://gerrit.wikimedia.org/r/c/operations/dns/+/703562/ [09:08:08] hmm yeah makes sense. [09:08:29] why do I get logged out of gerrit everyday now? [09:08:45] (03Restored) 10Ayounsi: Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [09:08:54] (03PS2) 10Ayounsi: Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [09:09:09] (03CR) 10Effie Mouzeli: [C: 03+1] Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [09:09:14] Funny it's happening now - very early in US - and didn't happen yesterday during prime time hours. [09:09:36] (03CR) 10Elukey: [C: 03+1] Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [09:09:46] (03CR) 10Kormat: pontoon: initialize $_role on bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705661 (owner: 10Filippo Giunchedi) [09:11:00] (03CR) 10Ayounsi: [C: 03+2] Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [09:11:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:11:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:11:47] !log depool eqiad to reduce load on one codfw-eqiad link - T287110 [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] T287110: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 [09:12:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:12:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [09:13:52] what would be the cause it happens in this direction, but not the other- geolocation of clients? cloud being eqiad only? T274234? [09:13:52] T274234: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 [09:14:02] so we should see the transport link usage decrease and transit/peering usage increase [09:14:40] cool :) [09:14:45] (03PS15) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [09:14:47] (03PS8) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:14:52] XioNoX, topranks thanks! [09:15:35] jynus: I believe it's the different switch models and placement of servers let me double check. [09:15:57] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:17:43] (03CR) 10Kormat: pontoon: initialize user bare repo on bootstrap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [09:20:42] jynus: Without rabbit holing too much on it the main difference is as follows. [09:20:50] backup1003: Connected to asw-b2-eqiad, which directly connects to CRs over 10G links. [09:20:50] backup2003: Connected to asw-c4-codfw, which is uplinked to asw-c2-codfw and asw-c7-codfw, each of which have 40G uplinks to CRs. [09:20:58] (03CR) 10Arturo Borrero Gonzalez: tools prometheus: allow scraping by ip address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [09:21:32] The faster line-rate on the 40G links mean the are idle a higher proportion of the time when traffic arrives in to them, and thus less buffering/holding packets waiting for link to be available needs to be done. [09:22:04] That means we don't have the same pressure on available buffers, and discards / tail drops when they are full and packets arrive that can't be transmitted right away. [09:22:15] sorry, I think you are telling me about T274234 [09:22:15] T274234: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 [09:22:42] I am, sry I must have got mixed up. [09:22:43] I was asking about the differences of bandwidth when eqiad is mw primary [09:22:55] vs when codfw is primary (now) [09:22:59] (03CR) 10Kormat: "What does the workflow look like for this?" [puppet] - 10https://gerrit.wikimedia.org/r/705663 (owner: 10Filippo Giunchedi) [09:23:19] basically, the alert we just had [09:23:55] (03PS6) 10Jbond: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [09:24:02] I would expect, naively, the bandwidth of the link to just get inverted [09:25:26] but it seems it is up to 3 times larger according to: https://librenms.wikimedia.org/graphs/to=1626945600/id=8284/type=port_bits/from=1624267200/ [09:25:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:08] (03PS9) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:26:17] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks @jcrespo. Yes this makes perfect sense. Due... [09:26:24] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10fgiunchedi) >>! In T276442#7228056, @jcrespo wrote: > The next step on productionization of workers is to setup the account for access to mw content on swift.... [09:28:08] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:28:13] (03CR) 10Kormat: [C: 03+1] pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 (owner: 10Filippo Giunchedi) [09:28:36] (03CR) 10Kormat: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/705663 (owner: 10Filippo Giunchedi) [09:29:32] (03CR) 10Kormat: [C: 03+1] pontoon: run puppet twice at enroll [puppet] - 10https://gerrit.wikimedia.org/r/705666 (owner: 10Filippo Giunchedi) [09:29:53] (03CR) 10Kormat: [C: 03+1] pontoon: always link hiera directory [puppet] - 10https://gerrit.wikimedia.org/r/705667 (owner: 10Filippo Giunchedi) [09:30:40] (03CR) 10Kormat: pontoon: create puppet client dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705664 (owner: 10Filippo Giunchedi) [09:36:05] (03PS1) 10Kormat: pc201[1-4]: Add to mariadb::parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/706335 (https://phabricator.wikimedia.org/T284825) [09:38:10] marostegui: nice quick CR for you ^ :) [09:38:45] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:39:23] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [09:40:28] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC ok https://puppet-compiler.wmflabs.org/compiler1002/30297/mw1356.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [09:40:42] (03CR) 10Marostegui: [C: 03+1] pc201[1-4]: Add to mariadb::parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/706335 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:41:11] (03CR) 10Kormat: [C: 03+2] pc201[1-4]: Add to mariadb::parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/706335 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [09:58:06] (03PS1) 10Jbond: amce_chief: add gitlab2001 to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/706339 [09:58:26] jbond: s/amce/acme/ <3 [09:58:33] (03PS1) 10Effie Mouzeli: add mwdebug service to LVS 3 [puppet] - 10https://gerrit.wikimedia.org/r/706340 (https://phabricator.wikimedia.org/T283056) [09:58:49] vgutierrez: thanks :) [09:59:08] (03PS2) 10Jbond: acme_chief: add gitlab2001 to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/706339 [09:59:36] jelto: FYI ^^^ this is should let puppet get a bit further on gitlab2001 [10:00:06] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1000). [10:00:40] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: add gitlab2001 to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/706339 (owner: 10Jbond) [10:01:01] (03PS5) 10Dzahn: conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) [10:01:13] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:31] (03CR) 10Dzahn: [C: 03+1] acme_chief: add gitlab2001 to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/706339 (owner: 10Jbond) [10:01:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [10:01:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [10:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:02] jbond: great thanks! [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:42] np [10:05:04] (03CR) 10Vgutierrez: [C: 03+1] add mwdebug service to LVS 3 [puppet] - 10https://gerrit.wikimedia.org/r/706340 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [10:05:36] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repositories'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) [10:05:38] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseClientRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706342 (https://phabricator.wikimedia.org/T257260) [10:05:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM before wmf.16 is safely rolled out to all wikis and won’t be rolled back again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [10:08:11] (03PS1) 10David Caro: ceph: Added CephClusterController tests and a couple fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/706343 [10:09:30] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1422.eqiad.wmnet [10:09:32] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Aklapper) 05Open→03Stalled @LanmeiCN: What does that mean exactly? Could you elaborate why exactly you cannot post on the existing mailing list? [10:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [10:12:36] (03CR) 10Dzahn: [C: 03+2] conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [10:13:49] jynus: sry just getting back to this. I'm relatively new and not sure of all the factors which affect traffic over those transport links. But a big factor is where requests come in to us on the Internet. Ashburn is likely to attract more inbound ISP traffic than Dallas no matter what, and thus when we switchover to Dallas there is more request traffic needs to be sent from eqiad to codfw, than from codfw to eqiad before. [10:14:08] XioNox: I am sort of guessing / making this answer up - please tell me I'm completely wrong here if I am!! [10:15:21] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:19:21] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1421.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [10:19:30] !log mw1421, mw1422 - converting from app to API server for balance in row A [10:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:26] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1422.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [10:26:07] (03PS7) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [10:26:11] topranks: also we have services active in eqiad only, especially analytics I'd say, but also wmcs and other stuff [10:26:47] so they used to be local, and now have to go to codfw with the switchover [10:26:58] ok yep thanks. I'm sure there are many factors. But hopefully wasn't completely off the mark with my answer. [10:27:03] That makes sense. [10:30:01] topranks: yep you're right, also esams goes through eqiad to reach codfw [10:32:03] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) First testing results: * The old scores are working (likely due to being pre-generated with old setu... [10:32:31] Ok yeah that's a massive factor also. Thanks for that - really helps my mental picture of the network. [10:35:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1421.eqiad.wmnet with reason: REIMAGE [10:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1421.eqiad.wmnet with reason: REIMAGE [10:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:21] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:38:28] (03PS1) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [10:38:30] (03PS1) 10Effie Mouzeli: add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706356 (https://phabricator.wikimedia.org/T283056) [10:39:06] (03CR) 10Effie Mouzeli: [C: 03+2] add mwdebug service to LVS 3 [puppet] - 10https://gerrit.wikimedia.org/r/706340 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [10:40:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1422.eqiad.wmnet with reason: REIMAGE [10:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:49] pybal alerts may fire [10:41:21] thanks for the heads-up [10:42:28] !log restart pybal on lvs2010 and lvs1016 [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1422.eqiad.wmnet with reason: REIMAGE [10:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:01] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:43:08] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.59:4444]) https://wikitech.wikimedia.org/wiki/PyBal [10:43:36] :) [10:43:52] good to know that the check still works after my bugfix /o\ [10:45:17] !log jiji@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mwdebug [10:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:58] !log restart pybal on lvs2009 and lvs1015 [10:46:02] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 66 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [10:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:36] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2005.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:46:50] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.59:4444]) https://wikitech.wikimedia.org/wiki/PyBal [10:47:53] ignore the mwdebug_4444 healthcheck [10:48:12] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:51:22] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [10:52:04] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:52:08] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:57:26] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1421.eqiad.wmnet'] ` and were **ALL** successful. [10:59:40] !log mw1421, mw1422 - puppetmaster - cleaning certs, reimaged hosts [10:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:00] (03PS1) 10Jgiannelos: tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/706372 [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: How many deployers does it take to do EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1100). [11:00:05] zabe: A patch you scheduled for EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:00:12] o/ [11:00:33] is anyone signed up for the training? [11:00:50] yes I did [11:01:34] alright, cool! [11:01:48] is anyone around to run the training? otherwise I can try my best :D [11:02:01] but I don’t think I have the slide deck that someone else had prepared [11:02:40] (03CR) 10Jgiannelos: "For testing purposed we are going to mirror some production traffic to tegola staging as a kartotherian tile source to see how it behaves " [deployment-charts] - 10https://gerrit.wikimedia.org/r/706372 (owner: 10Jgiannelos) [11:02:46] apergos is usually around on Thursdays :) [11:03:32] eh, I revoked puppet certs of my reimaged hosts but should not have. used to be needed but not anymore..duh [11:03:54] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2009.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:07:34] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:09:02] tsepoThoabala: if you send me the google hangout link, I can join the call and try to say useful things until someone else shows up [11:11:01] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1422.eqiad.wmnet'] ` and were **ALL** successful. [11:11:28] (03PS3) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [11:13:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:15:18] (03PS2) 10Lucas Werkmeister (WMDE): Avoid using WikiPage::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705682 (owner: 10Zabe) [11:15:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid using WikiPage::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705682 (owner: 10Zabe) [11:16:17] (03Merged) 10jenkins-bot: Avoid using WikiPage::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705682 (owner: 10Zabe) [11:17:45] zabe: the first change is on mwdebug2001 [11:18:44] I’m not sure how to test it tbh… is getting the robots.txt with x-wikimedia-debug enough? [11:18:50] I feel like there might be a lot of caching [11:20:00] hm, x-cache-status response header is “pass”, so maybe it’s not cached [11:20:13] please accept my apologies for not being here, I've had some network connectivity issues that have just thrown me for a loop [11:20:33] effie: should I not touch confctl/pybal for now? waiting a bit [11:20:37] in the meantime this week is 2 hours of collab followed by 5 hours of offsite every day so it's pushed everything else off the calendar [11:20:47] oh wow [11:20:57] and I meant to be here today because we have someone to be trained too :-( [11:20:59] mutante: no please wait for me to finish up [11:21:00] I’m in the call now, but if you can, feel free to take over [11:21:14] The last time I did a patch for robots.php. I made a mistake and I immediately got fatals when using mwdebug [11:21:27] being very honest, I should cook food so I do not starve during the next 7 hours... :-/ [11:21:36] zabe: alright, then I think that’s good enough for me [11:21:41] effie: yep, ack [11:21:47] apergos: ok :) [11:22:00] gah I was so sxcited to do this too.... sorry once again [11:23:12] Lucas_WMDE: mwdebug looks good to me [11:23:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized w/robots.php: Config: [[gerrit:705682|Avoid using WikiPage::factory()]] (duration: 01m 06s) [11:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Nikerabbit) Automatic merging is not working: https://gerrit.wikimedia.org/r/c/operations/software/mailman-templat... [11:24:53] (03PS3) 10Lucas Werkmeister (WMDE): Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 (owner: 10Zabe) [11:24:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 (owner: 10Zabe) [11:25:48] (03Merged) 10jenkins-bot: Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 (owner: 10Zabe) [11:27:04] zabe: alright, second change is also on mwdebug2001 [11:27:09] I’ll test it as well [11:28:13] (03PS2) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [11:29:18] (03PS3) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [11:29:51] I’m guessing touch.php powers https://en.wikipedia.org/static/apple-touch/wikipedia.png [11:30:06] (03PS1) 10Jelto: fix puma and sidekiq exporter listen address [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/706396 (https://phabricator.wikimedia.org/T275170) [11:30:35] (03CR) 10Hnowlan: [C: 03+1] tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/706372 (owner: 10Jgiannelos) [11:31:15] yes [11:31:37] stuff looks to me [11:32:09] alright, then I’ll sync [11:34:01] !log lucaswerkmeister-wmde@deploy1002 Synchronized w/favicon.php: Config: [[gerrit:705690|Avoid using MWHttpRequest::factory()]] (1/2) (duration: 01m 04s) [11:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized w/touch.php: Config: [[gerrit:705690|Avoid using MWHttpRequest::factory()]] (2/2) (duration: 01m 04s) [11:35:49] !log removing maps2010 from old maps cassandra cluster [11:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:27] (03CR) 10Jelto: "Exporters for rails and sidekiq still listen on localhost only. This change changes the listen address (hopefully). For the puma/rails exp" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/706396 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [11:36:44] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/706372 (owner: 10Jgiannelos) [11:36:49] !log EU backport+config window done [11:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:53] 10SRE, 10Traffic: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) >>! In T286713#7215152, @Vgutierrez wrote: > we should reduce the threshold, 3 weeks should be better for a LE acme-chief manage... [11:39:23] (03Merged) 10jenkins-bot: tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/706372 (owner: 10Jgiannelos) [11:39:30] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/706398 [11:40:44] 10SRE, 10Traffic: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) Same with the https://phabricator.wikimedia.org cert, it is still a DigiCert cert for me. So this is about adjusting the monitor... [11:41:16] Lucas_WMDE: thanks for your help :) [11:41:30] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [11:42:11] (03PS1) 10Jgiannelos: tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 [11:42:30] zabe: no problem :) [11:43:07] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10RhinosF1) Jenkins bot never voted on it. It should have the right to. [11:44:59] (03CR) 10Jgiannelos: "Debugging all SQL queries performed by tegola is a bit noisy. At the moment we know that serving tiles is working OK so no need for the de" [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 (owner: 10Jgiannelos) [11:50:52] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:52:10] 10SRE, 10Wikimedia-Mailing-lists: Make customized Mailman3 templates translatable - https://phabricator.wikimedia.org/T282018 (10hashar) >>! In T282018#7069803, @gerritbot wrote: > Change 685938 **merged** by jenkins-bot: > %%%[integration/config@master] Zuul: [operations/software/mailman-templates] Add CI of... [11:53:32] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10hashar) >>! In T282018#7069803, @gerritbot wrote: > Change 685938 **merged** by jenkins-bot: > %%%[integration/con... [11:54:27] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/706398 (owner: 10Jgiannelos) [11:57:12] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/706398 (owner: 10Jgiannelos) [12:01:27] (03PS1) 10Dzahn: icinga/planet: use letsencrypt check command for https cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/706410 (https://phabricator.wikimedia.org/T286713) [12:01:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1421-1422].eqiad.wmnet with reason: new host [12:01:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1421-1422].eqiad.wmnet with reason: new host [12:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:03] !log cleaning rest of auto-approve logs of ruwiki [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:19:25] (03PS1) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [12:19:55] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:20:09] (03PS1) 10Elukey: grafana: add the codfw k8s-ml_serve datasource [puppet] - 10https://gerrit.wikimedia.org/r/706425 [12:20:53] (03PS2) 10Elukey: grafana: add the codfw k8s-mlserve datasource [puppet] - 10https://gerrit.wikimedia.org/r/706425 [12:23:00] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:23:04] (03CR) 10Elukey: [C: 03+2] grafana: add the codfw k8s-mlserve datasource [puppet] - 10https://gerrit.wikimedia.org/r/706425 (owner: 10Elukey) [12:25:16] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) Things that are broken in safe mode should be put in a separate task. I think in the long term, i... [12:33:07] (03PS2) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [12:33:38] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:34:15] (03CR) 10Vgutierrez: add mwdebug service to LVS 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [12:35:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` pc1014.eqiad.wmnet ` The log can be found in `/var/log/wm... [12:37:30] (03CR) 10Muehlenhoff: "Looks fine. But won't we need some kind of logrotation or trimming of the audit logs?" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [12:38:09] (03CR) 10Ottomata: [C: 03+1] Create an aqs-roots group, analogous to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [12:40:06] !log cleaning flaggedrevs auto-approve logs in dewiki [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:30] (03PS1) 10Kormat: pc201[1-4]: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/706436 (https://phabricator.wikimedia.org/T284825) [12:47:28] (03CR) 10Kormat: [C: 03+2] pc201[1-4]: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/706436 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [12:48:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1014.eqiad.wmnet with reason: REIMAGE [12:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:39] 10SRE, 10Traffic, 10Sustainability (Incident Followup): False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) [12:50:09] 10SRE, 10Traffic, 10Sustainability (Incident Followup): LVS can't handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Vgutierrez) [12:50:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1014.eqiad.wmnet with reason: REIMAGE [12:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:49] (03PS3) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [12:56:19] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:56:58] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Inductiveload) Adding a "." to the title of https://en.wikisource.org/wiki/Abide_with_Me_(Illustrated_Victor... [12:59:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10cmooney) Thanks @Jclark-ctr I've set up port 39 on scs-c1-eqiad now, standard port config same as the other Juniper gear. But I get nothing back on the console when I try to... [12:59:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1014.eqiad.wmnet'] ` and were **ALL** successful. [12:59:26] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/706390 (owner: 10L10n-bot) [12:59:34] (03CR) 10Hashar: "I have manually added this change back in CI with:" [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/706390 (owner: 10L10n-bot) [12:59:46] (03CR) 10Ssingh: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [13:01:06] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September), 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10hashar) The job fails due to: ` + "help": "Aide pour la liste de diffusion ${listname}\n\nC’... [13:01:23] it is train time [13:01:58] (03PS1) 10Hashar: group2 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706450 [13:02:00] (03CR) 10Hashar: [C: 03+2] group2 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706450 (owner: 10Hashar) [13:02:06] (03PS4) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [13:02:35] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [13:02:53] (03Merged) 10jenkins-bot: group2 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706450 (owner: 10Hashar) [13:04:06] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.37.0-wmf.15 [13:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Kormat) [13:05:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Kormat) @RobH : i took the liberty to reimage pc1014 as the network connection is now working. I've also set it to 'staged' in netbox, but i'll leave this task fo... [13:07:16] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: Not alerting due to fresh production wikiversions: 845 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [13:10:00] (03PS5) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [13:10:29] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [13:13:17] (03PS6) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [13:14:43] (03CR) 10jerkins-bot: [V: 04-1] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [13:19:11] (03PS7) 10Muehlenhoff: Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) [13:20:32] (03PS2) 10Effie Mouzeli: add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706356 (https://phabricator.wikimedia.org/T283056) [13:20:55] (03CR) 10jerkins-bot: [V: 04-1] add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706356 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [13:23:53] (03CR) 10Jbond: [C: 03+1] "LGTM, completly optional comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [13:23:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [13:24:44] (03PS1) 10Kormat: pc101[1-4]: Add to parsercache role and sections. [puppet] - 10https://gerrit.wikimedia.org/r/706475 (https://phabricator.wikimedia.org/T284825) [13:25:33] marostegui: another nice short one for you ^ :) [13:25:50] (03CR) 10Ottomata: [C: 03+1] Add 5 minutes offset to gobblin webrequest timer [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [13:26:33] (03PS1) 10David Caro: global: ran black and isort [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706476 [13:26:35] (03PS1) 10David Caro: global: ran flake8 on the code [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706477 [13:26:37] (03PS1) 10David Caro: global: added .gitreview file [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706478 [13:27:22] (03CR) 10Marostegui: [C: 03+1] pc101[1-4]: Add to parsercache role and sections. [puppet] - 10https://gerrit.wikimedia.org/r/706475 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:27:44] (03CR) 10Kormat: [C: 03+2] pc101[1-4]: Add to parsercache role and sections. [puppet] - 10https://gerrit.wikimedia.org/r/706475 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [13:31:19] (03PS1) 10Marostegui: Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/705996 [13:31:28] (03CR) 10Marostegui: [C: 04-2] "Wait for the network maintenance to be over." [dns] - 10https://gerrit.wikimedia.org/r/705996 (owner: 10Marostegui) [13:31:32] (03PS2) 10Marostegui: Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/705996 [13:34:40] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#7229715, @Inductiveload wrote: > Adding a cache-busting "." to the title of https://en... [13:37:42] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [13:42:23] (03PS3) 10Ssingh: auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705696 [13:43:50] (03PS1) 10Jelto: site/conftool: add mw1439,mw1440,mw1441,mw1442 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) [13:47:06] (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [13:47:40] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:20] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10Ottomata) Approved! [13:49:01] (03PS4) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [13:49:21] (03CR) 10Jelto: "this change creates four new canary api servers mw1439 to mw1442 as replacements for Id9596cca8dad791cfbb2bb4abfd306ee8d2cb02b. Could you " [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:49:25] (03CR) 10jerkins-bot: [V: 04-1] add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [13:50:00] (03PS5) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [13:52:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] (03PS6) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [13:53:56] (03CR) 10jerkins-bot: [V: 04-1] add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [13:54:18] (03PS7) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [13:54:37] PROBLEM - cassandra CQL 10.192.48.166:9042 on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:58:59] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10Ottomata) Apporved! [14:01:49] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/30299/" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [14:05:13] ^^ kartotherian seems to be back on codfw, should we worry about that cassandra alert? [14:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [14:12:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [14:16:35] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [14:17:30] (03PS1) 10Urbanecm: Growth: Add mentor dashboard related config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706491 (https://phabricator.wikimedia.org/T278920) [14:17:31] jouncebot: now [14:17:31] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [14:17:33] jouncebot: next [14:17:33] In 0 hour(s) and 42 minute(s): Switch buffer re-partition - Eqiad Row C(network change) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1500) [14:17:50] (03CR) 10Urbanecm: [C: 03+2] Growth: Add mentor dashboard related config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706491 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [14:18:28] (03PS8) 10Effie Mouzeli: add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) [14:18:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) 05Open→03Resolved [14:18:42] (03Merged) 10jenkins-bot: Growth: Add mentor dashboard related config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706491 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [14:18:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) All good, task resolved. [14:19:41] (03CR) 10Vgutierrez: [C: 03+1] add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [14:20:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0208fc2b71863c91c3e767373d4bea1a2eaf178d: Growth: Add mentor dashboard related config (T278920) (duration: 00m 55s) [14:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:10] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [14:20:59] (03CR) 10Effie Mouzeli: [C: 03+2] add mwdebug service to LVS 4 [puppet] - 10https://gerrit.wikimedia.org/r/706355 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [14:22:27] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:22:36] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:54] (03PS1) 10Ottomata: Ensure remaining camus jobs are absent [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) [14:23:44] joal https://gerrit.wikimedia.org/r/c/operations/puppet/+/706492 [14:23:44] :) [14:23:52] maybe will get one thing done today besides emailsi and planning meeting :) [14:24:20] (03CR) 10Zabe: [C: 04-1] Ensure remaining camus jobs are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:24:34] (03PS2) 10Ottomata: Ensure remaining camus jobs are absent [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) [14:24:48] (03CR) 10Ottomata: Ensure remaining camus jobs are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:25:08] !log restarting pybal in lvs2010 and lvs1016 [14:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:09] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:27:13] !log installing libwebp security updates on stretch [14:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:07] (03CR) 10Joal: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:28:14] sorry ottomata I'm late :) [14:29:10] !log restarting pybal in lvs2009 and lvs1015 [14:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:41] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:30:47] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:31:34] (03CR) 10Ottomata: [C: 03+2] Ensure remaining camus jobs are absent [puppet] - 10https://gerrit.wikimedia.org/r/706492 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:33:12] 10SRE, 10ops-codfw: decommission procyon - https://phabricator.wikimedia.org/T287114 (10Papaul) [14:33:36] 10SRE, 10ops-codfw: decommission procyon - https://phabricator.wikimedia.org/T287114 (10Papaul) 05Open→03Resolved complete [14:34:11] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 845 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:35:48] (03PS1) 10David Caro: am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 [14:35:57] (03Abandoned) 10Effie Mouzeli: add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706356 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [14:35:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:43] !log depool cp108[3-6].eqiad.wmnet - T286065 [14:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:51] T286065: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 [14:38:54] RECOVERY - mediawiki-installation DSH group on mw2384 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:39:21] (03PS1) 10Effie Mouzeli: add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706502 (https://phabricator.wikimedia.org/T283056) [14:40:04] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:31] (03PS2) 10David Caro: am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 [14:40:37] !log mmandere@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on cp[1083-1086].eqiad.wmnet with reason: Eqiad row C maintenance [14:40:40] !log mmandere@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp[1083-1086].eqiad.wmnet with reason: Eqiad row C maintenance [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:59] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 4 host(s) and their services with reason: Eqiad row C maintenance ` cp[108... [14:43:02] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [14:44:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706502 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [14:45:08] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Inductiveload) [14:45:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:35] (03CR) 10Effie Mouzeli: [C: 03+2] add mwdebug service to LVS 5 [puppet] - 10https://gerrit.wikimedia.org/r/706502 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [14:47:44] !log depool lvs1015 - T286065 [14:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:54] T286065: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 [14:49:35] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [14:50:09] !log mmandere@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs1015.eqiad.wmnet with reason: Eqiad row C maintenance [14:50:10] !log mmandere@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs1015.eqiad.wmnet with reason: Eqiad row C maintenance [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:28] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 1 host(s) and their services with reason: Eqiad row C maintenance ` lvs101... [14:50:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:19] ^^ that's expected [14:53:23] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:55:48] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [14:58:33] (03CR) 10Giuseppe Lavagetto: profile::trafficserver: include mwdebug.discovery.wmnet in X-Wikimedia-Debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) (owner: 10Effie Mouzeli) [14:58:40] (03CR) 10Hashar: "> Hiera should only be looked up in profiles." [puppet] - 10https://gerrit.wikimedia.org/r/699427 (https://phabricator.wikimedia.org/T73480) (owner: 10Hashar) [14:58:49] !log disabled puppet temporarily for Row C switch maintenance [14:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:02] (03PS1) 10Effie Mouzeli: Add mwdebug discovery record [dns] - 10https://gerrit.wikimedia.org/r/706506 (https://phabricator.wikimedia.org/T283056) [15:00:05] topranks and XioNox: It is that lovely time of the day again! You are hereby commanded to deploy Switch buffer re-partition - Eqiad Row C(network change). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1500). [15:00:14] (03CR) 10Vgutierrez: [C: 03+1] "looking good from a pure ATS perspective, no problem with what joe suggested though, feel free to submit a new PS" [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) (owner: 10Effie Mouzeli) [15:01:26] (03PS2) 10Effie Mouzeli: profile::trafficserver: include mwdebug.discovery.wmnet in X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) [15:04:12] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [15:04:57] 10ops-eqiad, 10DBA: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10wiki_willy) a:03Jclark-ctr [15:07:09] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/705996 (owner: 10Marostegui) [15:07:58] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:08:57] (03PS3) 10David Caro: am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) [15:10:23] PROBLEM - Check systemd state on wtp1042 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:30] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [15:11:20] !log pool cp108[3-6].eqiad.wmnet - T286065 [15:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] T286065: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 [15:11:45] !log re-enabled puppet after row C switch maintenance completed [15:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:18] (03PS1) 10Kormat: pc101[1-4]: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/706507 (https://phabricator.wikimedia.org/T284825) [15:14:35] !log shutdown db2097 for hw servicing T287072 [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:43] T287072: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 [15:14:49] (03CR) 10Kormat: [C: 03+2] pc101[1-4]: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/706507 (https://phabricator.wikimedia.org/T284825) (owner: 10Kormat) [15:14:55] !log pool lvs1015 - T286065 [15:14:58] (03PS1) 10Elukey: profile::prometheus::ops: add collection of ml-etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/706508 [15:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:29] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Host should be down now @Papaul [15:15:56] godog: if you have a minute <3 --^ [15:16:27] (03PS1) 10Filippo Giunchedi: thanos: add rule to module/profile [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) [15:16:27] elukey: sure [15:16:29] (03PS1) 10Filippo Giunchedi: hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) [15:16:31] (03PS1) 10Filippo Giunchedi: role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) [15:16:33] (03PS1) 10Filippo Giunchedi: prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) [15:16:35] (03PS1) 10Filippo Giunchedi: thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) [15:17:01] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 77, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:15] (03CR) 10jerkins-bot: [V: 04-1] hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [15:17:24] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) >>! In T282022#7229672, @hashar wrote: >>>! In T282018#7069803, @gerritbot wrote: >> Change 685938 **merg... [15:18:10] what the hell jenkins, I ran CI locally [15:18:18] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::ops: add collection of ml-etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/706508 (owner: 10Elukey) [15:20:46] thank yooouuu [15:20:55] (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: add collection of ml-etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/706508 (owner: 10Elukey) [15:20:57] (03PS2) 10Effie Mouzeli: Add mwdebug discovery record [dns] - 10https://gerrit.wikimedia.org/r/706506 (https://phabricator.wikimedia.org/T283056) [15:22:17] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:22:39] !log installing dnspython bugfix updates from Buster 10.10 point release [15:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:14] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:23:36] (03CR) 10Effie Mouzeli: [C: 03+2] Add mwdebug discovery record [dns] - 10https://gerrit.wikimedia.org/r/706506 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [15:23:51] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:23:55] (03PS3) 10Effie Mouzeli: Add mwdebug discovery record [dns] - 10https://gerrit.wikimedia.org/r/706506 (https://phabricator.wikimedia.org/T283056) [15:24:38] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Looking good: ` $ free -g total used free shared buff/cache available Mem:... [15:24:58] (03PS2) 10Filippo Giunchedi: hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) [15:25:00] (03PS2) 10Filippo Giunchedi: role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) [15:25:02] (03PS2) 10Filippo Giunchedi: prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) [15:25:04] (03PS2) 10Filippo Giunchedi: thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) [15:25:30] (03CR) 10Ayounsi: "Can it leverage the work I/F is doing to document service owners? Otherwise the risk is to end up with multiple definitions/mappings acros" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [15:27:37] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) switch shipped out today tracking information below Tracking Number: 1ZA19A021295420730 [15:30:43] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) I doubled confirmed all dimms "Good, In use". Thank you, @Papaul for the quick response! ` PROC 1 DIMM 3 Good, In Use... [15:30:52] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) 05Open→03Resolved Return DIMM information {F34560100} [15:31:20] (03PS6) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [15:33:21] (03CR) 10Ahmon Dancy: "typo. Otherwise looks ok." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:35:05] (03CR) 10Hnowlan: [C: 03+2] maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:35:27] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [15:35:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::trafficserver: include mwdebug.discovery.wmnet in X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) (owner: 10Effie Mouzeli) [15:39:34] (03CR) 10Effie Mouzeli: [C: 03+2] profile::trafficserver: include mwdebug.discovery.wmnet in X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) (owner: 10Effie Mouzeli) [15:42:30] (03PS1) 10Ottomata: Remove camus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/706541 (https://phabricator.wikimedia.org/T271232) [15:42:39] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 6525 MB (2% inode=74%): /tmp 6525 MB (2% inode=74%): /var/tmp 6525 MB (2% inode=74%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [15:44:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2091', diff saved to https://phabricator.wikimedia.org/P16859 and previous config saved to /var/cache/conftool/dbconfig/20210722-154408-marostegui.json [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:34] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30300/console" [puppet] - 10https://gerrit.wikimedia.org/r/706541 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:44:40] hnowlan_: should I merge your patch too ? [15:45:03] "Hnowlan: maps: make maps2010 a buster replica of maps2009 (17ea527057)" [15:45:04] PROBLEM - Host ms-be1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:04] PROBLEM - Host ms-be1042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:15] PROBLEM - Host ms-fe1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:17] !log Stop db2091 for onsite maintenance [15:45:21] PROBLEM - Host an-coord1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:29] PROBLEM - Host an-worker1091.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:31] PROBLEM - Host backup1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:37] PROBLEM - Host an-worker1110.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:37] PROBLEM - Host an-worker1109.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:43] PROBLEM - Host clouddb1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:47] PROBLEM - Host an-worker1133.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:53] PROBLEM - Host wtp1041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:54] PROBLEM - Host analytics1075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:57] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:46:17] PROBLEM - Host dumpsdata1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:28] ^ looks like all C7 [15:46:37] PROBLEM - Host wtp1040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:37] PROBLEM - Host wtp1042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:40] just mgmt gladly :) [15:46:52] oh so it is [15:46:53] PROBLEM - Host kafka-main1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:01] PROBLEM - Host ms-be1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:02] in that case I'll go make myself another coffee [15:47:04] I mean mgmt interfaces [15:47:34] topranks: that's not a knock-on from the eqiad C work, is it? [15:47:51] objection, leading the witness [15:47:57] lol [15:48:00] fine, fine [15:48:02] ahahaha [15:48:03] PROBLEM - Host cp1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:07] PROBLEM - Host cp1086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:39] PROBLEM - Host elastic1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:39] PROBLEM - Host elastic1051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:39] PROBLEM - Host lvs1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:55] * topranks hmm.... looking into mgmt [15:48:58] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Remove camus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/706541 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:48:59] PROBLEM - Host mc-gp1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:49:04] boxes are still up gladly :) [15:49:13] okay well I can't think of a not-leading-the-witness version of this joke that doesn't come off SUPER passive-aggressive, so kormat wins this round [15:49:41] PROBLEM - Host db2091.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:49:49] what a world... a dba winning something [15:49:50] rzl: 🎉 [15:50:04] effie: you puppet-merging? [15:50:34] ottomata: I am waiting for hnowlan_ because there is a patch of his too there [15:50:42] k i have one in now that should be a no-op [15:50:49] PROBLEM - Host dbprov1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:56] well, you are #3 in the queue now ottomata [15:50:58] haha [15:51:13] welp if you happen toi merge it i would not mind! :) [15:51:13] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [15:51:23] ge-0/0/24 up down Core: msw-c7-eqiad:50 {#1546} [15:51:30] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) 05Open→03Resolved Overall istio looks a lot like the other envoy-based ingress I evaluated, being just quite a bit more complicated because istio can do much, much more... [15:51:31] ottomata: ok I will merge it [15:51:37] another switch bites the dust XioNoX? [15:51:37] did someone ping DCops? [15:51:54] vgutierrez: looks like the mgmt switch of that rack [15:52:07] bad month for being a network sswitch at WMF [15:52:19] for sure [15:53:00] ottomata: go ahead [15:53:20] done ty [15:53:24] one day I'll get there before you Arzhel :) [15:54:16] topranks: step 1) sabotage his net conn. step 2) be first [15:55:37] RECOVERY - Host db2091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [15:55:55] haha, and I noticed because I clicked on the wrong IRC chan [15:58:37] 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10ayounsi) p:05Triage→03High [15:59:53] topranks: imagine it went down 55min ago, would you be scratching your head very hard? [16:00:05] jbond42 and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1600). [16:00:31] (03CR) 10David Caro: "> Patch Set 3:" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [16:00:53] effie: thank you! apologies [16:00:53] XioNox: panicking more like :) [16:00:59] no patches, I declare the puppet window closed [16:01:29] rzl: the puppetshow? [16:03:02] yesss [16:06:27] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:54] 10ops-codfw, 10DBA: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) [16:08:12] 10ops-codfw, 10DBA: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) p:05Triage→03Medium [16:10:11] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:27] 10ops-codfw, 10DBA: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Papaul) [16:10:31] 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10wiki_willy) Reached out to John, who's heading over to the cage right now, to check it out. Thanks, Willy [16:10:39] (03CR) 10David Caro: "> Patch Set 3:" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [16:13:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 5%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16860 and previous config saved to /var/cache/conftool/dbconfig/20210722-161333-root.json [16:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:57] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:41] (03PS1) 10Cathal Mooney: Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065. [homer/public] - 10https://gerrit.wikimedia.org/r/706568 (https://phabricator.wikimedia.org/T284592) [16:20:05] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@fb4bc10]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) [16:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:25] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@fb4bc10]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) (duration: 00m 20s) [16:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:37] RECOVERY - Host an-worker1133.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:21:51] RECOVERY - Host dumpsdata1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [16:22:21] RECOVERY - Host kafka-main1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [16:24:50] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring [16:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:10] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring (duration: 00m 20s) [16:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:19] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2010.codfw.wmnet with reason: REIMAGE [16:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2010.codfw.wmnet with reason: REIMAGE [16:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 10%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16861 and previous config saved to /var/cache/conftool/dbconfig/20210722-162838-root.json [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:40] (03CR) 10Andrew Bogott: [C: 03+2] toolforge grid master: run disable_tool.py every 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [16:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 15%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16862 and previous config saved to /var/cache/conftool/dbconfig/20210722-164342-root.json [16:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:15] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/ops [16:49:35] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10Legoktm) Currently lists.wm.o uses its own independent user database, so I think we'd need to take the LDAP uid, get the associated email address, and the... [16:52:04] (03PS1) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [16:53:01] (03CR) 10Ayounsi: [C: 03+1] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T2 [homer/public] - 10https://gerrit.wikimedia.org/r/706568 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:53:26] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Tested on gitlab-ansible-test, seems ok. Will deploy." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/706396 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [16:53:32] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [16:56:23] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10cmooney) @Jclark-ctr you can go ahead and use one of the devices from T259758, and mark that as swapped out. We're not going to apply any config to the new ones so it just slots into place and we're good to go. [16:56:26] !log gitlab1001: running ansible to deploy [[gerrit:706396]] (T275170) [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:35] T275170: Define monitoring for gitlab - https://phabricator.wikimedia.org/T275170 [16:56:35] (03PS2) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [16:56:57] (03CR) 10Cathal Mooney: [C: 03+2] Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T2 [homer/public] - 10https://gerrit.wikimedia.org/r/706568 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:57:29] (03Merged) 10jenkins-bot: Adding flag for asw2-c-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually in T286065. [homer/public] - 10https://gerrit.wikimedia.org/r/706568 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 25%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16863 and previous config saved to /var/cache/conftool/dbconfig/20210722-165846-root.json [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1700). [17:01:30] (03CR) 10Andrew Bogott: [C: 03+2] toolforge cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) (owner: 10Andrew Bogott) [17:01:40] (03CR) 10Andrew Bogott: [C: 03+2] nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [17:02:23] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) All went very well with the change, this time I ran rapid ping from the CR to see if any packet loss was observed, and did detect some loss,... [17:02:39] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) 05Open→03Resolved [17:02:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [17:03:28] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@67a0db1]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:49] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@67a0db1]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) (duration: 00m 21s) [17:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:12] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring [17:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:23] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring (duration: 00m 12s) [17:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:02] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/ops [17:13:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 50%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16864 and previous config saved to /var/cache/conftool/dbconfig/20210722-171349-root.json [17:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:56] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@b173b4f]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) [17:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:16] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@b173b4f]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) (duration: 00m 20s) [17:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:54] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:06] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: Rollback maps2007 mirroring (duration: 00m 12s) [17:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:28] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:34] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 75%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16865 and previous config saved to /var/cache/conftool/dbconfig/20210722-172853-root.json [17:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:20] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@bbb7ba8]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) [17:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:41] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@bbb7ba8]: Preparing maps2007 to mirror traffic to the Tegola service (no-op) (duration: 00m 20s) [17:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:04] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@b414857]: Mirror 10% of maps2007 traffic to the Tegola service [17:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:24] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@b414857]: Mirror 10% of maps2007 traffic to the Tegola service (duration: 00m 20s) [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10MoritzMuehlenhoff) >>! In T286906#7230602, @Legoktm wrote: > Also to clarify, when we mean "logout", we actually mean "disable account" right? No, the c... [17:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2091 (re)pooling @ 100%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P16866 and previous config saved to /var/cache/conftool/dbconfig/20210722-174357-root.json [17:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:03:59] (03PS1) 10Btullis: Add a DNS dicovery service for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706641 (https://phabricator.wikimedia.org/T273642) [18:05:25] (03CR) 10jerkins-bot: [V: 04-1] Add a DNS dicovery service for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706641 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [18:05:33] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: config values do not need double quotes [puppet] - 10https://gerrit.wikimedia.org/r/706042 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [18:10:14] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [18:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:34] !log testing dc switchover warmup script in eqiad [18:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [18:13:14] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [18:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:15:26] legoktm: the alert is you, right? No need to worry about deploying now. [18:16:40] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:17:13] yes [18:18:02] great, thanks [18:18:02] (03PS1) 10Urbanecm: Add digital.ub.umu.se to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706649 (https://phabricator.wikimedia.org/T287204) [18:18:15] (03CR) 10Urbanecm: [C: 03+2] Add digital.ub.umu.se to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706649 (https://phabricator.wikimedia.org/T287204) (owner: 10Urbanecm) [18:19:01] (03Merged) 10jenkins-bot: Add digital.ub.umu.se to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706649 (https://phabricator.wikimedia.org/T287204) (owner: 10Urbanecm) [18:20:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f765832fa2bcfb9e43516e4962254854c3a3b39a: Add digital.ub.umu.se to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T287204) (duration: 00m 55s) [18:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:42] T287204: Add digital.ub.umu.se to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T287204 [18:22:38] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:46] doing another run now [18:23:32] (03PS1) 10Urbanecm: Enable the visual editor on the 2021 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706655 (https://phabricator.wikimedia.org/T287197) [18:23:41] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [18:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:11] (03PS3) 10Urbanecm: Enable the visual editor on the 2021 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706002 (https://phabricator.wikimedia.org/T287197) (owner: 10Bodhisattwa) [18:24:38] (03Abandoned) 10Urbanecm: Enable the visual editor on the 2021 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706655 (https://phabricator.wikimedia.org/T287197) (owner: 10Urbanecm) [18:24:53] (03CR) 10Urbanecm: [C: 03+2] Enable the visual editor on the 2021 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706002 (https://phabricator.wikimedia.org/T287197) (owner: 10Bodhisattwa) [18:25:36] PROBLEM - DNS on kafka-main1003.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.131 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:25:43] (03Merged) 10jenkins-bot: Enable the visual editor on the 2021 namespace on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706002 (https://phabricator.wikimedia.org/T287197) (owner: 10Bodhisattwa) [18:27:38] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:41] (03CR) 10Herron: "nice, looking forward to this! some thoughts inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [18:31:33] (03CR) 10Herron: "Would be good to see a PCC run when ready and one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [18:32:32] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6a909301d93045ad6752ded08fa5ed7c2972f855: Enable the visual editor on the 2021 namespace on Wikimania wiki (T287197) (duration: 00m 55s) [18:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:41] T287197: Enable the visual editor on the 2021 namespace on Wikimania wiki - https://phabricator.wikimedia.org/T287197 [18:35:55] (03PS1) 10Urbanecm: hewikisource: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706660 (https://phabricator.wikimedia.org/T286500) [18:35:59] (03PS1) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [18:37:07] (03Abandoned) 10Btullis: Add a DNS dicovery service for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706641 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [18:37:59] (03PS2) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [18:38:11] !log otto@deploy1002 Started deploy [analytics/refinery@1ef4fe1]: bin/gobbin wrapper now avoids launching if job is already running - T271232 [18:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:19] (03PS6) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [18:38:19] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [18:38:35] (03PS9) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [18:40:31] (03PS1) 10Urbanecm: enwikisource: Create upload-shared user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706662 (https://phabricator.wikimedia.org/T285130) [18:40:41] (03PS2) 10Urbanecm: enwikisource: Create upload-shared user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706662 (https://phabricator.wikimedia.org/T285130) [18:40:44] (03CR) 10Urbanecm: [C: 03+2] enwikisource: Create upload-shared user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706662 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:41:15] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [18:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:30] !log otto@deploy1002 Finished deploy [analytics/refinery@1ef4fe1]: bin/gobbin wrapper now avoids launching if job is already running - T271232 (duration: 03m 18s) [18:41:34] (03Merged) 10jenkins-bot: enwikisource: Create upload-shared user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706662 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:16] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [18:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:32] PROBLEM - DNS on an-worker1133.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.159 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:44:24] (03PS1) 10Urbanecm: enwikisource: Fix upload-shared user group creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706670 (https://phabricator.wikimedia.org/T285130) [18:44:27] (03CR) 10Urbanecm: [C: 03+2] enwikisource: Fix upload-shared user group creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706670 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:45:18] (03Merged) 10jenkins-bot: enwikisource: Fix upload-shared user group creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706670 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:46:42] (03PS1) 10Urbanecm: enwikisource: Actually allow admins to remove upload-shared... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706671 (https://phabricator.wikimedia.org/T285130) [18:46:43] (03CR) 10Urbanecm: [C: 03+2] enwikisource: Actually allow admins to remove upload-shared... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706671 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:46:48] this took more commits than it's supposed to... [18:47:39] (03Merged) 10jenkins-bot: enwikisource: Actually allow admins to remove upload-shared... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706671 (https://phabricator.wikimedia.org/T285130) (owner: 10Urbanecm) [18:48:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 599c2209c332fb0ebf3079bfb44558eb67ae5657: enwikisource: Create upload-shared user group (T285130) (duration: 00m 56s) [18:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:15] T285130: create "upload shared" group at English Wikisource - https://phabricator.wikimedia.org/T285130 [18:48:28] (03PS2) 10Urbanecm: hewikisource: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706660 (https://phabricator.wikimedia.org/T286500) [18:48:36] (03CR) 10Urbanecm: [C: 03+2] hewikisource: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706660 (https://phabricator.wikimedia.org/T286500) (owner: 10Urbanecm) [18:48:40] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:50] (03Merged) 10jenkins-bot: hewikisource: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706660 (https://phabricator.wikimedia.org/T286500) (owner: 10Urbanecm) [18:50:57] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 26c23dee57cc105c6ff98f4403618cfab536e089: hewikisource: Add namespace aliases (T286500) (duration: 00m 55s) [18:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:30] T286500: namespace aliases for he.wikisource - https://phabricator.wikimedia.org/T286500 [18:53:05] !log [urbanecm@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki=hewikisource --fix # T286500 [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:06] !log otto@deploy1002 Started deploy [analytics/refinery@3115f9e]: Set gobblin job.lock.dir after all - T271232 [18:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:14] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [18:56:47] !log Start server-side upload for 1 video file (T286665) [18:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:55] T286665: Server side upload for Sturm - https://phabricator.wikimedia.org/T286665 [18:58:15] !log Start server-side upload for 1 video file (T286489) [18:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:24] T286489: Server side upload for Gpkp - https://phabricator.wikimedia.org/T286489 [18:59:28] !log otto@deploy1002 Finished deploy [analytics/refinery@3115f9e]: Set gobblin job.lock.dir after all - T271232 (duration: 03m 22s) [18:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:52] (03CR) 10Dzahn: [C: 03+1] "thank you, looks good to me. also checked the regex for remaining "insetup". it matches the ones not checked on https://phabricator.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [19:00:05] hashar and dancy: That opportune time is upon us again. Time for a MediaWiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T1900). [19:00:07] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator disabled on new cluster maps hosts https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:07] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator disabled on new cluster maps hosts https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:54] !log Start server-side upload for 1 video file (T287061) [19:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:02] T287061: Server side upload for CalendulaAsteraceae - https://phabricator.wikimedia.org/T287061 [19:02:26] PROBLEM - DNS on dumpsdata1003.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.5.182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:03:09] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Jgreen) [19:04:23] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw142[1-2].eqiad.wmnet [19:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:26] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw142[1-2].eqiad.wmnet [19:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:48] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [19:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:35] !log mw1421, mw1422 - scap pull, re-pool as new API servers after reimaging, previously appservers [19:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw142[1-2].eqiad.wmnet [19:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:05] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:09:37] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [19:10:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:13:41] (03CR) 10Herron: [C: 03+1] logstash: add gitlab ECS transformations [puppet] - 10https://gerrit.wikimedia.org/r/705019 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [19:14:49] (03PS1) 10Ottomata: Finalize several EventLogging -> Event Platfom migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706689 (https://phabricator.wikimedia.org/T282855) [19:16:22] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [19:16:54] (03CR) 10Bstorm: tools prometheus: allow scraping by ip address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [19:17:10] (03PS2) 10Ottomata: Finalize several EventLogging -> Event Platfom migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706689 (https://phabricator.wikimedia.org/T282855) [19:20:42] (03PS2) 10Bstorm: tools prometheus: allow scraping by ip address [puppet] - 10https://gerrit.wikimedia.org/r/706047 [19:22:01] (03CR) 10Bstorm: tools prometheus: allow scraping by ip address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [19:22:06] hashar: dancy o/ are you currently doing thw mw train? can I deploy a no-op config change patch? [19:22:30] ottomata: I have pushed it to the rest of wikis around 13:00 UTC earlier today ;D [19:22:50] ok! proceeding with my patch then, thank you! [19:23:01] so yeah it is all open and we are not doing any train activity tonight (european biased) [19:23:08] \o/ [19:23:08] (03CR) 10Ottomata: [C: 03+2] Finalize several EventLogging -> Event Platfom migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706689 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [19:24:05] (03Merged) 10jenkins-bot: Finalize several EventLogging -> Event Platfom migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706689 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [19:26:12] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Finalize several EventLogging -> Event Platfom migrations - T282855 T238138 T282562 T271168 (duration: 00m 55s) [19:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:26] T282855: LandingPageImpression Event Platform Migration - https://phabricator.wikimedia.org/T282855 [19:26:27] T282562: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 [19:26:27] T271168: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 [19:26:28] T238138: VirtualPageView Event Platform Migration - https://phabricator.wikimedia.org/T238138 [19:26:43] (03PS1) 10Effie Mouzeli: Add k8s-experimental to the list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706701 (https://phabricator.wikimedia.org/T286491) [19:27:33] (03CR) 10Andrew Bogott: [C: 03+1] tools prometheus: allow scraping by ip address [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [19:29:14] (03CR) 10Bstorm: [C: 03+2] tools prometheus: allow scraping by ip address [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [19:29:14] RECOVERY - mediawiki-installation DSH group on mw1421 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:58:35] (03PS1) 10Ssingh: wikidough: add motd script to indicate logging of root commands [puppet] - 10https://gerrit.wikimedia.org/r/706722 [20:09:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:11:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:28] RECOVERY - Host ms-be1036.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 3.90 ms [20:23:30] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [20:23:30] RECOVERY - Host ms-be1042.mgmt is UP: PING WARNING - Packet loss = 75%, RTA = 829.17 ms [20:23:31] PROBLEM - ps1-c7-eqiad-infeed-load-tower-A-phase-X on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:32] RECOVERY - Host ms-fe1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [20:23:42] RECOVERY - Host an-worker1091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [20:23:54] RECOVERY - Host backup1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.34 ms [20:23:54] RECOVERY - Host an-coord1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [20:23:58] RECOVERY - Host an-worker1109.mgmt is UP: PING OK - Packet loss = 0%, RTA = 16.86 ms [20:23:58] RECOVERY - Host an-worker1110.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.65 ms [20:24:03] RECOVERY - Host wtp1041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.44 ms [20:24:03] RECOVERY - Host analytics1075.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [20:24:03] RECOVERY - Host clouddb1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.31 ms [20:24:26] RECOVERY - ps1-c7-eqiad-infeed-load-tower-A-phase-X on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-A-phase-X 385 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:24:29] (03PS1) 10Ebernhardson: [aptrepo] Add migraphx and dependencies from rocm [puppet] - 10https://gerrit.wikimedia.org/r/706740 [20:24:31] (03PS1) 10Ebernhardson: [amd_rocm] Install migraphx to amd_rocm instances [puppet] - 10https://gerrit.wikimedia.org/r/706741 [20:25:16] RECOVERY - Host ms-be1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [20:25:16] RECOVERY - Host wtp1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [20:25:16] RECOVERY - Host wtp1042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [20:26:46] RECOVERY - Host cp1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.61 ms [20:26:50] RECOVERY - Host cp1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.49 ms [20:26:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) [20:27:00] RECOVERY - Host elastic1052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [20:27:00] RECOVERY - Host elastic1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [20:27:14] RECOVERY - Host lvs1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [20:27:14] RECOVERY - Host mc-gp1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [20:27:37] ^^ jclark-ctr just connected replacement switch for msw-c7-eqiad, all looks good at first glance. [20:29:00] RECOVERY - Host dbprov1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.54 ms [20:37:01] (03CR) 10Jbond: [C: 03+1] "lol" [puppet] - 10https://gerrit.wikimedia.org/r/706722 (owner: 10Ssingh) [20:37:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) completed rack c7 T287180 [20:41:26] (03CR) 10Legoktm: "Wouldn't it make more sense to ship this in the auditd module so any host with that enabled will also have the correct motd?" [puppet] - 10https://gerrit.wikimedia.org/r/706722 (owner: 10Ssingh) [20:43:42] df [20:45:19] legoktm: ^ short answer is we are making sure we don't extend this to other hosts without further discussion on the implications of doing this [20:45:28] and hence the Wikidough-specific ASCII art :) [20:45:38] (will reply on the ticket as well) [20:45:48] then why not make the auditd role wikidough specific? [20:45:55] s/role/class/ [20:45:59] or module* [20:46:24] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10cmooney) Recovery looks good, all Icinga alerts have cleared: ` cmooney@msw1-eqiad> show interfaces descriptions | match msw-c7 ge-0/0/24 up up Core: msw-c7-eqiad:50 {#1546} {master:0} cmooney@msw1-eqiad> show eth... [20:46:33] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10cmooney) 05Open→03Resolved [20:50:19] legoktm: I think the short answer is that while the auditd module is generic, the usage is Wikidough specific. the reason I keep it generic is so that in case someone outside of WMF also uses it, but I am curious what specific things you have in mind :) [20:50:40] shorter even :P [20:52:13] mostly I don't think it should be possible to enable auditd without the motd technically (assuming we care about having a motd!) [20:52:31] you could have the audit class take a motd => parameter so the motd could still be wikidough specific [20:55:00] my personal reasons for caring about a motd are that this is the first time we are doing this, so we should let the people know [20:55:11] (03Abandoned) 10Ebernhardson: [aptrepo] Add migraphx and dependencies from rocm [puppet] - 10https://gerrit.wikimedia.org/r/706740 (owner: 10Ebernhardson) [20:55:21] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) [20:55:23] (03Abandoned) 10Ebernhardson: [amd_rocm] Install migraphx to amd_rocm instances [puppet] - 10https://gerrit.wikimedia.org/r/706741 (owner: 10Ebernhardson) [20:56:30] I agree :) [20:57:26] (03PS1) 10RLazarus: admin: Create jabrams, add to restricted, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/706755 (https://phabricator.wikimedia.org/T286927) [20:57:52] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) [21:02:53] (03PS1) 10RLazarus: admin: Add toberto with no-ssh membership in analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/706756 (https://phabricator.wikimedia.org/T286746) [21:03:33] RECOVERY - DNS on dumpsdata1003.mgmt is OK: DNS OK: 0.028 seconds response time. dumpsdata1003.mgmt.eqiad.wmnet returns 10.65.5.182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:16:59] (03CR) 10Ssingh: [C: 03+2] admin: Add toberto with no-ssh membership in analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/706756 (https://phabricator.wikimedia.org/T286746) (owner: 10RLazarus) [21:17:28] (03CR) 10Ssingh: [C: 03+2] admin: Create jabrams, add to restricted, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/706755 (https://phabricator.wikimedia.org/T286927) (owner: 10RLazarus) [21:21:32] (03PS1) 10Andrew Bogott: disable_tool: run every 5 minutes rather than every 10 [puppet] - 10https://gerrit.wikimedia.org/r/706768 (https://phabricator.wikimedia.org/T170355) [21:21:34] (03PS1) 10Andrew Bogott: disable-tool: add a job to the sge-cron host that archives databases [puppet] - 10https://gerrit.wikimedia.org/r/706769 (https://phabricator.wikimedia.org/T285332) [21:22:04] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7229688, @tstarling wrote: > Things that are broken in safe mode should be put in a... [21:23:36] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: run every 5 minutes rather than every 10 [puppet] - 10https://gerrit.wikimedia.org/r/706768 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [21:23:46] (03CR) 10Andrew Bogott: [C: 03+2] disable-tool: add a job to the sge-cron host that archives databases [puppet] - 10https://gerrit.wikimedia.org/r/706769 (https://phabricator.wikimedia.org/T285332) (owner: 10Andrew Bogott) [21:24:57] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Skierpage) [21:25:00] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10Patch-For-Review: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) Added to wmf group: ` rzl@mwmaint2002:~$ ldapsearch -x cn=wmf | grep toberto member: uid=toberto,ou=people,dc=... [21:25:32] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Skierpage) I created T287212 while Legoktm was already fixing it 😄 [21:28:01] RECOVERY - DNS on kafka-main1003.mgmt is OK: DNS OK: 0.019 seconds response time. kafka-main1003.mgmt.eqiad.wmnet returns 10.65.3.131 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:28:01] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10Patch-For-Review: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) 05Open→03Resolved a:03RLazarus @toberto Give it 30 minutes for the Puppet change to roll out everywhere,... [21:30:12] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) ` rzl@krb1001:~$ sudo manage_principals.py create jabrams --email_address=jabram... [21:34:31] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) 05Open→03Resolved @JAbrams You should be all set! Give it 30 minutes for the... [21:44:57] RECOVERY - DNS on an-worker1133.mgmt is OK: DNS OK: 0.017 seconds response time. an-worker1133.mgmt.eqiad.wmnet returns 10.65.0.159 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:05:42] (03PS1) 10Jdlrobson: Make sure enable responsive mode UI reflects actual preference value [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/706003 (https://phabricator.wikimedia.org/T285402) [22:09:11] (03PS1) 10Bstorm: cloud dns: alias all attached floating_ips, not just server ones [puppet] - 10https://gerrit.wikimedia.org/r/706814 (https://phabricator.wikimedia.org/T287107) [22:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [22:23:27] (03CR) 10Bstorm: "My apologies for the black formatting addition. It doesn't seem to confuse the diff that much, though, since I changed so much of this." [puppet] - 10https://gerrit.wikimedia.org/r/706814 (https://phabricator.wikimedia.org/T287107) (owner: 10Bstorm) [22:33:42] (03CR) 10Bstorm: "I see this script isn't smart enough to also correctly label the instance if I do this." [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [22:37:53] (03CR) 10Andrew Bogott: "Assuming we can trust Neutron to give us a reliable list, this seems great!" [puppet] - 10https://gerrit.wikimedia.org/r/706814 (https://phabricator.wikimedia.org/T287107) (owner: 10Bstorm) [22:39:04] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/706814 (https://phabricator.wikimedia.org/T287107) (owner: 10Bstorm) [22:50:42] (03PS1) 10Bstorm: tools prometheus: try directly using openstack [puppet] - 10https://gerrit.wikimedia.org/r/706821 [22:51:19] (03CR) 10jerkins-bot: [V: 04-1] tools prometheus: try directly using openstack [puppet] - 10https://gerrit.wikimedia.org/r/706821 (owner: 10Bstorm) [22:52:45] (03PS2) 10Bstorm: tools prometheus: try directly using openstack [puppet] - 10https://gerrit.wikimedia.org/r/706821 [22:56:52] (03CR) 10Bstorm: "PCC looks good. https://puppet-compiler.wmflabs.org/compiler1003/30302/tools-prometheus-03.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/706821 (owner: 10Bstorm) [22:57:01] (03CR) 10Bstorm: [C: 03+2] tools prometheus: try directly using openstack [puppet] - 10https://gerrit.wikimedia.org/r/706821 (owner: 10Bstorm) [23:00:05] brennen and thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210722T2300). [23:00:06] Jdlrobson: A patch you scheduled for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:30] * thcipriani waves [23:01:01] Jdlrobson: around? [23:01:17] o/ [23:01:21] thcipriani: yup [23:01:29] and set up for testing [23:01:33] <3 [23:01:45] xSavitar_ will be deploying today :) [23:03:44] xSavitar_: nice! [23:04:00] thcipriani: while I have you... do you know anything about preference database migrations? [23:04:02] o/ [23:04:05] e.g. renaming a preference [23:05:43] (03CR) 10D3r1ck01: [C: 03+2] "Backport" [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/706003 (https://phabricator.wikimedia.org/T285402) (owner: 10Jdlrobson) [23:05:55] Jdlrobson: I do not know about migrations, usually you just schedule those with data persistence (is where I would start, unsure) [23:06:43] Okay no worries, it's pretty minor. I'm pretty sure the preference in question is seldom used. [23:09:33] (03CR) 10Bstorm: [C: 03+2] "Since I did my testing in my home dir directly on cloudservices1003, I think it should be safe to merge and revert if I'm wrong. Going ahe" [puppet] - 10https://gerrit.wikimedia.org/r/706814 (https://phabricator.wikimedia.org/T287107) (owner: 10Bstorm) [23:14:17] oh jenkins...why [23:14:23] o_o [23:15:15] rapidly clicking buttons on the zuul page to pass the time [23:16:03] I have noticed a minor bug where if you have the same patch in 2 different queues on the zuul page and you click one it will expand both since they both have the same id: magic! [23:16:29] * thcipriani has stared at the zuul page too long [23:17:03] (03CR) 10Neil P. Quinn-WMF: "I'm pretty sure we should be using the settings in patchset 3 (intended for client-submitted analytics streams) rather than the one in the" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [23:20:19] (03CR) 10Neil P. Quinn-WMF: [C: 04-1] "One more thing:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [23:25:07] (03Merged) 10jenkins-bot: Make sure enable responsive mode UI reflects actual preference value [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/706003 (https://phabricator.wikimedia.org/T285402) (owner: 10Jdlrobson) [23:25:30] \o/ [23:27:12] wahoo [23:27:18] ok im ready with my xdebug [23:27:41] Jdlrobson: getting the patch to mwdebug1002, and it's there now [23:27:57] Jdlrobson: go on :) [23:28:24] The Wikipedia database is temporarily in read-only mode for the following reason: [23:28:29] Is that a function of xdebug ? [23:28:56] Can I test changes that touch the database? [23:29:26] ohI think that's due to the debug machine we used [23:29:33] let's try one in codfw [23:29:38] we'll see :) [23:30:20] Jdlrobson: test on mwdebug2002 [23:30:29] sorry about the prev message. [23:32:15] it works! xSavitar_ [23:32:24] feel free to sync [23:32:35] okay Jdlrobson, thank you very much [23:32:42] patch going live now [23:35:00] !log derick@deploy1002 Synchronized php-1.37.0-wmf.15/includes/preferences/DefaultPreferencesFactory.php: Backport: [[gerrit:706003|Make sure enable responsive mode UI reflects actual preference value (T285402)]] (duration: 00m 56s) [23:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:10] T285402: "Enable responsive MonoBook design" should be a (renamed) core skin preference - https://phabricator.wikimedia.org/T285402 [23:35:19] Jdlrobson: it's live now [23:35:49] hurrah! [23:36:50] thanks xSavitar_ [23:36:59] \o/ [23:37:02] Jdlrobson: you're welcome! anytime! [23:37:06] was this your first deploy? [23:37:18] thanks brennen and thcipriani for the assistance. Really appreciate :) [23:37:27] Jdlrobson: second deploy [23:37:30] anytime! [23:37:32] two more deploy than me lol [23:37:41] Jdlrobson: what!? [23:37:45] Jdlrobson: I won't be that proud, no way. [23:37:52] haha bone spurs or something... ;) [23:37:54] Jdlrobson: https://wikitech.wikimedia.org/wiki/Deployments/Training [23:38:06] ^ whenever you're ready we're around :) [23:38:45] Jdlrobson: this is all new to me, trying to make progress slowly all thanks to brennen and thcipriani. If you join the training, you'll like it for sure. [23:39:02] After each deploy, we try to drink water (esp me) to make sure heart rate is in check :D [23:39:18] hahaha [23:39:31] it gets easier every time I can promise that [23:39:46] * xSavitar_ agrees with ^^ [23:39:49] well. It gets easier every time most of the time :P [23:40:23] I like that second statement since "most of the time" !== "all the time" :D [23:40:52] honesty in advertising [23:41:08] * xSavitar_ goes to sleep now. thanks everyone, good night! [23:41:33] thanks xSavitar_ ! [23:57:59] (03CR) 10Bstorm: [C: 03+2] Drop kubeadm 1.17 remains [puppet] - 10https://gerrit.wikimedia.org/r/705969 (owner: 10Majavah) [23:59:08] :)