[00:43:24] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:44:10] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:43:42] PROBLEM - MD RAID on elastic1039 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:43:43] ACKNOWLEDGEMENT - MD RAID on elastic1039 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T285643 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:43:47] 10SRE, 10ops-eqiad: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10ops-monitoring-bot) [02:45:46] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:32] PROBLEM - Device not healthy -SMART- on elastic1039 is CRITICAL: cluster=elasticsearch device=sda instance=elastic1039 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [03:35:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:58] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:41:46] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:26] RECOVERY - Device not healthy -SMART- on elastic1039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [04:06:46] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:14] marostegui: re: https://phabricator.wikimedia.org/T284644 it would be great if it can be first deployed in testwiki and then wikishared DB. [05:39:01] kart_: sure, I can do that, but not till next week (we have the DC switchover this week) [05:40:36] (03PS1) 10KartikMistry: Deploy ContentTranslation out of Beta feature in 9 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701702 (https://phabricator.wikimedia.org/T284641) [05:47:58] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:16:15] !log remove BGP to AS13768 in AMS-IX [06:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:55] 10SRE, 10ops-eqiad, 10Discovery: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10elukey) [06:48:50] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:01:03] (03CR) 10Ayounsi: "Overall lgtm, some comments." (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [07:16:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:31:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:33:20] (03CR) 10Jelto: [C: 03+2] DHCP and site: add gitlab2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/701565 (https://phabricator.wikimedia.org/T285456) (owner: 10Jelto) [07:36:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Tor Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/701537 (owner: 10Muehlenhoff) [07:36:58] jelto: I'll puppet-merge your gitlab2001 change along, ok? [07:37:43] moritzm: yes thank you :) [07:38:00] (03PS6) 10Jcrespo: dbbackups: Migrate db1171:s2 to db1139, reimage as buster and set s7&s8 [puppet] - 10https://gerrit.wikimedia.org/r/700473 (https://phabricator.wikimedia.org/T280979) [07:38:16] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:49] jelto: done :-) [07:43:10] moritzm: thank you! [07:45:03] (03PS2) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [07:45:54] (03CR) 10Ayounsi: "Thanks, I did one of the tests, but will need help for the other ones." (036 comments) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [07:46:17] !log hashar@deploy1002 Started deploy [integration/docroot@cf677eb]: integration: Change agents dashboard link from Nagf to Grafana [07:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:25] !log hashar@deploy1002 Finished deploy [integration/docroot@cf677eb]: integration: Change agents dashboard link from Nagf to Grafana (duration: 00m 08s) [07:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:51] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [07:51:45] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Migrate db1171:s2 to db1139, reimage as buster and set s7&s8 [puppet] - 10https://gerrit.wikimedia.org/r/700473 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [07:52:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [07:52:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] systemd::timer::job: add parameter to define denendency using after [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [07:53:36] (03CR) 10David Caro: [C: 03+2] "This should be good yep" [puppet] - 10https://gerrit.wikimedia.org/r/701581 (owner: 10Andrew Bogott) [07:55:36] (03PS3) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [07:55:44] That's an ACK ^ [07:57:30] !log repool wdqs1005 [07:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:33] !log jelto@cumin1001:~$ sudo cumin install* 'run-puppet-agent' # update DHCP entry for gitlab2001 on install[1003,2003,3001,4001,5001].wikimedia.org [07:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:36] !log depool and restart blazegraph on wdqs1012 [07:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:48] (03PS3) 10David Caro: cinderutils.ensure: don't add mountpoint if we exec [puppet] - 10https://gerrit.wikimedia.org/r/701559 [08:06:32] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701559 (owner: 10David Caro) [08:07:22] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::nginx for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/698509 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:08:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though how this will work for decom'd hosts that stop generating logs altogether?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701576 (https://phabricator.wikimedia.org/T285371) (owner: 10Herron) [08:09:36] (03CR) 10Elukey: "There seem to be a trend of adding a lot of hiera values to the Analytics' namespaces since we are extending kerberos to a more general us" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [08:09:59] Hello lovely people, I'm about to deploy several config changes. Nothing major. [08:12:44] (03CR) 10Ladsgroup: [C: 03+2] "Deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [08:12:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [08:13:30] (03Merged) 10jenkins-bot: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [08:14:34] (03PS3) 10Jcrespo: dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) [08:14:50] (03CR) 10Filippo Giunchedi: switchdc: add a few new excluded services. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 (owner: 10Giuseppe Lavagetto) [08:15:03] (03PS2) 10Jcrespo: dbbackups: Remove s5 (stretch) from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/700725 (https://phabricator.wikimedia.org/T283235) [08:16:07] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Remove s5 (stretch) from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/700725 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [08:19:01] !log stop and remove db1145:s5 db2099:s5 T283235 [08:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:07] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [08:19:41] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:698518|Set Wikidata's main sandbox item (T219215)]], Part I (duration: 00m 57s) [08:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:46] T219215: Don’t suggest edits to real items in API sandbox - https://phabricator.wikimedia.org/T219215 [08:21:36] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:698518|Set Wikidata's main sandbox item (T219215)]], Part II (duration: 00m 56s) [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Checkout portals for multiversion & webserver img (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701001 (https://phabricator.wikimedia.org/T285325) (owner: 10Jeena Huneidi) [08:21:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [08:21:55] I'm getting 08:19:39 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1384.eqiad.wmnet returned [4]: NOT restarting php7.2-fpm: free opcache 455 MB [08:22:02] only on mw1384 [08:22:22] (03PS1) 10Kosta Harlan: Donor campaign: fix signup page styling [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701711 (https://phabricator.wikimedia.org/T284740) [08:23:17] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: REIMAGE [08:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:51] (03PS3) 10Ladsgroup: Remove special configurations for Dagbani in Wikibase code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699540 (https://phabricator.wikimedia.org/T283168) (owner: 10Mbch331) [08:24:55] (03CR) 10Ladsgroup: [C: 03+2] Remove special configurations for Dagbani in Wikibase code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699540 (https://phabricator.wikimedia.org/T283168) (owner: 10Mbch331) [08:25:25] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: REIMAGE [08:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:54] (03Merged) 10jenkins-bot: Remove special configurations for Dagbani in Wikibase code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699540 (https://phabricator.wikimedia.org/T283168) (owner: 10Mbch331) [08:27:11] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:699540|Remove special configurations for Dagbani in Wikibase code (T283168)]] (duration: 00m 56s) [08:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:16] T283168: remove special configurations for Dagbani in Wikibase code - https://phabricator.wikimedia.org/T283168 [08:27:29] (03PS1) 10Kosta Harlan: GrowthExperiments: Set up campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) [08:27:58] (03PS4) 10Ladsgroup: Remove idGeneratorRateLimiting from production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [08:28:02] (03PS2) 10Kosta Harlan: GrowthExperiments: Set up campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) [08:29:01] (03CR) 10Ladsgroup: [C: 03+2] "Wikibase.php should be sync'ed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [08:29:59] (03Merged) 10jenkins-bot: Remove idGeneratorRateLimiting from production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [08:30:41] (03CR) 10Muehlenhoff: [C: 03+2] Switch acmechief-test1001 to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:31:44] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:698751|Remove idGeneratorRateLimiting from production config (T274157)]], Part I (duration: 00m 58s) [08:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:49] T274157: Remove WikibaseRepo idGeneratorRateLimiting option - https://phabricator.wikimedia.org/T274157 [08:32:40] (03CR) 10David Caro: [C: 03+2] "The pcc results are as expected, there's still errors on the buster bastions (there's a patch up for those)." [puppet] - 10https://gerrit.wikimedia.org/r/701559 (owner: 10David Caro) [08:33:22] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:698751|Remove idGeneratorRateLimiting from production config (T274157)]], Part II (duration: 00m 55s) [08:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:49] (03PS3) 10Kosta Harlan: GrowthExperiments: Set up campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) [08:34:39] (03PS4) 10Kosta Harlan: GrowthExperiments: Update campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) [08:37:53] !Log phab1001 - removing 2fa for my own account [08:40:24] !log drain kubestage2002 for docker restart(s) [08:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:27] !log rolling upgrade of ATS on ulsfo - T285535 [08:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:37] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Update campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) (owner: 10Kosta Harlan) [08:41:09] (03CR) 10Kormat: "(adding self in case this becomes active at some point in the future)" [cookbooks] - 10https://gerrit.wikimedia.org/r/701475 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [08:41:20] (03PS3) 10Muehlenhoff: Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) [08:41:40] mutante it looks like your !Log about 2fa wasn't processed - maybe its case sensitive for lowercase !log ? [08:41:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:41:51] (03PS4) 10Jcrespo: dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) [08:41:53] (03PS1) 10Jcrespo: dbbackups: Migrate s2 backup taking on eqiad from db1171 to db1139 [puppet] - 10https://gerrit.wikimedia.org/r/701874 (https://phabricator.wikimedia.org/T280979) [08:42:24] (03PS2) 10Jcrespo: dbbackups: Migrate s2 backup taking on eqiad from db1171 to db1139 [puppet] - 10https://gerrit.wikimedia.org/r/701874 (https://phabricator.wikimedia.org/T280979) [08:43:19] (03PS1) 10Ladsgroup: Set idGeneratorInErrorPingLimiter to 9 for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701875 (https://phabricator.wikimedia.org/T284538) [08:44:27] 10SRE, 10GitLab, 10serviceops, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) @MoritzMuehlenhoff No worries, 100% with you here. The only reason to do it like that was that I didn't have access to look it up and was planning to use it as an example... [08:45:28] (03CR) 10Ladsgroup: [C: 03+2] "Syncing IS.php first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701875 (https://phabricator.wikimedia.org/T284538) (owner: 10Ladsgroup) [08:46:28] (03Merged) 10jenkins-bot: Set idGeneratorInErrorPingLimiter to 9 for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701875 (https://phabricator.wikimedia.org/T284538) (owner: 10Ladsgroup) [08:48:24] !log phab1001 - removing 2fa for my own account [08:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] DannyS712: thanks, yep [08:49:31] the mw config patch looks okay on mwdebug, movign forward [08:50:42] (03PS1) 10David Caro: puppet.refresh_certs: don't fail if resources changed [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 [08:51:43] (03PS4) 10Muehlenhoff: Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) [08:51:44] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:701875|Set idGeneratorInErrorPingLimiter to 9 for Wikidata (T284538)]], Part I (duration: 00m 56s) [08:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] T284538: effectively lower Item ID rate limit for bad Item creation requests - https://phabricator.wikimedia.org/T284538 [08:52:38] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Migrate s2 backup taking on eqiad from db1171 to db1139 [puppet] - 10https://gerrit.wikimedia.org/r/701874 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [08:53:31] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:701875|Set idGeneratorInErrorPingLimiter to 9 for Wikidata (T284538)]], Part II (duration: 00m 57s) [08:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:54:38] (03CR) 10Muehlenhoff: "Working fine on acmechief-test1001" [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:54:42] (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/701632 (owner: 10Paladox) [08:56:18] (03CR) 10jerkins-bot: [V: 04-1] puppet.refresh_certs: don't fail if resources changed [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [08:56:34] !log rolling upgrade of ATS on codfw - T285535 [08:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:05] I'm done [08:58:23] ok, I’ll wait a bit for the dust to settle and then probably start deploying the series at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/701500 [08:59:59] (03PS5) 10Jcrespo: dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) [09:01:43] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [09:03:27] (03PS3) 10Muehlenhoff: dumps::distribution::server: Switch to -full flavour [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) [09:06:22] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10ayounsi) Added Singtel and part of Telxius. Reached out to Zayo, Telia, Leaseweb, Relined, CyrusOne. @wiki_willy could... [09:11:06] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting Wikibase client changesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701500 (https://phabricator.wikimedia.org/T257260) [09:11:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting Wikibase client changesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701500 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:12:01] (03Merged) 10jenkins-bot: Stop setting Wikibase client changesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701500 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:13:38] (03CR) 10Muehlenhoff: [C: 03+2] wdqs: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:14:20] (03PS2) 10David Caro: puppet.refresh_certs: don't fail if resources changed [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 [09:14:53] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:701500|Stop setting Wikibase client changesDatabase (T257260)]] (duration: 00m 55s) [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:58] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [09:15:00] switching to new hardware made my Phabricator 2fa fail.. just said "invalid", thought it must be the clock but wasn't. disabling and enabling again and I am back [09:15:30] fpm restart still failing on mw1384, as noted by Amir1 above (IIRC) [09:23:30] (03PS4) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) [09:27:38] !log rolling upgrade of ATS on eqsin - T285535 [09:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:20] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ssingh Should we keep this ticket open and make a new one "get a free IP in eqsin"? [09:31:01] (03CR) 10JMeybohm: [C: 03+2] scaffold: The metrics-config is only needed if statsd is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/700319 (owner: 10JMeybohm) [09:33:20] (03Merged) 10jenkins-bot: scaffold: The metrics-config is only needed if statsd is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/700319 (owner: 10JMeybohm) [09:34:43] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] docker_registry_ha: Enable local nginx cache by default [puppet] - 10https://gerrit.wikimedia.org/r/696403 (https://phabricator.wikimedia.org/T256762) (owner: 10JMeybohm) [09:36:47] nothing broke after that last Wikibase config change, proceeding with the next one [09:37:07] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientChangesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701501 (https://phabricator.wikimedia.org/T257260) [09:37:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseClientChangesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701501 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:37:34] (03CR) 10Filippo Giunchedi: "LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [09:38:15] (03Merged) 10jenkins-bot: Remove $wmgWikibaseClientChangesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701501 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:38:31] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ayounsi) @MoritzMuehlenhoff is it possible to decom bast5001 ? or at least move it to the private vlan? @ssingh otherwise you can use 103.102.166.5 so we don't block you [09:39:32] !log lucaswerkmeister-wmde@deploy1002 sync-file aborted: Config: [[gerrit:701502|Stop setting Wikibase repo foreignRepositories (T257260)]] (1/2, prod) (duration: 00m 10s) [09:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:37] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [09:39:54] !log ^ wrong gerrit change used for message, sorry [09:39:54] (03PS1) 10Muehlenhoff: mirrors: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) [09:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:25] (03CR) 10jerkins-bot: [V: 04-1] mirrors: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:40:50] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh5002.wikimedia.org [09:40:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh5002.wikimedia.org [09:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:701501|Remove $wmgWikibaseClientChangesDatabase (T257260)]] (1/2, prod) (duration: 00m 57s) [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:16] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ayounsi I don't know how we could use that IP for a VM. When running the cookbook to create a VM it simply fails to get a free IP and there is no option or expectation... [09:42:22] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10MoritzMuehlenhoff) I think we can simply decom bast5001 for now (it can still be reinstalled under a new name later), but let's hear if @BBlack has some objection. [09:42:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:701501|Remove $wmgWikibaseClientChangesDatabase (T257260)]] (2/2, beta) (duration: 00m 56s) [09:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:03] (03PS2) 10Muehlenhoff: mirrors: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) [09:43:11] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [09:43:50] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) a:05Joe→03None [09:45:36] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ayounsi) >>! In T284246#7180178, @Dzahn wrote: > @ayounsi I don't know how we could use that IP for a VM. When running the cookbook to create a VM it simply fails to get a free... [09:45:45] (03CR) 10Joal: "I like the strategy, lets discuss the job names :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [09:46:11] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10Joe) 05Open→03Resolved Boldly resolving, I think this was done. [09:46:17] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [09:46:46] (03PS5) 10Jelto: profiles::docker::builder use after in production-images-weekly-rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) [09:47:00] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) a:03jijiki [09:47:22] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) @jijiki I think this task is resolved as well, correct? [09:47:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:48:24] (03PS5) 10Dzahn: add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) [09:49:12] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) Ah, gotcha! Yes, thank you for moving this forward. [09:49:16] (03PS1) 10Phuedx: vector: Enable language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701881 (https://phabricator.wikimedia.org/T269093) [09:49:53] (03PS4) 10Hnowlan: maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) [09:50:39] (03CR) 10Hnowlan: [C: 03+2] maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:50:55] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10jijiki) 05Open→03Resolved [09:50:58] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [09:52:16] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) First tests with the staging version of the mwdebug deployment, and I get the following non-encouraging timings (in ms, approximated from multiple runs): | page | k8s staging | mwd... [09:52:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30001/console" [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [09:52:58] !log rolling upgrade of ATS on esams - T285535 [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @joe appservers are running onhost memcached, which can be a factor for this specific test: https://phabricator.wikimedia.org/T263958#6510350 [09:57:46] (03PS2) 10Phuedx: vector: Enable language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701881 (https://phabricator.wikimedia.org/T269093) [10:01:47] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30002/console" [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [10:01:54] (03CR) 10JMeybohm: [C: 03+1] profiles::docker::builder use after in production-images-weekly-rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [10:04:36] (03PS1) 10Muehlenhoff: Switch sodium to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/701883 (https://phabricator.wikimedia.org/T164456) [10:08:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701883 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:11:12] !log installing remaining libxml2 security updates [10:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:50] !log rolling upgrade of ATS on eqiad - T285535 [10:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:05] (03PS6) 10Dzahn: add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) [10:13:07] (03PS3) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [10:13:33] (03CR) 10jerkins-bot: [V: 04-1] add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:13:35] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:16:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/30003/sodium.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:17:50] (03CR) 10Dzahn: "noop on sodium" [puppet] - 10https://gerrit.wikimedia.org/r/701880 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:21:59] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/30004/sodium.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/701883 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:22:05] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Switch sodium to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/701883 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:22:14] (03CR) 10Jbond: "done a first pass some comments and questions inline" (039 comments) [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [10:22:48] !log sodium (mirrors.wikimedia.org) - switching to nginx light variant T164456 [10:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:54] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [10:22:55] 10SRE, 10Infrastructure-Foundations, 10SRE-tools: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10fgiunchedi) [10:23:43] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2007.codfw.wmnet with reason: REIMAGE [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:48] !log sodium - restarted nginx [10:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10ayounsi) https://github.com/andrenth/drib seems like a took made exactly for that purpose. [10:25:57] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2007.codfw.wmnet with reason: REIMAGE [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:31] 10ops-eqiad: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10fgiunchedi) [10:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1030). [10:32:27] 10ops-eqiad: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10fgiunchedi) [10:33:01] (03CR) 10Dzahn: [C: 03+1] Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:36:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, rules make sense based on existing filter." [homer/public] - 10https://gerrit.wikimedia.org/r/701085 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [10:36:54] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701884 (https://phabricator.wikimedia.org/T128546) [10:39:55] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701884 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:35] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701884 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:43:06] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:701884| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:44:03] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:701884| Bumping portals to master (T128546)]] (duration: 00m 56s) [10:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:12] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/30006/registry1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:46:20] (03CR) 10Dzahn: [C: 03+1] systemd::timer::job: add parameter to define denendency using after [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [10:54:01] jouncebot: next [10:54:02] In 0 hour(s) and 5 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1100) [10:55:01] (03PS2) 10Urbanecm: Donor campaign: fix signup page styling [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701711 (https://phabricator.wikimedia.org/T284740) (owner: 10Kosta Harlan) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1100). [11:00:04] Thiemo_WMDE, kart_, kostajh, and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] hi [11:00:10] * kart_ is here [11:00:37] o/ [11:00:41] Thiemo_WMDE is here [11:01:11] i can deploy today [11:01:20] Great. [11:01:42] Thiemo_WMDE: I briefly looked at your change, and it looks generally sensible to me, but I’m worried that it hasn’t been merged or even reviewed on master yet :/ [11:02:20] (03PS1) 10Hnowlan: postgresql: ensure that pg_basebackup can access variables for resync [puppet] - 10https://gerrit.wikimedia.org/r/701888 [11:02:28] Thiemo_WMDE: per policy , only merged patches are eligible for backporting [11:02:43] Oh dear. [11:02:44] if Lucas_WMDE (or someone else) merges, happy to do it, but unfortunately not otherwise [11:03:02] (03CR) 10Urbanecm: [C: 03+2] Donor campaign: fix signup page styling [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701711 (https://phabricator.wikimedia.org/T284740) (owner: 10Kosta Harlan) [11:03:20] (03PS2) 10Urbanecm: Deploy ContentTranslation out of Beta feature in 9 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701702 (https://phabricator.wikimedia.org/T284641) (owner: 10KartikMistry) [11:03:24] (03CR) 10Urbanecm: [C: 03+2] Deploy ContentTranslation out of Beta feature in 9 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701702 (https://phabricator.wikimedia.org/T284641) (owner: 10KartikMistry) [11:03:31] I think I would be comfortable to deploy it, but only if I could test the fix on mwdebug myself (i.e. you’d have to tell me what the bug is and what it’s supposed to look like) [11:03:40] in the meantime let’s do the other deployments first [11:03:46] kostajh: is your config patch backport dependant? [11:04:02] urbanecm: it is not [11:04:12] great, i'll ping you when it's ready then [11:04:20] (03Merged) 10jenkins-bot: Deploy ContentTranslation out of Beta feature in 9 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701702 (https://phabricator.wikimedia.org/T284641) (owner: 10KartikMistry) [11:04:50] I'm sitting here with tabs open where I can reproduce the bug on the actual enwiki and every other one running wmf.11. The patch will fix it. [11:05:02] kart_: your patch is at mwdebug1001, please test. [11:07:21] Thiemo_WMDE: note that "soon", next wmf version will start overriding wmf.11 anyway (so the merge, and likely backport to wmf.12, would have to happen very soon anyway). [11:07:30] o/ Sorry I'm late [11:07:43] hello phuedx :-). You're at the end, so you didn't miss anything ye [11:07:45] urbanecm: testing.. [11:07:59] (03CR) 10WMDE-Fisch: [C: 03+1] Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) (owner: 10Thiemo Kreuz (WMDE)) [11:08:24] https://gerrit.wikimedia.org/r/701644%20is%20not%20even%20meant%20to%20be%20merged%20to%20master.%20It's%20just%20a%20hotfix.%20The%20proper%20fix%20for%20master%20is%20https://gerrit.wikimedia.org/r/701423.%20Merging%20both%20is%20possible%20but%20unnecessary.%20We%20would%20effectively%20need%20to%20undo%20the%20hotfix%20anyway%20before%20we%20ca [11:08:24] n%20merge%20anything%20else. [11:08:32] ??? [11:08:51] * urbanecm is confused [11:08:51] uh? [11:09:06] Stupid Chrome. [11:09:06] the URL 404s, but...it's probably not meant to be an URL [11:09:08] https://gerrit.wikimedia.org/r/701644 is not even meant to be merged to master. It's just a hotfix. The proper fix for master is https://gerrit.wikimedia.org/r/701423. Merging both is possible but unnecessary. We would effectively need to undo the hotfix anyway before we can merge anything else. [11:09:41] well, Fisch just +2ed the hotfix on master… [11:09:53] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 607369224 and 0 seconds Hnowlan awaiting resync - not pooled. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:09:55] +1ed because of what I just explained. [11:10:21] +1ed on wmf.11 but +2ed on master [11:10:35] if it’s not supposed to be on master then -2 it before Zuul merges it… [11:10:36] Yeah sorry, we should have done this earlier. [11:10:39] to be *merged on master [11:10:49] Doesn't matter. I will undo it later on master. [11:11:00] I still don’t even know what the bug is, https://phabricator.wikimedia.org/T285571 has practically no info [11:11:02] It's not bad if it's on master. [11:11:08] tbh i'd be more comfortable deploying an unreviewed revert (as that brings production to a reviewed state) than an unreviewed hotfix (as that brings production to an unreviewed state) [11:11:21] urbanecm: good to go! [11:11:29] but since WMDE-Fisch did review it now, that resolves the concern, at least for me [11:11:43] thanks kart_, syncing [11:12:25] The bug is if you do crazy things with the template editor in VisualEditor, at some point it doesn't let you add parameters back that have been removed before. Not so easy to reproduce. [11:12:58] Lucas_WMDE: since you're a deployer and commented on this issue, given the review, OK to backport the hotfix from you? [11:13:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ade641b39bae8f2abb5d318299b033bfd8a7cb7a: Deploy ContentTranslation out of Beta feature in 9 WPs (T284641) (duration: 00m 56s) [11:13:04] kart_: live [11:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:05] T284641: Deploy Content Translation tool out of Beta feature in 9 Wikis - https://phabricator.wikimedia.org/T284641 [11:13:14] (03PS5) 10Urbanecm: GrowthExperiments: Update campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) (owner: 10Kosta Harlan) [11:13:18] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Update campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) (owner: 10Kosta Harlan) [11:13:19] tbh I’m now so confused that I don’t want to deploy it myself [11:13:26] regardless of what code review happened on master [11:14:19] and hearing that it’s difficult to reproduce hasn’t made me happier either :/ [11:15:01] (03Merged) 10jenkins-bot: GrowthExperiments: Update campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701873 (https://phabricator.wikimedia.org/T284800) (owner: 10Kosta Harlan) [11:15:14] urbanecm: Awesome. Thanks a lot! [11:15:33] should've looked to scap logs earlier, but... [11:15:50] it says ` /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1384.eqiad.wmnet returned [4]: NOT restarting php7.2-fpm: free opcache 437 MB` as a warning [11:16:04] i never saw this kind of message, any idea what it means? [11:16:11] Amir1 and I got the same error earlier today [11:16:19] or, the same *message, I should say [11:16:21] idk if it’s an error [11:16:34] it says `11:13:00 1 hosts had failures restarting php-fpm` right below it [11:16:58] it's not in red font though, which scap does for critical stuff [11:17:33] Lucas_WMDE: do you know what it means though? [11:17:40] no :/ [11:17:48] urbanecm: let me see [11:18:01] !log lucaswerkmeister-wmde@mw1384:~$ scap pull # did not print any errors [11:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:27] i'm waiting with more deployments until effie or someone can clarify the meaning. [11:18:27] I think that mw1384 is a host that was intentionally left depooled in a poor state [11:18:28] tried a scap pull, it said “checking if php-fpm restart needed” but nothing after that [11:18:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from buster master maps1009 [11:18:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from buster master maps1009 [11:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:04] it does say something about “cannot delete non-empty directory” for wmf.1, which I had on mwdebug a few days ago [11:19:05] 1384 is depooled (for some reason) yea [11:19:32] so whatever errors scap says about it, not a reason to worry, is that right? [11:19:35] I read it on the look, but that shoudn't be an issue anyway [11:19:36] no [11:19:48] nothing to worry about for sure [11:20:01] okay, continuing then. Thanks effie (and all) [11:20:16] we are around anyway [11:20:37] !log lucaswerkmeister-wmde@mw1384:~$ sudo -u mwdeploy sh -c 'rm /srv/mediawiki/php-1.37.0-wmf.1/cache/l10n/l10n_cache-*.cdb && rmdir /srv/mediawiki/php-1.37.0-wmf.1/cache/l10n && rmdir /srv/mediawiki/php-1.37.0-wmf.1/cache && rmdir /srv/mediawiki/php-1.37.0-wmf.1' # per comments in T157030 and similar tasks [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:41] T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030 [11:20:47] I cleaned up those old l10n files in wmf.1, idk if it’s related [11:20:50] I'm asking about messages I don't recognize as a matter of habit 🙂 [11:20:59] kostajh: your patch is at mwdebug1001, plese have a look [11:21:01] (the config one) [11:21:02] but I’m pretty sure they weren’t needed anymore [11:21:04] urbanecm: thx, looking [11:21:09] (03PS6) 10Jcrespo: dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) [11:21:11] (03PS1) 10Jcrespo: dbbackups: Temporarily change backup schedules to fit better dc switch [puppet] - 10https://gerrit.wikimedia.org/r/701891 (https://phabricator.wikimedia.org/T284897) [11:21:23] (03PS2) 10Jcrespo: dbbackups: Temporarily change backup schedules to fit better dc switch [puppet] - 10https://gerrit.wikimedia.org/r/701891 (https://phabricator.wikimedia.org/T284897) [11:22:38] urbanecm: lgtm [11:22:43] (03Merged) 10jenkins-bot: Donor campaign: fix signup page styling [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701711 (https://phabricator.wikimedia.org/T284740) (owner: 10Kosta Harlan) [11:22:45] thanks, syncing [11:22:55] PROBLEM - pg_up reduced availability on alert1001 is CRITICAL: 0.7778 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:07] just in time, jenkins 🙂 [11:23:44] (03PS1) 10Jbond: C:puppetmaster::puppetdb::database: move local account to hiera [puppet] - 10https://gerrit.wikimedia.org/r/701892 (https://phabricator.wikimedia.org/T285666) [11:23:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9495d18521c3639cbd783a6341fe5348e31b0103: GrowthExperiments: Update campaign pattern (T284800) (duration: 00m 56s) [11:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:00] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [11:24:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30007/console" [puppet] - 10https://gerrit.wikimedia.org/r/701892 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [11:24:50] kostajh: your backport is at mwdebug1001 too, please test [11:25:08] urbanecm: looking [11:25:38] "Confused". Not sure what to do with that. What's the point of having the owner of a bugfix around so they can confirm a bugfix did exactly what it's supposed to do if we stop believing anything these devs say the moment the issue is a bit more confusing? Fixing it to not be confusing any more is the point of this fix. [11:27:08] urbanecm: looks good [11:27:13] thanks kostajh, syncing [11:28:12] Thiemo_WMDE: okay, but should I believe the dev who said the fix is not meant to be merged on master, or the other dev who merged it on master anyways? do you see how that could be confusing? [11:28:35] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/modules/signup/campaign.less: cd16aa2b51fb74e628c4ad26ac6b469bc04ab370: Donor campaign: fix signup page styling (T284740) (duration: 00m 56s) [11:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:40] T284740: Donors to newcomers: design an enhanced account creation landing page - https://phabricator.wikimedia.org/T284740 [11:28:58] kostajh: synced [11:29:04] urbanecm: thanks! [11:29:10] (03PS3) 10Urbanecm: vector: Enable language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701881 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [11:29:10] We can +2 it on master just to prove a point, but we need to revert that anyway before we can apply a better fix. [11:29:14] (03CR) 10Urbanecm: [C: 03+2] vector: Enable language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701881 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [11:29:27] I can confirm the issue and the fix working. It the fix should not need the be merged on master because continued refactoring will be done there that will avoid the issue. [11:29:49] But I merged it to confirm the point it's working for the issue. [11:29:57] (03Merged) 10jenkins-bot: vector: Enable language switcher treatment A/B test on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701881 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [11:30:04] And does not break anything. [11:30:14] phuedx: your patch is at mwdebug1001, please test. [11:30:22] urbanecm: Testing now [11:30:42] Also it was this chat here requesting it to be merged n master first. And now that we did that it causes confusion??? [11:30:57] urbanecm: LGTM. Thanks [11:31:45] (03CR) 10Zabe: [C: 04-1] "there are mistakes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [11:32:14] 10SRE, 10Traffic, 10Security: Backport ATS 8.1.2 security fixes to our in-house ATS 8.0.8 - https://phabricator.wikimedia.org/T285535 (10ema) [11:32:22] 10SRE, 10Traffic, 10Security: Backport ATS 8.1.2 security fixes to our in-house ATS 8.0.8 - https://phabricator.wikimedia.org/T285535 (10ema) [11:32:54] Thiemo_WMDE: please let's move discussion about this patch elsewhere (ideally, to the task). The inconsistent information about the patch definitely caused confusion and uncertainity about "this being the correct approach", and it also caused some questions. Please reschedule the patch to a different window. Thanks for your understanding. [11:33:04] thanks phuedx, syncing [11:33:20] OMG, seriously? [11:34:01] Nobody is "discussing" this patch. [11:34:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e4a088fbcf90a8c232c616ba5df9ad01cb5449e8: vector: Enable language switcher treatment A/B test on fawiki (T269093) (duration: 00m 55s) [11:34:20] It's a hotfix for a very specific issue, confirmed by multiple people. [11:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:22] T269093: Deploy new language switching location to test wikis and begin A/B test pt 1 - https://phabricator.wikimedia.org/T269093 [11:34:51] It's best practice to merge on master first and it makes totall sense. - I think the only confusing part here was the miss-communication around the fact, "That the hotfix should not be needed on master, because it will be fixed by future changes there anyways." [11:35:09] It doesn't matter if it's the "best possible" fix or what. Since when is this a requirement for a backport? [11:35:30] Thiemo_WMDE: yes, seriously. The policy is clear that the backport team (in this case, represented by me) MUST be comfortable with the patch going out. I'm not comfortable with deploying this patch, and I ask you to ask to reschedule it. [11:35:39] the iron rule for backports is that the backport deployers have to be comfortable with the deployment [11:35:43] arturo: I'm going to push https://gerrit.wikimedia.org/r/c/operations/homer/public/+/701085 shortly [11:35:54] I am. What's the issue? [11:36:17] Thiemo_WMDE: with all due respect, you're not a backport team member. I and Lucas_WMDE are. [11:36:33] I misunderstood the sentence, sorry. [11:37:10] Adam Wight is a deployer, according to https://wmde-access.toolforge.org/ – they would probably be a good person to deploy this (having authored the original change that caused the bug, if I understand it correctly) [11:37:20] The thing is: I tried to give you as much information as I could, and exactly that confused you??? [11:37:37] AdamW is not here for multiple weeks. [11:38:27] !log push "Port cloud-in4 to Capirca" to cr1/2-codfw [11:38:29] What if I had not done that, not said anything, just waiting for possible questions Would it be merged then? [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:51] Thiemo_WMDE: continuing this discussion here is not productive [11:39:01] Thiemo_WMDE: Martin's not comfortable deploying the patch as is. I suggest you leave it for the next window, and in the meantime could add more documentation to the task, patch commit message, etc. [11:39:37] So this is not about information but about feelings??? [11:40:08] (03PS2) 10Amire80: [WIP] Update autonyms for kea, ota, sjd in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) [11:40:24] The triple question marks are not helping [11:40:44] !log push "Port cloud-in4 to Capirca" to cr1/2-eqiad [11:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:35] (03PS3) 10Amire80: Update autonyms for kea, ota, sjd in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) [11:42:05] is anyone still deploying? otherwise I’d continue with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/701502/1 before the window closes [11:42:19] Lucas_WMDE: I'm done [11:42:24] ok thanks [11:42:34] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting Wikibase repo foreignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701502 (https://phabricator.wikimedia.org/T257260) [11:42:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting Wikibase repo foreignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701502 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:43:05] (03CR) 10Ayounsi: [C: 03+2] Port cloud-in4 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701085 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:43:25] (03Merged) 10jenkins-bot: Stop setting Wikibase repo foreignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701502 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:43:42] testing on mwdebug1001… [11:43:56] (03Merged) 10jenkins-bot: Port cloud-in4 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701085 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:44:29] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Temporarily change backup schedules to fit better dc switch [puppet] - 10https://gerrit.wikimedia.org/r/701891 (https://phabricator.wikimedia.org/T284897) (owner: 10Jcrespo) [11:49:23] everything still working on Commons afaict, syncing [11:50:46] (03PS1) 10Jcrespo: Revert "dbbackups: Temporarily change backup schedules to fit better dc switch" [puppet] - 10https://gerrit.wikimedia.org/r/701717 [11:50:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:701502|Stop setting Wikibase repo foreignRepositories (T257260)]] (duration: 00m 55s) [11:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:55] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [11:51:30] (mw1384 still logging warnings, for the record) [11:51:42] (03CR) 10Jcrespo: [C: 04-1] "Waiting until UTC 19h to revert." [puppet] - 10https://gerrit.wikimedia.org/r/701717 (owner: 10Jcrespo) [11:53:05] by the way, is the train happening this week? the server switch announcements claim it’s not running, but there are still train windows in the deployment calendar, and the train blockers task hasn’t been closed either [11:54:28] ah, wait, the wikitech-l email said the train *will* run (https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/MR4YNIEIWVJG2LCCTSMBMEASUZUNNPZE/) – so I guess we just reused some previous copy for the massmessage without updating it [11:54:39] Lucas_WMDE: yes, train is going to run. [11:54:44] ok [11:54:45] (just as other deployments) [12:00:16] !log EU backport+config window done [12:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:39] (I’ll continue with more Wikibase config cleanups later but since it’s the end of the hour I figured I should log that ^^) [12:01:12] (03PS1) 10Ladsgroup: media: Handle lack of 'metadata' key from getSizeAndMetadata gracefully [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701718 (https://phabricator.wikimedia.org/T285490) [12:01:37] (03CR) 10Ladsgroup: [C: 03+2] media: Handle lack of 'metadata' key from getSizeAndMetadata gracefully [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701718 (https://phabricator.wikimedia.org/T285490) (owner: 10Ladsgroup) [12:01:46] quickly deploying this [12:03:01] (03CR) 10Thiemo Kreuz (WMDE): Hotfix for broken "Extract show all to placeholder class" (034 comments) [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) (owner: 10Thiemo Kreuz (WMDE)) [12:11:31] (03PS1) 10Filippo Giunchedi: grafana: select swift cluster in dashboard [puppet] - 10https://gerrit.wikimedia.org/r/701895 [12:12:13] (03PS7) 10Dzahn: add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) [12:13:53] (03PS2) 10Filippo Giunchedi: grafana: select swift cluster in dashboard [puppet] - 10https://gerrit.wikimedia.org/r/701895 [12:14:54] (03PS4) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [12:15:36] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:15:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] systemd::timer::job: add parameter to define denendency using after [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [12:16:07] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: select swift cluster in dashboard [puppet] - 10https://gerrit.wikimedia.org/r/701895 (owner: 10Filippo Giunchedi) [12:16:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] profiles::docker::builder use after in production-images-weekly-rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [12:19:59] (03Merged) 10jenkins-bot: media: Handle lack of 'metadata' key from getSizeAndMetadata gracefully [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701718 (https://phabricator.wikimedia.org/T285490) (owner: 10Ladsgroup) [12:20:04] (03PS5) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [12:20:33] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:24:29] !log repool wdqs1012 [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:37] PROBLEM - cassandra service on maps2007 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:37] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:03] PROBLEM - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:29:57] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.11/includes/media/MediaHandler.php: Backport: [[gerrit:701718|media: Handle lack of 'metadata' key from getSizeAndMetadata gracefully (T285490)]] (duration: 00m 56s) [12:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:05] T285490: InvalidArgumentException: Media handler BmpHandler returned NULL for metadata, should be array - https://phabricator.wikimedia.org/T285490 [12:33:31] (03PS6) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [12:34:03] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:36:24] (03PS1) 10Kosta Harlan: Make it possible to force opt-in/opt-out to Growth features during account creation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701719 (https://phabricator.wikimedia.org/T284119) [12:37:15] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) a:03Jelto [12:41:52] (03PS2) 10Dzahn: site/conftool/DHCP: remove old eqiad appservers in A5, mostly canaries [puppet] - 10https://gerrit.wikimedia.org/r/679527 (https://phabricator.wikimedia.org/T280203) [12:43:16] (03PS2) 10Jbond: C:puppetmaster::puppetdb::database: move local account to hiera [puppet] - 10https://gerrit.wikimedia.org/r/701892 (https://phabricator.wikimedia.org/T285666) [12:44:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30008/console" [puppet] - 10https://gerrit.wikimedia.org/r/701892 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [12:50:31] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) I merged and rolled out the [additional parameter for intra-service dependencies in systemd::timer::job](https://gerrit.wikimedia.org/r/701525) and [the chang... [12:52:14] (03CR) 10Andrew Bogott: cinderutils.ensure: don't add mountpoint if we exec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701559 (owner: 10David Caro) [12:59:32] !log installing intel-microcode security updates on buster [12:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:11] doing another no-op Wikibase config cleanup since there’s nothing on the deployment calendar right now [13:03:20] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseRepoForeignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701503 (https://phabricator.wikimedia.org/T257260) [13:03:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseRepoForeignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [13:04:19] (03Merged) 10jenkins-bot: Remove $wmgWikibaseRepoForeignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [13:05:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:06:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:701503|Remove $wmgWikibaseRepoForeignRepositories (T257260)]] (1/2, prod) (duration: 00m 57s) [13:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:16] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [13:07:18] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:701503|Remove $wmgWikibaseRepoForeignRepositories (T257260)]] (2/2, beta) (duration: 00m 57s) [13:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] (03PS2) 10MSantos: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 [13:08:21] (03CR) 10MSantos: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [13:09:43] alright, I’m done for now [13:11:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:14:33] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:27] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 912 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:17:29] RECOVERY - pg_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:18:23] RECOVERY - cassandra service on maps2007 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:36:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:38:24] ^ λοοκινγ [13:38:41] ^looking [13:40:53] there are some appservers low on idle workers, but not that much [13:41:05] and they don't seem to fail health checks for the moment [13:41:51] https://grafana.wikimedia.org/goto/1VQj7ck7z [13:42:50] I was looking at the traffic pattern a bit [13:42:51] apcu frag in one of them is around 50%, not that high [13:43:50] effie: let me know if you spot something, didn't find much so far :( [13:46:12] yeah looking [13:48:13] ah yes I started seeing mw hosts failing healt checks on lvs1016's pybal [13:48:17] *health [13:48:31] same ones with 0 idle workers that makes sense [13:51:27] ok I could work something to restart the servers with 0 idle workers [13:51:59] but we need to take a look at fridays/thursday's deployments and see if something is causing this [13:52:26] it is only app servers, and I see really long POST request times, although this could be a side effect [13:52:40] it happened also during the weekend, all the appservers were restarted [13:52:55] not sure if it is the exact same but so far it seems os [13:52:56] *so [13:53:16] every time a group of 4/5 appservers caused the latency regressions, restart fixed them [13:53:38] https://phabricator.wikimedia.org/T285634 [13:53:44] yeah I skimmed the backlog and task this morning [13:53:57] the apcu's fragmentation was higher though [13:54:09] I dug a bit but didn't find anything that could be the cause, and even apcu itself getting to 90% wasnt causing problems before [13:54:33] the memory allocation could be a bug in lib.php but I can look at it later [13:55:27] I did setup the switchdc tmux session on cumin1001. Please join with `sudo -i tmux attach -rt switchdc` [13:55:59] legoktm: what about the open o11y change (exclude thanos stuff)? [13:56:05] <_joe_> use decent sized terminals [13:56:18] <_joe_> jayme: I'd do that by hand tbh, at this point [13:56:34] <_joe_> elukey: do you think we should halt the switchover? [13:56:34] either by hand or we can skip it for this switchover [13:56:59] <_joe_> switching the services over to codfw can cause higher latnecy [13:58:04] _joe_ it may be the safest thing, but I don't have a solid understanding about what is causing this [13:58:09] RECOVERY - cassandra CQL 10.192.32.46:9042 on maps2007 is OK: TCP OK - 0.033 second response time on 10.192.32.46 port 9042 https://phabricator.wikimedia.org/T93886 [13:58:32] I thought that it was as one-off issue solvable with a complete roll restart of appservers, but it looks returning [13:58:49] <_joe_> elukey: no I mean the *immediate* issue [13:59:12] _joe_: since this is for the otehr services [13:59:19] and we are going to restart a few servers [13:59:30] I do not think we should pause the switchover [13:59:43] <_joe_> yeah let's first solve the problem by restarting those few servers though [13:59:52] <_joe_> we have a 1-hour window for the switch [14:00:01] ok I will do a full restart then [14:00:04] appservers only [14:00:04] legoktm and jayme: Your horoscope predicts another unfortunate Datacenter Switchover: Services deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1400). [14:00:06] <_joe_> jayme and legoktm can help, I have a meeting now [14:00:14] <_joe_> why a full restart though? is it needed? [14:00:17] effie: let's not roll restart them all [14:00:21] jouncebot: hey, a little optimism please :( [14:00:28] only the appservers [14:00:35] let's start with the busy ones [14:00:40] <_joe_> ^^ [14:00:52] https://grafana.wikimedia.org/goto/1VQj7ck7z [14:00:56] effie: --^ [14:01:02] the reason I am calling for a full one, is not to have other servers get busy [14:01:05] mid-switch [14:01:23] but at the same time we'll not know if it is again a widespread issue or not, this is why I am saying it [14:01:45] if we restart all the appservers now, we will know that for the next hour, of teh switchover [14:01:45] we won't have this on our minds [14:02:04] <_joe_> I disagree [14:02:06] if we didn't have the switchover, I would go for restarting the busy iones [14:02:16] _joe_: why? [14:02:32] we having to monitor for the next busy server [14:02:39] while we are doing this ? [14:02:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster::puppetdb::database: move local account to hiera [puppet] - 10https://gerrit.wikimedia.org/r/701892 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [14:02:43] <_joe_> effie: because it's not really needed, and the switchover lasts 5 minutes [14:03:14] also if it rehappens we can isolate one appserver and investigate [14:03:20] restarting the others [14:03:40] (03PS2) 10Hnowlan: postgresql: ensure that pg_basebackup can access variables for resync [puppet] - 10https://gerrit.wikimedia.org/r/701888 [14:04:14] I will restart the busy ones for now, and we will see how it hoes [14:07:26] ack. Let us know when you're done [14:07:42] here to help, if anything needs another pair of hands [14:07:43] !log restarting busy php-fpm app servers [14:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:55] one note - during the weekend I left mw1384 aside for futher debugging (depooled = no) [14:07:56] legoktm: should we do thanos before or after "all services"? [14:08:03] running [14:08:14] after I think [14:08:15] it may be useful to do the same with one of the current busy ones, otherwise we cannot investigate [14:08:26] how difficult will it be to do manually? [14:08:41] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Legoktm) > and also `filerepo_file_foreign_description` This is something that @ladsgroup and @... [14:09:02] AIUI it's two confctl commands as they are active-active [14:09:48] so `confctl --object-type discovery select 'dnsdisc=thanos-query,name=eqiad' set/pooled=false` [14:10:42] + equivalent for thanos-swift ofc [14:10:54] yep, looks right and https://wikitech.wikimedia.org/wiki/Thanos#Pool_/_depool_a_site agrees [14:10:57] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:31] marostegui: o/ could I ask you to update the channel topic? I'll be on clinic duty this week [14:11:47] sure [14:11:57] jayme: or do both together if you like, select 'dnsdisc=thanos-.*,name=eqiad' [14:12:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:12:39] rzl: thanks. Looks like it's still two commands then as we should restart thanos-sidecar :) [14:12:41] jayme legoktm [14:12:42] go [14:13:09] latecy is back to normal [14:13:15] can we wait 10 minutes ? [14:13:20] marostegui: ty! [14:13:23] IIUC we have a window of 1 hour [14:13:25] sure [14:13:38] the graphs just settled, let's wait a couple of datapoints [14:13:51] jayme: yeah, not sure how to tell if "might be needed" is relevant but we can find out :D [14:14:37] eheh, yeah. Hopefully "might be needed" == "does not hurt" [14:14:47] I fear that we will have other php processes running out of free workers [14:14:54] I'm here too if needed re: thanos, I think we're fine even not restarting the sidecar, it'll eventually do the right thing and reconnect [14:15:04] if we assume that this is going to happen again for sure [14:15:07] godog: oh perfect, was about to ask you [14:15:35] rzl: o/ [14:17:06] so yeah I'm +1 on not restarting the sidecar, I'll fix the documentation to be less vague [14:20:25] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/699254 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:20:29] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ecs migration config for sampled webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/699254 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:20:34] effie: can you log the list of the ones that you restarted? (or the cumin command, so we keep track) [14:20:48] sure [14:20:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/695563 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:21:31] !log restarted mw[1322,1329,1333,1350,1351,1352,1353,1354,1366,1367,1368,1370,1372,1373] [14:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:31] it's been about ~10min now [14:24:01] I think that we are stable-ish, a couple of busy workers but nothing horrible, +1 from my side [14:24:34] (03CR) 10David Caro: cinderutils.ensure: don't add mountpoint if we exec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701559 (owner: 10David Caro) [14:24:45] normally we're at like 275ms, it's ~330ms right now [14:24:49] (03PS1) 10Jbond: P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) [14:24:59] effie: looks fine to you as well? [14:25:25] (03CR) 10jerkins-bot: [V: 04-1] P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [14:26:21] legoktm: I think it is within acceptable limits for now [14:26:54] ok :D [14:27:03] jayme: let's do it? [14:27:15] legoktm: ack [14:27:32] (03PS2) 10Jbond: P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) [14:27:49] please highlight me if you see me doing anything stupid :) [14:28:12] "eqiad codfw" lgtm [14:28:39] we'll get a good chance to review the list of services when you reduce TTLs [14:28:44] I was checking that too, lgtm as well :D [14:28:49] (03CR) 10jerkins-bot: [V: 04-1] P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [14:29:20] eheh, yeah. I'll wait a couple of secs before every command to give the chance to yell [14:29:23] !log jayme@cumin1001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [14:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:22] "Blame Joe." [14:30:30] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10herron) [14:30:36] <_joe_> :P [14:30:36] * jayme blaming Joe [14:30:43] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10herron) [14:30:45] (03PS5) 10Ema: varnish: add counters for Varnish SLI [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) [14:31:25] lol [14:31:34] nice change from blaming vo lans [14:32:15] I can actually blame both of them asynchronously [14:32:37] frickin showoff multitaskers [14:32:59] there, see, now you're mad at me and you forgot about Joe completely [14:33:10] (03CR) 10Ema: varnish: add counters for Varnish SLI (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [14:33:32] I wish it put timestamps on the messages optionally (cookbook) so that we could have them for things like this [14:34:47] tail -f /var/log/spicerack/sre/switchdc/services.log [14:34:50] that one has timestamps [14:35:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [14:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:26] apergos: optionally you can `tail -f /var/log/spicerack/sre/switchdc/services.log` [14:35:38] you'll lose the interactive menu output but I think that's all [14:35:41] ah nice lego [14:35:47] oh oops beaten [14:36:16] and you can't see commands before they're typed :p [14:36:44] !log jayme@cumin1001 START - Cookbook sre.switchdc.services.01-switch-dc [14:36:44] !log jayme@cumin1001 Switching services kartotherian, proton, wdqs-internal, wikifeeds, zotero, recommendation-api, swift-ro, linkrecommendation, mobileapps, citoid, eventgate-analytics, push-notifications, eventstreams-internal, mathoid, similar-users, schema, apertium, restbase-async, shellbox, termbox, wdqs, ores, eventgate-analytics-external, swift, helm-charts, restbase, cxserver, search, sessionstore, eventstreams, api-gate [14:36:44] ore, eventgate-logging-external: eqiad => codfw [14:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:07] what's wrong with kartotherian? [14:37:18] "failed to check $NAME" just means "hasn't updated yet" [14:37:21] it recovered? [14:37:22] oh [14:37:30] I thought we corrected all of those after the last one but I must have missed the services cookbook [14:37:41] REFUSED is different, not sure what's happening here [14:37:41] * legoktm adds to notes [14:37:51] !log jayme@cumin1001 END (FAIL) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=99) [14:37:53] legoktm: if you file a task for that last one, assign it my way [14:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:04] <_joe_> ok let's pause for a sec [14:38:05] let's see about chartmuseum though [14:38:12] (03PS3) 10Jbond: P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) [14:38:18] <_joe_> let's see what the situation is on etcd [14:38:36] that might be because chartmuseum is 'special' - hmpf [14:38:44] not having a service IP [14:38:46] that is [14:38:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30011/console" [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [14:38:54] <_joe_> yes [14:38:58] <_joe_> we should [14:39:04] <_joe_> Ve excluded it [14:39:07] <_joe_> btw [14:39:12] <_joe_> the actual switch happened AFAICT [14:39:29] <_joe_> confctl --object-type=discovery select 'name=eqiad' get | grep -F '"pooled": true' | jq .tags seems ok [14:39:46] is that because the cookbook expects everything to have service IPs? [14:39:54] <_joe_> so we just failed the verify step because we didn't exclude chartmuseum [14:39:56] <_joe_> yes [14:40:11] yes [14:40:29] do we rerun with --exclude_services then, to get the verify step on everything else? [14:40:33] <_joe_> no [14:40:42] I think it verified everything else before failing on this one? [14:40:46] <_joe_> I think we should be ok [14:40:55] <_joe_> let's look at traffic patterns [14:40:55] Are there outstanding issues with kartotherian? [14:41:00] legoktm: only if this was the last one in the list, it would have stopped I think [14:41:02] <_joe_> hnowlan:no [14:41:05] cool [14:41:16] <_joe_> rzl: was the second to last [14:41:21] ack [14:41:41] (sorry, I'm on a cramped laptop screen, slower to tab around to check stuff than I'd like) [14:42:36] <_joe_> https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 looks legit for instance [14:42:47] <_joe_> (and also https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=thanos&var-site=eqiad&var-service=sessionstore&var-prometheus=k8s&var-container_name=kask-production) [14:43:40] <_joe_> ok, let's see if latency for mediawiki or restbase increases [14:43:55] <_joe_> and how much, if it's not too bad, we can finish the switch [14:44:02] eventgates look fine as well [14:44:11] fwiw I see "restbase, cxserver, search, sessionstore, eventstreams, api-gate" after helm-charts in the service list [14:44:17] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10herron) a:03NRodriguez Hi @NRodriguez there are a couple steps to check off in order to move forward on this request. When you have a moment could you please... * Review and s... [14:44:27] so we haven't verified any of those six, should check them all by hand if we're not rerunning [14:44:40] is there any harm in rerunning? [14:44:42] <_joe_> rzl: yes, I checked everything is switched [14:44:50] _joe_: okay, thanks [14:44:51] ok [14:44:54] <_joe_> legoktm: no, ofc not [14:45:02] <_joe_> just skip the reduce ttl one [14:45:20] <_joe_> but I'm trying to say you should look more at the current impact of having switched everything that was switched [14:45:51] gotcha [14:46:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb::database: merge puppetmaster::puppetdb::database into this profile [puppet] - 10https://gerrit.wikimedia.org/r/701926 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [14:46:30] let's re-run 01-switch-dc then after checking latency for sake of consistency [14:46:46] <_joe_> for instance, https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=57&orgId=1&from=now-1h&to=now [14:47:12] <_joe_> as expected, little to no request take less than 50 ms [14:47:23] <_joe_> as most requests will make at least 1 query to sessionstore [14:47:32] (03PS1) 10DCausse: [wdqs] switch updater reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/701927 [14:47:35] <_joe_> which is now 30 ms of RTT away [14:48:24] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=6&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=sessionstore this is the latency as seen by envoy on the appservers [14:48:57] (correction: that full list of services after helm-charts would have been restbase, cxserver, search, sessionstore, eventstreams, api-gateway, eventgate-main, echostore, eventgate-logging-external -- legoktm, for later can you also note that long SAL lines are getting truncated, and we should do some work at the spicerack level to split them automatically) [14:49:09] (agree let's focus on the latency effects now) [14:49:19] <_joe_> so, the impact seems limited enough to me [14:49:35] already noticed that :) keeping track at https://etherpad.wikimedia.org/p/2021-switchdc-prep [14:49:38] cheers [14:49:39] <_joe_> you can re-run the cookbook excluding chartmuseum [14:49:57] <_joe_> and ofc don't do the reduce ttl again [14:50:10] ^ sgtm [14:50:13] jayme: ^^ [14:50:28] wdqs will complain about lag but it's just a broken metric that depends on a particular kafka topic (https://gerrit.wikimedia.org/r/701927 should fix this) [14:50:31] gehel: ^ [14:50:38] (legoktm: we should also add everything joe just said and linked https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Services :)) [14:50:43] yep :D [14:51:02] dcausse: is it the metric/monitoring that's broken or actual functionality? [14:51:15] legoktm: the metric, the updates are still ok [14:51:21] ok [14:51:30] a note on timing, in 9 minutes the next window for the next piece of the switchover starts. [14:51:45] yep, fine to run over -- depooling eqiad from traffic will be quick [14:51:46] q: should thanos-swift.discovery.wmnet have switched? [14:51:55] dcausse: not yet [14:52:00] thanks [14:52:13] * jayme re-running switchdc [14:52:25] jayme: --excude helm-charts -- eqiad codfw [14:52:29] dcausse: I'll hold on merging that patch, since only the metric is broken. Let's not get in the way of the DC switch more than we need to. [14:52:30] is it "helm-charts" or "chartmuseum"? [14:52:34] this time the "--" is loadbearing [14:52:44] hrhr [14:52:46] (excLude obviously, my typo but you got it) [14:52:53] legoktm: helm-charts is the dnsdisc name [14:53:00] so that's what the cookbook takes [14:53:00] nice one rzl, thanks [14:53:05] ok [14:53:22] args look good to me [14:53:26] +1 [14:53:29] !log jayme@cumin1001 START - Cookbook sre.switchdc.services.01-switch-dc [14:53:29] !log jayme@cumin1001 Switching services swift, proton, mathoid, restbase, swift-ro, eventstreams, search, shellbox, eventgate-analytics-external, wdqs-internal, kartotherian, api-gateway, termbox, mobileapps, similar-users, wikifeeds, apertium, restbase-async, eventgate-main, eventgate-logging-external, ores, sessionstore, linkrecommendation, echostore, push-notifications, citoid, zotero, eventgate-analytics, wdqs, eventstreams-i [14:53:29] , schema, cxserver: eqiad => codfw [14:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:47] (03PS1) 10Andrew Bogott: toolforge: add a profile for installing the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701928 (https://phabricator.wikimedia.org/T170355) [14:54:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [14:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:06] woot [14:54:28] nice! when we run 02-restore-ttl we'll drop the --exclude flag, since we want to restore helm-charts also [14:54:34] ack [14:54:50] but wait until everyone's satisfied with the services a/p first [14:54:50] there is restbase-async special case [14:55:26] <_joe_> jayme: yes, we can either switch that to eqiad, or leave everything in codfw for now [14:55:36] <_joe_> last time we decided to test how codfw would behave [14:56:05] yeah that's a question for legoktm ^ not sure if it's been discussed [14:56:12] yeah. Just wanted to make the point that we should decide now to take benefit of the reduced ttl [14:56:14] (03PS1) 10Jbond: P:puppetdb::database: add $role variable back as $db_role [puppet] - 10https://gerrit.wikimedia.org/r/701930 [14:56:30] <_joe_> I would leave it in codfw again [14:56:52] +1 on removing special cases [14:56:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30012/console" [puppet] - 10https://gerrit.wikimedia.org/r/701930 (owner: 10Jbond) [14:57:08] which I think leaving in codfw means [14:57:49] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Wikidata, and 2 others: mediawiki::maintenance::wikidata should not run crons for testwikidatawiki when used on labs / a testwikidatawiki doesnt exist - https://phabricator.wikimedia.org/T173357 (10Addshore) 05Open→03Declined Just... [14:58:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb::database: add $role variable back as $db_role [puppet] - 10https://gerrit.wikimedia.org/r/701930 (owner: 10Jbond) [14:58:37] +1 to that. otoh if we want to make that seperation the default, we should add a special case cookbook after 01-switch-dc [14:58:45] yeah [14:59:00] I'll add a note to either remove the special instructions or implement it in code [14:59:01] or even inside of that...whatever [14:59:14] terminal size [14:59:32] legoktm: +1 to doing one or the other as a followup -- have we settled on a plan for today? [14:59:36] Can we have a window to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/701927 ? The bad reporting WDQS lag is affecting bot edits on wikidata [14:59:55] gehel: do it [14:59:59] (03PS2) 10Gehel: [wdqs] switch updater reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/701927 (owner: 10DCausse) [15:00:04] legoktm and sukhe: (Dis)respected human, time to deploy Datacenter Switchover: Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1500). Please do the needful. [15:00:05] we are all here but not in the meeting, right? [15:00:06] legoktm: thanks! doing it! [15:00:13] here [15:00:18] <_joe_> gehel: that definitely needs to be rethought though [15:00:29] sukhe: it'll probably be a bit longer until we're done with services [15:00:38] <_joe_> we need to be able to switch eventgate in a second in an emergency [15:00:44] legoktm: noted! [15:00:52] the new updater does not suffer from the same issue, so the fix is probably to migrate, not to fix the current one [15:00:53] I added a TODO to automate the wdqs monitoring [15:01:28] (03CR) 10Gehel: [C: 03+2] [wdqs] switch updater reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/701927 (owner: 10DCausse) [15:02:06] jayme, rzl: I think for today we should just leave restbase-async as is, pooled in codfw [15:02:19] all agreed [15:02:51] can whoever is connected with that small terminal, please resize [15:04:07] ^ the shared tmux session takes the size of the *smallest* connected terminal [15:04:32] one question - I am checking ores eqiad logs and I see change propagation entries (from the pre-cache settings for Ores I assume) [15:05:25] I'd expect traffic to slowly decrease over time but lemme know if there is any obscure tribal knowledge that I don't know [15:06:09] legoktm: I would assume we're happy to switch thanos? [15:06:14] oh, no [15:06:24] we did not change ttl there anyways [15:06:45] so lets continue with restoring TTL? [15:06:48] elukey: looking at https://grafana.wikimedia.org/d/vAN_bQemz/ores-advanced-metrics?orgId=1&refresh=1m I can see the switchover, but it looks like beforehand codfw was getting some traffic [15:06:52] jayme: +1 [15:07:15] restoring ttl sgtm [15:07:38] legoktm: yeah I see only pre-caching entries from changeprop on eqiad nodes, but I expected them to stop as well [15:07:51] !log restarting wdqs-updater on all wdqs hosts for new configuration [15:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:11] * jayme restoring TTL [15:08:25] !log jayme@cumin1001 START - Cookbook sre.switchdc.services.02-restore-ttl [15:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:53] elukey: so you're saying changeprop is still hitting ORES in eqiad? [15:09:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [15:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:27] legoktm: yes this is what I am seeing [15:10:51] not sure if it is meant to do it [15:10:55] (03CR) 10Cwhite: "Per https://phabricator.wikimedia.org/T285318, there appears to be no extra transition steps needed." [puppet] - 10https://gerrit.wikimedia.org/r/701617 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:10:55] (since it is pre-cache) [15:11:14] can we check if codfw was getting those same requests pre-switchover? [15:12:40] yep I can follow up, it is not a big issue, I was just checking if people knew why [15:12:47] <_joe_> sorry I'm in a meeting, but I would suggest the issue might be long-lasting connections in envoy from changeprop [15:13:14] <_joe_> so if you do a restart of changeprop in eqiad, that should fix it [15:13:27] changeprop in eqiad has stopped processing revision-create events from queues so they're probably long-running events [15:14:19] <_joe_> please ping me if you need me to step away from the meeting and follow the chat [15:14:50] (03PS1) 10Jbond: O:puppetmaster::puppetdb: rename role to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) [15:15:42] hnowlan: what are "long-running events"? [15:16:18] and seemingly no other issues besides ORES/changeprop (maybe)? [15:16:25] gehel: is everything ok with WDQS now? [15:16:30] <_joe_> hnowlan: the point is that changeprop in eqiad should re-resolve ores.discovery.wmnet to codfw [15:16:31] (03PS2) 10Jbond: O:puppetmaster::puppetdb: rename role to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) [15:16:45] legoktm: sorry for the lack of update. Yes, everything is fine! [15:17:03] _joe_: aha. in that case you'd be right about the restart [15:17:11] I think we've seen similar before [15:17:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30015/console" [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:17:23] (03CR) 10Jbond: O:puppetmaster::puppetdb: rename role to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:17:31] should we roll-restart changeprop then? [15:17:35] to verify? [15:17:37] the long term fix for WDQS update lag monitoring is going to be T244590 [15:17:38] T244590: [Epic] Rework the WDQS updater as an event driven application - https://phabricator.wikimedia.org/T244590 [15:17:38] jayme: yep [15:17:55] legoktm: things that arrived to queues late I'm guessing, changeprop itself reports 0 revision-create events being processed in eqiad atm though so I'm not sure [15:17:55] hnowlan: did you volunteer? :) [15:18:08] legoktm, hnowlan I am tcpdumping right now and I see traffic from kubernetes codfw pods [15:18:20] so it was probably not clear from logs [15:18:26] all good :) [15:18:43] <_joe_> elukey: traffic from codfw pods to what? [15:18:49] ores eqiad nodes [15:18:59] <_joe_> ok, that makes sense [15:19:03] I think it is the pre-cache stuff, that warms up both clusters [15:19:10] <_joe_> because pre-caching is both clusters yes [15:19:23] cool [15:20:40] so all happy not IIUC? [15:20:44] *now [15:21:10] I think so...just thanos-* left? [15:21:15] yeah [15:22:00] yeah, we could reduce its TTL first like we did with the other services, or we could just go ahead and depool [15:22:13] not sure if godog has a preference, docs seem to indicate depooling is fine [15:22:20] and it's a five-minute wait for us either way :D [15:24:58] (03PS1) 10Jbond: O:puppetmaster::puppetdb: add RO postgres user [puppet] - 10https://gerrit.wikimedia.org/r/701933 (https://phabricator.wikimedia.org/T285666) [15:25:35] Is etherpad supposed to be down [15:25:59] Seems very very slow [15:26:07] (03PS2) 10Jbond: O:puppetmaster::puppetdb: add RO postgres user [puppet] - 10https://gerrit.wikimedia.org/r/701933 (https://phabricator.wikimedia.org/T285666) [15:26:16] I would assume we don't ultimately need to reduce TTL in thanos case [15:26:17] no, but it's working for me [15:26:22] RhinosF1: ^ [15:26:26] RhinosF1: works fine for me too [15:26:29] etherpad's working for me too, nothing's expected there [15:26:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:26:36] it hiccups sometimes though, give it another try [15:26:36] I think we should be fine without reducing TTL [15:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:49] Think it woke back up [15:26:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30017/console" [puppet] - 10https://gerrit.wikimedia.org/r/701933 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:27:10] legoktm, jayme: skipping the TTL change sounds good to me, just wanted to check [15:27:26] Yeah it's stopped being a pain now [15:27:27] I think we're ready to switch thanos then, any objections godog ? [15:27:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetmaster::puppetdb: add RO postgres user [puppet] - 10https://gerrit.wikimedia.org/r/701933 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [15:28:37] rzl: in a meeting but +1 [15:28:49] ack [15:29:49] lgtm [15:30:02] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-.*,name=eqiad [15:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:58] while we wait out the five minutes, want to prepare the traffic patch? [15:31:02] if it's not already [15:31:05] sukhe already did that [15:31:09] oh perfect [15:31:16] ok, so just thanos-* to switchover? [15:31:20] erp, wrong paste [15:31:21] https://gerrit.wikimedia.org/r/c/operations/dns/+/701610/ [15:31:46] legoktm: you want me to cumin run authdns-update as well? [15:32:05] does the cookbook normally do that? [15:32:17] after the traffic switch I mean [15:32:18] at least as per the instructions, that needs to be run manually [15:32:30] jayme: oh, sukhe was going to do the traffic part [15:32:44] no worries, feel free to do it jayme :) [15:33:22] it doesn't need cumin though, running it on any single server is sufficient -- check me sukhe? [15:33:30] rzl: correct [15:33:34] I would not insist...but as I'm the RW-guy in tmux :D [15:34:32] ah, docs do say any not all [15:35:10] I didn't think depooling eqiad / running authdns-update was so exciting that we'd all watching on the tmux :p [15:35:55] 5m is up, how are we doing? [15:37:06] https://grafana-rw.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=now-30m&to=now-1m&var-site=codfw&var-prometheus=thanos&var-cluster=thanos looks reasonable [15:40:15] codfw has a bunch more server errors, but that seems normal [15:40:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [15:40:53] ok, are we ready to move on to the traffic switch or should we wait a bit more? [15:42:05] no need to wait afaik [15:42:22] (03CR) 10Vgutierrez: "tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/701073 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [15:42:31] +1 [15:42:49] ok to merge and run dns-update? [15:42:55] sukhe: go for it [15:43:01] (03CR) 10Ssingh: [C: 03+2] admin_state: depool eqiad for datacenter switchover (June 2021) [dns] - 10https://gerrit.wikimedia.org/r/701610 (https://phabricator.wikimedia.org/T281515) (owner: 10Ssingh) [15:43:59] sukhe: legoktm: I think legoktm is right about authdns-update being pretty boring. So please do so after merging [15:44:16] and I'll close the tmux lounge for today [15:44:46] !log Traffic: depool eqiad from user traffic [15:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:10] legoktm: done [OK - authdns-update successful on all nodes!] [15:45:14] :D [15:45:23] IIRC there will be some alert about eqiad having a sudden drop in traffic [15:45:41] :D [15:45:43] ok! [15:45:51] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Cmjohnson) a:05Cmjohnson→03RobH the idrac has been updated and netbox. This should also fix the port spamming Arzhel. @RobH can you do the install? [15:46:36] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Cmjohnson) [15:46:51] (03CR) 10Ahmon Dancy: [C: 03+1] [WMF] fork gitiles to prevent loading fonts from 3rd party [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/700932 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [15:46:59] https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&var-site=codfw&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=now-30m&to=now [15:47:15] also watching https://grafana.wikimedia.org/d/000000093/varnish-traffic?orgId=1&from=now-30m&to=now [15:48:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10Cmjohnson) @Jgreen the idrac is set up, the password is set to the temporary DM if you do not remember. The production port is not set up. Which VLAN do you want this to go in? fundraising? [15:49:27] jouncebot: next [15:49:27] In 1 hour(s) and 10 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1700) [15:49:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Cmjohnson) @ayounsi I should be able to rack them on Friday but not cable them. We are out the week of 5 July and I am on vacation as of this moment the following week. Maybe @... [15:49:40] cool, we ought to be done by the end of this window but no rush [15:52:55] got the alert as well and things seem to be OK so far [15:53:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 48.38 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:53:29] when eqiad is depooled, does all of its traffic end up at codfw or will some of it go elsewhere like esams/ulsfo? [15:53:41] * legoktm acks [15:54:03] legoktm: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/geo-maps is the preference order for each location [15:54:13] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 48.38 le 60 Legoktm eqiad is depooled for DC switchover https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:54:21] rzl beat me to it [15:54:58] legoktm: from a quick ctrl-F, every occurrence of `[eqiad,` is `[eqiad, codfw, ` so the answer is it all goes to codfw [15:55:15] but not because it has to be that way, just because that turns out to be the best answer [15:55:30] oooh [15:55:32] there probably are some places that are, network-topology-wise, closer to ulsfo than they are to codfw, while still being closest to eqiad [15:55:55] legoktm: the other thing to note is that esams is special because it is 'too big to fail' ;) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master [15:56:00] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/geo-maps-esams-offline [15:56:13] in other cases we do split them -- e.g. if we depool ulsfo, some drains to codfw and some to eqsin [15:56:44] that map pushes most esams traffic to either eqiad or ulsfo, while also pushing users that are normally eqiad-destined further west towards codfw/ulsfo [15:56:55] because otherwise we can overheat eqiad's egress [15:57:23] that problem goes away in a drmrs universe, right? [15:57:23] drmrs will help solve this [15:57:26] I think so! [15:57:27] haha [15:57:29] cdanis: we might be good now with the addition of a Telia link, but we never tried it [15:57:33] aha [15:57:38] yeah and in general [15:57:44] we should probably make more use of codfw than we do [15:57:52] and yeah drmrs will solve that [15:59:06] since we're just waiting on eqiad to depool, unless there's any objections, I'm going to officially say today's part of the switchover is over [16:00:18] (03PS1) 10Jbond: (WIP): add ro user [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) [16:00:47] thanks everyone who helped out, especially jayme and sukhe who pressed all the buttons [16:01:13] thanks to you legoktm for coordinating it so nicely! my part was pretty trivial [16:01:21] thanks all! I'm going back to vacation until tomorrow [16:01:23] I'll send out a short writeup of how today went and the issues we found in a bit, plus file phab tasks [16:01:33] <3 [16:01:34] great work everybody and especially legoktm [16:01:48] rzl: enjoy the vacation, even though it seems short haha [16:01:49] thanks all and happy hunting rzl :) [16:02:00] (03CR) 10jerkins-bot: [V: 04-1] (WIP): add ro user [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [16:02:07] <3 [16:02:15] and tomorrow is the big MediaWiki switchover, I'm going to do another live test later today, I'll drop a note in various channels when I start if people want to watch/participate [16:02:35] great! [16:05:28] legoktm: when was the last DC switchover? [16:05:41] (just so i can mentaly think about anything that might have cropped up) [16:05:59] addshore: Sept 1, 2020: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_2020_switch [16:06:46] (03PS1) 10Jgiannelos: Add cronjob for tegola tiles pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/701938 [16:07:25] (03PS1) 10Jbond: puppetmater:puppetdb: fix puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/701939 [16:08:13] (03CR) 10Jbond: [C: 03+2] puppetmater:puppetdb: fix puppetdb_ro user [puppet] - 10https://gerrit.wikimedia.org/r/701939 (owner: 10Jbond) [16:15:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:37] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [16:24:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:34:46] (03CR) 10Bstorm: [C: 03+1] "I love the tests in puppet! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/701063 (owner: 10David Caro) [16:35:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:14] (03CR) 10Bstorm: [C: 04-1] toolforge: Add buster specific packages/setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700186 (owner: 10David Caro) [16:45:33] (03PS2) 10Jbond: puppetdb::app: Use seperate user for the read databse [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) [16:46:37] (03CR) 10Bstorm: "Changing to 0 because when that's resolved you can consider it an automatic +1 from me? I just don't know a good way to show that across t" [puppet] - 10https://gerrit.wikimedia.org/r/700186 (owner: 10David Caro) [16:48:01] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::app: Use seperate user for the read databse [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [17:00:05] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1700). [17:03:04] (03PS7) 10Jcrespo: dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) [17:03:06] (03PS1) 10Jcrespo: dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps [puppet] - 10https://gerrit.wikimedia.org/r/701948 (https://phabricator.wikimedia.org/T284897) [17:04:17] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications and remove db1171 future reimages [puppet] - 10https://gerrit.wikimedia.org/r/700474 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [17:05:14] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps [puppet] - 10https://gerrit.wikimedia.org/r/701948 (https://phabricator.wikimedia.org/T284897) (owner: 10Jcrespo) [17:06:27] (03PS1) 10Jcrespo: Revert "dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps" [puppet] - 10https://gerrit.wikimedia.org/r/701721 [17:06:42] (03CR) 10Jcrespo: [C: 04-1] "Waiting for 19 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/701721 (owner: 10Jcrespo) [17:13:33] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:29] (03CR) 10Jeena Huneidi: "It looks like this should have been happening when we checked out the original mediawiki-config repo anyway, but was set to disable submod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701001 (https://phabricator.wikimedia.org/T285325) (owner: 10Jeena Huneidi) [17:17:58] (03Abandoned) 10Jeena Huneidi: Checkout portals for multiversion & webserver img [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701001 (https://phabricator.wikimedia.org/T285325) (owner: 10Jeena Huneidi) [17:23:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you please cable mc1039, 1051-1054. I could not find mc1039 so please update netbo... [17:26:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @robh please assign this to @Jclark-ctr to finish cabling and updating netbox for mw1448-1457 [17:26:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr [17:27:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @Jclark-ctr please cable mw1448-mw1458, and update netbox and task with switch port info. [17:29:28] (03PS1) 10Ebernhardson: Prepare Cirrus more_like for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701951 [17:32:20] (03PS2) 10Ebernhardson: Prepare Cirrus more_like for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701951 [17:44:17] (03CR) 10Arlolra: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [17:50:04] (03PS4) 10Bartosz Dziewoński: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [17:50:16] (03PS5) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) [17:52:45] jouncebot: now [17:52:45] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [17:52:47] jouncebot: next [17:52:47] In 0 hour(s) and 7 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1800) [17:53:25] (03CR) 10Urbanecm: [C: 03+2] Make it possible to force opt-in/opt-out to Growth features during account creation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701719 (https://phabricator.wikimedia.org/T284119) (owner: 10Kosta Harlan) [17:53:44] (03CR) 10Bartosz Dziewoński: [C: 03+1] Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) (owner: 10Thiemo Kreuz (WMDE)) [17:53:50] kostajh: fyi i merged the Growth patch a bit ahead of time, to give jenkins time to process it [17:54:04] MatmaRex: if you'll be around, happy to do the same for your backport. [17:54:31] yeah, i'm here. that's fine [17:54:40] thanks [17:54:45] (03CR) 10Urbanecm: [C: 03+2] Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) (owner: 10Thiemo Kreuz (WMDE)) [17:54:48] done [17:56:47] (03PS2) 10Urbanecm: Growth: Enable community configuration at all Growth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701579 (https://phabricator.wikimedia.org/T285423) [17:57:37] urbanecm: thanks [17:58:23] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable community configuration at all Growth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701579 (https://phabricator.wikimedia.org/T285423) (owner: 10Urbanecm) [17:58:45] (03CR) 10Legoktm: [C: 03+1] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [17:58:47] (merged to be able to test at debug host) [17:59:12] (03Merged) 10jenkins-bot: Growth: Enable community configuration at all Growth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701579 (https://phabricator.wikimedia.org/T285423) (owner: 10Urbanecm) [18:00:05] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T1800). [18:00:05] MatmaRex, kostajh, Urbanecm, and ebernhardson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] \o [18:00:15] i can deploy today! [18:01:56] \o [18:03:53] (03PS5) 10Urbanecm: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [18:04:08] (03CR) 10Urbanecm: [C: 03+2] Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [18:04:18] MatmaRex: this one is untestable, right? [18:04:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1043c931b4b41d13410e91336f87b122b5447959: Growth: Enable community configuration at all Growth wikis (T285423) (duration: 00m 56s) [18:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:47] yeah, nothing to test [18:04:48] T285423: Deploy Special:EditGrowthConfig to all wikis - https://phabricator.wikimedia.org/T285423 [18:04:54] (03Merged) 10jenkins-bot: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [18:04:58] MatmaRex: ok, i'll pull it together with the following patch then [18:05:04] (03PS6) 10Urbanecm: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) (owner: 10Bartosz Dziewoński) [18:05:07] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) (owner: 10Bartosz Dziewoński) [18:05:52] (03Merged) 10jenkins-bot: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) (owner: 10Bartosz Dziewoński) [18:06:19] MatmaRex: both of your config patches are at mwdebug1001, please have a look [18:07:09] looking [18:07:52] seems fine [18:08:14] great, syncing [18:10:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5b59184804b2ee0697ad20102eca0646aec4b105: Remove redundant wgDiscussionToolsEnable overrides (duration: 00m 56s) [18:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:35] syncing the other patch... [18:11:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4ae0fdd3ad19105fc36bd1eb7102dea9c4a5178d: Enable DiscussionTools topicsubscription as beta feature on partner wikis (T274280) (duration: 00m 57s) [18:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:18] T274280: Deploy config change to make Topic Subscriptions available as beta feature - https://phabricator.wikimedia.org/T274280 [18:11:19] (03CR) 10Subramanya Sastry: [C: 03+1] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:12:01] MatmaRex: live [18:12:29] ebernhardson: I'd prefer if you deploy your patch; is it ok if I ping you when I'M done? [18:13:05] thanks [18:13:10] np [18:13:24] I'm now waiting for CI to do the backports [18:13:27] shouldn't take long... [18:13:51] urbanecm: sure [18:14:03] great [18:15:32] (03Merged) 10jenkins-bot: Make it possible to force opt-in/opt-out to Growth features during account creation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701719 (https://phabricator.wikimedia.org/T284119) (owner: 10Kosta Harlan) [18:16:49] kostajh: your patch is at mwdebug1001, please have a look [18:16:58] (and i'll have a look too, as i wrote it) [18:18:58] (03Merged) 10jenkins-bot: Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) (owner: 10Thiemo Kreuz (WMDE)) [18:20:07] kostajh: fyi everything sounds to work on my end (tested at cswiki with geEnabled=0 and enwiki with geEnabled=1) [18:20:14] waiting for you to confirm [18:21:33] MatmaRex: your backport is at mwdebug1002, please have a look [18:22:01] 1002? [18:22:17] yes [18:22:46] we've got variety today ;) looking [18:23:19] looks good [18:23:24] thanks, syncing [18:24:51] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/VisualEditor/: 794a46c861dbf5ac05ec824d7591e507c1eefd16: Hotfix for broken "Extract show all to placeholder class" (T284636; T285571) (duration: 00m 57s) [18:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:00] T284636: Extract new parameter (placeholder) "show all" implementation - https://phabricator.wikimedia.org/T284636 [18:25:00] T285571: Broken parameter search widget in VisualEditor template editor - https://phabricator.wikimedia.org/T285571 [18:25:07] kostajh: are you looking at the patch? 🙂 [18:27:29] (thanks) [18:27:46] any time MatmaRex [18:27:54] urbanecm: sorry, I had to switch machines and forgot IRC :( [18:28:05] no worries :) [18:28:14] lmk if it's good to sync. [18:29:32] urbanecm: you can sync it [18:29:35] thanks [18:29:41] doing, thanks kostajh [18:30:23] syncing it homepagehooks first, then everything else [18:31:39] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/includes/HomepageHooks.php: ecf1d6c47c6cc30d84161e023373c6a2c7287be8: Make it possible to force opt-in/opt-out to Growth features during account creation (T284119; T284800; 1/3) (duration: 00m 58s) [18:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:48] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [18:31:48] T284119: Growth: options for disabling/enabling Growth features for groups of users - https://phabricator.wikimedia.org/T284119 [18:32:59] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/includes/HelpPanelHooks.php: ecf1d6c47c6cc30d84161e023373c6a2c7287be8: Make it possible to force opt-in/opt-out to Growth features during account creation (T284119; T284800; 2/3) (duration: 00m 55s) [18:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/includes/WelcomeSurveyHooks.php: ecf1d6c47c6cc30d84161e023373c6a2c7287be8: Make it possible to force opt-in/opt-out to Growth features during account creation (T284119; T284800; 3/3) (duration: 00m 55s) [18:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:16] kostajh: should be live [18:34:26] urbanecm: thank you! [18:34:30] any time :) [18:34:38] ebernhardson: the floor is yours [18:35:46] urbanecm: thanks [18:35:59] (03PS3) 10Ebernhardson: Prepare Cirrus more_like for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701951 [18:36:07] (03CR) 10Ebernhardson: [C: 03+2] Prepare Cirrus more_like for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701951 (owner: 10Ebernhardson) [18:36:50] (03Merged) 10jenkins-bot: Prepare Cirrus more_like for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701951 (owner: 10Ebernhardson) [18:40:34] !log ebernhardson@deploy1002 Synchronized wmf-config/: T281515: Prepare Cirrus more_like for dc switchover (duration: 01m 02s) [18:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:41] T281515: June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 [18:40:55] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Majavah) As someone who works regularly with deployment-prep I've found... [19:09:15] urbanecm: I assume youre done, is that right? [19:09:42] that is correct Krinkle -- unsure whether ebernhardson is (the last deployer) [19:12:59] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) A multi-version image with the portals directory has been published to the... [19:15:33] (03CR) 10Krinkle: [C: 03+2] purgeParserCache.php: Implement --tag for purging one server only [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701408 (https://phabricator.wikimedia.org/T282761) (owner: 10Krinkle) [19:17:12] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Legoktm) So...is there a reason the portals need to be deployed with MediaWiki? T... [19:20:10] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10wiki_willy) Hi @ayounsi - I reached out to Telxius for the kmz file. For Lumen, I think the new customer success manage... [19:26:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for fgoodwin - https://phabricator.wikimedia.org/T285580 (10herron) p:05Triage→03Medium a:05MNadrofsky→03FGoodwin Hi @FGoodwin could you please coordinate obtaining a comment of approval on this task from your manager? Once that's done we'll be r... [19:26:31] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for fgoodwin - https://phabricator.wikimedia.org/T285580 (10herron) [19:29:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for TChin - https://phabricator.wikimedia.org/T285326 (10herron) a:03tchin Hi @tchin could you please coordinate obtaining a comment of approval on this task from your manager? Once that's done we'll be ready to move on to creating a patchset for acces... [19:35:30] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I posted an update on today's switchover to wikitech-l: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/XI57Z6T... [19:35:48] (03PS3) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [19:36:01] (03Merged) 10jenkins-bot: purgeParserCache.php: Implement --tag for purging one server only [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701408 (https://phabricator.wikimedia.org/T282761) (owner: 10Krinkle) [19:36:32] 10SRE, 10Dumps-Generation: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10herron) Hey @ArielGlenn, Since this has been idling in the access request queue for some time I'm going to untag #sre-access-requests for the... [19:41:40] * Krinkle testing on mwdebug1002 [19:45:04] herron: yeah sorry about that task, that was the right solution, to just remove you folks. if there's something not right I can fix it after all [19:45:05] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.11/includes/objectcache/SqlBagOStuff.php: T282761 - I618bc1e8ca3008 (duration: 00m 59s) [19:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [19:46:12] apergos: cool no worries, just triaging/cleaning the queue for the week. if there is anything please do re-add, happy to help! [19:46:30] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.11/includes/libs/objectcache/: T282761 - I618bc1e8ca3008 (duration: 00m 56s) [19:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:31] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.11/maintenance/: I618bc1e8ca3008 (duration: 00m 56s) [19:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:15] (03PS4) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [19:56:00] (03CR) 10jerkins-bot: [V: 04-1] statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [19:56:35] (03PS5) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [19:59:12] (03CR) 10CDanis: "Thanks very much for the helpful comments!" (039 comments) [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [20:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T2000). [20:07:35] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) T238747 for hosting portals as a service hasn't be completed due to not ha... [20:17:29] PROBLEM - MegaRAID on db1129 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:17:30] ACKNOWLEDGEMENT - MegaRAID on db1129 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T285715 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:17:35] 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10ops-monitoring-bot) [20:23:58] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) p:05Triage→03Medium Can we get a rma for this failed disk? Thanks [20:28:32] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [20:35:18] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10wiki_willy) a:03Cmjohnson [20:37:39] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10wiki_willy) Hi @Cmjohnson - just a heads up, there's only a couple more months before the warranty expires on this host. Thanks, Willy [20:38:39] (03PS1) 10Bstorm: tools-clush: remove paws from clush and add the rest of the k8s setup [puppet] - 10https://gerrit.wikimedia.org/r/701975 (https://phabricator.wikimedia.org/T280299) [20:44:25] 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10wiki_willy) [20:44:44] 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10wiki_willy) [20:47:28] 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [20:47:48] 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T2100). [21:10:49] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Doing): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10thcipriani) [21:26:34] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Joe) >>! In T285325#7182067, @Legoktm wrote: > So...is there a reason the portals... [21:38:28] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Unfortunately, my patch to just ignore x2 didn't really work. spicerack gets the list of core_dbs by querying `A:core-db and A:db-role-mast... [21:55:01] !log krinkle@mwmaint1002: purgeParserCache.php --tag pc2, ref T282761 [21:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:10] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [21:57:45] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10faidon) For Lumen we have a very long ongoing thread for our eqiad-esams relocation, so let's make sure we wrap this one... [22:09:13] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) I live hacked this onto cumin1001 for now: ` diff --git a/spicerack/mysql_legacy.py b/spicerack/mysql_legacy.py index a69cc74..be423e9 100... [22:09:24] !log live-hacked spicerack on cumin1001 to ignore x2, see https://phabricator.wikimedia.org/T285519#7182377 [22:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:15] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:09] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Krinkle) >>! In T285519#7178376, @Legoktm wrote: > Ack, thanks for all the input. For next week we'll just ignore x2, it'll stay RW in both DCs thro... [22:21:35] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:22:18] legoktm: how does pc get excluded today? it surprised me that it was a new mechanism [22:22:45] maybe x2 is wrongly returned by "db-core" [22:22:53] not sure if that would simplify things or not [22:22:58] just thinking out loud :) [22:23:09] if that makes sense, seems easir to change now that it is still unused than later [22:23:55] or maybe its better to keep in db-core given same authentication and use case (MW), but then again, that seems true for pc as well. [22:24:38] Krinkle: yeah, there's a separate A:db-parsercache [22:24:49] aha [22:24:56] what are the implications of that grouping? [22:25:07] https://gerrit.wikimedia.org/g/operations/puppet/+/1f13f3f9a7bb0f57157a6ff9fc0214fa27625720/modules/profile/templates/cumin/aliases.yaml.erb#74 [22:25:13] it's just inferred from the puppet role used [22:25:19] might be worth aksing dbas if that's up for change / makes sense to them [22:25:35] I imagine it'll have a couple of implications for conftool and prometheus metrics [22:25:49] but this is pretty much the best time to do it (or right after switch over) [22:26:48] I spent a while grepping and following code in spicerack/switchdc cookbooks to verify that removing x2 wouldn't have any harmful implications in that context [22:27:05] but I don't know what being "core" means outside of that [22:31:20] !log starting DC switchover live test, which will "switch" us from codfw -> eqiad [22:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:38] args=['--live-test', 'codfw', 'eqiad'] [22:32:29] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [22:32:31] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [22:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:56] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [22:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:11] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10Papaul) [22:37:29] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10Papaul) a:05Papaul→03wiki_willy After making a decision on the line cards, you can resolve the task . Thanks [22:38:22] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [22:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:59] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [22:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:00] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [22:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:09] > Warmup completed in 0:00:18.139640 │············· [22:40:17] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [22:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:37] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [22:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:15] checked mwmaint2002, it successfully stopped the systemd timers and units [22:41:41] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [22:41:41] !log legoktm@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2021-06-28 22:41:41.222740 [22:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:52] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [22:41:55] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [22:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:26] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [22:42:31] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [22:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:41] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [22:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:47] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [22:42:49] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [22:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:56] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [22:42:59] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [22:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:03] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [22:43:04] !log legoktm@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2021-06-28 22:43:04.512602 [22:43:04] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [22:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:25] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [22:43:28] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [22:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:34] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [22:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:54] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [22:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:10] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters [22:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:04] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10wiki_willy) a:05wiki_willy→03Papaul Thanks Papaul. You can just move the line cards on the accounting spreadsheet to the top section called "Variance Between Netbox and Asset Tag List (Items Tracked but... [22:48:21] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0) [22:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:31] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [22:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:54] (03CR) 10Cwhite: [C: 03+1] logstash: add logstash200[123] to v7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [22:50:18] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [22:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:02] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [22:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:16] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [22:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:46] all done, I think [22:56:23] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I did a successful run through of the live-test mode just now, where we "switch" from codfw -> eqiad. The only issue I ran into is T285519#718237... [22:57:07] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Accounting Errors - https://phabricator.wikimedia.org/T285718 (10wiki_willy) 05Open→03Resolved Actually, I'm going to keep it in the same place on the spreadsheet, then just mark WMFNA, which should fix it. Thanks, Willy [23:00:05] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210628T2300). [23:00:05] arlolra: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:14] I can deploy today [23:00:20] ty [23:01:16] (03PS4) 10Urbanecm: Enable Parsoid inspired media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [23:01:51] (03CR) 10Urbanecm: [C: 03+2] Enable Parsoid inspired media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [23:02:36] arlolra: is it right this config value will only be used in wmf.12? [23:02:38] (03Merged) 10jenkins-bot: Enable Parsoid inspired media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [23:02:44] (I'm asking if there's anything to do to test it) [23:02:53] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [23:03:05] yeah, there's nothing really to test [23:03:12] before the train tomorrow [23:03:25] okay. I'll just sync it then. [23:03:31] thanks [23:05:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5ec855d14b31a9392274c2bfe2e21e2ad44986bc: Enable Parsoid inspired media structure on test wikis (T51097) (duration: 00m 59s) [23:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:47] T51097: Use figure and figcaption HTML5 elements when possible - https://phabricator.wikimedia.org/T51097 [23:05:50] arlolra: should be live. Anything else I can help with? [23:06:12] nope, that's it from me. thanks [23:06:17] any time :) [23:07:05] !log Evening B&C window done [23:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:05] PROBLEM - Host ping3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:25] RECOVERY - Host ping3001 is UP: PING OK - Packet loss = 0%, RTA = 107.19 ms [23:48:33] PROBLEM - Host ping3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:48:51] RECOVERY - Host ping3001 is UP: PING OK - Packet loss = 0%, RTA = 107.32 ms [23:50:29] (03PS1) 10Tim Starling: Include SQL queries in the debug log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 [23:51:42] (03CR) 10jerkins-bot: [V: 04-1] Include SQL queries in the debug log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [23:55:12] (03CR) 10Tim Starling: "I'll test this by patching a couple of live files on mwdebug1001." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995 (owner: 10Tim Starling) [23:58:19] (03PS2) 10Tim Starling: Include SQL queries in the debug log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701995