[00:00:05] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T0000). [00:00:05] ebernhardson and mewoph: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:17] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:01:13] i can ship mine [00:04:22] i can ship the other as well if mowoph is around [00:04:42] (03PS2) 10Ebernhardson: Revert "Move cirrus traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744857 (https://phabricator.wikimedia.org/T296897) [00:05:19] i'm around, but still waiting for CI [00:06:26] (03CR) 10Ebernhardson: [C: 03+2] "backport window" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744896 (https://phabricator.wikimedia.org/T297250) (owner: 10MewOphaswongse) [00:06:30] mewoph: it takes awhile :) lets start it now [00:06:48] (03CR) 10Ebernhardson: [C: 03+2] Revert "Move cirrus traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744857 (https://phabricator.wikimedia.org/T296897) (owner: 10Ebernhardson) [00:06:53] ebernhardson: thanks! [00:07:39] (03Merged) 10jenkins-bot: Revert "Move cirrus traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744857 (https://phabricator.wikimedia.org/T296897) (owner: 10Ebernhardson) [00:09:07] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:09:33] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T296897 Move cirrus traffic back to eqiad (duration: 01m 08s) [00:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:39] T296897: Eqiad Geosearch API queries return errors on Commons - https://phabricator.wikimedia.org/T296897 [00:13:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:50] (03CR) 10Herron: [C: 03+1] "LGTM overall! Minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:19:06] (03CR) 10Herron: [C: 03+1] prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:19:55] (03PS1) 10Krinkle: mediawiki.base: Add missing toString param to Messags#escaped() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744804 (https://phabricator.wikimedia.org/T292489) [00:20:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:44] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10colewhite) >>! In T297239#7554905, @herron wrote: > Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster u... [00:31:19] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:32:57] (03PS2) 10Bartosz Dziewoński: mediawiki.base: Add missing toString param to Message#escaped() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744804 (https://phabricator.wikimedia.org/T292489) (owner: 10Krinkle) [00:33:00] (03Merged) 10jenkins-bot: Add an image: Only validate caption if the recommendation is accepted [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744896 (https://phabricator.wikimedia.org/T297250) (owner: 10MewOphaswongse) [00:34:21] PROBLEM - dump of s2 in eqiad on alert1001 is CRITICAL: dump for s2 at eqiad taken more than 8 days ago: Most recent backup 2021-11-30 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:38:57] ebernhardson: 744896 is finally merged :) i'm ready to test [00:39:16] mewoph: finally! ok pulling to mwdebug host [00:41:15] mewoph: i've rebased wmf.12 and pulled to mwdebug1002, should be testable [00:41:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:33] ebernhardson: lgtm! [00:45:43] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [00:46:32] (03CR) 10Cwhite: [C: 04-1] "Change as-is looks alright, but will likely break puppet on beta-logs." [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [00:47:11] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [00:48:00] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [00:49:20] mewoph: ok, shipping [00:49:51] ebernhardson: thank you! [00:50:33] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:51:09] !log ebernhardson@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/AddImageSubmissionHandler.php: backport window for 744896 (duration: 01m 05s) [00:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:33] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:41] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:02:17] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:04:33] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:07:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:13] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:13:43] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:49] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) >! In T297239#7554905, @herron wrote: > If I'm understanding correctly this would be ingesting/filtering the medaiwiki log topics from kafka-l... [01:24:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:49] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:27:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:41] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:01] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:42:29] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:47:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:27] PROBLEM - SSH on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:51:21] PROBLEM - dump of s3 in eqiad on alert1001 is CRITICAL: dump for s3 at eqiad taken more than 8 days ago: Most recent backup 2021-11-30 01:18:08 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:53:33] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:53:43] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:37] statograph_post is failing because of graphite [02:09:18] !log powercycle graphite1004 via mgmt [02:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:45] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1634 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:11:49] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:12:07] PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:12:33] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:12:33] PROBLEM - carbon-local-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:13:11] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 94.44% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [02:13:23] PROBLEM - Mediawiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:13:47] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:51] PROBLEM - Mediawiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:14:23] RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:14:49] RECOVERY - carbon-local-relay service on graphite1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:18:15] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:47:11] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:47:24] 10SRE, 10Graphite, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10Legoktm) p:05Triage→03Unbreak! [02:53:27] RECOVERY - Mediawiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:53:47] PROBLEM - dump of m2 in codfw on alert1001 is CRITICAL: dump for m2 at codfw taken more than 8 days ago: Most recent backup 2021-11-30 02:40:58 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:53:55] RECOVERY - Mediawiki CirrusSearch update rate - eqiad on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:58:15] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:00:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Legoktm) @Vgutierrez cp5006 dropped off monitoring at exactly midnight, and ssh for it has been flapping - currently I can't get in.... [03:02:43] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:09:23] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:15:28] (03PS1) 10Herron: graphite1004: set profile::monitoring::is_critical: true [puppet] - 10https://gerrit.wikimedia.org/r/744906 (https://phabricator.wikimedia.org/T297265) [03:25:01] (03CR) 10Legoktm: [C: 03+1] graphite1004: set profile::monitoring::is_critical: true [puppet] - 10https://gerrit.wikimedia.org/r/744906 (https://phabricator.wikimedia.org/T297265) (owner: 10Herron) [03:25:30] (03CR) 10RLazarus: [C: 03+1] graphite1004: set profile::monitoring::is_critical: true [puppet] - 10https://gerrit.wikimedia.org/r/744906 (https://phabricator.wikimedia.org/T297265) (owner: 10Herron) [03:25:43] (03CR) 10Herron: [C: 03+2] graphite1004: set profile::monitoring::is_critical: true [puppet] - 10https://gerrit.wikimedia.org/r/744906 (https://phabricator.wikimedia.org/T297265) (owner: 10Herron) [03:31:01] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) I don't much care about having to click through the partman step but imaging still fails. Now it stalls on 'Attempt to run 'cookbooks.s... [03:33:57] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:34:30] (03PS1) 10Andrew Bogott: site.pp: fix a very important typo re: cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/744909 (https://phabricator.wikimedia.org/T296906) [03:36:30] (03CR) 10Andrew Bogott: [C: 03+2] site.pp: fix a very important typo re: cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/744909 (https://phabricator.wikimedia.org/T296906) (owner: 10Andrew Bogott) [03:37:17] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:37:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [03:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:27] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:41:23] PROBLEM - dump of x1 in eqiad on alert1001 is CRITICAL: dump for x1 at eqiad taken more than 8 days ago: Most recent backup 2021-11-30 03:19:34 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:42:23] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:18:04] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:18:14] RECOVERY - Check for large files in client bucket on mwmaint1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [04:42:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:53] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) This fix is in new release candidate Testflight 6.8.2 (1868). [04:43:27] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) 05Open→03Resolved fixed! [04:46:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:23] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:13] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:50:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1028.eqiad.wmnet with OS buster [04:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:19] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:17:59] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-12-08 04:01:29 (564 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:39:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:40:55] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:52:03] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:52:49] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:07:49] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:14:29] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:19:35] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 6.845e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:29:18] (03CR) 10Razzi: "Added a question for Luca, this looks fine to me personally but I'd like him to weigh in :)" [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis) [06:45:21] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:52:58] (03PS5) 10Razzi: turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) [06:53:32] (03CR) 10Razzi: "Whoops, this stayed open for way long! Should be g2g now and still good" [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [08:18:19] 10SRE, 10Graphite, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) Thank you folks for investigating this! I am taking a look too and so far have failed to find anything of note [08:19:28] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:22:24] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) 05Stalled→03Resolved [08:22:28] 10SRE, 10Traffic, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:22:36] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:32:39] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [08:32:51] 10SRE, 10Traffic-Icebox, 10HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) 05Open→03Resolved a:03ema Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Ti... [08:33:03] 10SRE, 10Traffic-Icebox, 10HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Majavah) Anything left to do here, now that all backends are using TLS? [08:34:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS buster [08:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:30] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS buster [08:39:04] (03CR) 10Filippo Giunchedi: "LGTM! see inline" [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [08:42:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/744845 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [08:44:05] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.11e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:44:30] (03PS5) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) [08:44:32] (03PS4) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) [08:44:34] (03PS9) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [08:44:36] (03PS9) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [08:44:38] (03PS10) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [08:44:43] (03CR) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:52:17] 10SRE, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10fgiunchedi) >>! In T212231#7553880, @Dzahn wrote: > Even though T210993 is open? Thanks! I am uploading a change to delete them. Yes, IIRC nothing is using those. To be honest at this point anything using... [08:57:02] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 60 probes of 724 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:20] (03PS5) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [09:00:34] (03CR) 10Ayounsi: [C: 03+1] "LGTM with 1 suggestion." [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) (owner: 10Cathal Mooney) [09:00:56] (03PS2) 10WQuarshie: example-node-api chart creating chart for the exampl-node-api Bug:T288134 [deployment-charts] - 10https://gerrit.wikimedia.org/r/743483 [09:02:06] (03CR) 10jerkins-bot: [V: 04-1] example-node-api chart creating chart for the exampl-node-api Bug:T288134 [deployment-charts] - 10https://gerrit.wikimedia.org/r/743483 (owner: 10WQuarshie) [09:02:06] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 724 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:05:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS buster [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:53] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS buster completed: - ganeti2013 (**PASS**) - Downtimed on Icinga... [09:12:10] (03PS1) 10JMeybohm: RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/745196 (https://phabricator.wikimedia.org/T287130) [09:14:27] jouncebot: nowandnext [09:14:28] No deployments scheduled for the next 2 hour(s) and 45 minute(s) [09:14:28] In 2 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T1200) [09:15:06] (03PS3) 10Majavah: Remove UserMerge rights from labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 [09:15:15] (03CR) 10Majavah: [C: 03+2] Remove UserMerge rights from labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 (owner: 10Majavah) [09:16:01] (03Merged) 10jenkins-bot: Remove UserMerge rights from labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 (owner: 10Majavah) [09:18:22] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743659|Remove UserMerge rights from labswiki (wikitech)]] (duration: 01m 07s) [09:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:03] * majavah done [09:19:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:22:44] (03PS1) 10Filippo Giunchedi: syslog: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) [09:23:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2014.codfw.wmnet with OS buster [09:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:08] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS buster [09:23:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:55] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32890/console" [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [09:26:59] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:29:01] !log restarting blazegraph on wdqs1006 (jvm stuck for 24h) [09:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:29] (03PS1) 10Muehlenhoff: Make ganeti2027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/745198 (https://phabricator.wikimedia.org/T294139) [09:40:10] (03PS1) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [09:40:22] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [09:41:40] (03PS1) 10Jelto: gitlab_runner: enable protected runners [puppet] - 10https://gerrit.wikimedia.org/r/745201 (https://phabricator.wikimedia.org/T295481) [09:41:45] I'd like to ask for review of a Varnish config patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/742148 -- any suggested reviewers I should add? [09:42:45] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:43:55] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:44:39] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 7.972e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:45:05] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:09] awight: I'd ask the traffic team (#wikimedia-traffic) [09:46:45] majavah: thanks! [09:46:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32892/console" [puppet] - 10https://gerrit.wikimedia.org/r/745201 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:49:21] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 88.79 ms [09:49:50] (03CR) 10Awight: Maps are invariant to revid parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) (owner: 10Awight) [09:50:35] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.63 ms [09:53:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: enable protected runners [puppet] - 10https://gerrit.wikimedia.org/r/745201 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:54:25] (03PS1) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [09:58:08] !log remove all users from obsolete "shell" and "clouadmin" groups on labtestwiki (labtestwikitech.wikimedia.org) [09:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:25] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: gitlab-runner2001, cloudcephmon1002, cloudcephmon1001, cloudcephmon1003, wcqs1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [10:00:53] (03PS1) 10Jelto: gitlab_runner: quote parameters in gitlab-runner config [puppet] - 10https://gerrit.wikimedia.org/r/745203 (https://phabricator.wikimedia.org/T295481) [10:01:15] RECOVERY - Check systemd state on gitlab-runner1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2014.codfw.wmnet with OS buster [10:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:47] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS buster completed: - ganeti2014 (**PASS**) - Downtimed on Icinga... [10:05:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32894/console" [puppet] - 10https://gerrit.wikimedia.org/r/745203 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:08:17] (03PS2) 10Filippo Giunchedi: syslog: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) [10:08:19] (03PS1) 10Filippo Giunchedi: netconsole: refactor targets lookup [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) [10:08:21] (03PS1) 10Filippo Giunchedi: graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) [10:09:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32895/console" [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:10:19] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32896/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [10:10:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32897/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [10:11:18] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: quote parameters in gitlab-runner config [puppet] - 10https://gerrit.wikimedia.org/r/745203 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:13:14] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32898/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [10:13:26] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32899/console" [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:14:50] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Service-deployment-requests: New Service Request tegola-vector-tiles - https://phabricator.wikimedia.org/T274390 (10akosiaris) 05Open→03Resolved a:03akosiaris tegola has been deployed for some time now, so I am resolving this. Fe... [10:15:05] (03PS2) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [10:16:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [10:16:45] !log ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=testwiki --custom-groups=steward --force "Dom walden" [10:16:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32900/console" [puppet] - 10https://gerrit.wikimedia.org/r/745202 (owner: 10JMeybohm) [10:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:31] (03PS1) 10Jelto: gitlab_runner: add --config parameter to register command [puppet] - 10https://gerrit.wikimedia.org/r/745207 (https://phabricator.wikimedia.org/T295481) [10:18:04] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10akosiaris) This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of... [10:18:21] (03PS1) 10JMeybohm: Add imagecatalog user to main and ml [labs/private] - 10https://gerrit.wikimedia.org/r/745208 (https://phabricator.wikimedia.org/T287130) [10:18:41] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add imagecatalog user to main and ml [labs/private] - 10https://gerrit.wikimedia.org/r/745208 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [10:21:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32902/console" [puppet] - 10https://gerrit.wikimedia.org/r/745207 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:21:13] (03CR) 10MVernon: [C: 03+1] "LGTM thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:21:40] (03CR) 10MVernon: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:22:30] (03CR) 10MVernon: "This seems sound to me, but I think a review from someone with more puppet knowledge than me would be sensible." [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [10:22:49] !log cp3051: depool to enable single backend experiment T288106 [10:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:53] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [10:23:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add --config parameter to register command [puppet] - 10https://gerrit.wikimedia.org/r/745207 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:23:55] !log cp3051: stop ats-be and clear its cache T288106 [10:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:02] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] syslog: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/745197 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:25:33] (03CR) 10Ema: [C: 03+2] cache: enable single backend experiment on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/743910 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [10:25:53] PROBLEM - Disk space on gitlab-runner1001 is CRITICAL: DISK CRITICAL - /run/docker/netns/3cd24ebca5bb is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner1001&var-datasource=eqiad+prometheus/ops [10:28:11] PROBLEM - Disk space on gitlab-runner2001 is CRITICAL: DISK CRITICAL - /run/docker/netns/09187dbdcb54 is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner2001&var-datasource=codfw+prometheus/ops [10:29:04] (03CR) 10MVernon: [C: 03+2] swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [10:29:16] ^ thats me configuring gitlab-runners, should disappear when I'm finished [10:35:12] !log cp3051: repool w/ single backend experiment enabled T288106 [10:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:17] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [10:36:41] (03PS3) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [10:39:29] (03PS3) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) [10:40:42] (03PS4) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) [10:41:05] (03PS1) 10Jelto: gitlab_runner: fix config.toml syntax [puppet] - 10https://gerrit.wikimedia.org/r/745210 (https://phabricator.wikimedia.org/T295481) [10:42:29] (03CR) 10Cathal Mooney: [C: 03+2] Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) (owner: 10Cathal Mooney) [10:42:41] !log depool cp5006, the host is down T290005#7555417 [10:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:45] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:43:29] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32903/console" [puppet] - 10https://gerrit.wikimedia.org/r/745210 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:44:20] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: fix config.toml syntax [puppet] - 10https://gerrit.wikimedia.org/r/745210 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:48:55] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:49:09] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:52:35] (03CR) 10Cathal Mooney: [C: 03+1] "Yep. It is what it is I guess. At least with automation it's not as big an issue as it might have been in previous years. Looks good." [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi) [10:53:47] PROBLEM - traffic_server backend process restarted on cp3051 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3051&var-layer=backend [10:55:15] (03CR) 10Cathal Mooney: [C: 03+2] Added option to disable Capirca ACL generation completely. (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/742942 (owner: 10Cathal Mooney) [10:55:55] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Added option to disable Capirca ACL generation completely. (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/742942 (owner: 10Cathal Mooney) [10:58:10] (03PS3) 10Phuedx: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) [11:02:41] (03PS2) 10Phuedx: Update .mailmap to de-duplicate my email addresses [puppet] - 10https://gerrit.wikimedia.org/r/648239 [11:04:06] (03PS1) 10Filippo Giunchedi: prometheus: remove broken blackbox check for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/745213 (https://phabricator.wikimedia.org/T291946) [11:10:27] (03PS4) 10Phuedx: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [11:28:55] (03CR) 10David Caro: [C: 03+1] prometheus: remove broken blackbox check for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/745213 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:30:47] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10fgiunchedi) >>! In T296289#7553346, @MatthewVernon wrote: > OK, I know what the problem is (at least at one level). Our swift front-ends use a bit of middleware wmf.rewrite wh... [11:47:01] (03PS4) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [11:47:45] (03CR) 10Ema: [C: 03+1] netconsole: refactor targets lookup [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [11:49:55] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:50:05] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:55:45] (03PS5) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [11:55:47] (03PS2) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [11:56:02] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [11:57:42] (03PS6) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [11:58:51] (03PS7) 10JMeybohm: WIP: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T1200). [12:00:05] No Gerrit patches in the queue for this window AFAICS. [12:00:13] \o/ [12:00:22] indeed, nothing to do [12:00:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32908/console" [puppet] - 10https://gerrit.wikimedia.org/r/745202 (owner: 10JMeybohm) [12:03:07] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.152e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:04:54] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Lucas_Werkmeister_WMDE) I can confirm that Lucas_WMDE on Libera Chat is my account, and as far as I’m aware it’s also appropriately password-protected and cloaked. [12:05:22] (03PS8) 10JMeybohm: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 (https://phabricator.wikimedia.org/T287130) [12:07:58] (03CR) 10JMeybohm: imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [12:13:10] (03PS8) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [12:14:05] (03CR) 10Hnowlan: cassandra: load grants files upon change (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [12:14:14] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the reviews." [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [12:14:16] (03PS9) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [12:34:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10hnowlan) [12:38:08] (03PS1) 10Cathal Mooney: Updated iBGP policy to process local and remote routes differently. [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) [12:44:01] (03CR) 10Ayounsi: netconsole: refactor targets lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [12:59:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:44] (03PS1) 104nn1l2: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) [13:00:45] !log powercycle cp5006 T290005 [13:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:49] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:01:33] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:49] (03CR) 10Ayounsi: [C: 03+1] "Coherent with the task, change LGTM overall." [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:02:31] (03CR) 10Ayounsi: [C: 03+1] Updated iBGP policy to process local and remote routes differently. (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:03:01] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 247.07 ms [13:03:05] PROBLEM - purged service on cp5006 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:31] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:03:41] PROBLEM - Webrequests Varnishkafka log producer on cp5006 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:04:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [13:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:55] PROBLEM - Check systemd state on cp5006 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:13] RECOVERY - purged service on cp5006 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:05:47] RECOVERY - Webrequests Varnishkafka log producer on cp5006 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:06:59] RECOVERY - Check systemd state on cp5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:05] (03CR) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff) [13:24:38] (03CR) 10Btullis: Pmacct add sflow listener (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [13:30:58] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [13:33:44] (03PS1) 10Kormat: wmfdb/log: Update docstring style. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745225 [13:34:12] (03CR) 10Kormat: [V: 03+2 C: 03+2] wmfdb/log: Update docstring style. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745225 (owner: 10Kormat) [13:34:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:57] (03CR) 10Zabe: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [13:39:32] (03PS1) 10Kormat: setup.cfg: Tell mypy to ignore wmfdb/test/unit [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745247 [13:40:13] (03CR) 10Kormat: [V: 03+2 C: 03+2] setup.cfg: Tell mypy to ignore wmfdb/test/unit [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745247 (owner: 10Kormat) [13:40:27] (03PS2) 104nn1l2: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) [13:40:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [13:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:23] (03CR) 104nn1l2: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [13:41:44] (03CR) 10Btullis: [C: 03+1] "I'm happy with this." [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [13:45:31] (03PS1) 10Kormat: wmfdb/section: Add class for handling of sections. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249 [13:46:11] PROBLEM - Check systemd state on gitlab-runner1001 is CRITICAL: CRITICAL - degraded: The following units failed: gitlab-runner.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) > profile::beta::motd I have seen a dedicated MOTD in beta cluster, maybe it's not showing up because of its stand-alone puppetmaster and somethi... [13:54:54] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [13:56:33] !log drain primary/secondary instance off ganeti2015 T296622 [13:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:38] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [13:57:30] (03PS2) 10Esanders: Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) [13:57:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage [13:57:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage [13:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:53] (03PS3) 10Esanders: Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) [14:01:50] !log installing nss regression updates for stretch [14:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:08:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:17:01] !log drain primary/secondary instance off ganeti2020 T296622 [14:17:03] (03PS1) 10Jgiannelos: Install python kafka lib on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/745255 [14:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:06] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [14:18:28] (03PS2) 10Jgiannelos: Install python client for kafka on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/745255 (https://phabricator.wikimedia.org/T289771) [14:25:34] (03CR) 10Jgiannelos: "This is needed so we can be able to temporarily reset the offsets for tile pregeneration. At first I thought that kafkacat would be enough" [puppet] - 10https://gerrit.wikimedia.org/r/745255 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:28:46] (03CR) 10Majavah: "recheck" [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester) [14:37:11] (03CR) 10Hnowlan: [C: 03+2] Install python client for kafka on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/745255 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:53:39] (03CR) 10Krinkle: [C: 03+2] mediawiki.base: Add missing toString param to Message#escaped() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744804 (https://phabricator.wikimedia.org/T292489) (owner: 10Krinkle) [14:56:41] majavah: once CI passes on that CodeMiror backport, I can roll that out at the same time [14:56:50] selenium flaky? [14:57:29] Krinkle: flaky or broken, this is the second recheck although it passed on master [14:57:47] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:53] feel free to deploy, although it's not my patch (just trying to do what I can to unblock the train) [14:59:23] that codemirror backport has now been sitting 30min in the Zuul queue :// [15:03:51] still failing [15:04:06] !log removing rest of wikiuser@localhost (T296537) [15:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:11] T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537 [15:05:31] (03PS1) 10Majavah: Empty change to test CI [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745265 [15:06:01] Error in "CodeMirror bracket match default.disables highlighting" [15:06:01] 23:59:50 Node is either not visible or not an HTMLElement [15:10:35] Seems to be unrelated to the patch. Also saw this error in the first failing main test build of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/744781. [15:12:50] it indeed seems unrelated, but it has failed 3 times in a row by now [15:19:51] (03Abandoned) 10Awight: Maps are invariant to revid parameter [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) (owner: 10Awight) [15:20:49] (03Merged) 10jenkins-bot: mediawiki.base: Add missing toString param to Message#escaped() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744804 (https://phabricator.wikimedia.org/T292489) (owner: 10Krinkle) [15:21:29] (03CR) 10jerkins-bot: [V: 04-1] Empty change to test CI [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745265 (owner: 10Majavah) [15:26:37] I discovered https://phabricator.wikimedia.org/P17638, which this failure seems to be at least a month old. [15:27:01] * ...which shows this... [15:27:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:53] Krinkle: still deploying? [15:39:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:40:20] (03PS1) 10MVernon: swift: update README for rewrite middleware [puppet] - 10https://gerrit.wikimedia.org/r/745271 [15:41:01] (03PS2) 10MVernon: swift: update README for rewrite middleware [puppet] - 10https://gerrit.wikimedia.org/r/745271 (https://phabricator.wikimedia.org/T296289) [15:42:17] 10SRE-swift-storage, 10Patch-For-Review: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10MatthewVernon) a:03MatthewVernon >>! In T296289#7555901, @fgiunchedi wrote: > AFAIK rewrite in puppet is now the canonical place for this work, i.e. S... [15:46:23] (03CR) 10JMeybohm: [C: 03+1] "this LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [15:46:56] majavah: arr. yes/no. let me just do this patch then [15:48:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) @Ladsgroup see that "classes" key in Hiera deployment-prep/common.yaml. Does it include "all classes listed in Hiera" ? [15:49:27] !log krinkle@deploy1002 Synchronized php-1.38.0-wmf.12/resources/src/mediawiki.base/: Ie9fa768c0dc1 (duration: 01m 06s) [15:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:31] and done :) [15:49:34] (03PS1) 10Jelto: gitlab_runner: rollback usage of dedicated module for runner config [puppet] - 10https://gerrit.wikimedia.org/r/745273 (https://phabricator.wikimedia.org/T295481) [15:50:05] (03PS1) 10Jgiannelos: maps: Install lib to handle compressed kafka messages [puppet] - 10https://gerrit.wikimedia.org/r/745274 (https://phabricator.wikimedia.org/T289771) [15:50:25] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) wow, what an epic task this was with many subtasks. quite the journey. congrats and thanks to all [15:51:00] (03CR) 10Jgiannelos: "This is related to the same issue we had here: https://gerrit.wikimedia.org/r/c/operations/software/tegola/+/730481" [puppet] - 10https://gerrit.wikimedia.org/r/745274 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [15:52:43] (03PS6) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [15:52:45] (03PS8) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [15:55:31] (03PS7) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [15:55:33] (03PS9) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [15:55:35] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32909/console" [puppet] - 10https://gerrit.wikimedia.org/r/745273 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:59:07] (03CR) 10Hashar: [C: 03+1] "iirc that was some oddity when we have scap provisioning repositories (such as on deployment-deploy01) and git::clone defines used to ship" [puppet] - 10https://gerrit.wikimedia.org/r/744839 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [15:59:25] (03PS10) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [16:00:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: rollback usage of dedicated module for runner config [puppet] - 10https://gerrit.wikimedia.org/r/745273 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:04:48] (03PS2) 10Cathal Mooney: Updated iBGP policy to process local and remote routes differently. [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) [16:06:44] (03CR) 10RhinosF1: "See inline - as far as I can see you've removed some non-redundant ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:11:46] (03PS2) 10Filippo Giunchedi: netconsole: refactor targets lookup [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) [16:11:48] (03PS2) 10Filippo Giunchedi: graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) [16:12:59] RECOVERY - Check systemd state on gitlab-runner1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32910/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [16:13:29] 10SRE, 10Graphite, 10Patch-For-Review, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) For the record, for testing purposes I've manually enabled netconsole on graphite1004 and pointed it to centrallog1001. Once the patch series above are merged the sa... [16:14:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32911/console" [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [16:14:56] (03CR) 10RhinosF1: [C: 04-1] Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:16:29] (03PS3) 10Filippo Giunchedi: netconsole: refactor targets lookup [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) [16:16:31] (03PS3) 10Filippo Giunchedi: graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) [16:18:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:19:15] (03CR) 10Filippo Giunchedi: "Default to broadcast MAC as per discussion with Arzhel" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [16:19:45] (03PS3) 104nn1l2: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) [16:21:28] (03CR) 10RhinosF1: [C: 03+1] Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:22:39] (03CR) 10RhinosF1: [C: 03+1] Remove redundant project namespace aliases (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:22:43] (03CR) 104nn1l2: Remove redundant project namespace aliases (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:23:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'm adding Aaron for his knowledge/information and in case we're missing something. Good to merge" [puppet] - 10https://gerrit.wikimedia.org/r/745271 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [16:23:38] (03PS4) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [16:23:53] (03CR) 10JMeybohm: calico: Allow to configure the IPAM module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [16:23:57] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) >>! In T280881#7555760, @akosiaris wrote: > @bd808, any news on those? It was not clear to me that these were tasks that the service reque... [16:27:28] (03CR) 10Majavah: [V: 03+2 C: 03+2] "Force merging. The unrelated flaky(?) test is failing on an empty wmf.12 patch too but passes on master." [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester) [16:31:44] (03CR) 104nn1l2: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:33:08] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/CodeMirror/resources/modules/ve-cm/ve.ui.CodeMirror.init.less: Backport: [[gerrit:744803|Fix invalid reference to core resources/ directory (T296639)]] (duration: 01m 06s) [16:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:14] T296639: Less_Exception_Parser: File `resources/lib/ooui/wikimedia-ui-base.less` not found. in ve.ui.CodeMirror.init.less - https://phabricator.wikimedia.org/T296639 [16:33:21] * majavah done [16:34:00] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) [16:34:11] (03PS4) 104nn1l2: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) [16:35:04] (03CR) 104nn1l2: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [16:35:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:39:38] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) @lmata @fgiunchedi @colewhite and I discussed this at todays o11y team meeting. To summarize, we have at least two actionable options on the... [16:40:36] (03CR) 10Hnowlan: [C: 03+2] maps: Install lib to handle compressed kafka messages [puppet] - 10https://gerrit.wikimedia.org/r/745274 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [16:42:54] (03Abandoned) 10Majavah: Empty change to test CI [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745265 (owner: 10Majavah) [16:43:07] (03CR) 10MVernon: [C: 03+2] swift: update README for rewrite middleware [puppet] - 10https://gerrit.wikimedia.org/r/745271 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [16:44:22] 10SRE-swift-storage, 10Patch-For-Review: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10MatthewVernon) 05Open→03Resolved [16:45:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) >>! In T272559#7556556, @Dzahn wrote: > @Ladsgroup see that "classes" key in Hiera deployment-prep/common.yaml. Does it include "all classes liste... [16:46:39] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/745284 (https://phabricator.wikimedia.org/T288621) [16:47:23] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: allow up to 64mb restore payload [puppet] - 10https://gerrit.wikimedia.org/r/744845 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [16:47:43] (03PS5) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [16:49:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:49:31] (03PS6) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [16:57:26] (03PS1) 10Clare Ming: Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) [16:58:25] (03CR) 10jerkins-bot: [V: 04-1] Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [17:00:56] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) I thought I had replied earlier, for now the plan is to test POSTing large files to Shellbox, identify what layers it fails at and fix those. A basic test wou... [17:01:07] (03CR) 10Herron: [C: 03+1] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/745284 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [17:02:27] (03PS1) 10Ladsgroup: deployment-prep: Remove motd class from hiera [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) [17:05:46] (03PS2) 10Clare Ming: Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) [17:06:07] (03PS1) 10Ladsgroup: xdummy: Remove xdummy [puppet] - 10https://gerrit.wikimedia.org/r/745287 (https://phabricator.wikimedia.org/T133183) [17:11:05] (03CR) 10Ladsgroup: "The link to the commit doesn't work: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/ffa9923c4263418c85e198a590a2e4" [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [17:17:24] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10Legoktm) @fgiunchedi two other things that it would be good to have your input on: 1. How critical is it that graphite stays up? If it goes down again, sho... [17:28:18] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [17:28:25] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:16] (03CR) 10Cathal Mooney: [C: 03+2] Updated iBGP policy to process local and remote routes differently. (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [17:29:27] (03PS12) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [17:29:39] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Updated iBGP policy to process local and remote routes differently. [homer/public] - 10https://gerrit.wikimedia.org/r/745218 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [17:31:03] (03CR) 10jerkins-bot: [V: 04-1] WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [17:34:20] (03PS13) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [17:35:38] (03PS14) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [17:50:48] (03CR) 10Herron: [C: 03+1] profile: add exim4 blackhole configuration [puppet] - 10https://gerrit.wikimedia.org/r/743207 (https://phabricator.wikimedia.org/T296373) (owner: 10Filippo Giunchedi) [17:51:28] (03CR) 10Herron: [C: 03+1] prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:54:20] (03PS15) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [17:56:22] (03PS16) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [18:01:38] (03PS17) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [18:03:39] (03PS18) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [18:06:58] (03PS19) 10Eigyan: wmf-config: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [18:10:49] (03CR) 10Eigyan: wmf-config: Deploy GDI survey to cawiki and fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [18:15:50] (03CR) 10Jgiannelos: tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos) [18:19:45] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10brennen) [18:23:57] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:00] (03PS3) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) [18:27:01] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) I see the running kernel is now the only version installed, looks good. Is there anything else to do before re-enabling puppet on mx20... [18:27:48] (03PS4) 10Jsn.sherman: Enable TheWikipediaLibrary on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) [18:31:05] (03CR) 10Jsn.sherman: "Marking all as resolved and dropping -1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [18:42:47] (03PS1) 10Jgiannelos: maps: Add kafka helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) [18:55:53] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.69% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [18:59:30] (03PS1) 10Majavah: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745301 [19:00:04] dancy and brennen: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T1900) [19:00:04] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T1900). [19:00:04] No Gerrit patches in the queue for this window AFAICS. [19:00:21] no scheduled patches apparently, I'll use this opportunity to update the iw cache [19:00:33] Triage meeting was moved to tomorrow [19:01:20] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate and restore foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Urbanecm_WMF) a:05Urbanecm_WMF→03RLazarus Hello @RLazarus, I discussed this internally, and the conclusion was the hard r... [19:01:21] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:42] (03CR) 10Majavah: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745301 (owner: 10Majavah) [19:02:38] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745301 (owner: 10Majavah) [19:05:24] !log taavi@deploy1002 Synchronized wmf-config/interwiki.php: Config: [[gerrit:745301|Update interwiki cache]] (duration: 01m 06s) [19:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:57] !log utc evening deploys done [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:21] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10RLazarus) 05Open→03Resolved Perfect! Agree there's no need for a test in that case, we can call this finished. Thanks for following up. [19:06:35] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10RLazarus) [19:07:36] (03CR) 10RLazarus: "For anyone history-diving: in the linked bug we decided not to restore the test, so please mentally remove "Temporarily" from the commit m" [puppet] - 10https://gerrit.wikimedia.org/r/742543 (https://phabricator.wikimedia.org/T296687) (owner: 10RLazarus) [19:08:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:12:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:19:13] (03PS5) 10Jsn.sherman: Enable TheWikipediaLibrary on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) [19:21:43] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:28] (03CR) 10Andrew Bogott: [C: 03+1] openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:27:40] (03CR) 10Jsn.sherman: "Hi Essex! cawiki config looks good to me, but we should clean up fawiki IMO." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:37:41] (03PS1) 10Ssingh: durum: show a message about the limitations of the DoH protocol [puppet] - 10https://gerrit.wikimedia.org/r/745307 [19:39:34] (03PS1) 10Jgiannelos: tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 [19:39:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32912/console" [puppet] - 10https://gerrit.wikimedia.org/r/745307 (owner: 10Ssingh) [19:41:01] (03CR) 10Jgiannelos: "Pregeneration logs show some connection errors because of lack of PG connections. Lets reduce for now the parallelism of the tile pregener" [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 (owner: 10Jgiannelos) [19:43:22] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: show a message about the limitations of the DoH protocol [puppet] - 10https://gerrit.wikimedia.org/r/745307 (owner: 10Ssingh) [19:46:59] (03PS20) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [19:49:13] (03CR) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:50:12] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10jhathaway) thanks for the reminder @herron I will do that now [19:51:33] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos) [19:51:55] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 (owner: 10Jgiannelos) [19:53:41] (03PS1) 10Ssingh: test_dns: update tests for durum [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/745309 [19:54:20] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos) [19:54:36] is the deployment window still open? [19:54:43] (03CR) 10Ssingh: [C: 03+2] test_dns: update tests for durum [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/745309 (owner: 10Ssingh) [19:55:26] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 (owner: 10Jgiannelos) [19:55:34] cjming: it is scheduled to close in 5 minutes [19:57:01] ah no worries then - I can sign up for the next one [19:58:07] (03Merged) 10jenkins-bot: tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos) [20:00:04] dancy and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T2000). [20:01:19] 10SRE, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [20:02:22] 10SRE, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [20:02:36] 10SRE, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [20:03:12] (03PS2) 10Jgiannelos: tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 [20:03:32] (03CR) 10Jgiannelos: [V: 03+2 C: 03+2] tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 (owner: 10Jgiannelos) [20:05:46] (03CR) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [20:06:50] o/ [20:06:58] Are we ready to do this thing? [20:07:10] (03PS21) 10Eigyan: wmf-config: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:07:32] (03Merged) 10jenkins-bot: tegola-vector-tiles: Reduced parallelism in pregeneration workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/745308 (owner: 10Jgiannelos) [20:11:51] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [20:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:00] o/ [20:12:06] seems like it [20:12:26] (03PS22) 10Eigyan: wmf-config: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:12:32] There seems to be pending activity on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/742763/ [20:12:34] yeah, that. [20:13:46] Not +2'd yet so I'm moving forward [20:14:17] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745313 [20:14:19] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745313 (owner: 10Ahmon Dancy) [20:14:23] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745314 [20:15:01] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745314 (owner: 10Jgiannelos) [20:15:05] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745313 (owner: 10Ahmon Dancy) [20:16:20] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [20:16:25] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12 refs T293953 [20:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:29] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [20:17:30] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.12 refs T293953 (duration: 01m 05s) [20:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:53] The deed is done [20:18:07] * brennen watches logs [20:18:14] let the errors begin [20:18:19] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745314 (owner: 10Jgiannelos) [20:18:25] .12 i/c/GlobalVarConfig:59 GlobalVarConfig::get: undefined option: 'UseAjax' [20:18:33] that'll be a rollback, i expect. [20:18:46] I'm doing it now. [20:19:05] (tho only 5 so far.) [20:19:16] (but... seems like a Bad Thing.) [20:19:27] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745315 [20:19:29] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745315 (owner: 10Ahmon Dancy) [20:20:11] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745315 (owner: 10Ahmon Dancy) [20:21:27] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293953 [20:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:31] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [20:21:52] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745314 (owner: 10Jgiannelos) [20:22:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:31] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293953 (duration: 01m 04s) [20:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:51] 10SRE, 10Analytics, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10razzi) a:05razzi→03None I haven't been working on this for months; putting it up for grabs. Definitely still worth doing. [20:26:23] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [20:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:15] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [20:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:56] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [20:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:00] !log enable exim on mx2001 [20:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:45] fix for the UseAjax issue: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/3D/+/745317 [20:36:12] +2'd [20:36:23] dancy, brennen: ok for me to backport that now? [20:36:56] (03PS1) 10Majavah: Remove use of $wgUseAjax [extensions/3D] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745238 [20:37:00] (03CR) 10Eigyan: wmf-config: Deploy GDI survey to cawiki and fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [20:38:10] majavah: thanks, and should be clear to backport but i'll wait for dancy since he's driving the train. [20:38:55] (doubt he's up to anything on deploy box but...) [20:41:42] Oops, how did we not grep for that? [20:42:01] majavah: You have the conn [20:42:07] probably searched for wgUseAjax [20:42:50] (03CR) 10Majavah: [C: 03+2] Remove use of $wgUseAjax [extensions/3D] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745238 (owner: 10Majavah) [20:43:27] (03PS1) 10Ladsgroup: Major fixes to maintenance/pruneRevData.php [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745239 (https://phabricator.wikimedia.org/T290769) [20:43:33] zabe: Probably. [20:44:02] * James_F We should also drop it from mediawiki/vagrant's categorytree role. [20:45:40] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:47:03] (03Merged) 10jenkins-bot: Remove use of $wgUseAjax [extensions/3D] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745238 (owner: 10Majavah) [20:48:32] verified the fix via mwdebug and testwiki, syncing [20:48:44] it's also in https://gerrit.wikimedia.org/g/mediawiki/extensions/OnlineStatus/+/a82c1b1fce5c1fdee9501e1b184be3c597e61b4a/OnlineStatus.body.php if that's used at all [20:48:50] (as well as vagrant) [20:49:52] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/3D/src/Hooks.php: Backport: [[gerrit:745238|Remove use of $wgUseAjax]] (duration: 01m 07s) [20:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:16] Thanks majavah for fixing this. Let me know once you're done so I move ahead to a quick deploy [20:50:17] dancy: I'm done deploying [20:50:21] jouncebot: nowandnext [20:50:22] For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T2000) [20:50:22] In 0 hour(s) and 9 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T2100) [20:50:32] majavah: thx [20:51:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:56] dancy: Hmm, do RL module misses go into the logs? If so https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#Removal_of_jquery.jStorage_imminent;_the_default_WatchlistNotice_gadget_is_affected might make group1 noisy. :-( [20:52:30] sounds great [20:52:44] * James_F sighs. [20:53:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:43] (03PS1) 10Ssingh: durum: set cache-control headers [puppet] - 10https://gerrit.wikimedia.org/r/745323 [20:55:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32913/console" [puppet] - 10https://gerrit.wikimedia.org/r/745323 (owner: 10Ssingh) [21:00:04] dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T2000). [21:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211208T2100). [21:01:44] I'll roll forward to group1 when https://phabricator.wikimedia.org/T297318 is cleared. [21:02:45] I can deploy it [21:02:56] the patch looks sensible and small enough [21:03:05] thx [21:03:11] (03PS1) 10Ladsgroup: Make sure 'enable-toc' key is set [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745240 (https://phabricator.wikimedia.org/T297318) [21:03:23] (03CR) 10Ladsgroup: [C: 03+2] Make sure 'enable-toc' key is set [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745240 (https://phabricator.wikimedia.org/T297318) (owner: 10Ladsgroup) [21:03:34] Much wikilove for the rapid responses. [21:04:04] amir1: are you going to +2 the patch on master too? [21:04:16] I did I think [21:04:25] yup [21:09:26] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10DAbad) 2021-12-08 Tech Steering Committee - seems like a small amount of effort - need by December 17th [21:12:45] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10ssingh) Mostly curious: is there anything that needs to be done for this as part of clinic duty? [21:13:06] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: set cache-control headers [puppet] - 10https://gerrit.wikimedia.org/r/745323 (owner: 10Ssingh) [21:13:31] I keep reading "drum set cache-control headers" [21:14:02] dancy: ha! this is named after https://en.wikipedia.org/wiki/Durum :) [21:14:44] dancy: https://twitter.com/chromakode/status/1404506174825791491 [21:14:55] haha [21:15:29] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:14] Is HHVM still a thing? [21:23:17] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Urbanecm_WMF) [21:26:10] (03Merged) 10jenkins-bot: Make sure 'enable-toc' key is set [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745240 (https://phabricator.wikimedia.org/T297318) (owner: 10Ladsgroup) [21:29:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:28] dancy: Not in production but there can be some documentations, some stuff named like that but not actually being hhvm [21:30:42] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikidataPageBanner/includes/WikidataPageBanner.php: Backport: [[gerrit:745240|Make sure 'enable-toc' key is set (T297318)]] (duration: 01m 05s) [21:30:44] our infra is full of surprises [21:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:30:47] T297318: PHP Notice: Undefined index: enable-toc - https://phabricator.wikimedia.org/T297318 [21:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:56] the patch has been deployed [21:32:50] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: Validate route project [puppet] - 10https://gerrit.wikimedia.org/r/742267 (https://phabricator.wikimedia.org/T129800) (owner: 10Majavah) [21:38:54] ok.. let's try again [21:39:46] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745331 [21:39:48] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745331 (owner: 10Ahmon Dancy) [21:40:36] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745331 (owner: 10Ahmon Dancy) [21:41:52] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12 refs T293953 [21:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:56] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [21:42:56] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.12 refs T293953 (duration: 01m 04s) [21:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:50] 10SRE, 10Scap, 10Release-Engineering-Team (Radar): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10dancy) Is this still a problem? [21:51:38] lot quieter this go. [21:51:46] nod [21:52:19] Is the Internet still online? [21:54:00] Sadly. [21:54:00] I think it was a mistake, but it's too late to put the Internet back in the box now. [21:54:14] It's just a fad [21:57:05] putting it into blockchain would definitely fix it [21:57:26] I mean, yes, in the sense that everything would stop running. [22:00:57] 10SRE, 10Scap: Decide on /var/lib vs /home as locations of homedir for l10nupdate - https://phabricator.wikimedia.org/T163288 (10dancy) 05Open→03Declined [22:12:42] I'm deploying now [22:14:00] brennen dancy ^ [22:14:10] :+:! [22:14:17] 👍🏾 there we go [22:14:18] ubn security issue (T297322) [22:14:41] ack, thx [22:16:26] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.12/includes/actions/: T297322 (duration: 01m 05s) [22:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:10] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.9/includes/actions/: T297322 (duration: 01m 05s) [22:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:35] 10SRE, 10ops-eqdfw, 10DC-Ops: eqdfw:pdus - https://phabricator.wikimedia.org/T295921 (10Papaul) [22:43:55] 10SRE, 10ops-eqdfw, 10DC-Ops: eqdfw:pdus - https://phabricator.wikimedia.org/T295921 (10Papaul) 05Open→03Resolved This is complete [22:45:04] @amir1 thanks for deploying the WikidataPageBanner patch [22:45:12] I only just saw the email [22:45:23] yw :) [22:53:48] (03PS1) 10Jdlrobson: Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) [22:55:11] (03PS1) 10Jdlrobson: Set higher specificity [extensions/MobileFrontend] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745241 (https://phabricator.wikimedia.org/T171726) [22:58:05] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) After putting in the new PDU's we still have the same problem. [23:55:18] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Urbanecm) +1 from me, @Zabe should be eligible as one of the current +2 holders. [23:56:51] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 439, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:57:15] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.69% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [23:58:11] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.039 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:59:38] ^ looks like we lost graphite1004 again, cc legoktm, herron [23:59:53] ffs [23:59:53] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state