[00:01:22] (03CR) 10Tim Starling: Unprovision the "swift" dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [00:03:45] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:05:45] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:51] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:18:39] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:35] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:17] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:13] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:25] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.383 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:39:25] RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [00:45:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:47:29] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:29] RECOVERY - puppet last run on wdqs1012 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:45:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:33] (03PS11) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [06:35:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:20] (03PS12) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [06:43:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:25] (03PS13) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [06:51:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:17] (03CR) 10Marostegui: "Amir, how do you feel about this?" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [06:52:55] (03PS14) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [06:56:58] (03PS1) 10Kosta Harlan: LevelingUpManager: Handle links/link-recommendation collision [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900684 (https://phabricator.wikimedia.org/T332309) [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:04:17] (03PS15) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:05:14] 10SRE, 10MediaWiki-Shell, 10WMF-General-or-Unknown, 10Security, 10Sustainability (Incident Followup): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10Joe) 05Open→03Declined There is no point in working on firejail profiles given we've introduced shellbox in... [07:10:15] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:35] (03PS16) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:22:04] (03PS17) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:23:41] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:53] (03PS18) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:39:32] (03PS19) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:41:01] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:17] (03PS1) 10KartikMistry: Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) [07:48:19] (03PS20) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [07:58:59] (03PS1) 10Giuseppe Lavagetto: sre: add alerting for poolcounter [alerts] - 10https://gerrit.wikimedia.org/r/900962 (https://phabricator.wikimedia.org/T83729) [07:59:40] (03PS21) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [08:00:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 138915 [08:01:27] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:37] 10SRE, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Joe) I think there is a larger topic of moving etcd to use the new PKI certs. There has been some work in that direction but I think t... [08:05:51] (03PS22) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [08:06:26] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:06:31] (03PS2) 10Filippo Giunchedi: o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) [08:12:09] (03PS23) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [08:15:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138915 [08:18:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'show' for AS: 138915 [08:18:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'show' for AS: 138915 [08:20:13] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/901116 [08:26:41] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/901116 (owner: 10Muehlenhoff) [08:35:09] 10SRE, 10Infrastructure-Foundations: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:36:30] (03PS2) 10Filippo Giunchedi: traffic: remove EdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) [08:37:26] (03CR) 10Filippo Giunchedi: traffic: remove EdgeTrafficDrop (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:37:49] (03PS1) 10Muehlenhoff: Make krb2002 a KDC [puppet] - 10https://gerrit.wikimedia.org/r/901117 (https://phabricator.wikimedia.org/T331695) [08:41:33] (03PS1) 10Elukey: profile::cache::purge: move purged to a new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) [08:42:41] (03CR) 10Muehlenhoff: [C: 03+2] Make krb2002 a KDC [puppet] - 10https://gerrit.wikimedia.org/r/901117 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [08:43:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40216/console" [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:43:41] (03CR) 10Elukey: profile::cache::purge: move purged to a new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:53:38] (03PS24) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [08:56:56] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10jcrespo) @KOfori FYI, this (and T330942) is the latest of the usual weekly mediawiki file workflow bug (read above), as you inquired about it recently. [08:59:57] (03PS25) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [09:01:45] (03PS1) 10Filippo Giunchedi: benthos: notify service on env change [puppet] - 10https://gerrit.wikimedia.org/r/901122 [09:08:25] PROBLEM - Check systemd state on krb2002 is CRITICAL: CRITICAL - degraded: The following units failed: krb5-admin-server.service,krb5-kdc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:19] 10SRE, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans) Btw, slightly related, I made an experiment to generate requestctl objects starting from the selected filters in the [[ https://wikitech.wikimedia.o... [09:21:50] !log Repooling parse2004 - T332119 [09:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:56] T332119: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 [09:22:24] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) Thanks ! [09:43:38] (03PS1) 10David Caro: kubernetes: set NO_HOME for bulidservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 [09:48:31] (03CR) 10Clément Goubert: [C: 03+2] profile::mediawiki::deployment::server: Don't pass HELM_* vars to train presync [puppet] - 10https://gerrit.wikimedia.org/r/900731 (https://phabricator.wikimedia.org/T331479) (owner: 10Ahmon Dancy) [09:52:47] (03CR) 10Elukey: [C: 03+1] benthos: notify service on env change [puppet] - 10https://gerrit.wikimedia.org/r/901122 (owner: 10Filippo Giunchedi) [09:54:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2552 [09:54:04] (03CR) 10Btullis: "Looking good, thanks. One suggestion about the use of an existing defined type for setting sysfs parameters. Also I don't think we will us" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [09:54:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2552 [09:54:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58655 [09:55:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58655 [09:55:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 141082 [09:55:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141082 [09:56:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12956 [09:56:20] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: notify service on env change [puppet] - 10https://gerrit.wikimedia.org/r/901122 (owner: 10Filippo Giunchedi) [09:56:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12956 [09:56:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36692 [09:57:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36692 [09:59:35] (03CR) 10Btullis: "You'll need to bump the version number on the chart, but other than that it's a +1 from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1000) [10:02:43] (03CR) 10Muehlenhoff: remove role::webserver_misc_apps from sre module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900464 (owner: 10Dzahn) [10:03:14] (03CR) 10David Caro: [V: 03+1] "Tested in toolsbeta" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [10:06:38] (03CR) 10David Caro: "Wait, not this one yet" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [10:10:57] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8705666, @Eevans wrote: >>>! In T310980#8705537, @elukey wrote: >> okok this is the part that I wasn't unclear about - we'd just deploy cqlsh in another way, lik... [10:19:09] (03PS2) 10David Caro: kubernetes: set NO_HOME for bulidservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 [10:19:25] (03CR) 10David Caro: [V: 03+1] "Now it's tested :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [10:19:50] (03CR) 10CI reject: [V: 04-1] kubernetes: set NO_HOME for bulidservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [10:20:07] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:39] RECOVERY - mediawiki-installation DSH group on parse2004 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:23:42] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10Volans) [10:26:54] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 (10Volans) Removing I/F as all the proposed solutions falls into the Traffic realm. [10:31:52] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10Volans) [10:34:42] (03CR) 10Vgutierrez: [C: 03+1] sre: add alerting for poolcounter [alerts] - 10https://gerrit.wikimedia.org/r/900962 (https://phabricator.wikimedia.org/T83729) (owner: 10Giuseppe Lavagetto) [10:35:10] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10Volans) [10:42:27] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Volans) [10:44:05] 10SRE-Sprint-Week-Sustainability-March2023, 10PoolCounter, 10serviceops, 10Patch-For-Review, and 2 others: Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) [10:45:06] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup), and 2 others: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Joe) a:03Joe [10:46:04] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10Vgutierrez) [10:47:44] (03PS1) 10JMeybohm: Move to demjson3 and install jsonnet-lint 0.19.1 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901138 [10:48:42] (03PS1) 10JMeybohm: Add .gitreview [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901139 [10:51:25] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10Vgutierrez) [10:52:22] (03PS1) 10Filippo Giunchedi: sre: deploy zk alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901140 (https://phabricator.wikimedia.org/T309182) [10:54:56] (03CR) 10Krinkle: [C: 03+2] rdbms: Add db_log_category=performance to TransactionProfiler [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898725 (owner: 10Krinkle) [10:54:59] (03CR) 10Krinkle: [C: 03+2] rdbms: Add missing QUERY_CHANGE_ flag to internal "USE" query [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900667 (https://phabricator.wikimedia.org/T332228) (owner: 10Jforrester) [11:01:05] (03PS1) 10Giuseppe Lavagetto: sre: add redis memory full alert [alerts] - 10https://gerrit.wikimedia.org/r/901141 (https://phabricator.wikimedia.org/T110169) [11:02:47] (03PS2) 10Filippo Giunchedi: sre: deploy zk alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901140 (https://phabricator.wikimedia.org/T309182) [11:03:06] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Joe) [11:06:15] (03PS1) 10Filippo Giunchedi: sre: deploy kafka alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901142 (https://phabricator.wikimedia.org/T309182) [11:06:59] (03PS1) 10Kosta Harlan: PostEdit: Increment the edit-count-for-task-type count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900685 (https://phabricator.wikimedia.org/T332319) [11:07:18] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10Joe) 05Open→03Invalid a:03Joe We've dismissed... [11:07:32] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068 (10Joe) [11:07:33] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:47] (03PS1) 10Kosta Harlan: TryNewTask: Set an array fallback if TryNewTaskOptOuts is null [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 [11:08:08] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: deploy zk alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901140 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:08:11] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: deploy kafka alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901142 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:08:18] (03PS3) 10Filippo Giunchedi: sre: deploy zk alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901140 (https://phabricator.wikimedia.org/T309182) [11:09:06] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10Joe) a:03Joe [11:11:03] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T317813) [11:11:26] (03PS2) 10Filippo Giunchedi: sre: deploy kafka alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/901142 (https://phabricator.wikimedia.org/T309182) [11:12:15] (03Merged) 10jenkins-bot: rdbms: Add db_log_category=performance to TransactionProfiler [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898725 (owner: 10Krinkle) [11:12:20] (03Merged) 10jenkins-bot: rdbms: Add missing QUERY_CHANGE_ flag to internal "USE" query [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900667 (https://phabricator.wikimedia.org/T332228) (owner: 10Jforrester) [11:17:21] TheresNoTime: noticing a lock held on deploy2002 [11:17:39] $ l /var/lock/scap-global-lock [11:17:39] -rw-rw-rw- 1 samtar wikidev 0 Mar 20 10:57 /var/lock/scap-global-lock [11:17:41] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Joe) 05Open→03Stalled p:05Medium→03Low I fail to see how this task is related to incident followu... [11:18:09] (03PS1) 10Filippo Giunchedi: kafka: broker replica lag alert moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/901167 (https://phabricator.wikimedia.org/T309010) [11:18:55] Krinkle: I didn't add that.. [11:19:27] *I didn't explicitly add that — not been on `deploy2002` today until just now [11:20:14] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Joe) Additionally: strict mode is on by default since mariadb 10.2, this has nothing to do with productio... [11:20:20] ok, let's see what scap does. Maybe this is no longer used by Scap and it was left over somehow [11:20:31] huh, Scap just goes ahead [11:20:56] ok, so I guess this was done on deplopy1002, replicated, and then mtime bumped by me just now trying to create it with `touch`. [11:21:04] okay, phew.. [11:21:29] although the one there is not yours, that one is by root:root and Mar 6. [11:21:36] so I guess you did somehow create the file [11:21:39] :shrug: [11:23:02] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10Vgutierrez) [11:24:45] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Epic, 10MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), and 3 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Joe) [11:25:46] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10akosiaris) [11:26:06] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10jcrespo) >>! In T108255#8709276, @Joe wrote: > I fail to see how this task is related to incident followu... [11:26:22] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10Vgutierrez) [11:27:54] (03PS1) 10Muehlenhoff: Add systemd override to allow KDC to write to it's log file [puppet] - 10https://gerrit.wikimedia.org/r/901170 (https://phabricator.wikimedia.org/T331695) [11:28:18] (03CR) 10JMeybohm: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [11:31:49] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10Joe) [11:32:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901170 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:35:31] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497 [11:35:51] !log krinkle@deploy2002 Synchronized php-1.40.0-wmf.27/includes/libs/rdbms/: (no justification provided) (duration: 15m 28s) [11:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:08] (03PS2) 10Vgutierrez: P:cache::varnish::frontend: Add parameter to enable requestctl on hits [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [11:36:13] Krinkle: What, the scap lock doesn't actually lock scap? [11:36:25] This seems like an issue [11:36:47] claime: well, I'm inclined to think it does but under a different file. [11:37:20] but yeah. the fact that the global one doesn't seem to work is also an issue [11:37:27] given that's for example what we use to lock the eqiad one right now [11:37:33] maybe worth running `scap lock` and seeing if that works..? [11:38:08] my deploy is done [11:38:13] -rw-rw-rw- 1 samtar wikidev 0 Mar 20 10:57 scap-global-lock [11:38:18] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497 [11:38:22] it's still there :) [11:38:28] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10observability, 10Epic, 10Sustainability (Incident Followup): Database alerting - https://phabricator.wikimedia.org/T172492 (10Joe) [11:38:35] Given that's what I added in the switchdc documentation, I'd reaaaally like it to work lol [11:38:42] claime: I mentoned it in -releng. Once US wakes up I hope someone can look into that. [11:38:51] Krinkle: Fantastic, tahnks [11:38:53] (03PS3) 10Vgutierrez: P:cache::varnish::frontend: Add parameter to enable requestctl on hits [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [11:38:53] thanks* [11:39:19] (03CR) 10Vgutierrez: "fixed merge conflicts and updated styling on VCL files" [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [11:39:51] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/901170 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:43:14] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10Volans) The mail dashboard has already a quick display of the queues, I've a... [11:44:57] (03CR) 10Vgutierrez: [C: 03+1] P:cache::varnish::frontend: Add parameter to enable requestctl on hits [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [11:45:19] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10MatthewVernon) I'm not sure how much this will actually help from the swift side (as oppos... [11:48:04] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10Volans) I've spoken with the people involved, and the original request has been me... [11:50:58] 10SRE-Sprint-Week-Sustainability-March2023, 10DNS, 10Traffic, 10Sustainability (Incident Followup): Automate DNS depools such that manual commits are not required - https://phabricator.wikimedia.org/T303219 (10Vgutierrez) [11:51:05] (03CR) 10Majavah: "Likely the real solution for your problem is to unset the toolforge: tool label for buildservice based tools." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [11:53:16] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10Volans) [11:54:09] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Vgutierrez) [11:54:15] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10SRE Observability, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171 (10Volans) [11:56:43] (03CR) 10Btullis: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [11:56:54] (03CR) 10Muehlenhoff: [C: 03+2] Add systemd override to allow KDC to write to it's log file [puppet] - 10https://gerrit.wikimedia.org/r/901170 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:59:20] (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [11:59:49] 10Puppet, 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10Volans) Although the principle still stands, I... [12:03:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:03:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:01] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497 [12:04:18] (03PS1) 10Muehlenhoff: Fix override to pass full directory [puppet] - 10https://gerrit.wikimedia.org/r/901178 (https://phabricator.wikimedia.org/T331695) [12:06:18] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) > So I think that the original concern has been almo... [12:06:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:14] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) 05Open... [12:07:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:16] 10SRE, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Volans) [12:07:26] (03CR) 10Muehlenhoff: [C: 03+2] Fix override to pass full directory [puppet] - 10https://gerrit.wikimedia.org/r/901178 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [12:15:24] 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10TheresNoTime) >>! In T332273#8702656, @colewhite wrote: > The `Media... [12:17:18] (03PS2) 10Giuseppe Lavagetto: sre: add alerting for poolcounter [alerts] - 10https://gerrit.wikimedia.org/r/900962 (https://phabricator.wikimedia.org/T83729) [12:18:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre: add alerting for poolcounter [alerts] - 10https://gerrit.wikimedia.org/r/900962 (https://phabricator.wikimedia.org/T83729) (owner: 10Giuseppe Lavagetto) [12:19:48] (03Merged) 10jenkins-bot: sre: add alerting for poolcounter [alerts] - 10https://gerrit.wikimedia.org/r/900962 (https://phabricator.wikimedia.org/T83729) (owner: 10Giuseppe Lavagetto) [12:21:04] 10Puppet, 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): A puppet run should not start if a box is under abnormal load. - https://phabricator.wikimedia.org/T84183 (10Volans) 05Open→03Invalid Resolving as invalid because is not very well d... [12:27:13] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10Joe) [12:30:11] 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10serviceops-radar, 10Sustainability (Incident Followup): Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 (10Joe) a:03Joe [12:31:46] (03CR) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" (031 comment) [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester) [12:41:28] (03PS6) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [12:43:50] (03PS1) 10Kamila Součková: tests: Increase test timeout for tests run in Docker [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 [12:43:52] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 (owner: 10Kamila Součková) [12:45:31] (03PS16) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [12:46:26] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:51:46] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:52:13] (03PS17) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [12:52:35] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:53:06] (03PS1) 10Btullis: Omit Python 3.7 in the analytics cluster after bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901196 (https://phabricator.wikimedia.org/T329363) [12:53:11] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:54:13] (03PS18) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [12:54:53] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901196 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [12:55:07] (03PS19) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [12:56:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) @Andrew can you let me know if these need dual 10g connection? [12:56:43] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [12:57:18] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [12:58:15] (03PS1) 10Muehlenhoff: cuminunpriv: No need for component/spicerack on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901199 (https://phabricator.wikimedia.org/T331700) [12:58:38] (03CR) 10Hnowlan: [C: 03+2] tests: Increase test timeout for tests run in Docker [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 (owner: 10Kamila Součková) [12:58:49] (03CR) 10Hnowlan: [C: 03+1] tests: Increase test timeout for tests run in Docker [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 (owner: 10Kamila Součková) [12:58:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901199 (https://phabricator.wikimedia.org/T331700) (owner: 10Muehlenhoff) [12:59:25] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1300). [13:00:05] koi, Aca, Kizule, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] Confirming my presence. I'd say I'm quite worried about the number of patches in the window, but I think we will sort it out. [13:00:18] hi [13:00:22] hi [13:00:25] the config ones go quickly, in theory :) [13:00:26] yeah, lots of patches in the window [13:00:27] Let's hope that Zuul will be nice to us as well. [13:00:38] unfortunately I can’t really deploy unless something’s super urgent [13:00:48] * kostajh offers a tribute to Zuul [13:01:24] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time Brian_King following up on this now https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:01:24] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 332 bytes in 1.054 second response time Brian_King following up on this now https://wikitech.wikimedia.org/wiki/Wikidata_query_servic [13:01:24] k [13:01:24] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.053 second response time Brian_King following up on this now https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ [13:01:37] do we have a deployer? [13:01:46] I don't think so. [13:01:53] I'm able to self-service deploy my patches, but not sure if I have the time to focus on everyone elses... if no one else is around, I can do it, though. [13:02:03] (03CR) 10Muehlenhoff: [C: 03+2] cuminunpriv: No need for component/spicerack on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901199 (https://phabricator.wikimedia.org/T331700) (owner: 10Muehlenhoff) [13:02:34] I can deploy :) [13:02:42] kostajh: did you want to self-serve yours first? [13:03:03] (or last, either way..) [13:03:04] TheresNoTime: maybe start the config ones first? what do you think? [13:03:16] last it is! :D [13:03:28] TheresNoTime: I guess I can start +2'ing the backports I have now, if that's ok [13:03:32] koi: we'll start with yours [13:04:12] kostajh: I don't know how well `scap backport` handles that, maybe worth waiting until I'm part way through? [13:04:47] (03PS1) 10Kosta Harlan: changeprop-jobqueue: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/901202 (https://phabricator.wikimedia.org/T331616) [13:04:49] ok i'll wait [13:04:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900537 (https://phabricator.wikimedia.org/T326012) (owner: 10Stang) [13:04:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900690 (https://phabricator.wikimedia.org/T332351) (owner: 10Stang) [13:04:53] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:05:22] koi: going to do those two together, and then the wordmark ones together afterwards [13:05:29] TheresNoTime: My patch for FlaggedRevs (900696) will require running maintenance scripts. Is it fine for you to do? [13:05:29] got it [13:05:37] Kizule: sure :) [13:05:49] It won't be long, but just wanted to make sure. [13:05:54] (03CR) 10Kosta Harlan: "Gergo, do we still need this one?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/681669 (owner: 10Gergő Tisza) [13:06:01] if you start the `scap backport` before the gate-and-submit is finished, it works without issue AFAIK [13:06:15] if the gate-and-submit finishes “too soon” you might need to deploy the old-fashioned way, I’m not sure [13:06:16] (03Merged) 10jenkins-bot: bewiki: Remove group "autoeditor", "reviewer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900537 (https://phabricator.wikimedia.org/T326012) (owner: 10Stang) [13:06:19] (03Merged) 10jenkins-bot: slwiki: Create Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900690 (https://phabricator.wikimedia.org/T332351) (owner: 10Stang) [13:06:24] ack. I'll just wait :) [13:06:24] (regarding early +2ing) [13:06:28] ok ^^ [13:06:42] !log samtar@deploy2002 Started scap: Backport for [[gerrit:900537|bewiki: Remove group "autoeditor", "reviewer" (T326012)]], [[gerrit:900690|slwiki: Create Draft namespace (T332351)]] [13:06:49] T332351: Request for the Draft namespace on the Slovene (sl) Wikipedia - https://phabricator.wikimedia.org/T332351 [13:06:49] T326012: Correct user groups for bewiki - https://phabricator.wikimedia.org/T326012 [13:08:04] TheresNoTime, I'm not sure but it might needs to run NamespaceDupes.php for the slwiki patch [13:08:08] looks like someone is deploying already? [13:08:13] !log samtar@deploy2002 stang and samtar: Backport for [[gerrit:900537|bewiki: Remove group "autoeditor", "reviewer" (T326012)]], [[gerrit:900690|slwiki: Create Draft namespace (T332351)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:08:31] koi: okay, looking [13:08:40] (can you test ^) [13:08:52] looking [13:10:04] (03CR) 10Hnowlan: [C: 04-1] changeprop-jobqueue: Bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901202 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [13:11:30] (03CR) 10Kosta Harlan: changeprop-jobqueue: Bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901202 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [13:11:34] (03Abandoned) 10Kosta Harlan: changeprop-jobqueue: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/901202 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [13:12:25] TheresNoTime, confirmed both patch works well from my side [13:12:29] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudcephosd1025: power supply temperature critical - https://phabricator.wikimedia.org/T332406 (10Jclark-ctr) a:03Jclark-ctr [13:12:31] syncing [13:14:47] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:14:50] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:15:23] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:15:43] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudcephosd1025: power supply temperature critical - https://phabricator.wikimedia.org/T332406 (10Jclark-ctr) 05Open→03Resolved Reseated power supply cleared fault on psu [13:17:08] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:17:26] koi: I'll dry run NamespaceDupes after this is done to check [13:17:28] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:18:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host cuminunpriv1001.eqiad.wmnet with OS bullseye [13:18:05] 10SRE, 10Infrastructure-Foundations: Migrate cuminunpriv1001 to Bullseye - https://phabricator.wikimedia.org/T331700 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host cuminunpriv1001.eqiad.wmnet with OS bullseye [13:18:18] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900537|bewiki: Remove group "autoeditor", "reviewer" (T326012)]], [[gerrit:900690|slwiki: Create Draft namespace (T332351)]] (duration: 11m 36s) [13:18:24] T332351: Request for the Draft namespace on the Slovene (sl) Wikipedia - https://phabricator.wikimedia.org/T332351 [13:18:25] T326012: Correct user groups for bewiki - https://phabricator.wikimedia.org/T326012 [13:18:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:07] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Jclark-ctr) [13:19:45] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Jclark-ctr) 05Open→03Resolved Removed from rack, ran offline script [13:20:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 5d 23h 37m 7s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [13:20:44] koi: errm, either I'm running things incorrectly, or there's a lot of problems.. https://phabricator.wikimedia.org/P45894 [13:21:12] (this went on and on until I ctrl+C'd it) [13:21:59] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:22:05] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:22:19] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:22:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jclark-ctr) Ran offline script, Removed from rack [13:23:15] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:23:16] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jclark-ctr) [13:23:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40219/console" [puppet] - 10https://gerrit.wikimedia.org/r/901196 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:23:27] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jclark-ctr) 05Open→03Resolved [13:23:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:01] koi: would you like me to move on to the wordmark patches? [13:24:01] TheresNoTime: sorry but I'm not sure what it means... I do make "Draft" as an alias of Osnutek [13:24:16] yeah, please move on [13:24:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900689 (https://phabricator.wikimedia.org/T326067) (owner: 10Stang) [13:24:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900742 (https://phabricator.wikimedia.org/T332439) (owner: 10Stang) [13:24:59] !log awight@deploy2002 Started deploy [kartotherian/deploy@906be32] (codfw): Update kartotherian to a6e9843 [13:25:21] (03Merged) 10jenkins-bot: kuwiktionary: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900689 (https://phabricator.wikimedia.org/T326067) (owner: 10Stang) [13:26:18] (03PS2) 10Samtar: trwikivoyage: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900742 (https://phabricator.wikimedia.org/T332439) (owner: 10Stang) [13:26:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cuminunpriv1001.eqiad.wmnet with reason: host reimage [13:26:38] !log awight@deploy2002 Finished deploy [kartotherian/deploy@906be32] (codfw): Update kartotherian to a6e9843 (duration: 01m 39s) [13:27:20] (03CR) 10TrainBranchBot: "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900742 (https://phabricator.wikimedia.org/T332439) (owner: 10Stang) [13:27:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:28:04] (03Merged) 10jenkins-bot: trwikivoyage: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900742 (https://phabricator.wikimedia.org/T332439) (owner: 10Stang) [13:28:14] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:28:18] !log samtar@deploy2002 Started scap: Backport for [[gerrit:900689|kuwiktionary: Add wordmark (T326067)]], [[gerrit:900742|trwikivoyage: Update wordmark (T332439)]] [13:28:24] T332439: Change Turkish Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T332439 [13:28:25] T326067: Change ku.wiktionary logo name for mobile - https://phabricator.wikimedia.org/T326067 [13:29:04] !log awight@deploy2002 Started deploy [kartotherian/deploy@906be32] (eqiad): Update kartotherian to a6e9843 [13:29:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cuminunpriv1001.eqiad.wmnet with reason: host reimage [13:29:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901196 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:29:41] !log samtar@deploy2002 stang and samtar: Backport for [[gerrit:900689|kuwiktionary: Add wordmark (T326067)]], [[gerrit:900742|trwikivoyage: Update wordmark (T332439)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:29:52] koi: please test ^ [13:29:55] looking [13:30:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 5d 23h 37m 7s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [13:30:34] !log awight@deploy2002 Finished deploy [kartotherian/deploy@906be32] (eqiad): Update kartotherian to a6e9843 (duration: 01m 30s) [13:31:15] TheresNoTime, tested on vector-2022 and LGTM [13:31:21] syncing [13:31:43] (03PS6) 10Samtar: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) (owner: 10Acamicamacaraca) [13:32:17] TheresNoTime: (I'll be back in ~10 minutes) [13:32:23] ack [13:34:05] Kizule: ref. your maintenance scripts, could you add the commands to T331762? [13:34:06] T331762: Remove FlaggedRevs for ptwikisource - https://phabricator.wikimedia.org/T331762 [13:34:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs3005.esams.wmnet with reason: rebooting for kernel updates [13:34:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs3005.esams.wmnet with reason: rebooting for kernel updates [13:34:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) 05Open→03Resolved [13:35:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs2008.codfw.wmnet with reason: rebooting for kernel updates [13:35:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs2008.codfw.wmnet with reason: rebooting for kernel updates [13:35:43] Aca: you're up next fyi [13:35:56] nice, okie [13:36:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:05] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900689|kuwiktionary: Add wordmark (T326067)]], [[gerrit:900742|trwikivoyage: Update wordmark (T332439)]] (duration: 08m 46s) [13:37:11] T332439: Change Turkish Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T332439 [13:37:11] T326067: Change ku.wiktionary logo name for mobile - https://phabricator.wikimedia.org/T326067 [13:37:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) (owner: 10Acamicamacaraca) [13:37:18] expected, BGP alerts in esams and codfw, lvs reboots [13:37:59] (03Merged) 10jenkins-bot: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) (owner: 10Acamicamacaraca) [13:38:10] !log samtar@deploy2002 Started scap: Backport for [[gerrit:900675|SITENAME change of Serbo-Croatian Wikipedia (T332468)]] [13:38:15] T332468: SITENAME change of Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T332468 [13:38:49] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:06] (03PS5) 10Samtar: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [13:39:13] (03PS5) 10Samtar: Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [13:39:45] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:48] !log samtar@deploy2002 aleksandar and samtar: Backport for [[gerrit:900675|SITENAME change of Serbo-Croatian Wikipedia (T332468)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:39:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:54] testing it rn [13:39:54] Aca: live on mwdebug, can you test? [13:39:56] :) [13:41:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host cuminunpriv1001.eqiad.wmnet with OS bullseye [13:41:34] 10SRE, 10Infrastructure-Foundations: Migrate cuminunpriv1001 to Bullseye - https://phabricator.wikimedia.org/T331700 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host cuminunpriv1001.eqiad.wmnet with OS bullseye completed: - cuminunpriv1001 (**PASS**) - Downtimed... [13:41:52] sitename is updated accordingly, seems good to me [13:42:01] syncing [13:43:10] (03CR) 10Samtar: [C: 03+2] "merge for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [13:43:55] (03Merged) 10jenkins-bot: Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [13:44:37] (03PS6) 10Samtar: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [13:47:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:35] Kizule: you're up next, around? [13:47:37] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900675|SITENAME change of Serbo-Croatian Wikipedia (T332468)]] (duration: 09m 26s) [13:47:42] T332468: SITENAME change of Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T332468 [13:47:44] Aca: live :) [13:47:45] TheresNoTime: Yep [13:48:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [13:48:18] TheresNoTime: Thanks for the deployment! :) [13:48:41] TheresNoTime: I've closed tasks of deployed patches, wanted to help you. :) [13:48:51] (03Merged) 10jenkins-bot: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [13:49:05] !log samtar@deploy2002 Started scap: Backport for [[gerrit:776200|Remove meaningless restriction level "none"]], [[gerrit:900696|Remove FlaggedRevs from ptwikisource (T331762)]] [13:49:10] T331762: Remove FlaggedRevs for ptwikisource - https://phabricator.wikimedia.org/T331762 [13:49:48] Kizule: many thanks :) did you see my comment about which maintenance scripts you need running? Are we just removing the empty user group (`EmptyUserGroup`)? [13:50:19] EmptyUserGroup for emptying editor group, moveUserGroup for moving autoreviewer to autopatrol and reviewer to patrol. [13:50:31] !log samtar@deploy2002 thiemowmde and samtar and zoranzoki21: Backport for [[gerrit:776200|Remove meaningless restriction level "none"]], [[gerrit:900696|Remove FlaggedRevs from ptwikisource (T331762)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:50:47] (03CR) 10Herron: [C: 03+1] "Thanks! Please see minor commit msg nit" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901138 (owner: 10JMeybohm) [13:51:01] Kizule: "migrateUserGroup" then? [13:51:13] (and that change is live on mwdebug, can you test?) [13:51:21] (03PS26) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [13:51:39] TheresNoTime: Yeah, migrateUserGroup. I don't know why I keep thinking that script is called moveUserGroup.. [13:51:45] However, testing on mwdebug now.. [13:52:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:56] TheresNoTime: Looks good, you can deploy. [13:53:02] syncing :) [13:53:07] Yeah, sync. [13:54:14] (oh that was just me announcing what I'm doing :p not correcting you!) [13:54:31] I know, I wanted to correct myself. :D [13:55:08] (03PS20) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [13:55:42] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [13:56:51] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 477, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:08] (03CR) 10Jameel Kaisar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:57:29] TheresNoTime: Please check T332351. [13:57:29] T332351: Request for the Draft namespace on the Slovene (sl) Wikipedia - https://phabricator.wikimedia.org/T332351 [13:57:39] Kizule: and for the avoidance of doubt, https://phabricator.wikimedia.org/P45895 are the commands you're expecting me to run? [13:57:47] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:56] TheresNoTime: About my task, yeah. [13:58:03] ack [13:58:49] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:776200|Remove meaningless restriction level "none"]], [[gerrit:900696|Remove FlaggedRevs from ptwikisource (T331762)]] (duration: 09m 44s) [13:58:54] T331762: Remove FlaggedRevs for ptwikisource - https://phabricator.wikimedia.org/T331762 [13:58:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:03] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/emptyUserGroup.php --wiki ptwikisource editor` T331762 [14:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:22] TheresNoTime: You should run namespaceDupes.php with --fix for T332351 [14:00:40] (will do) [14:01:14] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'autoreviewer' 'autopatrol'` ("nothing to do") T331762 [14:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'reviewer' 'patrol'` T331762 [14:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:49] TheresNoTime: I've moved my patches to the UTC late window [14:03:01] thanks for managing the config patches during this window! [14:03:40] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/namespaceDupes.php --wiki slwiki --fix` T332351 [14:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:44] T332351: Request for the Draft namespace on the Slovene (sl) Wikipedia - https://phabricator.wikimedia.org/T332351 [14:03:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:27] TheresNoTime: I'm checking now why it shows that there are no users in autoreviewer group. [14:04:36] Kizule: ack [14:04:46] kostajh: (also ack) [14:05:09] TheresNoTime: I think it needs to be run as autoreview to autopatrol. [14:05:38] mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'autoreview' 'autopatrol' [14:05:55] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'autoreview' 'autopatrol'` T331762 [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:00] T331762: Remove FlaggedRevs for ptwikisource - https://phabricator.wikimedia.org/T331762 [14:06:15] (41 changed) [14:06:18] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:06:42] TheresNoTime: Looks good in database, thanks! [14:06:58] But not on wiki.. https://pt.wikisource.org/wiki/Especial:Privil%C3%A9gios/Liejo_Gruz [14:07:41] Oh, sorry for trouble. It has to be autopatrolled. [14:07:46] I see `Member of: autopatrol`? [14:07:55] ah [14:08:02] mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'autopatrol' 'autopatrolled' [14:08:07] I just checked srwiki's database. [14:08:07] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:08:19] | 323597 | autopatrolled | NULL | [14:08:38] vs [14:08:39] | 35965 | autopatrol | NULL | [14:08:53] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:08:56] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/migrateUserGroup.php --wiki ptwikisource 'autopatrol' 'autopatrolled'` T331762 [14:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:02] (41 changed again) [14:09:10] Now it looks fine! :) [14:09:27] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [14:09:59] (03PS21) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [14:10:02] so koi, Kizule — everything looks okay for you both? [14:10:25] ThersNoTime: Yeah, for me ptwikisource part looks fine. [14:10:26] I rechecked the db and LGTM :) [14:10:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs1018.eqiad.wmnet with reason: rebooting for kernel updates [14:10:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs1018.eqiad.wmnet with reason: rebooting for kernel updates [14:10:57] !log close UTC afternoon backport window [14:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:12] *phew*, that was a busy one.. [14:11:37] Apologies that overran kostajh [14:15:08] Yeah [14:16:16] Thank you TheresNoTime for patience and everything, I appreciate it. :) [14:16:24] no problem! :D [14:17:05] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:17:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:17:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:18:55] 10SRE-swift-storage, 10Commons, 10Patch-For-Review, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Hi, This seems to work OK for me now. Thanks for fixing it. [14:20:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40221/console" [puppet] - 10https://gerrit.wikimedia.org/r/901167 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [14:20:45] (03PS2) 10Samtar: InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) [14:21:15] jouncebot: nowandnext [14:21:15] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [14:21:15] In 1 hour(s) and 8 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1530) [14:22:04] 10SRE-swift-storage, 10Commons, 10Patch-For-Review, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [14:22:48] (03CR) 10Elukey: [V: 03+1 C: 03+1] kafka: broker replica lag alert moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/901167 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [14:23:54] (03PS1) 10Bking: rdf-streaming-updater: use correct resource name [deployment-charts] - 10https://gerrit.wikimedia.org/r/901218 (https://phabricator.wikimedia.org/T328675) [14:24:59] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:25:45] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:27:07] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [14:27:19] (03PS1) 10Samtar: InitialiseSettings-labs: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901219 (https://phabricator.wikimedia.org/T332521) [14:29:10] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: use correct resource name [deployment-charts] - 10https://gerrit.wikimedia.org/r/901218 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:29:13] (03CR) 10Samtar: [C: 03+2] "[prod noop] Per I20db42475909cb17ee45396f1dc58289f7f5e295's +1, beta cluster only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901219 (https://phabricator.wikimedia.org/T332521) (owner: 10Samtar) [14:29:30] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use correct resource name [deployment-charts] - 10https://gerrit.wikimedia.org/r/901218 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:29:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: T326564 [14:29:49] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [14:30:07] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901219 (https://phabricator.wikimedia.org/T332521) (owner: 10Samtar) [14:30:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: T326564 [14:30:21] (03PS1) 10Volans: es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) [14:30:39] MatmaRex: your maintenance script has finally finished! [14:30:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:58] ^ expected [14:31:56] PROBLEM - Host es2029 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:32:07] uh? [14:32:13] (03PS2) 10David Martin: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 [14:32:50] is that expected? [14:33:06] I don't think so\ [14:33:11] Incident acked [14:33:15] thanks [14:33:17] there seems to be nothing in the sal for es2029 [14:33:20] I'd say no: https://logstash.wikimedia.org/goto/83874c6c6b848b8236a12c8f470be6f8 [14:33:26] many codfw db errors [14:33:28] (03CR) 10Volans: "How can this be tested to make sure the aggregation is done correctly?" [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [14:33:32] RECOVERY - Host es2029 #page is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [14:33:32] marostegui: ^ [14:34:11] should I depool until it catches up? [14:34:31] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:34] it is now overloaded [14:34:36] (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct resource name [deployment-charts] - 10https://gerrit.wikimedia.org/r/901218 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:34:38] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) p:05Triage→03Medium [14:34:58] jynus: +1 for depool, but you know dbs better than I do [14:35:08] jynus: I am checking [14:35:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:35:34] my guess is it lost networking and now backlog overloaded connections [14:35:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:35:42] marostegui: ack, oncall standing by [14:35:42] + it will be lagged [14:36:32] uptime is 3mins, so it did reboot [14:36:43] then it wasn't network [14:36:47] dbctl seems to be broken [14:37:07] volans _joe_ ^ [14:37:10] most likely a crash [14:37:12] can you help? [14:37:12] PROBLEM - mysqld processes #page on es2029 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:37:33] marostegui: looking [14:37:36] what's the problem? [14:37:37] (03CR) 10Filippo Giunchedi: [C: 03+2] kafka: broker replica lag alert moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/901167 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [14:37:38] Ah no, I know what it is [14:37:41] let me switchover the master [14:37:45] ack [14:38:10] letting manuel do it, it just now needs switchover and depool [14:38:13] anyways, the host is up [14:38:19] it is a standalone one [14:38:25] Acked the secondary page [14:38:25] I will depool it to double check [14:38:33] but I will switch the master so I can depool it [14:38:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Omit Python 3.7 in the analytics cluster after bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901196 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [14:38:56] meanwhile I will check for hw logs [14:39:06] nothing obvious in dmesg [14:39:08] RECOVERY - mysqld processes #page on es2029 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:39:08] can someone create a task too? [14:39:48] yup, I can create a task [14:39:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2029 and promote es2027 to es3 master', diff saved to https://phabricator.wikimedia.org/P45896 and previous config saved to /var/cache/conftool/dbconfig/20230320-143951-root.json [14:39:57] jhathaway: thanks, assign it to me please [14:40:04] nod [14:40:25] thanks, can I leave it to you jhathaway because I haven't eaten yet and was about to [14:40:38] yup, enjoy [14:40:41] (03PS1) 10Marostegui: es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/901225 [14:40:58] thanks [14:41:27] (03CR) 10Marostegui: [C: 03+2] es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/901225 (owner: 10Marostegui) [14:42:24] (03Abandoned) 10Dzahn: remove role::webserver_misc_apps from sre module [puppet] - 10https://gerrit.wikimedia.org/r/900464 (owner: 10Dzahn) [14:42:29] marostegui: so all good with dbctl? was just that was not allowing to depool the "master" although is a standalone? [14:42:33] Power supply redundancy is lost [14:42:34] (03PS1) 10Bking: rdf-streaming-update: use correct data type [deployment-charts] - 10https://gerrit.wikimedia.org/r/901226 (https://phabricator.wikimedia.org/T328675) [14:42:39] volans: correct [14:42:42] 03/20/2023 14:26:43 [14:42:49] that's a few minutes ago, right? [14:43:19] jynus: yes [14:43:31] either power was lost on both supplies, or when lost in one it didn't switch properly [14:43:32] (03CR) 10DCausse: [C: 03+1] rdf-streaming-update: use correct data type [deployment-charts] - 10https://gerrit.wikimedia.org/r/901226 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:43:32] interesting [14:44:02] what's the task #, so I can paste it? [14:44:03] (03CR) 10Bking: [C: 03+2] rdf-streaming-update: use correct data type [deployment-charts] - 10https://gerrit.wikimedia.org/r/901226 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:44:23] jynus: https://phabricator.wikimedia.org/T332603 [14:44:27] no detail yet [14:44:28] thanks [14:44:35] I will add those [14:46:31] (03PS1) 10Hokwelum: Remove dumpsdata1005 from the list of spare servers [puppet] - 10https://gerrit.wikimedia.org/r/901227 [14:46:37] The iDRAC firmware was rebooted with the following reason: ac. [14:46:56] no, that's old [14:47:48] actually it is the same as the recent one [14:47:55] loss of power led to CPU shutdown [14:48:33] I posted the output of ipmi-sel on the task [14:48:44] (03CR) 10ArielGlenn: [C: 03+2] Remove dumpsdata1005 from the list of spare servers [puppet] - 10https://gerrit.wikimedia.org/r/901227 (owner: 10Hokwelum) [14:49:04] sort of the same info I edited in :-D [14:49:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:49:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:49:34] indeed [14:49:34] 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) [14:49:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2552 [14:50:44] 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) @Papaul any thoughts on ` Message = The iDRAC firmware was rebooted with the following reason: ac. Message Arg 1 = ac`? ` I've never seen that message before [14:51:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2552 [14:53:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [14:53:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye [14:53:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1013.eqiad.wmnet with OS bullseye [14:53:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with... [14:54:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) Just one should do it, we're already in the process of converting the other cloudvirts to single nic as well. [14:56:12] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:56:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [14:56:20] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:56:22] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:56:25] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability): Thumbor 404s on an auth failure to Swift - https://phabricator.wikimedia.org/T332210 (10TheDJ) @MatthewVernon FYI, I think I was able to find back this incident in logstash via the thumbor logger: https://logstash.... [14:56:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:01:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) I am not able to do the initial installs, fe1013 and 1014 fail immediately, maybe there is a dhcp error and thanos-fe doesn't get a lease [15:02:24] 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10colewhite) 05Open→03Resolved a:03colewhite I went ahead and re... [15:02:35] 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) Meanwhile I am doing a data consistency check [15:06:21] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [15:06:43] (03CR) 10CI reject: [V: 04-1] Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [15:11:12] taavi: yay! (i guess…) ;) [15:11:14] 10SRE-Sprint-Week-Sustainability-March2023, 10Maps (Kartotherian), 10Sustainability (Incident Followup), 10Technical-Debt: Kartotherian configuration should be deployable to all production envs at once - https://phabricator.wikimedia.org/T328406 (10Volans) Removing #sre-onfire as it seems to me a very spec... [15:12:05] (03PS11) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [15:12:25] taavi: we can start a longer one now. enwiki is waiting. i'll need to schedule some config changes first though [15:12:36] (03CR) 10Elukey: "Switched from goodfaith to articletopic, so we'll have something to present to the Search team." [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:12:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:13:04] 10SRE-OnFire, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10Volans) [15:17:52] 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Papaul) @Marostegui nobody is working on worked in that rack this morning. Taking a look now [15:17:58] (03CR) 10Jameel Kaisar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [15:18:22] (03PS1) 10Giuseppe Lavagetto: graphite::alerts: add alert on mediawiki account creation failures [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) [15:21:00] (03PS1) 10Muehlenhoff: cuminunpriv: Remove old buster code [puppet] - 10https://gerrit.wikimedia.org/r/901234 (https://phabricator.wikimedia.org/T331700) [15:21:48] (03PS1) 10David Caro: toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 [15:22:29] (03CR) 10CI reject: [V: 04-1] toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [15:22:34] 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Joe) Removing the sustainability tag as it doesn't seem like there is any related actionable here. @Clement_Goubert if... [15:22:49] 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10jcrespo) If it helps, that log entry happened also at: * 2022-08-03 17:18:42 * 2022-04-11 15:52:45 * 2020-09-09 09:41:56 [15:22:57] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10Joe) [15:23:56] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10Volans) p:05Triage→03Medium [15:24:38] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10Joe) I guess this task is surely in the "serviceops" area, but probably @Eevans has the most experience being one of the o... [15:25:31] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Engineering-Planning, 10Event-Platform Value Stream, and 2 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Volans) [15:27:48] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Volans) Is there anything left to do here related to #sre-onfire or #wikimedia-incident-actionable ? [15:27:51] (03PS1) 10Vgutierrez: haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) [15:30:03] PROBLEM - pybal on lvs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:30:05] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1530). nyaa~ [15:30:12] (03CR) 10Muehlenhoff: [C: 03+2] cuminunpriv: Remove old buster code [puppet] - 10https://gerrit.wikimedia.org/r/901234 (https://phabricator.wikimedia.org/T331700) (owner: 10Muehlenhoff) [15:30:18] lvs1018 downtime expired [15:30:21] er 2008 [15:30:22] pooling [15:31:05] 10SRE, 10Infrastructure-Foundations: Migrate cuminunpriv1001 to Bullseye - https://phabricator.wikimedia.org/T331700 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff cuminunpriv1001 has been updated and all tests went fine. [15:31:09] (03CR) 10Cwhite: [C: 03+1] "If the metric name and label names are ok with folks, this change LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [15:31:09] RECOVERY - pybal on lvs2008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:31:10] RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:33:49] (03CR) 10Majavah: [C: 04-1] "Let's add the config file to the same profile which installs the cli package?" [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [15:34:28] (03CR) 10Vgutierrez: "looking good:" [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [15:34:53] (03PS1) 10Elukey: install_server: update netboot config for kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) [15:35:27] (03CR) 10CI reject: [V: 04-1] install_server: update netboot config for kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [15:37:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney We can move any servers racked from U11 up [15:38:15] (03CR) 10Hashar: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:38:25] (03PS2) 10Elukey: install_server: update netboot config for kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) [15:39:19] (03CR) 10Elukey: profile::cache::purge: move purged to a new CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [15:39:22] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10jcrespo) I believe this was only open waiting for the incident writing, which happened long time ago, but Releng or Dzah... [15:40:04] (03CR) 10EoghanGaffney: [V: 03+1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:41:50] (03CR) 10Hashar: [C: 03+1] Adds php and apache logs for doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:42:20] (03PS1) 10Cwhite: profile: remove SQLPlatform::isWriteQuery high log volume mitigation [puppet] - 10https://gerrit.wikimedia.org/r/900716 (https://phabricator.wikimedia.org/T332228) [15:43:06] (03CR) 10Hashar: [C: 03+1] Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:43:40] (03PS1) 10Bking: rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) [15:44:28] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Volans) [15:46:29] (03CR) 10Cwhite: [C: 03+1] es_exporter: add NEL metrics by country (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [15:47:08] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [15:51:46] (03CR) 10Hashar: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:52:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) (owner: 10Giuseppe Lavagetto) [15:52:31] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:52:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w... [15:53:15] (03CR) 10EoghanGaffney: [V: 03+1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [15:54:18] (03CR) 10Volans: [C: 03+1] "LGTM, do you have a PCC to check the generated with/without the maxconn set?" [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [15:54:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [15:55:16] (03PS2) 10JMeybohm: Move to demjson3 and install jsonnet-lint 0.19.1 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901138 (https://phabricator.wikimedia.org/T331659) [15:55:18] (03PS2) 10JMeybohm: Add .gitreview [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901139 (https://phabricator.wikimedia.org/T331659) [15:57:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:57:19] (03CR) 10Elukey: "Hugh: o/ fine to roll it out anytime, or do you prefer to be present when I do it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:59:18] (03CR) 10Hnowlan: [C: 03+1] services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:59:55] (03CR) 10David Caro: toolforge: add k8s bastion with toolforge config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:00:56] (03PS2) 10David Caro: toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 [16:01:29] (03CR) 10CI reject: [V: 04-1] toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:02:42] (03CR) 10David Caro: [V: 03+1] kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [16:02:57] (03CR) 10Majavah: [C: 04-1] toolforge: add k8s bastion with toolforge config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:04:32] (03CR) 10Majavah: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [16:05:19] (03PS3) 10David Caro: kubernetes: set NO_HOME for bulidservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 [16:10:48] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [16:10:53] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:11:51] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [16:17:37] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) >>! In T266155#8707579, @doctaxon wrote: > @TheDJ thanks for your comment. These... [16:20:06] (03PS3) 10David Caro: toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 [16:20:08] (03CR) 10David Caro: toolforge: add k8s bastion with toolforge config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:20:39] (03CR) 10CI reject: [V: 04-1] toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:20:41] (03CR) 10David Caro: toolforge: add k8s bastion with toolforge config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [16:21:24] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:22:13] (03PS4) 10David Caro: toolforge: add k8s bastion with toolforge config [puppet] - 10https://gerrit.wikimedia.org/r/901235 [16:22:19] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:23:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [16:32:48] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:33:14] MatmaRex: do you want me to start it on enwiki? [16:36:14] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:36:34] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:37:33] taavi: i need to change a config for it to work [16:37:40] i'll do it today, i don't have the patch ready [16:43:39] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:43:50] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:44:22] taavi: do you know if it would be okay to run it in parallel on two or more sets of wikis at once? or how would i go about finding out if it's okay? [16:45:11] MatmaRex: usually it's fine to run scripts on one wiki per section at a time [16:48:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [16:49:21] (03PS1) 10Hnowlan: changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 [16:49:23] hmm, okay. i'll think about it [16:50:43] the remaining wikis are all of group2, so i think i'd have to create dblists that are like group2&s1, group2&s2, group2&s3, etc., and run foreachwiki on each of those? [16:57:06] (03PS1) 10Samtar: InitialiseSettings-labs: Add `Phonos` channel to `debug` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901248 (https://phabricator.wikimedia.org/T332521) [16:58:02] foreachwikiindblist can evaluate those kinds of expressions on the fly [16:58:43] (03PS2) 10Samtar: InitialiseSettings-labs: Add `Phonos` channel to `debug` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901248 (https://phabricator.wikimedia.org/T325464) [17:00:01] (03CR) 10Samtar: [C: 03+2] "[noop prod] beta deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901248 (https://phabricator.wikimedia.org/T325464) (owner: 10Samtar) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1700) [17:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T1700). [17:00:47] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Add `Phonos` channel to `debug` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901248 (https://phabricator.wikimedia.org/T325464) (owner: 10Samtar) [17:02:04] (03PS1) 10Jdlrobson: Add languages to Minerva HTML [skins/MinervaNeue] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901275 (https://phabricator.wikimedia.org/T331905) [17:11:33] taavi: oh cool, i need to look at that. thanks [17:11:35] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Jclark-ctr) [17:12:25] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10Jclark-ctr) [17:12:28] please ignore BGP alerts in eqiad, codfw, esams [17:13:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:14:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet with reason: reboot for kernel update [17:14:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet with reason: reboot for kernel update [17:14:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs3006.esams.wmnet with reason: reboot for kernel update [17:14:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs3006.esams.wmnet with reason: reboot for kernel update [17:14:48] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10Jclark-ctr) All temp sensors installed. Next step is setup the msw's in racks e5-e8 f5-f8 and configures ports on scs in Rack f8 [17:16:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:33] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:03] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:11] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:30] (03Abandoned) 10Jdlrobson: Make messages about editing site code more prominent [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900467 (https://phabricator.wikimedia.org/T311891) (owner: 10Jdlrobson) [17:21:09] 10SRE-Sprint-Week-Sustainability-March2023, 10Release-Engineering-Team, 10Scap, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Joe) Re-tagging to the team responsible for scap. [17:22:52] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Scap, 10serviceops, 10Sustainability (Incident Followup): Add etcdmirror status check to scap - https://phabricator.wikimedia.org/T317403 (10Joe) [17:26:08] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Joe) >What I don't understand is why the python etcd lib client would fail on connection to only one... [17:26:36] !log disable puppet on rdb*, netbox*, ores*, registry* [17:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:43] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Joe) [17:38:16] (03PS1) 10DCausse: dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/901253 (https://phabricator.wikimedia.org/T328675) [17:39:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add .gitreview [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901139 (https://phabricator.wikimedia.org/T331659) (owner: 10JMeybohm) [17:39:25] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Move to demjson3 and install jsonnet-lint 0.19.1 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901138 (https://phabricator.wikimedia.org/T331659) (owner: 10JMeybohm) [17:43:03] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:14] (03CR) 10Dzahn: [C: 03+2] site: add miscweb1003 to miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/900739 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [17:46:14] ACKNOWLEDGEMENT - Check health of redis instance on 6378 on rdb1011 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6378 alexandros kosiaris pass rollover https://wikitech.wikimedia.org/wiki/Redis [17:47:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:37] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:17] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 477, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:39] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:47] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:13] 10SRE-Sprint-Week-Sustainability-March2023, 10Maps (Kartotherian), 10Sustainability (Incident Followup), 10Technical-Debt: Kartotherian configuration should be deployable to all production envs at once or should prevent this - https://phabricator.wikimedia.org/T328406 (10awight) [17:54:13] PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:44] (03PS2) 10Gergő Tisza: Job queue configuration for DeleteLinkRecommendationJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/681669 [17:55:00] ^ expected, downtiming [17:55:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs1019.eqiad.wmnet with reason: reboot for kernel update [17:55:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs1019.eqiad.wmnet with reason: reboot for kernel update [17:55:59] lvs1019 is depooled, so nothing to worry. the downtime expired as the host didn't come back up after reboot [17:56:05] just as an fyi [17:56:48] (03PS1) 10Alexandros Kosiaris: codfw ORES: Switch to rdb2009 [puppet] - 10https://gerrit.wikimedia.org/r/901254 [17:57:06] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on miscweb1003.eqiad.wmnet with reason: maintenance [17:57:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "In a rush, self merging this to stop the bleeding" [puppet] - 10https://gerrit.wikimedia.org/r/901254 (owner: 10Alexandros Kosiaris) [17:57:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on miscweb1003.eqiad.wmnet with reason: maintenance [17:58:17] RECOVERY - Host lvs1019 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:58:44] (03CR) 10Gergő Tisza: Job queue configuration for DeleteLinkRecommendationJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/681669 (owner: 10Gergő Tisza) [17:58:53] (03Abandoned) 10Gergő Tisza: Job queue configuration for DeleteLinkRecommendationJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/681669 (owner: 10Gergő Tisza) [17:59:29] !log when applying apache role for the first time on new hosts we still have the same old conflict: miscweb1003 - manual "a2dismod mpm_event" to be able to let puppet enable mod PHP (T196968) [17:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:34] T196968: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 [18:01:28] (03CR) 10Kosta Harlan: [C: 04-2] "Should wait for go-ahead from Elena; tentatively planning on Wednesday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan) [18:03:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:37] (03PS1) 10Dzahn: miscweb: add httpd::mpm directory to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/901255 [18:04:50] (03PS2) 10Dzahn: miscweb: add httpd::mpm directory to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/901255 (https://phabricator.wikimedia.org/T331896) [18:04:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1019.eqiad.wmnet [18:04:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1019.eqiad.wmnet [18:05:22] !log miscweb1003 - syntax error in httpd config due to "Unknown Authn provider: ldap" - comes from static-rt vhost (T331896) [18:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:27] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [18:05:32] no more BGP alerts expected now [18:08:40] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/901255/40222/" [puppet] - 10https://gerrit.wikimedia.org/r/901255 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [18:11:05] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [18:11:18] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [18:11:19] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [18:11:39] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [18:11:40] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [18:11:57] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [18:13:38] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:37] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [18:15:50] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [18:15:51] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [18:16:08] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [18:16:09] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [18:16:24] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [18:17:42] (03CR) 10Hnowlan: [C: 03+2] tests: Increase test timeout for tests run in Docker [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 (owner: 10Kamila Součková) [18:18:16] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [18:18:31] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [18:18:32] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [18:19:29] akosiaris: just fyi I hit an issue in codfw where I couldn't rollout to codfw due to cpu requests being too high and the final pod couldn't be created. For the short term, reduce replicas to 29, apply and then increase. For the longer term I filed this earlier https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/901246/ [18:19:35] on jobqueue that is [18:22:00] (03Merged) 10jenkins-bot: tests: Increase test timeout for tests run in Docker [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901194 (owner: 10Kamila Součková) [18:22:02] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:14] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:23:54] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:28:52] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [18:28:53] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [18:30:19] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [18:30:24] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [18:30:25] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [18:30:42] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [18:30:43] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [18:31:00] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [18:32:29] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [18:32:40] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [18:32:41] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [18:32:50] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [18:34:28] (03PS1) 10Dzahn: miscweb: remove custom source line from httpd::mpm [puppet] - 10https://gerrit.wikimedia.org/r/901259 (https://phabricator.wikimedia.org/T331896) [18:35:12] (03PS2) 10Dzahn: miscweb: remove custom source line from httpd::mpm [puppet] - 10https://gerrit.wikimedia.org/r/901259 (https://phabricator.wikimedia.org/T331896) [18:40:28] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:56] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:42:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams_sync.service,netbox_report_coherence_run.service,netbox_report_puppetdb_virtual_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:23] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3aaecb7]: safely quote spark args in skein script [18:42:36] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3aaecb7]: safely quote spark args in skein script (duration: 00m 13s) [18:44:22] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:44:33] (03PS1) 10BCornwall: apigw: Upper-case Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 [18:45:17] !log re-enable puppet on rdb*, netbox*, ores*, registry* [18:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:54] (03CR) 10Dzahn: [C: 03+2] miscweb: remove custom source line from httpd::mpm [puppet] - 10https://gerrit.wikimedia.org/r/901259 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [18:47:42] !log emergency rollover of redis password complete [18:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:18] (03CR) 10Bking: [C: 03+2] dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/901253 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [18:48:43] (03CR) 10Bking: [C: 03+1] dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/901253 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [18:48:53] !log akosiaris@deploy2002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 28s) [18:48:59] (03CR) 10Bking: [C: 03+2] dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/901253 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [18:50:42] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:53:38] (03Merged) 10jenkins-bot: dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/901253 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [18:54:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8710573, @Papaul wrote: > @cmooney We can move any servers racked from U11 up... [18:55:14] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:57:30] (03CR) 10Herron: [C: 03+1] "Nice catch LGTM! Please see commit message nit inline" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 (owner: 10BCornwall) [18:57:39] (03PS2) 10BCornwall: apigw: Upper-case Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 [18:58:13] (03PS1) 10Dzahn: miscweb: use php7.4 if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901263 (https://phabricator.wikimedia.org/T331896) [18:58:28] (03CR) 10CI reject: [V: 04-1] miscweb: use php7.4 if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901263 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [18:58:43] (03PS2) 10Dzahn: miscweb: use php7.4 if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901263 (https://phabricator.wikimedia.org/T331896) [19:00:54] (03CR) 10JHathaway: [C: 03+1] "Looks good, just one question." [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [19:01:32] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:01:40] (03PS3) 10BCornwall: apigw: Upper-case Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 (https://phabricator.wikimedia.org/T332629) [19:03:44] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:07:15] (03CR) 10BCornwall: [C: 03+2] apigw: Upper-case Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 (https://phabricator.wikimedia.org/T332629) (owner: 10BCornwall) [19:07:31] (03CR) 10BCornwall: [V: 03+2 C: 03+2] apigw: Upper-case Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 (https://phabricator.wikimedia.org/T332629) (owner: 10BCornwall) [19:07:54] (03CR) 10BCornwall: [V: 03+2 C: 03+2] apigw: Upper-case Grizzly tag (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901261 (https://phabricator.wikimedia.org/T332629) (owner: 10BCornwall) [19:10:38] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:20] (03CR) 10Dzahn: [C: 03+2] miscweb: use php7.4 if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901263 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:11:56] (03CR) 10Vipz: [C: 03+1] "Per consensus on the project." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901276 (https://phabricator.wikimedia.org/T332614) (owner: 10Acamicamacaraca) [19:13:19] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@b16917e]: fix templating in SimpleSkeinOperator [19:13:32] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@b16917e]: fix templating in SimpleSkeinOperator (duration: 00m 13s) [19:25:00] (03CR) 10Cwhite: [C: 03+2] profile: remove SQLPlatform::isWriteQuery high log volume mitigation [puppet] - 10https://gerrit.wikimedia.org/r/900716 (https://phabricator.wikimedia.org/T332228) (owner: 10Cwhite) [19:26:40] (03CR) 10Herron: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [19:29:50] (03PS1) 10Dzahn: miscweb: use http::mod_conf to add authnz_ldap module [puppet] - 10https://gerrit.wikimedia.org/r/901287 (https://phabricator.wikimedia.org/T331896) [19:30:07] (03CR) 10CI reject: [V: 04-1] miscweb: use http::mod_conf to add authnz_ldap module [puppet] - 10https://gerrit.wikimedia.org/r/901287 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:30:57] (03PS2) 10Dzahn: miscweb: use http::mod_conf to add authnz_ldap module [puppet] - 10https://gerrit.wikimedia.org/r/901287 (https://phabricator.wikimedia.org/T331896) [19:33:41] (03CR) 10Dzahn: [C: 03+2] miscweb: use http::mod_conf to add authnz_ldap module [puppet] - 10https://gerrit.wikimedia.org/r/901287 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:37:29] (03PS1) 10Dzahn: site: add miscweb role to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901288 (https://phabricator.wikimedia.org/T331896) [19:47:31] deployment.eqiad.wmnet being an alias for deploy2002.codfw.wmnet is messing with my head :p [19:48:58] !log miscweb1003 - manually edit /srv/deployment/iegreview/iegreview-cache/.config and replace tin.eqiad.wmnet with deployment.eqiad.wmnet (which is an alias for deploy2002.codfw.wmnet) T257317 T332623 T331896 [19:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:07] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [19:49:07] (03CR) 10BCornwall: [V: 03+1 C: 03+1] onboard home dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896426 (https://phabricator.wikimedia.org/T331656) (owner: 10Herron) [19:49:07] T257317: scap deploy --init on deployment server fails on first puppet run - https://phabricator.wikimedia.org/T257317 [19:49:07] T332623: scap deploy fails for iegreview - https://phabricator.wikimedia.org/T332623 [19:49:11] (03CR) 10Herron: [C: 03+1] "LGTM, and I think it was a good call to plan on approaching this carefully during sprint week. FWIW I can say the kafka-logging upgrades " [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [19:50:08] (03PS1) 10Samtar: wgAbuseFilterConditionLimit: Set default condition limit to 2000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901289 (https://phabricator.wikimedia.org/T309609) [19:50:48] (03CR) 10Dzahn: [C: 03+2] site: add miscweb role to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901288 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:57:25] (03PS1) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [19:57:40] (03PS2) 10Herron: onboard home dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896426 (https://phabricator.wikimedia.org/T331656) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T2000). [20:00:05] kostajh, MatmaRex, Aca, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] Confirming my presence, and sorry for overcrowding backport windows with my patch. [20:00:19] Hi [20:00:23] hello [20:00:27] hello [20:01:14] I'll be around in 5, kostajh did you want to self-deploy yours? [20:01:25] *first [20:01:55] Ok I’ll get started. [20:03:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 (owner: 10Kosta Harlan) [20:03:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900685 (https://phabricator.wikimedia.org/T332319) (owner: 10Kosta Harlan) [20:03:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900684 (https://phabricator.wikimedia.org/T332309) (owner: 10Kosta Harlan) [20:03:51] (03PS3) 10Herron: onboard home dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896426 (https://phabricator.wikimedia.org/T331656) [20:06:48] (03CR) 10Herron: [V: 03+2 C: 03+2] "thanks for the review!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896426 (https://phabricator.wikimedia.org/T331656) (owner: 10Herron) [20:08:16] (03PS1) 10Ahmon Dancy: Fix scap::dsh::group -> scap::dsh::groups (hiera) comments [puppet] - 10https://gerrit.wikimedia.org/r/901292 [20:14:04] (03CR) 10Dzahn: [C: 03+2] Fix scap::dsh::group -> scap::dsh::groups (hiera) comments [puppet] - 10https://gerrit.wikimedia.org/r/901292 (owner: 10Ahmon Dancy) [20:14:33] that post-merge `quibble-vendor-mysql-php74-selenium-docker` looks like its been having some trouble [20:15:24] (03CR) 10Muehlenhoff: [C: 03+1] install_server: update netboot config for kafka-main nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [20:15:30] (03CR) 10Kosta Harlan: [C: 03+2] "reapply +2 due to flaky Selenium test" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 (owner: 10Kosta Harlan) [20:15:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10Htriedman) Hi @MatthewVernon! We're currently running into some weird errors with Aranya's permissions, specifically regarding access to Turnilo a... [20:18:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10MatthewVernon) I think the right thing would be to open a new ticket; but I note it's SRE Sprint Week, so I'm not sure whether clinic duty tasks w... [20:20:09] (03PS2) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [20:21:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10Dzahn) Probably makes sense to reach out to SREs in analytics. [20:22:21] (03CR) 10CI reject: [V: 04-1] TryNewTask: Set an array fallback if TryNewTaskOptOuts is null [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 (owner: 10Kosta Harlan) [20:22:32] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [20:22:53] (03CR) 10Kosta Harlan: [C: 03+2] TryNewTask: Set an array fallback if TryNewTaskOptOuts is null [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 (owner: 10Kosta Harlan) [20:23:19] TheresNoTime: yep. sorry, looks like it will be another ~15-18 minutes :\ [20:23:48] (03PS3) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [20:24:38] kostajh: ack, okay [20:25:02] (03PS1) 10Dzahn: add webserver-misc-sites and point it to miscweb1003/2003 [dns] - 10https://gerrit.wikimedia.org/r/901296 (https://phabricator.wikimedia.org/T331896) [20:25:37] (03CR) 10Dzahn: [C: 03+2] add webserver-misc-sites and point it to miscweb1003/2003 [dns] - 10https://gerrit.wikimedia.org/r/901296 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [20:26:51] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10Eevans) >>! In T320398#8710536, @Joe wrote: > I guess this task is surely in the "serviceops" area, but probably @Eevans h... [20:27:30] (03PS1) 10BCornwall: home_w_wiki_status: editable=false, Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901297 (https://phabricator.wikimedia.org/T331656) [20:28:37] (03PS1) 10Bking: elastic: [WIP] Add node-banning cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/901298 (https://phabricator.wikimedia.org/T331303) [20:28:52] (03PS1) 10Ahmon Dancy: scap.pp: Update comment about /etc/profile.d/mediawiki.sh [puppet] - 10https://gerrit.wikimedia.org/r/901299 [20:29:06] (03Merged) 10jenkins-bot: PostEdit: Increment the edit-count-for-task-type count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900685 (https://phabricator.wikimedia.org/T332319) (owner: 10Kosta Harlan) [20:29:09] (03Merged) 10jenkins-bot: LevelingUpManager: Handle links/link-recommendation collision [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900684 (https://phabricator.wikimedia.org/T332309) (owner: 10Kosta Harlan) [20:30:10] (03CR) 10Herron: Import mail dashboard into static Grizzly template (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [20:30:12] (03PS2) 10BCornwall: home_w_wiki_status: editable=false, Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901297 (https://phabricator.wikimedia.org/T331656) [20:30:52] (03PS3) 10BCornwall: home_w_wiki_status: editable=false, Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901297 (https://phabricator.wikimedia.org/T331656) [20:32:31] (03PS1) 10Dzahn: miscweb: switch 15.wikipedia.org to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901300 (https://phabricator.wikimedia.org/T331896) [20:32:46] (03CR) 10Herron: [C: 03+1] "thank you!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901297 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [20:33:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney Please see first batch proposal. We can move all those servers next week. @aborrero ca... [20:34:32] (03PS4) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [20:34:54] (03CR) 10BCornwall: [V: 03+2 C: 03+2] home_w_wiki_status: editable=false, Grizzly tag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901297 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [20:35:31] (03PS5) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [20:35:36] (03CR) 10BCornwall: Import mail dashboard into static Grizzly template (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [20:37:17] (03CR) 10Herron: [C: 03+1] Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [20:38:17] (03PS6) 10BCornwall: Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) [20:38:21] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Import mail dashboard into static Grizzly template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901290 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [20:39:01] (03PS1) 10Cwhite: logstash: sample high-volume rdbms lib logging [puppet] - 10https://gerrit.wikimedia.org/r/900718 (https://phabricator.wikimedia.org/T332228) [20:40:26] (03PS2) 10Cwhite: logstash: sample high-volume rdbms lib logging [puppet] - 10https://gerrit.wikimedia.org/r/900718 (https://phabricator.wikimedia.org/T332228) [20:46:05] (03Merged) 10jenkins-bot: TryNewTask: Set an array fallback if TryNewTaskOptOuts is null [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901146 (owner: 10Kosta Harlan) [20:46:07] Aca, Jdlrobson, MatmaRex — can any of your patches be rescheduled to another window? [20:46:08] (03PS1) 10BCornwall: Import Host Overview dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) [20:46:38] * TheresNoTime can stay on to deploy outside the window, depending on other deploys [20:46:53] TheresNoTime: I see `The following are unexpected commits pulled from origin for /srv/mediawiki-staging` with your name next to the commits [20:46:56] is that expected? [20:47:13] kostajh: yes, should be -labs config files [20:47:19] yep [20:47:25] TheresNoTime: so, ok to proceed? [20:47:30] yes :) [20:47:45] i'd prefer to have my arwiki config changes out today, since they were announced to the wiki's community [20:47:46] ty [20:47:50] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:901146|TryNewTask: Set an array fallback if TryNewTaskOptOuts is null]], [[gerrit:900685|PostEdit: Increment the edit-count-for-task-type count (T332319)]], [[gerrit:900684|LevelingUpManager: Handle links/link-recommendation collision (T332309)]] [20:47:57] T332319: Leveling up: Off-by-one error for edit-count-for-task-type in impression event - https://phabricator.wikimedia.org/T332319 [20:47:57] T332309: Uncaught TypeError: can't access property "difficulty", this.taskType is undefined - https://phabricator.wikimedia.org/T332309 [20:48:18] MatmaRex: no problem, if you're okay hanging on? [20:48:23] sure [20:49:13] (03PS4) 10Samtar: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [20:49:24] (03PS4) 10Samtar: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [20:49:27] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:901146|TryNewTask: Set an array fallback if TryNewTaskOptOuts is null]], [[gerrit:900685|PostEdit: Increment the edit-count-for-task-type count (T332319)]], [[gerrit:900684|LevelingUpManager: Handle links/link-recommendation collision (T332309)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmn [20:49:27] et [20:50:27] verifying the changes... [20:51:01] (be back in 5) [20:52:13] hmm, I informed the community about the changes, but I can reschedule it if Jdlrobson would like his patch to be deployed first. [20:53:47] syncing my changes [20:55:12] (back) [20:56:10] Aca: if you're okay hanging on as well, we can still get it deployed today :) I'm just mindful of people's time [20:56:45] no worries, I'm free :) [20:58:18] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:901146|TryNewTask: Set an array fallback if TryNewTaskOptOuts is null]], [[gerrit:900685|PostEdit: Increment the edit-count-for-task-type count (T332319)]], [[gerrit:900684|LevelingUpManager: Handle links/link-recommendation collision (T332309)]] (duration: 10m 28s) [20:58:25] T332319: Leveling up: Off-by-one error for edit-count-for-task-type in impression event - https://phabricator.wikimedia.org/T332319 [20:58:25] T332309: Uncaught TypeError: can't access property "difficulty", this.taskType is undefined - https://phabricator.wikimedia.org/T332309 [20:59:14] kostajh: okay for me to deploy? [20:59:15] TheresNoTime: I'm done. [20:59:19] :) [20:59:22] thanks for your patience, everyone [20:59:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [20:59:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230320T2100). [21:00:14] (03Merged) 10jenkins-bot: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [21:00:17] (03Merged) 10jenkins-bot: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [21:00:32] !log samtar@deploy2002 Started scap: Backport for [[gerrit:898845|Enable new Vector (2022) "Add topic" button at arwiki (T331313)]], [[gerrit:898846|Enable DiscussionTools usability improvements at arwiki (T329407)]] [21:00:37] !log extending UTC late backport window [21:00:40] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [21:00:40] T331313: [Config Change] Enable Vector (2022) "Add topic" button at partner wikis - https://phabricator.wikimedia.org/T331313 [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:40] (03CR) 10Daimona Eaytoy: [C: 03+1] wgAbuseFilterConditionLimit: Set default condition limit to 2000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901289 (https://phabricator.wikimedia.org/T309609) (owner: 10Samtar) [21:02:09] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:898845|Enable new Vector (2022) "Add topic" button at arwiki (T331313)]], [[gerrit:898846|Enable DiscussionTools usability improvements at arwiki (T329407)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:02:14] MatmaRex: live on mwdebug, can you test? [21:02:21] looking [21:03:24] TheresNoTime: seems good [21:03:30] syncing [21:03:58] (03PS3) 10Samtar: Rename project and project talk namespace for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901276 (https://phabricator.wikimedia.org/T332614) (owner: 10Acamicamacaraca) [21:07:28] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [21:07:40] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [21:09:07] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:898845|Enable new Vector (2022) "Add topic" button at arwiki (T331313)]], [[gerrit:898846|Enable DiscussionTools usability improvements at arwiki (T329407)]] (duration: 08m 34s) [21:09:10] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@1302ca2]: ensure swift_upload delete_after is an integer [21:09:13] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [21:09:14] T331313: [Config Change] Enable Vector (2022) "Add topic" button at partner wikis - https://phabricator.wikimedia.org/T331313 [21:09:24] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@1302ca2]: ensure swift_upload delete_after is an integer (duration: 00m 13s) [21:09:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901276 (https://phabricator.wikimedia.org/T332614) (owner: 10Acamicamacaraca) [21:10:13] (03Merged) 10jenkins-bot: Rename project and project talk namespace for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901276 (https://phabricator.wikimedia.org/T332614) (owner: 10Acamicamacaraca) [21:10:27] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901276|Rename project and project talk namespace for shwiki (T332614)]] [21:10:33] T332614: Rename project and project talk namespace for the Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T332614 [21:10:36] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:11:54] !log samtar@deploy2002 samtar and aleksandar: Backport for [[gerrit:901276|Rename project and project talk namespace for shwiki (T332614)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:12:09] on it! [21:12:14] :) [21:14:46] (thanks TheresNoTime) [21:14:53] np! [21:16:09] (03CR) 10AikoChou: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [21:16:29] Namespace names updated accordingly. Page titles are updating slowly, which is kinda odd, but I guess that's expected [21:17:12] will sync [21:18:45] Jdlrobson: are you around for your patch after this one? [21:22:50] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901276|Rename project and project talk namespace for shwiki (T332614)]] (duration: 12m 22s) [21:22:56] T332614: Rename project and project talk namespace for the Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T332614 [21:23:12] Aca: live — can you check again and see if its behaving a little better? [21:23:26] 10SRE: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10RhinosF1) [21:23:29] Thanks! Yeah, okie! [21:24:40] (03PS1) 10AikoChou: ml-services: new revert-risk multilingual model and image [deployment-charts] - 10https://gerrit.wikimedia.org/r/901308 (https://phabricator.wikimedia.org/T332392) [21:24:47] 10SRE, 10ops-eqiad: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10Papaul) p:05Triage→03Medium [21:25:41] !log closing UTC late backport window, extended [21:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] (03PS1) 10Ahmon Dancy: devtools common.yaml: Set profile::mediawiki::scap_client::is_master to false [puppet] - 10https://gerrit.wikimedia.org/r/901309 [21:30:03] Welp, namespace names are updated in the lists and page info, which is fine, but page names are still behaving a little bit strange. Like, when you edit a page, it will be displayed as "Wikipedija:Potpis", but if you view it right now, it is displayed as "Wikipedia:Potpis". Perhaps it will take some time until it is updated accordingly. [21:30:20] okay, let me take a look [21:30:47] Yeah, try to reproduce it [21:33:15] Aca: purging the page seems to resolve it [21:33:40] yep, tried it now [21:34:22] TheresNoTime: im around [21:34:31] it's not ideal but it can wait until tomorrow though [21:34:38] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/namespaceDupes.php --wiki shwiki --fix` T332614 [21:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:43] T332614: Rename project and project talk namespace for the Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T332614 [21:34:52] welp, not a concern, then. Thanks for deployment! [21:34:54] Jdlrobson: I'll deploy now [21:35:04] are you sure? [21:35:17] yeah :) [21:35:21] ok let's do it! [21:35:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901275 (https://phabricator.wikimedia.org/T331905) (owner: 10Jdlrobson) [21:38:32] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:40:10] (03PS1) 10Ahmon Dancy: scap2: Remove unused wmflabs_master stuff [puppet] - 10https://gerrit.wikimedia.org/r/901310 [21:46:25] (03Abandoned) 10Ahmon Dancy: scap2: Remove unused wmflabs_master stuff [puppet] - 10https://gerrit.wikimedia.org/r/901310 (owner: 10Ahmon Dancy) [21:50:01] (03Merged) 10jenkins-bot: Add languages to Minerva HTML [skins/MinervaNeue] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901275 (https://phabricator.wikimedia.org/T331905) (owner: 10Jdlrobson) [21:50:16] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901275|Add languages to Minerva HTML (T331905)]] [21:50:23] T331905: Make languages available to index crawlers in BODY of mobile version of article pages - https://phabricator.wikimedia.org/T331905 [21:52:13] !log samtar@deploy2002 jdlrobson and samtar: Backport for [[gerrit:901275|Add languages to Minerva HTML (T331905)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:52:26] Jdlrobson: live on mwdebug for testing [21:52:45] .. [21:53:04] almost done [21:54:07] LGTM! [21:54:13] thanks TheresNoTime for squeezing this in [21:54:16] syncing [21:54:25] no problem :) [21:55:51] nice [22:00:01] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901275|Add languages to Minerva HTML (T331905)]] (duration: 09m 45s) [22:00:12] and live [22:00:14] T331905: Make languages available to index crawlers in BODY of mobile version of article pages - https://phabricator.wikimedia.org/T331905 [22:03:09] TheresNoTime: LGTM in production. [22:03:12] thanks again [22:03:30] ^^ [22:10:39] (03CR) 10Dzahn: [C: 03+2] miscweb: switch 15.wikipedia.org to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901300 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [22:16:30] (03CR) 10Dzahn: [C: 03+2] scap.pp: Update comment about /etc/profile.d/mediawiki.sh [puppet] - 10https://gerrit.wikimedia.org/r/901299 (owner: 10Ahmon Dancy) [22:42:08] (03PS1) 10Cwhite: logstash: add tag on json parsing log field [puppet] - 10https://gerrit.wikimedia.org/r/900719 (https://phabricator.wikimedia.org/T234565) [22:44:21] (03CR) 10CI reject: [V: 04-1] logstash: add tag on json parsing log field [puppet] - 10https://gerrit.wikimedia.org/r/900719 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:47:36] (03PS2) 10Cwhite: logstash: add tag on json parsing log field [puppet] - 10https://gerrit.wikimedia.org/r/900719 (https://phabricator.wikimedia.org/T234565) [22:59:28] (03CR) 10Cwhite: [C: 03+2] logstash: add tag on json parsing log field [puppet] - 10https://gerrit.wikimedia.org/r/900719 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:08:00] (03PS1) 10Dzahn: miscweb: switch annual and bienvenida microsites to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901318 (https://phabricator.wikimedia.org/T331896) [23:08:16] PROBLEM - MegaRAID on db1154 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:08:17] ACKNOWLEDGEMENT - MegaRAID on db1154 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T332649 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:08:21] 10SRE, 10ops-eqiad: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10ops-monitoring-bot) [23:09:05] (03PS1) 10Dzahn: miscweb: switch tendril and dbtree microsites to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901319 (https://phabricator.wikimedia.org/T331896) [23:10:53] (03PS1) 10Dzahn: miscweb: switch security.wm.org microsite to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901320 (https://phabricator.wikimedia.org/T331896) [23:12:21] (03PS1) 10Dzahn: miscweb: switch sitemaps, transparency and tr-archives to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901321 (https://phabricator.wikimedia.org/T331896) [23:28:57] (03PS3) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) [23:31:54] (03CR) 10Tim Starling: [C: 03+2] Unprovision the "swift" dashboard [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [23:36:34] (03CR) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [23:57:22] (03PS1) 10Tim Starling: Temporarily disable xenon/excimer for switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165)