[00:00:02] FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0000) [00:00:49] (03CR) 10CDanis: [C:03+1] add discovery records for gerrit as CNAMEs to public names [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [00:02:10] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338235 (10Papaul) [00:03:21] (03CR) 10Dzahn: [C:03+2] add discovery records for gerrit as CNAMEs to public names [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [00:03:47] !log dzahn@dns1004 START - running authdns-update [00:04:27] RESOLVED: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:37] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338243 (10Papaul) @cmooney thanks for the feedback we can clarify this tomorrow during the meeting and have all ready and run it by @ayounsi when he is back. [00:04:44] !log dzahn@dns1004 END - running authdns-update [00:05:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [00:07:51] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338254 (10Papaul) [00:08:32] FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:27] FIRING: [5x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:46] !log cdanis@dns1004 START - running authdns-update [00:10:21] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338258 (10Papaul) [00:10:28] !log cdanis@dns1004 END - running authdns-update [00:11:39] (03CR) 10Xcollazo: [C:03+1] Add the python3-pymysql package to the analytics::refinery profile [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [00:13:32] FIRING: [3x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:14:27] RESOLVED: [3x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:15] jouncebot: nowandnext [00:17:16] For the next 0 hour(s) and 42 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0000) [00:17:16] In 2 hour(s) and 42 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0300) [00:21:19] (03PS2) 10Zabe: Using Hadoop for MostTranscludedPages on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199522 (https://phabricator.wikimedia.org/T309738) [00:21:42] (03CR) 10Zabe: [C:03+2] Using Hadoop for MostTranscludedPages on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199522 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [00:22:31] (03Merged) 10jenkins-bot: Using Hadoop for MostTranscludedPages on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199522 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [00:23:00] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]] [00:23:03] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:23:04] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [00:25:04] !log zabe@deploy2002 zabe: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:26:39] !log zabe@deploy2002 zabe: Continuing with sync [00:26:58] (03PS1) 10Dzahn: tcpproxy: add basic logging config [puppet] - 10https://gerrit.wikimedia.org/r/1201311 (https://phabricator.wikimedia.org/T408532) [00:27:56] (03PS2) 10Dzahn: tcpproxy: add basic logging config [puppet] - 10https://gerrit.wikimedia.org/r/1201311 (https://phabricator.wikimedia.org/T408532) [00:28:03] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:28:28] (03CR) 10Dzahn: "This is what is currently in the config on tcp-proxy1001 but not puppetized yet." [puppet] - 10https://gerrit.wikimedia.org/r/1201311 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [00:30:17] FIRING: ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:32:05] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]] (duration: 09m 05s) [00:32:08] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [00:32:25] (03PS1) 10Dzahn: site: apply tcpproxy role on all VMs created for it [puppet] - 10https://gerrit.wikimedia.org/r/1201312 (https://phabricator.wikimedia.org/T408532) [00:35:17] RESOLVED: ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:03] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201314 [00:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201314 (owner: 10TrainBranchBot) [00:42:34] (03PS1) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [00:43:03] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:44:44] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [00:51:27] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:53:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201314 (owner: 10TrainBranchBot) [00:56:27] RESOLVED: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:56:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [00:58:03] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:03:03] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201319 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201319 (owner: 10TrainBranchBot) [01:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:18:22] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135 (10Xaosflux) 03NEW [01:19:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11338446 (10Xaosflux) [01:20:31] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11338449 (10Xaosflux) Other index have also been wrong, including showing ZERO tickets when there are actually tickets, making the folder not a... [01:33:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:58:42] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10Mail: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11338499 (10DamianZaremba) [01:59:25] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10Mail: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11338500 (10DamianZaremba) Tagging SRE as not sure which team is responsible. [02:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.1 [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201323 (https://phabricator.wikimedia.org/T408271) [02:08:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.1 [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201323 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [02:09:40] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201319 (owner: 10TrainBranchBot) [02:22:15] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.1 [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201323 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [02:33:03] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:35:15] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11338539 (10Xaosflux) Got 1 outbound email that was tested, it had a 30 hour delay.; will start a new timer [02:46:17] FIRING: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:51:17] FIRING: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:18] (03PS1) 10Bking: Revert "wdqs: allowlist new endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/1201326 [02:53:24] (03CR) 10Bking: [V:03+2 C:03+2] Revert "wdqs: allowlist new endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/1201326 (owner: 10Bking) [02:56:17] RESOLVED: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:58:03] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0300) [03:03:11] !log bking@cumin2002 restart wdqs-blazegraph.service in CODFW to apply 1201326 T409132 [03:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:14] T409132: WDQS CODFW high load/lag incident - https://phabricator.wikimedia.org/T409132 [03:21:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:26:17] FIRING: [10x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:31:17] FIRING: [14x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:17] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:56:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0400) [04:06:04] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11338576 (10Jdrewniak) Thanks for those graphs @SToyofuku-WMF , l... [04:08:52] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:29:49] 06SRE, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11338621 (10JJMC89) [04:35:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:37:45] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:40:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:41:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:43:45] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11338629 (10Aafi) Outbound emails from wm-deoband sent yesterday & days earlier, have been reported to be received today. Tested with 2025110410006642, received within a minute or so. [04:44:13] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11338634 (10jhathaway) @Xaosflux the outbound queue has now been cleared of all backscatter bounce emails, so delivery times should be back to normal. [04:46:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11338645 (10jhathaway) After some analysis today, I think the cause of the bounces were as follows: # Spammers set thei... [04:55:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:57:45] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:59:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0500) [05:00:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:02:31] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.23 (duration: 02m 28s) [05:03:52] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:04:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:05:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:16:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:21:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:24:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:29:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:33:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11338659 (101F616EMO) >>! In T408632#11338645, @jhathaway wrote: >Spammers set their Return-Path to info@wikimedia.org,... [05:33:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:34:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:35:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:45:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:48:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:49:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:07] (03PS1) 10DLynch: DiscussionTools: turn on automatic topic subscriptions for all editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201338 (https://phabricator.wikimedia.org/T290778) [06:12:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [06:13:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11338707 (10Marostegui) 05Open→03Resolved RAID back to normal and host back green in icinga. It all worked fine! [06:14:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [06:14:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84684 and previous config saved to /var/cache/conftool/dbconfig/20251104-061449-marostegui.json [06:14:52] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:17:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84685 and previous config saved to /var/cache/conftool/dbconfig/20251104-061745-marostegui.json [06:31:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:32:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P84686 and previous config saved to /var/cache/conftool/dbconfig/20251104-063253-marostegui.json [06:39:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:40:52] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11338730 (10Geagea) permissions-en shows 4, when there are actually only 3 [06:43:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11338731 (10Geagea) Thanks, I think that the delay of mails from VRT solved. I've got my notifications at once. [06:44:17] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11338732 (10Geagea) Thanks, I think that the delay of mails from VRT solved. I've got my notifications at once. Also received feedback from costumers. [06:48:03] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:48:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P84687 and previous config saved to /var/cache/conftool/dbconfig/20251104-064803-marostegui.json [06:48:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:52:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:54:05] (03CR) 10Filippo Giunchedi: [C:03+1] neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack [puppet] - 10https://gerrit.wikimedia.org/r/1200016 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [06:54:17] (03CR) 10Filippo Giunchedi: [C:03+1] nova: enable nrpe2nodexp wrapper on check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1200018 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [06:56:17] RESOLVED: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:57:09] (03PS1) 10DCausse: Revert^3 "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201340 [06:57:39] (03PS2) 10DCausse: Revert^3 "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201340 [06:57:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201340 (owner: 10DCausse) [06:58:03] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:59:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0700). [07:03:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84688 and previous config saved to /var/cache/conftool/dbconfig/20251104-070311-marostegui.json [07:03:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:03:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:03:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:03:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T407997)', diff saved to https://phabricator.wikimedia.org/P84689 and previous config saved to /var/cache/conftool/dbconfig/20251104-070356-marostegui.json [07:06:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T407997)', diff saved to https://phabricator.wikimedia.org/P84690 and previous config saved to /var/cache/conftool/dbconfig/20251104-070653-marostegui.json [07:22:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P84691 and previous config saved to /var/cache/conftool/dbconfig/20251104-072201-marostegui.json [07:27:13] (03PS1) 10Marostegui: db2176: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201391 (https://phabricator.wikimedia.org/T407463) [07:27:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:28:12] (03CR) 10Marostegui: [C:03+2] db2176: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201391 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [07:28:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: Maintenance [07:28:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2176 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84692 and previous config saved to /var/cache/conftool/dbconfig/20251104-072854-marostegui.json [07:31:33] (03CR) 10Kosta Harlan: [C:03+1] Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T409079) (owner: 10STran) [07:37:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84693 and previous config saved to /var/cache/conftool/dbconfig/20251104-073707-root.json [07:47:00] !log ozge@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [07:48:59] !log ozge@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [07:52:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84694 and previous config saved to /var/cache/conftool/dbconfig/20251104-075213-root.json [07:52:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [07:52:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T407997)', diff saved to https://phabricator.wikimedia.org/P84695 and previous config saved to /var/cache/conftool/dbconfig/20251104-075239-marostegui.json [07:52:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:53:43] (03PS1) 10Ozge: feat: updates addalink base url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201541 [07:55:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T407997)', diff saved to https://phabricator.wikimedia.org/P84696 and previous config saved to /var/cache/conftool/dbconfig/20251104-075510-marostegui.json [07:55:30] (03PS1) 10Marostegui: db1178: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201542 [07:56:05] (03CR) 10Marostegui: [C:03+2] db1178: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201542 (owner: 10Marostegui) [07:57:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1178.eqiad.wmnet with reason: Maintenance [07:57:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1178 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84697 and previous config saved to /var/cache/conftool/dbconfig/20251104-075718-marostegui.json [07:57:28] (03CR) 10Ozge: [C:03+2] feat: updates addalink base url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201541 (owner: 10Ozge) [07:59:22] (03Merged) 10jenkins-bot: feat: updates addalink base url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201541 (owner: 10Ozge) [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T0800). [08:00:05] Tchanders and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:08] o/ [08:00:18] I'll get started on mine [08:00:38] o/ [08:00:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T409079) (owner: 10STran) [08:00:58] !log ozge@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [08:01:36] (03Merged) 10jenkins-bot: Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T409079) (owner: 10STran) [08:02:14] !log ozge@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [08:02:21] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1200083|Deploy temporary accounts to enwiki (T409079)]] [08:02:24] T409079: Deploy Temporary accounts to English Wikipedia - https://phabricator.wikimedia.org/T409079 [08:02:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:32] !log tchanders@deploy2002 tchanders, stran: Backport for [[gerrit:1200083|Deploy temporary accounts to enwiki (T409079)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:05:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84698 and previous config saved to /var/cache/conftool/dbconfig/20251104-080522-root.json [08:05:57] testing... [08:07:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84699 and previous config saved to /var/cache/conftool/dbconfig/20251104-080719-root.json [08:08:38] !log ozge@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [08:10:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P84700 and previous config saved to /var/cache/conftool/dbconfig/20251104-081017-marostegui.json [08:10:25] !log tchanders@deploy2002 tchanders, stran: Continuing with sync [08:10:55] !log ozge@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [08:14:44] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200083|Deploy temporary accounts to enwiki (T409079)]] (duration: 12m 22s) [08:14:47] T409079: Deploy Temporary accounts to English Wikipedia - https://phabricator.wikimedia.org/T409079 [08:15:32] Mine is done. Over to you dcausse! [08:16:43] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:53] Tchanders: thanks! [08:17:16] oh I feel like I should welcome enwiki to TA gang /joke [08:17:37] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:18:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201340 (owner: 10DCausse) [08:18:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:23] (03Merged) 10jenkins-bot: Revert^3 "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201340 (owner: 10DCausse) [08:19:43] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1201340|Revert^3 "cirrus: enable completion search with defaultsort A/B test"]] [08:20:08] (03CR) 10Muehlenhoff: [C:03+2] Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [08:20:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84701 and previous config saved to /var/cache/conftool/dbconfig/20251104-082031-root.json [08:20:43] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:48] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1201340|Revert^3 "cirrus: enable completion search with defaultsort A/B test"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:22:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84702 and previous config saved to /var/cache/conftool/dbconfig/20251104-082226-root.json [08:24:51] !log dcausse@deploy2002 dcausse: Continuing with sync [08:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P84703 and previous config saved to /var/cache/conftool/dbconfig/20251104-082525-marostegui.json [08:25:39] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:04] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201340|Revert^3 "cirrus: enable completion search with defaultsort A/B test"]] (duration: 09m 20s) [08:29:54] !log UTC morning backport window done [08:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:23] (03PS1) 10Filippo Giunchedi: pontoon: fix unknown nova image fetching [puppet] - 10https://gerrit.wikimedia.org/r/1201544 [08:32:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:07] (03CR) 10Elukey: [C:03+2] "I got the approval from Chris Albon on Slack, I think he didn't have time to follow up here, so I am going to proceed and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1201187 (https://phabricator.wikimedia.org/T408579) (owner: 10Dzahn) [08:34:14] (03CR) 10Filippo Giunchedi: [C:03+1] "Not production, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1201544 (owner: 10Filippo Giunchedi) [08:34:23] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix unknown nova image fetching [puppet] - 10https://gerrit.wikimedia.org/r/1201544 (owner: 10Filippo Giunchedi) [08:35:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11338856 (10elukey) I got in touch with @calbon on slack and got the approval from him, I think he didn't have time to follow up so I am go... [08:35:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84704 and previous config saved to /var/cache/conftool/dbconfig/20251104-083538-root.json [08:37:13] (03PS18) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [08:40:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11338867 (10elukey) ` elukey@krb1002:~$ sudo manage_principals.py get dpogorzelski get_principal: Principal does not exist while retrieving... [08:40:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T407997)', diff saved to https://phabricator.wikimedia.org/P84705 and previous config saved to /var/cache/conftool/dbconfig/20251104-084032-marostegui.json [08:40:36] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:40:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [08:40:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T407997)', diff saved to https://phabricator.wikimedia.org/P84706 and previous config saved to /var/cache/conftool/dbconfig/20251104-084056-marostegui.json [08:42:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T407997)', diff saved to https://phabricator.wikimedia.org/P84707 and previous config saved to /var/cache/conftool/dbconfig/20251104-084327-marostegui.json [08:43:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:43] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:47:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:49:16] (03PS19) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [08:49:33] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:19] elukey@cumin2002 provision (PID 3537380) is awaiting input [08:50:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84708 and previous config saved to /var/cache/conftool/dbconfig/20251104-085043-root.json [08:51:57] (03CR) 10Muehlenhoff: [C:03+2] Remove code to install hp-health [puppet] - 10https://gerrit.wikimedia.org/r/1201030 (owner: 10Muehlenhoff) [08:53:41] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:53:56] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:54:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:54:32] !log installing squid security updates [08:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:28] (03CR) 10Majavah: [C:03+2] P:openstack::designate: Remove check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/1200306 (owner: 10Majavah) [08:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [08:56:56] (03CR) 10Majavah: [C:03+2] toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [08:57:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P84709 and previous config saved to /var/cache/conftool/dbconfig/20251104-085834-marostegui.json [08:58:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [09:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:11:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:11:54] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 (owner: 10Majavah) [09:12:14] (03CR) 10Majavah: [C:03+2] P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 (owner: 10Majavah) [09:13:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P84710 and previous config saved to /var/cache/conftool/dbconfig/20251104-091342-marostegui.json [09:16:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:16:50] (03PS1) 10Ozge: feat: addalink add both urls as env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201548 [09:19:11] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Restore original X-Original-URI value [puppet] - 10https://gerrit.wikimedia.org/r/1201549 (https://phabricator.wikimedia.org/T409008) [09:19:26] (03CR) 10Majavah: "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/1201549 (https://phabricator.wikimedia.org/T409008) (owner: 10Majavah) [09:19:26] (03CR) 10Ozge: [C:03+2] feat: addalink add both urls as env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201548 (owner: 10Ozge) [09:21:00] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Restore original X-Original-URI value [puppet] - 10https://gerrit.wikimedia.org/r/1201549 (https://phabricator.wikimedia.org/T409008) (owner: 10Majavah) [09:21:15] (03Merged) 10jenkins-bot: feat: addalink add both urls as env [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201548 (owner: 10Ozge) [09:21:41] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Restore original X-Original-URI value [puppet] - 10https://gerrit.wikimedia.org/r/1201549 (https://phabricator.wikimedia.org/T409008) (owner: 10Majavah) [09:23:52] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:26:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:28:29] !log ozge@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:28:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T407997)', diff saved to https://phabricator.wikimedia.org/P84711 and previous config saved to /var/cache/conftool/dbconfig/20251104-092850-marostegui.json [09:28:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:29:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [09:29:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84712 and previous config saved to /var/cache/conftool/dbconfig/20251104-092913-marostegui.json [09:29:54] !log ozge@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:30:02] (03PS1) 10Filippo Giunchedi: pontoon: new stack demo [puppet] - 10https://gerrit.wikimedia.org/r/1201550 [09:30:02] (03PS1) 10Filippo Giunchedi: pontoon: add rolegroup bootstrap to demo [puppet] - 10https://gerrit.wikimedia.org/r/1201551 [09:30:02] (03PS1) 10Filippo Giunchedi: pontoon: improve README and instructions [puppet] - 10https://gerrit.wikimedia.org/r/1201552 [09:31:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:31:48] (03PS2) 10Filippo Giunchedi: pontoon: improve README and instructions [puppet] - 10https://gerrit.wikimedia.org/r/1201552 [09:31:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84713 and previous config saved to /var/cache/conftool/dbconfig/20251104-093148-marostegui.json [09:31:59] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "Not production, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1201552 (owner: 10Filippo Giunchedi) [09:33:52] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:45] (03CR) 10Vgutierrez: [C:03+1] ncmonitor: Add MarkMonitor API key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [09:38:52] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:41:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:41:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:43:52] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:46:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P84714 and previous config saved to /var/cache/conftool/dbconfig/20251104-094658-marostegui.json [09:48:01] (03PS1) 10MVernon: aptrepo: add conftool-trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) [09:51:11] (03CR) 10Muehlenhoff: aptrepo: add conftool-trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) (owner: 10MVernon) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:52] (03CR) 10Btullis: [V:03+1 C:03+2] Add the python3-pymysql package to the analytics::refinery profile [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [09:55:30] (03PS2) 10MVernon: aptrepo: add conftool-trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) [09:55:47] (03CR) 10MVernon: "oops, thanks, well spotted." [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) (owner: 10MVernon) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:47] !log btullis@deploy2002 Started deploy [analytics/hdfs-tools/deploy@bb26b34]: Deploying after updating targets [10:01:59] !log btullis@deploy2002 Finished deploy [analytics/hdfs-tools/deploy@bb26b34]: Deploying after updating targets (duration: 00m 24s) [10:02:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P84715 and previous config saved to /var/cache/conftool/dbconfig/20251104-100206-marostegui.json [10:03:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. (At some it would be good if we could standardise on a common prefix for the artefact imports, something like cibuild-conftool" [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) (owner: 10MVernon) [10:03:35] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11339074 (10Aklapper) @Shisma: Hi, can you please answer on this ticket? Otherwise it will be declined. Thanks. [10:03:46] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11339075 (10Aklapper) a:03Shisma [10:05:31] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11339079 (10Xaosflux) Prior test cleared in 90mins Current tests, clearing in normal time ~2mins [10:05:55] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11339085 (10Xaosflux) p:05High→03Medium [10:08:14] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11339090 (10Xaosflux) Perhaps the junk queue should not be allowed to send agent notifications? [10:16:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11339110 (10ItamarWMDE) The scripts being discussed are not the PHP maintenance scripts, but the bash scripts currently invoked by airflow: - https://gerrit.wikimedia.org/r/plugins/gitiles/... [10:17:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84716 and previous config saved to /var/cache/conftool/dbconfig/20251104-101713-marostegui.json [10:17:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:17:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:18:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [10:18:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T407997)', diff saved to https://phabricator.wikimedia.org/P84717 and previous config saved to /var/cache/conftool/dbconfig/20251104-101845-marostegui.json [10:21:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T407997)', diff saved to https://phabricator.wikimedia.org/P84718 and previous config saved to /var/cache/conftool/dbconfig/20251104-102121-marostegui.json [10:23:24] (03PS1) 10Btullis: Switch an-launcher1002 to the insetup role prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) [10:24:56] (03PS2) 10Btullis: Switch an-launcher1002 to the insetup role prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) [10:25:01] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [10:26:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7535/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) (owner: 10Btullis) [10:34:45] (03CR) 10Clément Goubert: [C:03+2] restgateway: update spec-json-wikimedia to use www prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198406 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:36:29] (03Merged) 10jenkins-bot: restgateway: update spec-json-wikimedia to use www prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198406 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:36:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P84719 and previous config saved to /var/cache/conftool/dbconfig/20251104-103629-marostegui.json [10:36:45] FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:39:15] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:39:18] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:40:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1200365 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [10:40:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:40:26] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:40:31] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:40:39] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:41:04] (03PS3) 10Btullis: Switch an-launcher1002 to the insetup role prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) [10:41:45] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) [10:42:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7536/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) (owner: 10Btullis) [10:42:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [10:45:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:30] (03CR) 10Marostegui: [C:03+1] site.pp, es2031.yaml: Decommission es2031 [puppet] - 10https://gerrit.wikimedia.org/r/1201073 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [10:47:47] (03CR) 10Marostegui: [C:03+1] site.pp, es2030.yaml: Decommission es2030 [puppet] - 10https://gerrit.wikimedia.org/r/1201072 (https://phabricator.wikimedia.org/T408409) (owner: 10Federico Ceratto) [10:47:56] (03CR) 10Marostegui: [C:03+1] site.pp, es2029.yaml: Decommission es2029 [puppet] - 10https://gerrit.wikimedia.org/r/1201071 (https://phabricator.wikimedia.org/T408408) (owner: 10Federico Ceratto) [10:48:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11339222 (10Krd) If I checked correctly, nobody is subscribed to the Junk queue, so no notifications for that should hav... [10:49:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Remove wmgULSPosition for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [10:51:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P84720 and previous config saved to /var/cache/conftool/dbconfig/20251104-105136-marostegui.json [10:52:17] (03PS1) 10Marostegui: db2216: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201571 (https://phabricator.wikimedia.org/T407463) [10:52:32] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:53:04] (03CR) 10Marostegui: [C:03+2] db2216: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201571 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [10:53:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2216.codfw.wmnet with reason: Maintenance [10:53:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2216 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84721 and previous config saved to /var/cache/conftool/dbconfig/20251104-105339-marostegui.json [10:54:02] !log uploaded openjdk-8 8u472-ga-1~deb11u1 to apt.wikimedia.org (forward port of latest Java 8 security updates) [10:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:45] RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:56:53] (03PS1) 10Marostegui: db1192: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201573 [10:57:20] (03PS1) 10JMeybohm: P:conftool::hiddenparma: enable ipblock and ipblock_source policies [puppet] - 10https://gerrit.wikimedia.org/r/1201574 (https://phabricator.wikimedia.org/T402014) [10:57:33] (03CR) 10Marostegui: [C:03+2] db1192: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201573 (owner: 10Marostegui) [10:58:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1192.eqiad.wmnet with reason: Maintenance [10:58:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1192 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84722 and previous config saved to /var/cache/conftool/dbconfig/20251104-105851-marostegui.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1100) [11:01:13] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2029.yaml: Decommission es2029 [puppet] - 10https://gerrit.wikimedia.org/r/1201071 (https://phabricator.wikimedia.org/T408408) (owner: 10Federico Ceratto) [11:01:16] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2030.yaml: Decommission es2030 [puppet] - 10https://gerrit.wikimedia.org/r/1201072 (https://phabricator.wikimedia.org/T408409) (owner: 10Federico Ceratto) [11:01:19] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2031.yaml: Decommission es2031 [puppet] - 10https://gerrit.wikimedia.org/r/1201073 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [11:01:49] (03PS1) 10Stevemunene: superset: upgrade the memcached container image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201575 (https://phabricator.wikimedia.org/T409151) [11:01:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84723 and previous config saved to /var/cache/conftool/dbconfig/20251104-110152-root.json [11:06:19] (03PS2) 10Stevemunene: superset: upgrade the memcached container image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201575 (https://phabricator.wikimedia.org/T409151) [11:06:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T407997)', diff saved to https://phabricator.wikimedia.org/P84724 and previous config saved to /var/cache/conftool/dbconfig/20251104-110643-marostegui.json [11:06:47] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:06:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84725 and previous config saved to /var/cache/conftool/dbconfig/20251104-110655-root.json [11:07:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [11:08:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [11:11:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:13:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:14:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84726 and previous config saved to /var/cache/conftool/dbconfig/20251104-111401-marostegui.json [11:14:08] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84727 and previous config saved to /var/cache/conftool/dbconfig/20251104-111658-root.json [11:18:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84728 and previous config saved to /var/cache/conftool/dbconfig/20251104-111814-marostegui.json [11:22:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: 10', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251104-112201-root.json [11:22:34] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11339319 (10elukey) I opened a Supermicro ticket to explain the problem, we'll see if they have suggestions. [11:25:55] (03PS1) 10Aqu: Analytics: Ops-Week fix project_namespace_map generation [puppet] - 10https://gerrit.wikimedia.org/r/1201579 [11:26:48] (03PS2) 10Aqu: Analytics: Ops-Week fix project_namespace_map generation [puppet] - 10https://gerrit.wikimedia.org/r/1201579 [11:28:52] (03PS12) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [11:31:22] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2029 - https://phabricator.wikimedia.org/T408408#11339345 (10FCeratto-WMF) [11:31:47] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2030 - https://phabricator.wikimedia.org/T408409#11339348 (10FCeratto-WMF) [11:32:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84730 and previous config saved to /var/cache/conftool/dbconfig/20251104-113205-root.json [11:32:14] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2031 - https://phabricator.wikimedia.org/T408410#11339351 (10FCeratto-WMF) [11:32:29] (03PS3) 10Daniel Kinzler: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) [11:33:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P84731 and previous config saved to /var/cache/conftool/dbconfig/20251104-113322-marostegui.json [11:33:49] !log Upgrading and restarting CI Jenkins | T404856 [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:23] (03CR) 10Stevemunene: [C:03+1] Switch an-launcher1002 to the insetup role prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) (owner: 10Btullis) [11:37:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84732 and previous config saved to /var/cache/conftool/dbconfig/20251104-113711-root.json [11:38:48] !log installing Java 8 security updates on Bullseye [11:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:45] (03CR) 10Btullis: "I don't think that this is necessary. The problem was a missing python package that was fixed in https://gerrit.wikimedia.org/r/c/operatio" [puppet] - 10https://gerrit.wikimedia.org/r/1201579 (owner: 10Aqu) [11:45:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162 (10Jhancock.wm) 03NEW [11:46:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11339413 (10Ladsgroup) [11:47:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84733 and previous config saved to /var/cache/conftool/dbconfig/20251104-114712-root.json [11:47:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11339414 (10Jhancock.wm) a:05Jhancock.wm→03Andrew @Andrew please fill out the racking details in this task and make any updates that are needed for the pres... [11:48:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P84734 and previous config saved to /var/cache/conftool/dbconfig/20251104-114830-marostegui.json [11:52:11] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: remove decommed mwdebug hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200332 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [11:52:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84735 and previous config saved to /var/cache/conftool/dbconfig/20251104-115217-root.json [11:54:15] (03PS1) 10Ladsgroup: Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) [11:56:13] (03PS2) 10Ladsgroup: Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) [11:57:10] !log temporary disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199247 (T408060) [11:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:13] T408060: Distinguish request classes based on user-agent declaration - https://phabricator.wikimedia.org/T408060 [11:58:42] (03Abandoned) 10Esanders: Enable DiscussionTools auto subscriptions for all interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076737 (https://phabricator.wikimedia.org/T290778) (owner: 10Esanders) [12:00:56] !log upgrade lsw1-d3-eqiad to SR-Linux v24.10.3 [12:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:19] (03CR) 10Fabfur: [C:03+2] P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [12:03:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84736 and previous config saved to /var/cache/conftool/dbconfig/20251104-120338-marostegui.json [12:03:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:03:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:04:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T407997)', diff saved to https://phabricator.wikimedia.org/P84737 and previous config saved to /var/cache/conftool/dbconfig/20251104-120401-marostegui.json [12:08:12] !log re-enable puppet on A:cp (T408060) [12:08:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T407997)', diff saved to https://phabricator.wikimedia.org/P84739 and previous config saved to /var/cache/conftool/dbconfig/20251104-120812-marostegui.json [12:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:15] T408060: Distinguish request classes based on user-agent declaration - https://phabricator.wikimedia.org/T408060 [12:10:26] 14SRE-Sprint-Week-Sustainability-March2023, 06serviceops-radar: Adopt SLIs / SLOs for sessionstore - https://phabricator.wikimedia.org/T256629#11339510 (10LSobanski) Removing the #wikimedia-incident-actionable tag. If this is still a solution to a possible incident root cause please add the tag back and consi... [12:11:39] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence, 07Wikimedia-Slow-DB-Query: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#11339519 (10LSobanski) Removing the #wikimedia-incident-actionable tag. If this is still a solution to a possible inciden... [12:12:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:14:10] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 10Cassandra, 06Data-Persistence: Document best-practice for hinted-handoff - https://phabricator.wikimedia.org/T315517#11339528 (10LSobanski) Removing the #wikimedia-incident-actionable tag. If this is still a solution to a possible incident root c... [12:17:41] FIRING: [2x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:22:12] 14SRE-Sprint-Week-Sustainability-March2023, 06collaboration-services, 10Phabricator, 06serviceops-radar, and 2 others: Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879#11339580 (10LSobanski) p:05Low→03Medium [12:22:41] FIRING: [22x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:23:06] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS: openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683#11339583 (10LSobanski) Removing the #wikimedia-incident-actionable tag. If this is still a solution to a possible incident root cause plea... [12:23:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P84740 and previous config saved to /var/cache/conftool/dbconfig/20251104-122320-marostegui.json [12:25:13] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for blake - https://phabricator.wikimedia.org/T409166 (10Blake) 03NEW [12:27:30] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11339617 (10Jclark-ctr) make sure to upload tsr report when submitting tickets it will help speed up turnaround time with dell ` Work Order: SR218119927 Denial Notes Thank you for... [12:27:35] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06serviceops-radar: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059#11339619 (10LSobanski) Removing the #wikimedia-incident-actionable tag. I... [12:27:41] FIRING: [38x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:28:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11339623 (10Jclark-ctr) Make sure to upload tsr reports when submitting tickets ` Work Order: SR218125316 Denial Notes Thank you for submitting the request. We would require more details in order to proc... [12:28:59] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1237 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201592 (https://phabricator.wikimedia.org/T409167) [12:29:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2191 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201593 (https://phabricator.wikimedia.org/T409168) [12:29:40] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201594 (https://phabricator.wikimedia.org/T409168) [12:32:41] FIRING: [60x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:33:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11339668 (10Jclark-ctr) Eta for delivery Arriving On Nov 7, 2025 [12:33:44] (03PS1) 10Jcrespo: mariadb: Remove grants for dbprov1003 & dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1201595 (https://phabricator.wikimedia.org/T403166) [12:37:41] FIRING: [76x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:38:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11339690 (10calbon) Approved [12:38:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P84741 and previous config saved to /var/cache/conftool/dbconfig/20251104-123827-marostegui.json [12:42:41] FIRING: [90x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:44:50] (03CR) 10Effie Mouzeli: [C:03+2] api-gateway: remove mwdebug* hosts from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200331 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [12:45:32] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly locally in my WM Phab checkout." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1199469 (owner: 10Pppery) [12:45:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1237 with weight 0 T409167', diff saved to https://phabricator.wikimedia.org/P84742 and previous config saved to /var/cache/conftool/dbconfig/20251104-124556-marostegui.json [12:46:00] T409167: Switchover x1 master (db1220 -> db1237) - https://phabricator.wikimedia.org/T409167 [12:46:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Primary switchover x1 T409167 [12:46:37] (03Merged) 10jenkins-bot: api-gateway: remove mwdebug* hosts from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200331 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [12:46:43] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1237 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201592 (https://phabricator.wikimedia.org/T409167) (owner: 10Gerrit maintenance bot) [12:47:41] FIRING: [111x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:47:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11339724 (10Jclark-ctr) @BTullis when you get a chance this week can you assist with this one? [12:47:47] !log Starting x1 eqiad failover from db1220 to db1237 - T409167 [12:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:57] (03PS1) 10Blake: admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) [12:48:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1237 to x1 primary T409167', diff saved to https://phabricator.wikimedia.org/P84743 and previous config saved to /var/cache/conftool/dbconfig/20251104-124803-marostegui.json [12:48:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1220 T409167', diff saved to https://phabricator.wikimedia.org/P84744 and previous config saved to /var/cache/conftool/dbconfig/20251104-124836-marostegui.json [12:50:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P84745 and previous config saved to /var/cache/conftool/dbconfig/20251104-125005-root.json [12:50:21] (03CR) 10Btullis: [C:03+1] "Looks good. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201575 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene) [12:52:18] (03PS1) 10Slyngshede: WIP: Assign managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1201597 [12:52:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:53:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T407997)', diff saved to https://phabricator.wikimedia.org/P84746 and previous config saved to /var/cache/conftool/dbconfig/20251104-125335-marostegui.json [12:53:39] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:53:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: Maintenance [12:54:00] (03CR) 10CI reject: [V:04-1] WIP: Assign managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1201597 (owner: 10Slyngshede) [12:54:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84747 and previous config saved to /var/cache/conftool/dbconfig/20251104-125359-marostegui.json [12:54:27] (03CR) 10Btullis: [V:03+1 C:03+2] Switch an-launcher1002 to the insetup role prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1201564 (https://phabricator.wikimedia.org/T353786) (owner: 10Btullis) [12:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84748 and previous config saved to /var/cache/conftool/dbconfig/20251104-125745-marostegui.json [12:58:55] (03CR) 10Marostegui: [C:03+1] "Checked the IPs" [puppet] - 10https://gerrit.wikimedia.org/r/1201595 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [12:59:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:16] !log downgrade lsw1-d3-eqiad to SR-Linux v24.10.1 [12:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1300) [13:02:39] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11339827 (10MoritzMuehlenhoff) We'll keep maps-test2001 around as a separate staging system (single PG master, no replicas, a separate role::maps::staging will be used (https://gerrit... [13:05:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P84749 and previous config saved to /var/cache/conftool/dbconfig/20251104-130512-root.json [13:06:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585#11339883 (10Jclark-ctr) →14Duplicate dup:03T400877 [13:06:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11339880 (10Jclark-ctr) [13:07:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11339885 (10Jclark-ctr) [13:07:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1033 - https://phabricator.wikimedia.org/T409089#11339888 (10Jclark-ctr) →14Duplicate dup:03T409030 [13:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:24] (03PS1) 10Daniel Kinzler: api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) [13:10:10] (03CR) 10Mark Bergsma: [C:03+1] "Approved for addition to the ops group" [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [13:12:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P84750 and previous config saved to /var/cache/conftool/dbconfig/20251104-131254-marostegui.json [13:13:05] (03PS2) 10Slyngshede: WIP: Assign managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1201597 [13:15:11] (03CR) 10CI reject: [V:04-1] WIP: Assign managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1201597 (owner: 10Slyngshede) [13:19:11] PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:19:11] PROBLEM - Confd vcl based reload on cp2041 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:19:54] 10ops-codfw, 06DC-Ops: Alert for device ps1-c5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409174#11339932 (10phaultfinder) [13:20:09] PROBLEM - Confd vcl based reload on cp2031 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2029 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2030 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2028 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:09] PROBLEM - Confd vcl based reload on cp2032 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:11] PROBLEM - Confd vcl based reload on cp2042 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:11] PROBLEM - Confd vcl based reload on cp2027 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:12] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:12] PROBLEM - Confd vcl based reload on cp2034 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:12] PROBLEM - Confd vcl based reload on cp2036 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:13] PROBLEM - Confd vcl based reload on cp2038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:13] PROBLEM - Confd vcl based reload on cp2040 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P84752 and previous config saved to /var/cache/conftool/dbconfig/20251104-132019-root.json [13:20:26] fabfur: is this you? ^^^ [13:21:09] RECOVERY - Confd vcl based reload on cp2031 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:09] RECOVERY - Confd vcl based reload on cp2039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:09] RECOVERY - Confd vcl based reload on cp2029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:09] RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:11] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:11] RECOVERY - Confd vcl based reload on cp2027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:11] RECOVERY - Confd vcl based reload on cp2041 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:21:11] RECOVERY - Confd vcl based reload on cp2033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:09] RECOVERY - Confd vcl based reload on cp2030 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:09] RECOVERY - Confd vcl based reload on cp2028 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:09] RECOVERY - Confd vcl based reload on cp2032 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:11] RECOVERY - Confd vcl based reload on cp2034 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:11] RECOVERY - Confd vcl based reload on cp2036 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:11] RECOVERY - Confd vcl based reload on cp2042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:11] RECOVERY - Confd vcl based reload on cp2038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:22:11] RECOVERY - Confd vcl based reload on cp2040 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:24:19] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for blake - https://phabricator.wikimedia.org/T409166#11339945 (10hnowlan) L3 signed, NDA applies. Key verified OOB. Last step is approval by @mark (or @Kappakayala maybe?) [13:25:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for blake - https://phabricator.wikimedia.org/T409166#11339961 (10taavi) [13:26:30] volans: may have been a bad rule I added I think? [13:26:47] ack [13:27:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P84753 and previous config saved to /var/cache/conftool/dbconfig/20251104-132804-marostegui.json [13:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P84754 and previous config saved to /var/cache/conftool/dbconfig/20251104-133526-root.json [13:37:28] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178 (10cmooney) 03NEW p:05Triage→03Medium [13:39:40] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11340040 (10cmooney) [13:41:55] !log installing tiff security updates [13:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84755 and previous config saved to /var/cache/conftool/dbconfig/20251104-134314-marostegui.json [13:43:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:43:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2201.codfw.wmnet with reason: Maintenance [13:45:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2211.codfw.wmnet with reason: Maintenance [13:45:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T407997)', diff saved to https://phabricator.wikimedia.org/P84756 and previous config saved to /var/cache/conftool/dbconfig/20251104-134545-marostegui.json [13:46:11] (03CR) 10Xcollazo: "More context at slack thread: https://wikimedia.slack.com/archives/C055QGPTC69/p1762203472836359" [puppet] - 10https://gerrit.wikimedia.org/r/1201579 (owner: 10Aqu) [13:49:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T407997)', diff saved to https://phabricator.wikimedia.org/P84757 and previous config saved to /var/cache/conftool/dbconfig/20251104-134955-marostegui.json [13:49:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:53:07] !log downgrade lsw1-c3-eqiad to SR-Linux v24.7.2 [13:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:43] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:56:43] PROBLEM - Host lsw1-c3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:57:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS trixie [13:58:13] PROBLEM - Host lsw1-c3-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-c3-eqiad (10.64.128.20) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:58:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/3 (Core: lsw1-c3-eqiad:ethernet-1/56 {#B00372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1400). [14:00:04] sefehpisikler and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] o/ [14:00:21] meow [14:00:31] afk, maybe back later [14:00:33] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:02:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:54] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [14:04:17] (03PS3) 10Muehlenhoff: Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) [14:05:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P84758 and previous config saved to /var/cache/conftool/dbconfig/20251104-140503-marostegui.json [14:06:46] RECOVERY - Host lsw1-c3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [14:07:37] (03CR) 10Stevemunene: [C:03+2] superset: upgrade the memcached container image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201575 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene) [14:08:28] RECOVERY - Host lsw1-c3-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [14:08:29] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11340160 (10Jclark-ctr) This might be the missing free link https://netbox.wikimedia.org/circuits/circuit-terminations/157/trace/ Found this from another ticket >>! In T405499#112... [14:08:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-c3-eqiad (10.64.128.20) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:08:45] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [14:08:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/3 (Core: lsw1-c3-eqiad:ethernet-1/56 {#B00372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:08:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:55] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:05] (03CR) 10Ssingh: [C:03+1] site: apply tcpproxy role on all VMs created for it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201312 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [14:09:40] (03Merged) 10jenkins-bot: superset: upgrade the memcached container image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201575 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene) [14:15:31] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [14:16:50] (03PS1) 10Muehlenhoff: Uninstall intel-microcode on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1201687 [14:18:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [14:20:08] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11340183 (10elukey) 05Stalled→03Open Mark +1ed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199810 so we can proceed! I think that there are some follow... [14:20:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P84759 and previous config saved to /var/cache/conftool/dbconfig/20251104-142010-marostegui.json [14:20:23] (03CR) 10Elukey: [C:03+1] Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:25:17] (03CR) 10Muehlenhoff: [C:03+2] Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:26:23] o/ [14:26:30] is anyone doing the backport+config window? [14:26:32] otherwise I can deploy [14:27:35] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [14:28:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201687 (owner: 10Muehlenhoff) [14:28:29] PROBLEM - SSH on stat1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:28:32] (03CR) 10Klausman: topic: add dpogorzelski to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [14:29:19] RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:29:27] (03CR) 10Majavah: topic: add dpogorzelski to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [14:29:32] (03PS1) 10Aklapper: Update funneling to invalid https://wikimediafoundation.org/zh/ [puppet] - 10https://gerrit.wikimedia.org/r/1201689 (https://phabricator.wikimedia.org/T407579) [14:29:41] alright [14:29:49] sefehpisikler: are you still here? :) [14:29:52] sorry for the delay [14:29:54] Lucas_WMDE, thank you. [14:29:54] yes, i am! [14:29:56] that's okay! [14:29:59] ok [14:30:06] (03PS1) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) [14:30:07] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:30:09] honestly we may as well deploy both together [14:30:14] looks low-risk [14:30:18] yippeeeeee [14:30:28] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:30:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [14:30:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [14:30:42] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:30:59] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:31:13] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:31:37] (03Merged) 10jenkins-bot: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [14:31:44] (03Merged) 10jenkins-bot: Remove wmgULSPosition for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [14:31:57] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:32:05] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1198390|azwiktionary: use new wordmark and tagline (T408147)]], [[gerrit:1199751|Remove wmgULSPosition for special wikis (T400067)]] [14:32:22] T408147: set a new wordmark and a new tagline for azwiktionary - https://phabricator.wikimedia.org/T408147 [14:32:25] T400067: Clean up LPL-owned settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400067 [14:34:21] !log lucaswerkmeister-wmde@deploy2002 ekrem, lucaswerkmeister-wmde, abi: Backport for [[gerrit:1198390|azwiktionary: use new wordmark and tagline (T408147)]], [[gerrit:1199751|Remove wmgULSPosition for special wikis (T400067)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:34:59] sefehpisikler, abijeet: please test! [14:35:10] Lucas_WMDE, on it [14:35:13] umm, i'm sorry, it's my first time here, how do i? [14:35:17] no problem [14:35:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T407997)', diff saved to https://phabricator.wikimedia.org/P84760 and previous config saved to /var/cache/conftool/dbconfig/20251104-143519-marostegui.json [14:35:24] https://wikitech.wikimedia.org/wiki/WikimediaDebug [14:35:29] there’s a Firefox/Chrome extension linked there [14:35:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:35:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: Maintenance [14:35:39] yeah i have it installed but idk how to use it [14:35:42] install that and enable it as shown in the screenshot [14:35:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T407997)', diff saved to https://phabricator.wikimedia.org/P84761 and previous config saved to /var/cache/conftool/dbconfig/20251104-143546-marostegui.json [14:35:55] and then, when you go to az.wiktionary.org (and maybe force-reload), you should be able to see the change [14:35:59] (03PS3) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [14:36:35] (you shouldn’t need any of the checkboxes, just the big on/off switch) [14:36:44] ah yes it's working [14:36:46] thanks much! [14:36:48] (and you can leave the selected server at “k8s-mwdebug” which should be the default) [14:36:49] yay! [14:36:58] alright [14:37:06] yeah those servers kinda made me confused lol [14:37:21] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS trixie [14:37:25] !log lucaswerkmeister-wmde@deploy2002 ekrem, lucaswerkmeister-wmde, abi: Continuing with sync [14:37:27] ok, then we can continue :) [14:37:32] yes, thanks again! [14:37:36] Lucas_WMDE, looks OK on my end too. [14:37:55] oops, sorry, I continued too soon there [14:38:09] I somehow misremembered your “on it” message as a confirmation 😅 [14:38:11] anyway, good! [14:38:28] (03PS1) 10Elukey: memcached: upgrade to Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201693 [14:38:52] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T407997)', diff saved to https://phabricator.wikimedia.org/P84762 and previous config saved to /var/cache/conftool/dbconfig/20251104-143958-marostegui.json [14:41:38] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198390|azwiktionary: use new wordmark and tagline (T408147)]], [[gerrit:1199751|Remove wmgULSPosition for special wikis (T400067)]] (duration: 09m 33s) [14:41:55] T408147: set a new wordmark and a new tagline for azwiktionary - https://phabricator.wikimedia.org/T408147 [14:42:00] T400067: Clean up LPL-owned settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400067 [14:42:26] !log UTC afternoon backport+config window done [14:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:41] sefehpisikler: should be live for everyone now :) [14:42:53] (the WikimediaDebug extension will automatically turn itself off anyway but you can also do it manually if you like) [14:43:14] lemme check real quick [14:44:04] Lucas_WMDE, works as expected. [14:44:08] \o/ [14:44:18] sorry, i guess there was something going on with my network [14:44:32] it's live on a normal window, but it isn't in incognito [14:44:36] maybe there's a caching issue or something [14:45:21] hmm [14:45:27] did you try force-reloading in the private window? [14:45:39] actually, yeah, no, you’re right [14:45:42] since the SVGs were modified [14:45:44] and not just new files used [14:45:46] I need to purge them [14:45:54] oh, got it [14:47:59] !log lucaswerkmeister-wmde@deploy2002 $ printf 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-%s-az.svg\n' tagline wordmark | mwscript-k8s --comment='T408147' --attach -- purgeList enwiki [14:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:05] T408147: set a new wordmark and a new tagline for azwiktionary - https://phabricator.wikimedia.org/T408147 [14:48:16] now it should be better [14:48:19] (03CR) 10Stevemunene: [C:03+1] "Looks good, Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201693 (owner: 10Elukey) [14:48:21] let me check [14:48:56] yay, the file is working!! [14:49:01] the size is the same as the old one tho [14:49:04] (i guess) [14:49:17] \o/ [14:49:37] lol [14:49:42] sorry for the inconvenience [14:49:47] no problem! [14:49:51] thank u! [14:49:53] it’s my fault for forgetting the maintenance script ^^ [14:49:54] thank you! [14:50:01] ehehe, happens :D [14:50:44] i also forgot to run the script that updates the logos.php file after modifying the svg a few times in the gerrit commit lol [14:52:12] (03CR) 10Elukey: "Tested with::" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201693 (owner: 10Elukey) [14:55:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P84763 and previous config saved to /var/cache/conftool/dbconfig/20251104-145506-marostegui.json [14:57:44] (03PS2) 10Elukey: admin: add dpogorzelski to ops [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [14:57:48] 06SRE: offline rackspace wikitech-static, online aws wikitech-static - https://phabricator.wikimedia.org/T408704#11340323 (10akosiaris) >>! In T408704#11326432, @LSobanski wrote: > cc @akosiaris Thanks @RobH, given the WMCS team re-org, this has inevitably been delayed. Can we push this forward a few months? [14:58:13] (03CR) 10Elukey: admin: add dpogorzelski to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [14:58:22] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [14:58:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [14:58:52] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:04] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS trixie [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1500) [15:00:57] (03PS1) 10Scott French: deployment_server: fully migrate mw-(api-int|jobrunner) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1201695 (https://phabricator.wikimedia.org/T405955) [15:00:58] (03PS1) 10Scott French: mw-(api-int|jobrunner): shift capacity back from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201696 (https://phabricator.wikimedia.org/T405955) [15:01:00] (03PS1) 10Scott French: mw-(api-ext|web): serve 10% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201697 (https://phabricator.wikimedia.org/T405955) [15:01:06] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [15:03:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2192.codfw.wmnet with reason: Maintenance [15:04:29] (03PS3) 10Slyngshede: WIP: Assign managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1201597 [15:06:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [15:06:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T407997)', diff saved to https://phabricator.wikimedia.org/P84764 and previous config saved to /var/cache/conftool/dbconfig/20251104-150623-marostegui.json [15:06:34] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:08:52] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:31] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11340353 (10jhathaway) >>! In T408632#11339222, @Krd wrote: > If I checked correctly, nobody is subscribed to the Junk q... [15:12:24] (03PS1) 10Vgutierrez: haproxy: Add axios to ua_library_default [puppet] - 10https://gerrit.wikimedia.org/r/1201699 [15:12:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201699 (owner: 10Vgutierrez) [15:13:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb2002-dev.wikimedia.org with OS trixie [15:13:11] Lucas_WMDE, i'm sorry, i have no intention of pressuring you, i'm just curious; how long do you think it's gonna take til when the size thing is fixed? [15:13:18] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:13:23] (03CR) 10CDanis: [C:03+1] haproxy: Add axios to ua_library_default [puppet] - 10https://gerrit.wikimedia.org/r/1201699 (owner: 10Vgutierrez) [15:13:48] sorry, which size thing? [15:13:52] I thought everything was okay now? [15:14:25] oh, i guess i mentioned it above [15:14:44] I wasn’t sure if that was a problem or not ^^ [15:14:56] oh, i'm sorry for not clarifying it [15:15:10] I don’t see the issue when browsing through azwiktionary in a private window [15:15:13] yeah it's not that bad but it could be better if it's possible [15:15:14] is it stretched or something? [15:15:16] or just too small? [15:15:26] it's too big, stretched i guess [15:15:40] hm, maybe you can post a screenshot on the task [15:15:50] like for example, if you check it via the devtools, the wordmark should be 85x22 but it's 102x27 i guess [15:15:57] alright, on phabricator or gerrit? [15:16:02] (03CR) 10Vgutierrez: [C:03+2] haproxy: Add axios to ua_library_default [puppet] - 10https://gerrit.wikimedia.org/r/1201699 (owner: 10Vgutierrez) [15:16:06] Phabricator, since you can upload images there [15:16:21] oh okay, i thought like using some image upload service like imgbb [15:17:43] oh, I think I see what you mean… (now that I realized I need to be testing the mobile site ^^) [15:17:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P84765 and previous config saved to /var/cache/conftool/dbconfig/20251104-151744-root.json [15:17:52] after purging https://az.wiktionary.org/wiki/Ana_s%C9%99hif%C9%99 and then force-reloading it, the logo got smaller [15:17:57] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:18:14] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:18:36] ah yes, it's now smaller for me too [15:18:41] thank you so much! [15:18:49] i guess i can leave now? [15:18:54] I… guess? [15:18:59] if other pages also look okay for you [15:19:08] let me see real quick [15:19:22] I think I now get the small logo even when going to Special:Random repeatedly [15:19:40] though purging the main page shouldn’t have affected those o_O [15:19:58] yeah it was probably that those pages had smaller logo but the main page didn't :D [15:20:04] anyways, it's small for every page now [15:20:09] thanks again, and have a nice day! [15:20:28] \o/ [15:20:33] oh, already quit ^^ [15:21:50] (03CR) 10Effie Mouzeli: [C:03+1] deployment_server: fully migrate mw-(api-int|jobrunner) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1201695 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:21:56] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-int|jobrunner): shift capacity back from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201696 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:22:34] (03PS1) 10Marostegui: db1166: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1201702 [15:23:11] (03CR) 10Marostegui: [C:03+2] db1166: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1201702 (owner: 10Marostegui) [15:23:13] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11340429 (10elukey) @MatthewVernon I am not able to reproduce on ms-be1088, at this point you can probably finish your maintenance to... [15:23:50] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): serve 10% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201697 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:24:27] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300#11340432 (10elukey) 05Open→03Resolved a:03elukey [15:24:39] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance thanos-be1005) - https://phabricator.wikimedia.org/T383301#11340435 (10elukey) 05Open→03Resolved a:03elukey [15:24:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T407997)', diff saved to https://phabricator.wikimedia.org/P84766 and previous config saved to /var/cache/conftool/dbconfig/20251104-152440-marostegui.json [15:24:44] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:26:30] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#11340457 (10elukey) 05Open→03Resolved a:03elukey I think we can close this task, we have established tha... [15:27:45] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11340469 (10elukey) 05Open→03Resolved a:03elukey The fix will go out in the next Spicerack release, closing! [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1530) [15:30:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [15:30:48] (03CR) 10Btullis: [C:03+1] "Looks good to me, too. Thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201693 (owner: 10Elukey) [15:31:11] (03CR) 10Elukey: [V:03+2 C:03+2] memcached: upgrade to Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201693 (owner: 10Elukey) [15:31:34] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice-archive: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11340484 (10CKoerner_WMF) Could I request assistance in verifying that we've properly addressed the user-agent policy requirements for Diff's o... [15:32:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P84767 and previous config saved to /var/cache/conftool/dbconfig/20251104-153249-root.json [15:33:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [15:33:52] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11340493 (10RobH) [15:36:57] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:37:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11340504 (10Krd) Please provide in private who that is and how you found the information. [15:39:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:31] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1088.eqiad.wmnet with OS trixie [15:39:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P84768 and previous config saved to /var/cache/conftool/dbconfig/20251104-153948-marostegui.json [15:39:56] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409192 (10phaultfinder) 03NEW [15:41:19] (03CR) 10Elukey: [C:03+2] admin: add dpogorzelski to ops [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [15:42:52] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11340528 (10elukey) 05Open→03Resolved a:03elukey Change merged, I think we can close! [15:43:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11340531 (10Dzahn) 05In progress→03Resolved Thanks for merging and moving this forward @elukey [15:43:18] (03PS1) 10Elukey: golang: fix golang1.24 warning while running docker-pkg [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201714 [15:44:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11340538 (10jhathaway) >>! In T408632#11340504, @Krd wrote: > Please provide in private who that is and how you found th... [15:46:39] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:47:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P84770 and previous config saved to /var/cache/conftool/dbconfig/20251104-154755-root.json [15:49:06] !log upgrade lsw1-c3-eqiad and lsw1-d3-eqiad to SR-Linux v24.10.4 [15:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:07] PROBLEM - Host lsw1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:52:11] PROBLEM - Host lsw1-d3-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:45] PROBLEM - Host lsw1-c3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:53:03] RECOVERY - Host lsw1-c3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:53:25] RECOVERY - Host lsw1-d3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [15:54:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11340612 (10Krd) Please stand by an hour or two. [15:54:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P84771 and previous config saved to /var/cache/conftool/dbconfig/20251104-155455-marostegui.json [15:56:46] (03CR) 10Btullis: [C:03+1] "This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:56:59] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:57:13] RECOVERY - Host lsw1-d3-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [15:57:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11340635 (10Dzahn) Thanks for clarifying. All good! [15:58:46] (03CR) 10BCornwall: ncmonitor: Add MarkMonitor API key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [15:59:38] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11340643 (10Lina_Farid_WMDE) Hi @Urbanecm_WMF, @Urbanecm and @JMeybohm, Could you let us know what additional information you need to make a decision on this issue... [15:59:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:04] (03PS3) 10BCornwall: ncmonitor: Add MarkMonitor API key [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) [16:00:05] jelto, arnoldokoth, and mutante: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1600). [16:00:10] (03CR) 10BCornwall: ncmonitor: Add MarkMonitor API key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [16:01:35] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1200381 (owner: 10Effie Mouzeli) [16:02:56] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Deploy [16:03:04] !log brennen@deploy2002 Started deploy [phabricator/deployment@e9011f3]: deploy phab2002 for T409193 [16:03:07] T409193: Deploy Phab/Phorge 2025-11-04 - https://phabricator.wikimedia.org/T409193 [16:03:36] !log brennen@deploy2002 Finished deploy [phabricator/deployment@e9011f3]: deploy phab2002 for T409193 (duration: 00m 31s) [16:03:53] !log brennen@deploy2002 Started deploy [phabricator/deployment@e9011f3]: deploy phab1004 for T409193 [16:04:11] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Deploy [16:04:18] (03CR) 10Scott French: [C:03+1] "Thanks, Janis!" [puppet] - 10https://gerrit.wikimedia.org/r/1201574 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [16:06:23] !log brennen@deploy2002 Finished deploy [phabricator/deployment@e9011f3]: deploy phab1004 for T409193 (duration: 02m 29s) [16:08:04] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [16:10:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T407997)', diff saved to https://phabricator.wikimedia.org/P84772 and previous config saved to /var/cache/conftool/dbconfig/20251104-161003-marostegui.json [16:10:07] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:10:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [16:10:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84773 and previous config saved to /var/cache/conftool/dbconfig/20251104-161027-marostegui.json [16:13:52] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:55] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [16:18:28] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [16:19:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:19:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:20:05] (03CR) 10Ebernhardson: [C:03+1] "dumps ran successfully and look reasonable, good to ship." [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [16:20:25] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11340767 (10Milimetric) 05Open→03Resolved [16:20:56] (03CR) 10BCornwall: [C:03+1] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201594 (https://phabricator.wikimedia.org/T409168) (owner: 10Gerrit maintenance bot) [16:27:39] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [16:27:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84774 and previous config saved to /var/cache/conftool/dbconfig/20251104-162754-marostegui.json [16:27:58] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:28:50] (03Abandoned) 10Muehlenhoff: Enable tile invalidation for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188345 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:29:12] (03CR) 10Slyngshede: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1201066 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [16:29:24] (03Abandoned) 10Muehlenhoff: data.yaml: record LDAP access for dpogorzelski [puppet] - 10https://gerrit.wikimedia.org/r/1197606 (owner: 10Slyngshede) [16:29:46] (03PS3) 10Muehlenhoff: nginx: Remove prometheus.lua [puppet] - 10https://gerrit.wikimedia.org/r/1036672 [16:35:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036672 (owner: 10Muehlenhoff) [16:37:01] (03PS1) 10Bvibber: Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201722 (https://phabricator.wikimedia.org/T409123) [16:37:48] (03PS1) 10Bvibber: Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201723 (https://phabricator.wikimedia.org/T409123) [16:43:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P84775 and previous config saved to /var/cache/conftool/dbconfig/20251104-164304-marostegui.json [16:44:53] (03CR) 10Dzahn: [C:03+2] "we can always adjust later.. matching the test setup for now so it does not get reverted when we re-enable puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1201311 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [16:45:11] (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1201725 [16:46:38] (03CR) 10Ryan Kemper: [C:03+2] dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [16:47:21] (03CR) 10Ssingh: [C:03+1] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1201725 (owner: 10BCornwall) [16:52:56] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:53:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201723 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [16:53:33] (03CR) 10Dzahn: [C:03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1201699 (owner: 10Vgutierrez) [16:53:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201722 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [16:53:59] (03CR) 10Dzahn: [C:03+1] "will do this after fixing firewall and logging" [puppet] - 10https://gerrit.wikimedia.org/r/1201312 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [16:54:30] (03CR) 10Eric Gardner: [C:03+1] Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201723 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [16:54:35] (03CR) 10Eric Gardner: [C:03+1] Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201722 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [16:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:56] (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1201725 (owner: 10BCornwall) [16:58:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P84776 and previous config saved to /var/cache/conftool/dbconfig/20251104-165812-marostegui.json [16:59:56] (03PS36) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:00:05] jhathaway and moritzm: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:35] (03PS1) 10Fabfur: P:cache:haproxy fix typo in ACL [puppet] - 10https://gerrit.wikimedia.org/r/1201728 [17:01:43] (03PS2) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [17:01:54] (03CR) 10Scott French: [C:03+1] P:cache:haproxy fix typo in ACL [puppet] - 10https://gerrit.wikimedia.org/r/1201728 (owner: 10Fabfur) [17:02:04] (03CR) 10Giuseppe Lavagetto: [C:03+1] P:cache:haproxy fix typo in ACL [puppet] - 10https://gerrit.wikimedia.org/r/1201728 (owner: 10Fabfur) [17:02:16] (03CR) 10BCornwall: [C:04-2] "Thank you for the patch! I'm going to verify with the owners of the site that this was intentional before moving forward. Hold tight until" [puppet] - 10https://gerrit.wikimedia.org/r/1201689 (https://phabricator.wikimedia.org/T407579) (owner: 10Aklapper) [17:02:53] (03CR) 10Fabfur: [C:03+2] P:cache:haproxy fix typo in ACL [puppet] - 10https://gerrit.wikimedia.org/r/1201728 (owner: 10Fabfur) [17:04:01] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:04:13] (03PS1) 10Pmiazga: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [17:04:26] (03PS1) 10RLazarus: {api,rest}-gateway: Update to Envoy 1.32.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201730 (https://phabricator.wikimedia.org/T405808) [17:04:28] (03PS1) 10RLazarus: mw-*: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201731 (https://phabricator.wikimedia.org/T405808) [17:06:17] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:07:17] (03PS1) 10RLazarus: mw-videoscaler: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201732 (https://phabricator.wikimedia.org/T405808) [17:07:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:07] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409192#11341086 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Adjusted limits as discussed in dcops meeting [17:09:39] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1200397 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:10:13] (03CR) 10Fabfur: [C:03+1] "thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1196544 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:11:39] (03PS1) 10Stevemunene: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) [17:12:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:13:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T407997)', diff saved to https://phabricator.wikimedia.org/P84777 and previous config saved to /var/cache/conftool/dbconfig/20251104-171320-marostegui.json [17:13:23] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:13:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [17:13:27] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11341129 (10dcausse) >>! In T393966#11325354, @RKemper wrote: > @dcausse In this updated version of the SLI we... [17:13:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84778 and previous config saved to /var/cache/conftool/dbconfig/20251104-171333-marostegui.json [17:15:27] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database thwikimedia (T409201) [17:15:29] T409201: [wikireplicas] Create views for new wiki thwikimedia - https://phabricator.wikimedia.org/T409201 [17:15:32] (03PS3) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [17:16:30] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409174#11341159 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm balanced power [17:17:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11341173 (10Jhancock.wm) fyi Dell is fighting me on this cause the idrac doesn't show a failure. so any extra evidence you got to throw at them would be appreciated. I've alread... [17:17:41] RESOLVED: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:17:45] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:18:52] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:20] (03PS1) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [17:19:27] (03PS1) 10David Caro: toolforge: scrape redis metrics [puppet] - 10https://gerrit.wikimedia.org/r/1201737 [17:19:47] (03PS2) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [17:20:41] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341199 (10hnowlan) Hi Virginie, your account appears to already be a member of `analytics-privatedata-users` which should grant you Superset access. This access was added in T407605. [17:20:58] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341201 (10hnowlan) 05Open→03In progress [17:21:32] (03PS2) 10Pmiazga: api-gateway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [17:21:44] (03PS37) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:22:30] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341211 (10taavi) >>! In T408920#11341199, @hnowlan wrote: > Hi Virginie, your account appears to already be a member of `analytics-privatedata-users` which should grant you Superset access. T... [17:22:36] (03PS3) 10Pmiazga: api-gateway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [17:23:27] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database mswikiquote (T404703) [17:23:30] T404703: [wikireplicas] Create views for new wiki mswikiquote - https://phabricator.wikimedia.org/T404703 [17:23:38] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database mswikiquote (T404703) [17:23:58] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database tokwiki (T404703) [17:24:07] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tokwiki (T404703) [17:24:23] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11341235 (10jhathaway) Is this still occurring? [17:24:31] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database tokwiki (T404566) [17:24:33] T404566: Prepare and check storage layer for tokwiki - https://phabricator.wikimedia.org/T404566 [17:24:34] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341238 (10Dzahn) This request is for `ldap/wmde , ldap/nda`, not for analytics-privatedata-users. These LDAP groups are standard for WMDE staff and give permissions like +2 in certain WMDE r... [17:24:40] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tokwiki (T404566) [17:25:45] (03CR) 10Majavah: [C:03+1] "+1 on the condition of ensuring this gets cleaned up from metricsinfra to avoid the confusing situation of having it scraped in both place" [puppet] - 10https://gerrit.wikimedia.org/r/1201737 (owner: 10David Caro) [17:25:52] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11341245 (10Krd) I have unsubscribed the mentioned user. This appears to be the only one, and I will monitor this from n... [17:26:32] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database tokwiki (T404570) [17:26:34] T404570: [wikireplicas] Create views for new wiki tokwiki - https://phabricator.wikimedia.org/T404570 [17:26:41] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tokwiki (T404570) [17:27:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11341256 (10jhathaway) >>! In T408632#11341245, @Krd wrote: > I have unsubscribed the mentioned user. This appears to be... [17:28:07] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11341259 (10Dzahn) Seems like this one can be closed as resolved. [17:29:36] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11341265 (10Krd) I just heard from one user that password recovery still doesn't work for them. [17:30:10] (03CR) 10David Caro: [C:03+2] toolforge: scrape redis metrics [puppet] - 10https://gerrit.wikimedia.org/r/1201737 (owner: 10David Caro) [17:31:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84779 and previous config saved to /var/cache/conftool/dbconfig/20251104-173100-marostegui.json [17:31:41] (03PS38) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:34:51] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341289 (10hnowlan) Thanks for the clarification. Virginie, I've added you to the wmf and nda ldap groups. [17:36:09] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409203 (10phaultfinder) 03NEW [17:38:53] (03PS1) 10Krinkle: robots.php: Clean up unused site, lang, and x-subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) [17:43:35] (03PS10) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [17:45:19] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [17:46:06] (03CR) 10Bking: [C:04-1] "Sorry for not being more clear when I linked this...this is the nginx config file for docker-registry. We need to add headers in a similar" [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [17:46:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P84780 and previous config saved to /var/cache/conftool/dbconfig/20251104-174608-marostegui.json [17:48:39] (03CR) 10Bking: [C:04-1] "The query_service (wdqs) nginx config is here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/mo" [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [17:49:44] (03PS39) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:54:15] (03PS1) 10Bking: wdqs: re-apply allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/1201742 (https://phabricator.wikimedia.org/T409132) [17:56:17] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1200397 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:56:25] (03CR) 10Scott French: [C:03+2] haproxy: add known-client DSL fixture in tests [puppet] - 10https://gerrit.wikimedia.org/r/1200397 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:57:18] (03PS1) 10Andrew Bogott: cloudweb2002-dev: update idp hacks for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201743 [17:59:11] (03PS2) 10Andrew Bogott: cloudweb2002-dev: update idp hacks for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201743 [17:59:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201743 (owner: 10Andrew Bogott) [18:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1800). [18:00:05] (03PS3) 10Andrew Bogott: cloudweb2002-dev: update package overrides for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201743 [18:00:10] o/ [18:00:20] (03PS1) 10BCornwall: ncmonitor: Change timer to run daily [puppet] - 10https://gerrit.wikimedia.org/r/1201744 [18:01:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P84781 and previous config saved to /var/cache/conftool/dbconfig/20251104-180116-marostegui.json [18:01:42] (03PS1) 10Dzahn: tcpproxy: greatly reduce connection timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1201745 (https://phabricator.wikimedia.org/T408532) [18:04:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201743 (owner: 10Andrew Bogott) [18:04:43] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201747 (https://phabricator.wikimedia.org/T408271) [18:04:46] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201747 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [18:05:32] wait, is the train rolling group0? [18:05:38] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201747 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [18:05:42] testwikis [18:05:59] Train presync failed last night due to patch issues [18:06:03] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409203#11341443 (10Jclark-ctr) a:03Jclark-ctr [18:06:06] !log jhuneidi@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.1 refs T408271 [18:06:08] so, why is that happening during the infra window? [18:06:09] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [18:06:18] (03CR) 10Andrew Bogott: [C:03+2] cloudweb2002-dev: update package overrides for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201743 (owner: 10Andrew Bogott) [18:06:59] swfrench-wmf: No specific reason. Jeena happened to retry to the operation now. If it's disruptive we can ask her to cancel. [18:07:42] sorry for the interruption swfrench-wmf [18:07:52] The actual deployment won't happen for about 35 minutes while images are built [18:07:56] (03CR) 10Daniel Kinzler: [C:04-1] api-gateway: rename symbols used in restgw ratelimiter (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [18:08:28] definitely disruptive, as I need to be able to use scap to deploy for the work scheduled during this window [18:08:40] but I'm not sure it makes sense to cancel at this point? [18:08:54] It's reasonable to cancel. [18:08:55] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:00] I will cancel it now [18:09:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:06] !log jhuneidi@deploy2002 sync-world aborted: testwikis to 1.46.0-wmf.1 refs T408271 (duration: 03m 00s) [18:09:30] My apologies [18:09:53] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7538/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201744 (owner: 10BCornwall) [18:10:20] jeena: dancy: no worries, it happens! so, what's the state of using scap? does the mediawiki-config patch also need reverted? [18:10:38] or actually, I guess don't _need_ to trigger image builds [18:11:13] oh, but scap is going to rebuild the l10n files anyway even if I tell it not to build [18:11:19] (03CR) 10Daniel Kinzler: [C:04-1] api-gateway: Make x-ratelimit response header configurable. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [18:11:35] this is what I meant about it being unclear whether it actually makes sense to cancel at this stage [18:11:36] The wikiversions.json change will need to be rolled back if you don't want it involved. [18:11:56] one way of doing that is to run `scap train` and select the [18:12:00] (03PS1) 10Vgutierrez: varnish: Avoid calling detect_browser more than once [puppet] - 10https://gerrit.wikimedia.org/r/1201748 [18:12:02] `start` station [18:12:12] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database thwikimedia (T409201) [18:12:15] T409201: [wikireplicas] Create views for new wiki thwikimedia - https://phabricator.wikimedia.org/T409201 [18:13:35] (03CR) 10CDanis: [C:03+1] varnish: Avoid calling detect_browser more than once [puppet] - 10https://gerrit.wikimedia.org/r/1201748 (owner: 10Vgutierrez) [18:13:42] (03CR) 10Vgutierrez: [C:03+1] tcpproxy: greatly reduce connection timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1201745 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [18:14:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201748 (owner: 10Vgutierrez) [18:15:12] swfrench-wmf, jeena: I'll do that [18:15:26] (03PS1) 10TrainBranchBot: all to next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201749 (https://phabricator.wikimedia.org/T408271) [18:15:28] dancy: jeena: apologies, but I've not really futzed around with `scap train` before. does selecting `start` actually put the state of /srv/mediawiki-staging back the way it wasy? [18:15:28] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201749 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [18:15:37] *was [18:15:44] (03Abandoned) 10Ahmon Dancy: all to next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201749 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [18:16:18] swfrench-wmf: That's what is supposed to happen but I just ran into a bug that I need to fix. [18:16:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84782 and previous config saved to /var/cache/conftool/dbconfig/20251104-181623-marostegui.json [18:16:28] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:16:36] I realize it shifts one layer of intent, but wasn't sure how that translates to other things [18:16:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [18:16:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84783 and previous config saved to /var/cache/conftool/dbconfig/20251104-181648-marostegui.json [18:17:12] (03CR) 10Bking: [C:03+2] wdqs: re-apply allowlist changes [puppet] - 10https://gerrit.wikimedia.org/r/1201742 (https://phabricator.wikimedia.org/T409132) (owner: 10Bking) [18:17:29] (03CR) 10Bking: [C:03+2] "self-merging, as the previous patch was vetted and this is identical." [puppet] - 10https://gerrit.wikimedia.org/r/1201742 (https://phabricator.wikimedia.org/T409132) (owner: 10Bking) [18:17:49] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11341493 (10Xaosflux) I'm seeing at least off-by-one errors on multiple queues, stewards queue right now is off by 2 [18:17:54] (03PS1) 10Ahmon Dancy: All wikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201750 [18:17:57] dancy: ack, thanks! sorry cancelling ended up being a bit messier than expected [18:18:17] (03CR) 10Ahmon Dancy: [C:03+2] All wikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201750 (owner: 10Ahmon Dancy) [18:19:21] (03Merged) 10jenkins-bot: All wikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201750 (owner: 10Ahmon Dancy) [18:19:43] swfrench-wmf: You should be good to go [18:20:30] dancy: amazing, thank you! [18:20:45] (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy for both text & upload" [puppet] - 10https://gerrit.wikimedia.org/r/1201748 (owner: 10Vgutierrez) [18:21:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb2002-dev.wikimedia.org with OS trixie [18:21:12] (03CR) 10Scott French: [C:03+2] deployment_server: fully migrate mw-(api-int|jobrunner) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1201695 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:21:35] (03CR) 10Vgutierrez: [V:03+2 C:03+2] varnish: Avoid calling detect_browser more than once [puppet] - 10https://gerrit.wikimedia.org/r/1201748 (owner: 10Vgutierrez) [18:22:46] swfrench-wmf: please go ahead and merge mine if it came up in your puppet-merge [18:23:08] hmm nope.. I got it [18:23:11] Scott French: deployment_server: fully migrate mw-(api-int|jobrunner) to 8.3 (9bbc386578) [18:23:11] Brian King: wdqs: re-apply allowlist changes (b56c5c1b97) [18:23:11] vgutierrez: was just about to ask [18:23:18] inflatador: is ok to merge that one? [18:23:22] swfrench-wmf: can I proceed with yours too? [18:23:30] vgutierrez yes, please do [18:23:31] I just confirmed with inflatador in -sre [18:23:37] thx <3, proceeding [18:23:55] new idea: puppet-merge is actually a google meet bot [18:24:08] thank you! [18:24:16] cdanis: IRC bot? [18:24:31] vgutierrez: no, you need to dial in and read the shorthashes of the changes you wish to merge [18:24:38] hell no [18:24:41] (done) [18:24:47] no more synchronization issues! [18:24:56] thanks, vgutierrez! [18:25:07] cdanis: new curse, when you have multiple merges, you are now inexplicably and forever linked to that person and they have to press the button for any of your future changes, and vice versa [18:25:10] sidesteps the IRC channel proliferation issue too [18:27:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11341569 (10Geagea) still occurring. permissions-en shows 2 and it's only one permissions-commons shows 174 but it's 176 [18:27:42] (03CR) 10Ssingh: [C:03+1] ncmonitor: Change timer to run daily [puppet] - 10https://gerrit.wikimedia.org/r/1201744 (owner: 10BCornwall) [18:28:25] cdanis: I'm sorry, I didn't get that. Please speak clearly into the microphone. Please say the number of hosts to continue. [18:28:50] rzl: ONE. ONE. TWO. [18:28:55] lol [18:29:09] cdanis: You have selected "you," referring to me. That is incorrect. The correct answer is you. Goodbye. [18:29:15] lol [18:29:17] x) [18:31:07] new record for number of multiple patches in puppet-merge? [18:32:31] !log swfrench@deploy2002 Started scap sync-world: Fully migrate mw-(api-int|jobrunner) to 8.3 - T405955 [18:32:34] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:33:45] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:36:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84784 and previous config saved to /var/cache/conftool/dbconfig/20251104-183619-marostegui.json [18:36:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:36:45] dancy: so, scap is just kinda sitting here on `sync-masters` for the past ~ 3 minutes. does that sound like a potential side effect of the earlier cancellation? [18:37:47] swfrench-wmf: Probably. /srv/mediawiki-staging/php-1.46.0-wmf.1 was cloned, so that'll need to be rsyncd [18:38:06] ah, that'd do it. cleared just now after 4m36s :) [18:40:21] !log swfrench@deploy2002 Finished scap sync-world: Fully migrate mw-(api-int|jobrunner) to 8.3 - T405955 (duration: 07m 49s) [18:40:24] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:42:49] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): shift capacity back from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201696 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:44:44] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): shift capacity back from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201696 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:47:09] PROBLEM - Confd vcl based reload on cp2039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:47:33] PROBLEM - Confd vcl based reload on cp1104 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:48:20] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:48:27] uh oh [18:48:38] swfrench-wmf: Lemme know when I can deploy a fixed scap on the deploy servers. [18:48:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:49:27] dancy: I'm done using scap for now, but have a number of manual helmfile changes to apply. which is to say, feel free to deploy scap, but don't test it yet :) [18:49:35] OK! [18:49:39] !log dancy@deploy2002 Installing scap version "4.222.0" for 2 host(s) [18:49:41] sukhe@cp2039:~$ cat /var/run/reload-vcl-state [18:49:41] 0 [18:49:59] we have seen this race condition before [18:50:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11341689 (10VRiley-WMF) Spoke to @cmooney about this ticket. This no longer has to be moved... [18:50:21] Does that happen when a new requestctl rule is published or something? [18:50:37] no, during VCL reload, so most likely coming from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/5a19e10bb719c4f8de748ed8dbeba17bea54b9a7 [18:50:54] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:50:54] though that's not the cause of it but, as in not specifically [18:50:55] let's try [18:51:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:51:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11341696 (10VRiley-WMF) Spoke to @cmooney about this ticket. This no longer has to be moved... [18:51:25] claime: I mean, requestctl rule would be VCL reload too but I think we have only seen it in cases of commits like above [18:51:26] !log dancy@deploy2002 Installation of scap version "4.222.0" completed for 2 hosts [18:51:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P84785 and previous config saved to /var/cache/conftool/dbconfig/20251104-185126-marostegui.json [18:51:44] sukhe: I'm asking because we had this alert pop after a requestctl commit earlier today [18:51:48] so my bad for saying "no" but I meant that the requestctl thing was unlikely to have caused it I think [18:51:51] ah [18:51:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:52:06] ok, that changes what we have seen in the past I think, where this was tied to a puppet run [18:52:12] around 1322 UTC [18:52:14] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:52:23] I literally just did a requestctl commit .. thats right [18:52:48] swfrench-wmf: Done. Please let Jeena know when she can retry `scap stage-train` [18:53:38] dancy: ack, will do. I'm running a bit behind after the delay, but hopefully shouldn't stray into the train window too much [18:53:39] I have a second one pending.. but not touching it right now [18:53:55] mutante: go for it please [18:54:05] because that should help (last time we did a NOOP change on requestctl) [18:54:15] here in case things go south :] [18:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via next at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:54:16] ok, doing! [18:54:28] committted to CDN [18:54:34] um ... I've not touched that ... [18:54:36] looking [18:54:46] swfrench-wmf: the fpm sat should level out right? [18:54:53] Oh it's web [18:54:56] so basically what happens here is that there are two separate things that touch the reload [18:54:59] Letting you check it out [18:55:01] claime: I've not touched -web yet, yeah =/ [18:55:08] and if one is already running and the other one kicks in, this happens [18:55:09] RECOVERY - Confd vcl based reload on cp2039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:55:17] ah nice ok [18:55:18] Unless you have something else to do and want me to swfrench-wmf [18:55:18] claime: I'm in the middle of some precarious changes on -int and -jobrunner [18:55:20] ack [18:55:24] I'll go take a gander [18:55:29] claime: could you take a look and I'll join in you in sec? [18:55:32] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:55:33] RECOVERY - Confd vcl based reload on cp1104 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:55:38] swfrench-wmf: yeah, on it [18:55:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:55:45] mutante: all good :) [18:55:50] sukhe: :) thanks! [18:56:44] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341707 (10Dzahn) Thanks as well! That should resolve the ticket. [18:56:56] swfrench-wmf: Hmm sharp rps increase that is not reflected on main [18:57:11] PROBLEM - Confd vcl based reload on cp2027 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:57:14] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP groups for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341708 (10Dzahn) [18:57:55] claime: thanks! alright, probably a workload of some sort that also happens to have managed to get cookie-enrolled [18:58:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11341710 (10cmooney) >>! In T405628#11341689, @VRiley-WMF wrote: > Spoke to @cmooney about... [18:58:08] swfrench-wmf: that's mw-web though? [18:58:13] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP groups for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11341711 (10Dzahn) @Virginie.caplet (all the) things should work now that also work for other WMDE staff. [18:58:15] So that should be client traffic [18:59:11] claime: right exactly, but it could very well be a client that's ... um ... a bit less well behaved than the typical client that gets enrolled [18:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via next at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:59:39] I am *not* helped by firefox deciding to lag tf out [18:59:49] (03CR) 10Dzahn: [C:03+2] tcpproxy: greatly reduce connection timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1201745 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:00:05] jeena and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T1900). [19:00:27] jeena: dduvall: do not proceed with the train [19:00:47] we're still trying to clean up from the delay earlier [19:01:07] I'll need another 10-15m to get things in a stable state [19:03:39] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:03:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:04:01] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:04:22] swfrench-wmf: 🫡 [19:04:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:05:27] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [19:05:45] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [19:06:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [19:06:33] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [19:06:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P84788 and previous config saved to /var/cache/conftool/dbconfig/20251104-190634-marostegui.json [19:06:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [19:07:03] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [19:08:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [19:08:13] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [19:09:36] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [19:09:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [19:09:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1220 (T403362)', diff saved to https://phabricator.wikimedia.org/P84789 and previous config saved to /var/cache/conftool/dbconfig/20251104-190946-ladsgroup.json [19:09:48] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [19:09:50] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [19:10:59] jeena: dduvall: you should be able to retry `stage-train` now [19:11:17] okay thanks! [19:11:49] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201766 (https://phabricator.wikimedia.org/T408271) [19:11:52] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201766 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:12:46] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201766 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:13:17] !log jhuneidi@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.1 refs T408271 [19:13:20] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [19:15:45] (03CR) 10CDobbins: [C:03+2] ncmonitor: Add MarkMonitor API key [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [19:18:04] (03CR) 10Dzahn: [C:03+1] "nice" [puppet] - 10https://gerrit.wikimedia.org/r/1201744 (owner: 10BCornwall) [19:19:09] (03PS4) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [19:20:45] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:10] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:21:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:21:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84790 and previous config saved to /var/cache/conftool/dbconfig/20251104-192142-marostegui.json [19:21:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:21:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [19:24:54] !log import ncmonitor 3.0.0 into bookworm-wikimedia [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:27:09] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncmonitor1001.eqiad.wmnet [19:27:45] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:29:17] (03PS5) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [19:29:51] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:31:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:33:12] (03PS6) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [19:33:42] (03CR) 10CI reject: [V:04-1] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:34:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [19:34:44] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11341804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2009.codfw.wmnet with OS trixie [19:36:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [19:38:37] (03PS1) 10Eevans: cassandra: add new grants to data-gateway role [puppet] - 10https://gerrit.wikimedia.org/r/1201768 (https://phabricator.wikimedia.org/T401021) [19:39:38] (03PS2) 10Eevans: cassandra: add new grants to data-gateway role [puppet] - 10https://gerrit.wikimedia.org/r/1201768 (https://phabricator.wikimedia.org/T401021) [19:40:35] swfrench-wmf: a helm upgrade failed during scap presync with an error about 'context deadline exceeded', do you know if anything changed recently? [19:40:53] (03CR) 10Eevans: [C:03+2] cassandra: add new grants to data-gateway role [puppet] - 10https://gerrit.wikimedia.org/r/1201768 (https://phabricator.wikimedia.org/T401021) (owner: 10Eevans) [19:41:09] (03PS7) 10Dzahn: gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) [19:41:17] jeena: hmmm ... that's odd. let me take a look at the logs. [19:41:46] (03PS2) 10Stevemunene: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) [19:42:04] jeena: would you happen to know what release failed? [19:42:47] mw-web-main-codfw, mw-api-ext-main-codfw, mw-api-ext-main-eqiad, mw-api-int-main-eqiad , and mw-web-main-eqiad [19:42:59] Hmm [19:43:13] scanning through logs [19:43:52] Looking at scap logs, it feels like that's because of the saturation causing logstash errors maybe? [19:43:53] mw-web.eqiad.main-67bb8997cc-9cwml 6/6 Running 0 8m [19:44:16] docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-11-04-143224-publish-81 [19:44:20] yeah that got rolled backl [19:45:50] yeah I think it rolled all of the back for the same error [19:46:42] STDERR: Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: context deadline exceeded [19:47:21] claime: so, I'm wondering if this might be an artifact of doing a full-image build while we're ~ 50% migrated [19:47:33] hm, if it *is* due to saturation, one option is to just pass a longer --timeout through to helmfile [19:47:36] swfrench-wmf: hmm, takes too long to pull? [19:47:45] oh and/or due to that, yeah [19:47:47] basically, the working set of "stuff pulled" is really large, yeah [19:48:15] I'd propose we temporarily bump the helmfile timeouts from 10m to 15m [19:48:20] yeah makes sense [19:48:22] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ncmonitor1001.eqiad.wmnet [19:48:23] that's what I do for the -next releases [19:48:32] yeah go ahead [19:48:33] i.e., when they're super early and the cache is always cold [19:48:38] cool, prepping a patch [19:48:50] This is an eventful end to my shift x) [19:49:09] (I am also being a complete idiot and working on non-pages) [19:49:12] (don't do it kids) [19:49:31] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [19:49:42] claime: yeah how dare you maintain an attitude of professionalism and responsibility, knock it off [19:50:03] rzl: T_T [19:51:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11341840 (10jhathaway) @Krd we are still receiving bounces for that user as their email rate is still too high. Do they... [19:51:47] PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [19:51:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [19:52:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T407997)', diff saved to https://phabricator.wikimedia.org/P84791 and previous config saved to /var/cache/conftool/dbconfig/20251104-195203-marostegui.json [19:52:07] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:52:31] (03PS1) 10BCornwall: ncmonitor: Add Gerrit API URL configuration [puppet] - 10https://gerrit.wikimedia.org/r/1201771 [19:52:40] (03PS3) 10Stevemunene: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) [19:53:03] RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [19:53:48] (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1201316/7542/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:54:29] (03PS40) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:54:31] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7543/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201771 (owner: 10BCornwall) [19:54:56] (03CR) 10BCornwall: ncmonitor: Add Gerrit API URL configuration [puppet] - 10https://gerrit.wikimedia.org/r/1201771 (owner: 10BCornwall) [19:55:05] (03CR) 10Dzahn: [V:03+1 C:03+1] "this adds the gerrit host names (e.g. gerrit1003) as allowed destinations; in addition to the service names (e.g. gerrit.wikimedia.org) bu" [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:55:19] (03PS1) 10Cwhite: prometheus: ops: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) [19:55:23] (03PS1) 10Scott French: mw-*: temporarily bump timeout on large services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201772 (https://phabricator.wikimedia.org/T405955) [19:55:43] rzl: could I ask you to review https://gerrit.wikimedia.org/r/1201772 when you have a moment? [19:55:49] yep looking [19:55:59] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add discovery name as allowed destination range IPs for ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:56:17] (03CR) 10CDobbins: [C:03+2] ncmonitor: Add Gerrit API URL configuration [puppet] - 10https://gerrit.wikimedia.org/r/1201771 (owner: 10BCornwall) [19:56:35] oh right, not the flag --timeout clearly, I forgot we have it in values :) [19:56:44] er, in helmfile [19:57:09] (03CR) 10RLazarus: [C:03+1] mw-*: temporarily bump timeout on large services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201772 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:57:12] ship it [19:57:14] !log import ncmonitor 3.0.0 into bookworm-wikimedia [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:20] yeah, we've got it in the yaml that drives the tool that makes the yaml from other yaml [19:57:23] * swfrench-wmf cries [19:57:26] ...whoops, wrong window to up+enter :) [19:57:31] 🤣 [19:57:44] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201772 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:57:46] (03CR) 10Scott French: [C:03+2] mw-*: temporarily bump timeout on large services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201772 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:58:45] swfrench-wmf: from other yaml and gotmpl [19:58:49] (03PS41) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:58:52] It's important to detail the misery completely [19:58:59] :D [19:59:43] (03Merged) 10jenkins-bot: mw-*: temporarily bump timeout on large services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201772 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:00:13] (03PS1) 10Dzahn: Revert "gerrit: add discovery name as allowed destination range IPs for ssh" [puppet] - 10https://gerrit.wikimedia.org/r/1201775 [20:00:15] (03PS42) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [20:00:30] jeena: ^ this is now live on deploy2002, so you should be good to retry. thanks for your patience! [20:00:45] thank you, I will try it now! [20:01:37] !log jhuneidi@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.1 refs T408271 [20:01:40] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [20:02:15] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11341888 (10jhathaway) >>! In T408967#11341265, @Krd wrote: > I just heard from one user that password recovery still doesn't work for them. the outbound queue remains empty, perhaps it wen... [20:04:06] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Change timer to run daily [puppet] - 10https://gerrit.wikimedia.org/r/1201744 (owner: 10BCornwall) [20:06:19] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11341899 (10jhathaway) I'm not sure how to remedy this issue. I see we switched to StaticDB in T355979, perhaps we need to rebuild the StaticDB... [20:07:19] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1201776 [20:07:22] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1201777 [20:08:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T407997)', diff saved to https://phabricator.wikimedia.org/P84792 and previous config saved to /var/cache/conftool/dbconfig/20251104-200857-marostegui.json [20:09:00] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1201776 (owner: 10Ncmonitor) [20:09:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:10:12] (03PS1) 10Eevans: data-gateway (staging): deploy version v1.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201778 (https://phabricator.wikimedia.org/T401021) [20:12:11] RECOVERY - Confd vcl based reload on cp2027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:12:30] sukhe: ^ recovery right after I deployed again :P [20:12:48] (03CR) 10Eevans: [C:03+2] data-gateway (staging): deploy version v1.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201778 (https://phabricator.wikimedia.org/T401021) (owner: 10Eevans) [20:13:45] !log jhuneidi@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.1 refs T408271 (duration: 12m 07s) [20:13:46] mutante: ah ok. well this makes it the third time it has happenedtoday [20:13:48] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [20:13:52] so probably worth a look. I will flag it for the team [20:13:54] thanks [20:14:01] swfrench-wmf: Success [20:14:01] `Finished sync-prod-k8s (duration: 03m 43s)` <- well, I guess we cheated, since the image is now cached everywhere from the attempt that timed out :) [20:14:07] hehe [20:14:15] sukhe: +1:) [20:14:16] jeena: thanks for the heads-up! yay :) [20:14:18] well now i get to deploy to group0 I guess [20:14:23] (03Merged) 10jenkins-bot: data-gateway (staging): deploy version v1.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201778 (https://phabricator.wikimedia.org/T401021) (owner: 10Eevans) [20:14:24] Thanks for your help! [20:14:32] * swfrench-wmf thumbs up [20:15:16] (03PS1) 10Scott French: mw-web: increase capacity of next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) [20:15:17] (03PS43) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [20:16:53] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [20:17:13] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201780 (https://phabricator.wikimedia.org/T408271) [20:17:13] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [20:17:15] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201780 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [20:18:04] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201780 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [20:18:55] (03CR) 10RLazarus: [C:03+1] mw-web: increase capacity of next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:20:13] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409203#11341957 (10Jclark-ctr) →14Duplicate dup:03T409192 [20:20:14] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409192#11341955 (10Jclark-ctr) [20:24:04] (03CR) 10Dzahn: [C:03+2] Revert "gerrit: add discovery name as allowed destination range IPs for ssh" [puppet] - 10https://gerrit.wikimedia.org/r/1201775 (owner: 10Dzahn) [20:24:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P84793 and previous config saved to /var/cache/conftool/dbconfig/20251104-202405-marostegui.json [20:26:34] jeena: when you're done with group0, would it be alright if I sneak in some quick capacity changes for mw-web before the backport window starts? [20:27:01] that would be fine with me [20:27:11] awesome, thank you! [20:27:27] yw! [20:28:45] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:28:59] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.1 refs T408271 [20:29:02] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [20:30:34] swfrench-wmf: you should be able to do your deploys now [20:30:45] jeena: great, thank you [20:30:54] 👍 [20:34:46] (03PS2) 10Scott French: mw-web: increase capacity of next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) [20:35:05] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:36:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11342002 (10Jhancock.wm) [20:37:04] (03CR) 10Scott French: [C:03+2] mw-web: increase capacity of next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:38:51] (03Merged) 10jenkins-bot: mw-web: increase capacity of next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201779 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:39:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P84794 and previous config saved to /var/cache/conftool/dbconfig/20251104-203912-marostegui.json [20:41:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [20:41:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [20:43:31] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [20:43:44] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [20:45:04] (03PS5) 10Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) [20:45:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [20:45:50] (03PS1) 10Dzahn: gerrit: add firewall rule to allow CDN caching servers to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201787 (https://phabricator.wikimedia.org/T365259) [20:46:14] (03CR) 10Dzahn: [V:03+1 C:03+2] "replaced with much simpler (and more specific) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1201787" [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:46:18] (03CR) 10CI reject: [V:04-1] gerrit: add firewall rule to allow CDN caching servers to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201787 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:46:35] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Update to Envoy 1.32.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201730 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [20:46:56] (03PS2) 10Dzahn: gerrit: add firewall rule to allow CDN caching servers to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201787 (https://phabricator.wikimedia.org/T365259) [20:47:21] (03CR) 10Dzahn: [V:03+1 C:03+2] "did not want to allow external users to ssh to the host IP" [puppet] - 10https://gerrit.wikimedia.org/r/1201316 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:48:15] (03CR) 10Scott French: [C:03+1] mw-*: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201731 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [20:48:42] (03CR) 10Scott French: [C:03+1] mw-videoscaler: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201732 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [20:51:48] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2009.codfw.wmnet with OS trixie [20:51:58] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11342033 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2009.codfw.wmnet with OS trixie executed with errors: - sretest2009 (**FAIL**) - Removed... [20:53:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [20:53:43] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11342034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2009.codfw.wmnet with OS trixie [20:54:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T407997)', diff saved to https://phabricator.wikimedia.org/P84795 and previous config saved to /var/cache/conftool/dbconfig/20251104-205420-marostegui.json [20:54:24] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:54:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [20:54:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T407997)', diff saved to https://phabricator.wikimedia.org/P84796 and previous config saved to /var/cache/conftool/dbconfig/20251104-205433-marostegui.json [20:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:31] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1201787/7550/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1201787 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:58:33] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add firewall rule to allow CDN caching servers to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201787 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T2100). [21:00:05] bvibber and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:45] o/ [21:00:59] having some trouble logging into dev servers myself atm, moment [21:01:48] resolved :D [21:03:31] ebernhardson: here? i can run all three together or i can run mine first [21:04:01] (03PS1) 10Dzahn: gerrit: remove firewall rule to accept Wikimania traffic [puppet] - 10https://gerrit.wikimedia.org/r/1201793 [21:05:01] i'll run mine first, should be quick [21:05:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11342084 (10cmooney) >>! In T405609#11341696, @VRiley-WMF wrote: > Spoke to @cmooney about... [21:05:08] (03PS2) 10Dzahn: gerrit: remove firewall rule to accept Wikimania traffic [puppet] - 10https://gerrit.wikimedia.org/r/1201793 [21:05:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201722 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [21:05:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201723 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [21:05:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [21:05:40] (03CR) 10Dzahn: "noticed this while making unrelated firewall changes and verifying rules. I assume it can be removed now: https://gerrit.wikimedia.org/r/c" [puppet] - 10https://gerrit.wikimedia.org/r/1175877 (owner: 10Jelto) [21:05:47] bvibber: sorry, yes here [21:06:02] cool we'll run it next :) [21:06:09] nice! [21:06:31] (03Merged) 10jenkins-bot: Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201722 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [21:06:33] (03Merged) 10jenkins-bot: Guard against some null dereferences in CroppedImage [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201723 (https://phabricator.wikimedia.org/T409123) (owner: 10Bvibber) [21:07:08] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1201722|Guard against some null dereferences in CroppedImage (T409123 T409126)]], [[gerrit:1201723|Guard against some null dereferences in CroppedImage (T409123 T409126)]] [21:07:12] T409123: ImageBrowsing: production error on Android 10: Cannot read properties of null (reading 'src') - https://phabricator.wikimedia.org/T409123 [21:07:13] T409126: ImageBrowsing JS error when closing overlay before image loads - https://phabricator.wikimedia.org/T409126 [21:07:29] (03CR) 10Dzahn: [C:03+2] gerrit: remove firewall rule to accept Wikimania traffic [puppet] - 10https://gerrit.wikimedia.org/r/1201793 (owner: 10Dzahn) [21:08:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [21:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:11:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T407997)', diff saved to https://phabricator.wikimedia.org/P84797 and previous config saved to /var/cache/conftool/dbconfig/20251104-211102-marostegui.json [21:11:06] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:11:20] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1201722|Guard against some null dereferences in CroppedImage (T409123 T409126)]], [[gerrit:1201723|Guard against some null dereferences in CroppedImage (T409123 T409126)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:09] !log bvibber@deploy2002 bvibber: Continuing with sync [21:12:12] looks good [21:13:16] 06SRE, 06Infrastructure-Foundations, 10netops: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217 (10cmooney) 03NEW p:05Triage→03Medium [21:18:25] (03PS44) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:18:31] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201722|Guard against some null dereferences in CroppedImage (T409123 T409126)]], [[gerrit:1201723|Guard against some null dereferences in CroppedImage (T409123 T409126)]] (duration: 11m 23s) [21:18:35] T409123: ImageBrowsing: production error on Android 10: Cannot read properties of null (reading 'src') - https://phabricator.wikimedia.org/T409123 [21:18:36] T409126: ImageBrowsing JS error when closing overlay before image loads - https://phabricator.wikimedia.org/T409126 [21:18:47] ok ebernhardson you're up! lemme pull up the patch [21:18:54] kk [21:19:01] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:19:23] ebernhardson: will this be something you can test when it hits test servers? [21:19:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [21:19:35] bvibber: not really [21:19:42] :D makes it easy then [21:20:19] (03Merged) 10jenkins-bot: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [21:20:32] (03PS45) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:20:50] 06SRE, 06Infrastructure-Foundations, 10netops: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11342153 (10Dzahn) I would expect the cause is that someone committed as root: ` root@netmon1003:/var/lib/rancid/core/.git/objects# find . -user root ./f2 ./f2/... [21:20:51] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1199054|cirrus: Start near match A/B test (T408154)]] [21:20:54] T408154: AB Test doubling near match field weights on commonswiki - https://phabricator.wikimedia.org/T408154 [21:24:02] !log bvibber@deploy2002 bvibber, ebernhardson: Backport for [[gerrit:1199054|cirrus: Start near match A/B test (T408154)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:24:16] (03PS1) 10Kamila Součková: deployment-server: generate clusterinfo for helm [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) [21:24:26] !log bvibber@deploy2002 bvibber, ebernhardson: Continuing with sync [21:24:38] and off it goes to copy to all the many boxes :D [21:24:45] thanks [21:24:54] yw [21:25:23] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:26:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:26:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2009.codfw.wmnet with OS trixie [21:26:08] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11342263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2009.codfw.wmnet with OS trixie completed: - sretest2009 (**PASS**) - Removed from Puppet... [21:26:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P84798 and previous config saved to /var/cache/conftool/dbconfig/20251104-212609-marostegui.json [21:26:38] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [21:26:55] (03PS46) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:28:36] (03PS47) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:28:45] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199054|cirrus: Start near match A/B test (T408154)]] (duration: 07m 53s) [21:28:48] T408154: AB Test doubling near match field weights on commonswiki - https://phabricator.wikimedia.org/T408154 [21:30:25] ebernhardson: should be all done [21:30:34] (03PS48) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:30:45] bvibber: excellent [21:31:16] bvibber: if you should happen to finish with the window early, I'll roll out some envoy updates -- no pressure, but let me know if you don't need the whole thing :) [21:31:40] rzl: go for it, we're all done! [21:33:05] thanks! [21:33:05] (03PS49) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:34:04] (03PS50) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:34:50] (03PS1) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [21:35:57] (03PS51) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:36:11] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.32.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201730 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [21:36:45] (03CR) 10Scott French: [C:03+2] hieradata: pilot use_etcd_known_client_ident on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1196544 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [21:38:00] (03Merged) 10jenkins-bot: {api,rest}-gateway: Update to Envoy 1.32.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201730 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [21:38:05] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [21:38:32] (03PS52) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:40:16] (03PS53) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:41:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P84799 and previous config saved to /var/cache/conftool/dbconfig/20251104-214117-marostegui.json [21:41:49] hm, api-gateway is dirty -- I'll roll out rest-gateway at least [21:42:54] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [21:43:19] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [21:43:29] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [21:45:42] (03PS2) 10Kamila Součková: deployment-server: generate clusterinfo for helm [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) [21:45:44] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [21:48:03] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [21:48:11] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [21:49:29] pausing there in case the web team is using their window today :) [21:50:51] (03PS2) 10Dzahn: site: apply tcpproxy role on all VMs created for it [puppet] - 10https://gerrit.wikimedia.org/r/1201312 (https://phabricator.wikimedia.org/T408532) [21:51:38] (03PS54) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:52:04] (03CR) 10Dzahn: [C:03+2] site: apply tcpproxy role on all VMs created for it [puppet] - 10https://gerrit.wikimedia.org/r/1201312 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [21:53:04] (03CR) 10CDanis: [C:03+1] tcpproxy: greatly reduce connection timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1201745 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [21:54:27] (03PS55) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [21:56:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T407997)', diff saved to https://phabricator.wikimedia.org/P84800 and previous config saved to /var/cache/conftool/dbconfig/20251104-215625-marostegui.json [21:56:29] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:56:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2221.codfw.wmnet with reason: Maintenance [21:56:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T407997)', diff saved to https://phabricator.wikimedia.org/P84801 and previous config saved to /var/cache/conftool/dbconfig/20251104-215649-marostegui.json [21:58:53] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1201810 [21:59:34] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (owner: 10CDanis) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251104T2200) [22:03:51] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1201777 (owner: 10Ncmonitor) [22:04:13] (03PS1) 10Daimona Eaytoy: Drop $wgCampaignEventsCountrySchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) [22:04:14] (03PS1) 10Scott French: P:cache::haproxy: update confd watch_keys for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1201811 (https://phabricator.wikimedia.org/T403220) [22:05:53] I'm assuming the web team window is unused and rolling out more envoy updates [22:06:15] (03PS2) 10RLazarus: mw-*: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201731 (https://phabricator.wikimedia.org/T405808) [22:06:30] (03PS2) 10Aaron Schulz: Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200441 (https://phabricator.wikimedia.org/T396805) [22:06:46] (03CR) 10CDanis: [C:03+1] P:cache::haproxy: update confd watch_keys for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1201811 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:07:51] (03CR) 10Scott French: "Does what I expect, for PCC runs after https://gerrit.wikimedia.org/r/1196544:" [puppet] - 10https://gerrit.wikimedia.org/r/1201811 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:08:47] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1201811 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:08:48] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: update confd watch_keys for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1201811 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:09:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:10] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:36] (03CR) 10RLazarus: [C:03+2] mw-*: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201731 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:09:44] (03PS1) 10BryanDavis: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 [22:12:03] (03PS3) 10Aaron Schulz: Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200441 (https://phabricator.wikimedia.org/T396805) [22:12:23] (03PS2) 10CDanis: prometheus::ops: add tcpproxies scrape [puppet] - 10https://gerrit.wikimedia.org/r/1201810 [22:12:47] (03PS3) 10CDanis: prometheus::ops: add tcpproxies scrape [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (https://phabricator.wikimedia.org/T408532) [22:12:59] (03Merged) 10jenkins-bot: mw-*: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201731 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:13:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T407997)', diff saved to https://phabricator.wikimedia.org/P84802 and previous config saved to /var/cache/conftool/dbconfig/20251104-221306-marostegui.json [22:13:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:13:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [22:14:11] (03CR) 10Dzahn: [C:03+1] "thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [22:14:27] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1201731 T405808 [22:14:30] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [22:17:03] 06SRE, 06Infrastructure-Foundations, 10netops: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11342444 (10cmooney) Thanks @Dzahn appreciate it! Yep that's what I thought, I will give your suggestion a try in the morning and see does it resolve the problem. [22:19:31] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1201731 T405808 (duration: 05m 39s) [22:19:34] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [22:21:49] (03PS2) 10RLazarus: mw-videoscaler: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201732 (https://phabricator.wikimedia.org/T405808) [22:23:17] (03PS1) 10Dzahn: gerrit: allow production networks to connect to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201820 (https://phabricator.wikimedia.org/T365259) [22:24:45] (03CR) 10Dzahn: [C:03+2] prometheus::ops: add tcpproxies scrape [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [22:25:51] (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201732 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:27:41] (03Merged) 10jenkins-bot: mw-videoscaler: Update to Envoy 1.32.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201732 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:27:41] (03PS1) 10Andrew Bogott: dnsrecursor config: fix a few broken settings in the yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) [22:28:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [22:28:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P84803 and previous config saved to /var/cache/conftool/dbconfig/20251104-222814-marostegui.json [22:29:43] (03PS2) 10Dzahn: gerrit: allow production networks to connect to gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1201820 (https://phabricator.wikimedia.org/T365259) [22:33:00] (03CR) 10Dzahn: [C:03+2] "needed a restart of haproxy on all hosts - done via cumin - after that I see listening port 9422 and new config" [puppet] - 10https://gerrit.wikimedia.org/r/1201810 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [22:33:41] rzl: let me know if you finish soon. I could sneak https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1200441 into the deploy window if no one's doing anything. [22:36:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [22:37:41] (03PS1) 10Dzahn: tcpproxy: use notify to ensure service gets restarted on config changes [puppet] - 10https://gerrit.wikimedia.org/r/1201822 (https://phabricator.wikimedia.org/T408532) [22:37:46] (03PS1) 10BCornwall: ncredir: Ignore wikimedia.support [puppet] - 10https://gerrit.wikimedia.org/r/1201823 [22:37:51] AaronSchulz: sure, checking one more thing then I'll wrap up and let you know [22:38:41] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [22:38:49] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [22:39:17] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [22:39:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [22:41:12] AaronSchulz: all yours! [22:42:25] (03PS1) 10Scott French: hieradata: end use_etcd_known_client_ident pilot on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1201824 (https://phabricator.wikimedia.org/T403220) [22:43:15] rzl: thanks [22:43:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P84804 and previous config saved to /var/cache/conftool/dbconfig/20251104-224321-marostegui.json [22:43:55] (03CR) 10CDobbins: [C:03+2] ncredir: Ignore wikimedia.support [puppet] - 10https://gerrit.wikimedia.org/r/1201823 (owner: 10BCornwall) [22:43:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200441 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [22:45:00] (03Merged) 10jenkins-bot: Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200441 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [22:45:03] (03PS2) 10Andrew Bogott: dnsrecursor config: fix a few broken settings in the yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) [22:45:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [22:45:34] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1200441|Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs (T396805)]] [22:45:37] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [22:46:41] (03CR) 10Scott French: [C:03+2] hieradata: end use_etcd_known_client_ident pilot on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1201824 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:47:54] !log aaron@deploy2002 aaron: Backport for [[gerrit:1200441|Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs (T396805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:48:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:49:07] !log aaron@deploy2002 aaron: Continuing with sync [22:52:49] (03PS8) 10Cwhite: prometheus: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) [22:52:49] (03CR) 10Cwhite: "PCC looks right: https://puppet-compiler.wmflabs.org/output/1201773/7551/" [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [22:53:22] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200441|Add a wgRestSandboxSpecs entry for wikimedia.org (math) specs (T396805)]] (duration: 07m 48s) [22:53:25] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [22:53:33] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [22:53:53] FIRING: [6x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:27] (03PS1) 10Aaron Schulz: Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 [22:58:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T407997)', diff saved to https://phabricator.wikimedia.org/P84805 and previous config saved to /var/cache/conftool/dbconfig/20251104-225829-marostegui.json [22:58:33] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:58:38] * AaronSchulz is done [22:58:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2222.codfw.wmnet with reason: Maintenance [22:58:52] FIRING: [7x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T407997)', diff saved to https://phabricator.wikimedia.org/P84806 and previous config saved to /var/cache/conftool/dbconfig/20251104-225853-marostegui.json [23:03:52] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:06] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1201820/7553/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1201820 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [23:06:03] (03PS1) 10Cwhite: logstash: reduce logstash-ml index utilization on ssd-class nodes [puppet] - 10https://gerrit.wikimedia.org/r/1201827 (https://phabricator.wikimedia.org/T390215) [23:07:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [23:07:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [23:08:02] (03CR) 10Cwhite: [C:03+2] logstash: reduce logstash-ml index utilization on ssd-class nodes [puppet] - 10https://gerrit.wikimedia.org/r/1201827 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [23:08:08] (03CR) 10Dzahn: [C:03+2] tcpproxy: use notify to ensure service gets restarted on config changes [puppet] - 10https://gerrit.wikimedia.org/r/1201822 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:08:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [23:08:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [23:16:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T407997)', diff saved to https://phabricator.wikimedia.org/P84807 and previous config saved to /var/cache/conftool/dbconfig/20251104-231628-marostegui.json [23:16:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [23:17:09] (03CR) 10BPirkle: [C:03+1] Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 (owner: 10Aaron Schulz) [23:20:16] (03CR) 10Andrea Denisse: [C:03+2] alertmanager: Add dashboard and runbook for Slack alerts [puppet] - 10https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [23:21:28] (03PS1) 10Dzahn: tcpproxy: add simple puppet service resource to manage haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1201828 (https://phabricator.wikimedia.org/T408532) [23:22:07] (03PS2) 10Dzahn: tcpproxy: add simple puppet service resource to manage haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1201828 (https://phabricator.wikimedia.org/T408532) [23:31:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P84808 and previous config saved to /var/cache/conftool/dbconfig/20251104-233135-marostegui.json [23:36:00] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1201828/7554/tcp-proxy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1201828 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:46:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P84809 and previous config saved to /var/cache/conftool/dbconfig/20251104-234643-marostegui.json [23:56:51] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11342826 (10Dzahn) haproxy configured as TCP-proxy for gerrit-ssh has been deployed on all 14 VMs, across POPs. We can now con... [23:57:30] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11342827 (10Dzahn) [23:59:14] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite)