[00:03:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 [00:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 (owner: 10TrainBranchBot) [00:54:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 (owner: 10TrainBranchBot) [00:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 (owner: 10TrainBranchBot) [01:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:15:45] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 15m 04s) [01:30:00] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 (owner: 10TrainBranchBot) [01:33:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:47:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:22] (03PS1) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T405080) [02:47:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:09] (03CR) 10CI reject: [V:04-1] Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T405080) (owner: 10Pppery) [02:48:39] (03PS2) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) [02:49:22] (03PS3) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) [02:49:27] (03CR) 10CI reject: [V:04-1] Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery) [02:50:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:55:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:55:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:08:56] (03PS1) 10MusikAnimal: AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 [03:10:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:26:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 (owner: 10MusikAnimal) [03:27:04] (03Merged) 10jenkins-bot: AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 (owner: 10MusikAnimal) [03:27:40] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] [03:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:52:31] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:53:27] !log musikanimal@deploy2002 musikanimal: Continuing with sync [04:07:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:07:35] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] (duration: 39m 55s) [04:12:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:12:36] 06SRE, 10Incident Tooling, 06Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804#11334175 (10Pppery) [04:21:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:26:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:28:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:34:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:49:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:59:56] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11334176 (10Papaul) [05:00:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:01:21] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.268 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:04:17] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11334177 (10Papaul) @cmooney i update all the IP's to match the other POP sites. I will be re-running the configuration and validation sometimes this week in m... [05:06:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:07:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 3.183 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:08:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:16:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:19:21] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 4.976 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:25:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:26:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.375 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:33:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:11] (03CR) 10Fabfur: [C:03+1] P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [05:36:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:49:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:51:37] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11334184 (10Joe) 05Open→03Resolved [05:55:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:04:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:05:07] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:06:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:07:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.404 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:10:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:11:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.626 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:11:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:14:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:15:06] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries - https://phabricator.wikimedia.org/T409024 (10Joe) 03NEW [06:15:21] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries - https://phabricator.wikimedia.org/T409024#11334201 (10Joe) p:05Triage→03Medium [06:16:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 7.718 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:18:50] (03PS1) 10Marostegui: mariadb: Move db1231 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1200751 (https://phabricator.wikimedia.org/T408829) [06:19:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1231 T408829', diff saved to https://phabricator.wikimedia.org/P84568 and previous config saved to /var/cache/conftool/dbconfig/20251103-061906-marostegui.json [06:19:14] T408829: Move one s6 eqiad host to s7 - https://phabricator.wikimedia.org/T408829 [06:20:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db[1174,1231].eqiad.wmnet with reason: Moving db1231 to s7 [06:20:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1174.eqiad.wmnet onto db1231.eqiad.wmnet [06:20:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1174 - Depool db1174.eqiad.wmnet to then clone it to db1231.eqiad.wmnet - marostegui@cumin1003 [06:21:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1174 - Depool db1174.eqiad.wmnet to then clone it to db1231.eqiad.wmnet - marostegui@cumin1003 [06:21:30] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1231 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1200751 (https://phabricator.wikimedia.org/T408829) (owner: 10Marostegui) [06:22:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:23:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.270 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:25:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:25:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:26:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84570 and previous config saved to /var/cache/conftool/dbconfig/20251103-062603-marostegui.json [06:26:06] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:27:19] (03PS1) 10Marostegui: db2174: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200752 (https://phabricator.wikimedia.org/T407463) [06:28:20] (03CR) 10Marostegui: [C:03+2] db2174: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200752 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:29:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:29:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2174 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84571 and previous config saved to /var/cache/conftool/dbconfig/20251103-062919-marostegui.json [06:36:55] (03PS9) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [06:37:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84572 and previous config saved to /var/cache/conftool/dbconfig/20251103-063742-root.json [06:37:56] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [06:38:32] !log Drop afl_ip related triggers from s2 T408780 [06:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:34] T408780: Drop abuse_filter_log trigger for afl_ip column - https://phabricator.wikimedia.org/T408780 [06:38:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84573 and previous config saved to /var/cache/conftool/dbconfig/20251103-063838-marostegui.json [06:38:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:41:51] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries and browsers - https://phabricator.wikimedia.org/T409024#11334225 (10Joe) [06:52:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84574 and previous config saved to /var/cache/conftool/dbconfig/20251103-065248-root.json [06:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P84575 and previous config saved to /var/cache/conftool/dbconfig/20251103-065346-marostegui.json [06:57:07] (03PS1) 10Marostegui: db1177: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200753 [06:57:40] (03CR) 10Marostegui: [C:03+2] db1177: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200753 (owner: 10Marostegui) [06:58:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [06:58:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1177 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84576 and previous config saved to /var/cache/conftool/dbconfig/20251103-065808-marostegui.json [07:06:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84577 and previous config saved to /var/cache/conftool/dbconfig/20251103-070612-root.json [07:07:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84578 and previous config saved to /var/cache/conftool/dbconfig/20251103-070753-root.json [07:08:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P84579 and previous config saved to /var/cache/conftool/dbconfig/20251103-070853-marostegui.json [07:15:04] (03PS2) 10Ryan Kemper: wdqs: detect blazegraph deadlock [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) [07:16:23] (03PS1) 10Marostegui: installserver: Do not reimage es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1200754 [07:18:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:40] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1200754 (owner: 10Marostegui) [07:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84580 and previous config saved to /var/cache/conftool/dbconfig/20251103-072118-root.json [07:23:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84581 and previous config saved to /var/cache/conftool/dbconfig/20251103-072303-root.json [07:23:34] (03PS1) 10Marostegui: instances.yaml: Remove es1034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1200755 (https://phabricator.wikimedia.org/T409025) [07:24:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84582 and previous config saved to /var/cache/conftool/dbconfig/20251103-072405-marostegui.json [07:24:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:24:23] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1200755 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:24:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:24:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84583 and previous config saved to /var/cache/conftool/dbconfig/20251103-072431-marostegui.json [07:25:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1034 from dbctl T409025', diff saved to https://phabricator.wikimedia.org/P84584 and previous config saved to /var/cache/conftool/dbconfig/20251103-072527-marostegui.json [07:25:34] T409025: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025 [07:26:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84585 and previous config saved to /var/cache/conftool/dbconfig/20251103-072647-marostegui.json [07:27:32] (03PS1) 10Marostegui: backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) [07:28:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:29:49] (03CR) 10Marostegui: "Jaime, this is a NOOP so I am merging it without waiting for you. es1057 was cloned from es1034, but neither of them have the dump user. D" [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:29:52] (03CR) 10Marostegui: [C:03+2] backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:35:23] (03CR) 10Marostegui: [C:03+2] "Just checked, none of the RO (es1-es5) section have the dump user. If this is expected, then nothing else to be done here. If it is not, " [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:36:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84586 and previous config saved to /var/cache/conftool/dbconfig/20251103-073624-root.json [07:39:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:40:37] (03PS1) 10Marostegui: es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200759 (https://phabricator.wikimedia.org/T409025) [07:41:56] (03CR) 10Marostegui: [C:03+2] es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200759 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:42:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P84587 and previous config saved to /var/cache/conftool/dbconfig/20251103-074156-marostegui.json [07:51:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84588 and previous config saved to /var/cache/conftool/dbconfig/20251103-075130-root.json [07:57:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P84589 and previous config saved to /var/cache/conftool/dbconfig/20251103-075706-marostegui.json [07:57:42] marostegui@cumin1003 clone (PID 2864179) is awaiting input [07:57:47] 10ops-eqiad, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030 (10Marostegui) 03NEW [07:58:32] 10ops-eqiad, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11334341 (10Marostegui) p:05Triage→03Medium [08:00:00] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11334342 (10MoritzMuehlenhoff) >>! In T407513#11332007, @LSobanski wrote: > To avoid confusion I believe the above statement should say "now available" instead of "... [08:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T0800). [08:00:05] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:34] o/ [08:07:23] I'll probably reschedule the patch for the next window since, as every Monday, the window will be empty :P [08:09:30] (03CR) 10Muehlenhoff: [C:03+2] Re-enable monitoring for maps/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:12:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84590 and previous config saved to /var/cache/conftool/dbconfig/20251103-081214-marostegui.json [08:12:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:12:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [08:12:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84591 and previous config saved to /var/cache/conftool/dbconfig/20251103-081238-marostegui.json [08:20:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [08:22:51] (03PS1) 10Marostegui: db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) [08:23:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:24:09] (03PS2) 10Marostegui: db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) [08:24:43] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [08:24:58] (03CR) 10Marostegui: [C:03+2] db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) (owner: 10Marostegui) [08:25:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84593 and previous config saved to /var/cache/conftool/dbconfig/20251103-082543-root.json [08:25:50] PROBLEM - very high load average likely xfs on ms-be1074 is CRITICAL: CRITICAL - load average: 160.95, 108.51, 54.98 https://wikitech.wikimedia.org/wiki/Swift [08:27:46] PROBLEM - very high load average likely xfs on ms-be1074 is CRITICAL: CRITICAL - load average: 142.36, 117.39, 64.13 https://wikitech.wikimedia.org/wiki/Swift [08:28:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.648 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:29:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84594 and previous config saved to /var/cache/conftool/dbconfig/20251103-082909-marostegui.json [08:29:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:32:46] RECOVERY - very high load average likely xfs on ms-be1074 is OK: OK - load average: 16.79, 68.62, 59.75 https://wikitech.wikimedia.org/wiki/Swift [08:34:19] (03CR) 10Muehlenhoff: [C:03+2] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [08:40:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: After moving it to s7', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251103-084049-root.json [08:40:57] !log silence wikitech-static icinga alert for a couple of weeks - T409029 [08:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:12] T409029: Flapping wikitech-static icinga alert - https://phabricator.wikimedia.org/T409029 [08:44:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P84596 and previous config saved to /var/cache/conftool/dbconfig/20251103-084417-marostegui.json [08:45:49] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Docker [08:51:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84598 and previous config saved to /var/cache/conftool/dbconfig/20251103-085600-root.json [08:56:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:33] !log elukey@cumin1003 START - Cookbook sre.dns.netbox [08:59:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P84599 and previous config saved to /var/cache/conftool/dbconfig/20251103-085925-marostegui.json [08:59:59] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix uncommitted changes for mwdebug2002 - elukey@cumin1003" [09:00:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix uncommitted changes for mwdebug2002 - elukey@cumin1003" [09:00:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:02:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:03:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:09] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1174.eqiad.wmnet onto db1231.eqiad.wmnet [09:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:11:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 15%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84600 and previous config saved to /var/cache/conftool/dbconfig/20251103-091109-root.json [09:11:46] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:14:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84601 and previous config saved to /var/cache/conftool/dbconfig/20251103-091435-marostegui.json [09:14:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [09:14:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84602 and previous config saved to /var/cache/conftool/dbconfig/20251103-091452-marostegui.json [09:14:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:15:23] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [09:17:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84603 and previous config saved to /var/cache/conftool/dbconfig/20251103-091708-marostegui.json [09:22:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334494 (10elukey) Really interesting, I retried today a reimage and got a "no media present" when trying to pxe/http boot. Then I checked the Boot order and the wrong UEFI netwo... [09:25:50] (03PS1) 10Esanders: Freeze LiquidThreads on huwiki and svwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) [09:26:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84604 and previous config saved to /var/cache/conftool/dbconfig/20251103-092618-root.json [09:29:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) (owner: 10Esanders) [09:29:22] (03CR) 10Clément Goubert: [C:03+2] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [09:31:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036 (10MatthewVernon) 03NEW [09:31:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11334535 (10MatthewVernon) p:05Triage→03High [09:32:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P84605 and previous config saved to /var/cache/conftool/dbconfig/20251103-093218-marostegui.json [09:33:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:14] is there any way to get information / output about an mwscript-k8s job after it’s been cleaned up? (context: https://phabricator.wikimedia.org/T398177#11334550) [09:35:24] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:35:26] Lucas_WMDE: logstash [09:35:27] like, maybe it gets cleaned up from k8s but is still in logstash or somewhere else? [09:35:30] ooh [09:36:34] nice, Kubernetes Events has something [09:37:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:37:13] Lucas_WMDE: tell me if you need help, I still have about 1h free :) [09:37:31] claime: so far I have https://logstash.wikimedia.org/goto/d4e84efcce342199642dede2a735d8be and am trying to make sense of it ^^ [09:37:46] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [09:37:48] which looks like it had died within half a day of me launching it [09:37:57] not sure if I can see the error reason anywhere [09:38:11] like, if it was another oom sigkill or something else [09:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:56] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:39:32] !log elukey@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [09:39:46] Hmm. [09:40:09] !log elukey@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:40:13] oooh, https://logstash.wikimedia.org/goto/607cd49141903a654ac2a97f06710486 looks a lot better [09:40:16] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:40:21] (App Logs instead of Kubernetes Events) [09:40:31] that’s… the full output? :o [09:40:34] (until it died anyway) [09:41:10] Lucas_WMDE: yeah [09:41:19] full output, one line per message becaused it's stupid [09:41:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 30%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84606 and previous config saved to /var/cache/conftool/dbconfig/20251103-094126-root.json [09:42:16] nice [09:42:46] and can I get the error / failure status somewhere? I assume it must have died for some reason that I can’t see yet [09:43:36] also, “Logs are retained in Logstash for a maximum of 90 days by default” (https://wikitech.wikimedia.org/wiki/Logstash) so I should pull the logs out of there later ^^ [09:43:38] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) [09:45:44] Lucas_WMDE: Hmm for the failure status I'm not sure, I'll take a look [09:45:49] ok, thanks! [09:46:09] then I’ll hold off on commenting on the task for a bit :) [09:47:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P84607 and previous config saved to /var/cache/conftool/dbconfig/20251103-094726-marostegui.json [09:48:06] Lucas_WMDE: I'm not finding it [09:48:38] (03PS1) 10Marostegui: db1178: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1200961 [09:48:42] hm, ok [09:49:10] then I guess I’ll just write that OOM feels like a possibility [09:49:10] (03CR) 10Marostegui: [C:03+2] db1178: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1200961 (owner: 10Marostegui) [09:49:17] (since any PHP-level error should be visible in the logs) [09:49:19] Lucas_WMDE: I'm checking grafana to see if I can confirm that [09:49:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040 (10MatthewVernon) 03NEW [09:50:09] !log installing intel-microcode security updates [09:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] interesting idea https://grafana.wikimedia.org/goto/HZJHdSzDg?orgId=1 [09:50:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11334669 (10MatthewVernon) p:05Triage→03High [09:51:21] that doesn’t look super OOMy [09:51:28] (maybe you have a better grafana dashboard) [09:51:30] Nope [09:51:36] (to both) [09:51:59] I guess I could just try an enwiki dry run then, see if it crashes again [09:52:08] Yeah that would be the way to go [09:52:19] alright, then I’ll comment on the task [09:52:21] thanks for your help! \o/ [09:52:24] I'll make a note somewhere to see if we can record failure states in logstash *somehow* [09:54:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334715 (10elukey) In theory the HttpBootPolicy should hit the right HTTP boot after some tries without stopping at the first failure: ` ['(B199/D0/F0) UEFI HTTP IPv4 Intel(R) I... [09:56:10] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [09:56:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84608 and previous config saved to /var/cache/conftool/dbconfig/20251103-095632-root.json [09:58:24] hm, if I narrow the date range, then https://grafana.wikimedia.org/goto/SYiEOSzDg?orgId=1 shows some suspicious spikes in the memory usage [09:58:45] it already came *very* close to the limit earlier (peaked at 1.13 out of 1.17 GiB limit) [09:59:18] Lucas_WMDE: Hah, sampling :D [09:59:22] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11334728 (10LSobanski) [09:59:27] un ts un ts un ts un ts [09:59:31] but it's not high at the moment of the cut [09:59:34] yeah [09:59:52] and it doesn’t feel like it could’ve spiked past the limit before even a single sample was recorded [09:59:53] Although it could have spiked hard and fast enough to get wrecked and the metrics not scraped [09:59:58] hah [10:00:06] I think it's 1m interval for the scrape [10:00:14] hm [10:00:42] yeah ok the previous spike hit its plateau within just over a minute apparently [10:01:01] Honestly I would try to repro [10:01:07] It's probably the easiest [10:01:39] alright [10:01:49] but I’ll leave that to MatmaRex first, it’s his maintenance script ^^ [10:02:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84609 and previous config saved to /var/cache/conftool/dbconfig/20251103-100233-marostegui.json [10:02:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:02:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [10:02:57] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11334741 (10Geagea) I've just received notification from October 29 (6 days). [10:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84610 and previous config saved to /var/cache/conftool/dbconfig/20251103-100257-marostegui.json [10:03:32] commented, feel free to unsubscribe again if you like ;) [10:04:07] (03PS1) 10David Caro: toolforge: add elasticsearch metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/1201011 (https://phabricator.wikimedia.org/T409047) [10:04:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11334768 (10LSobanski) I just checked and the junk queue is close to 500k at this time. [10:04:58] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:05:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84611 and previous config saved to /var/cache/conftool/dbconfig/20251103-100511-marostegui.json [10:07:33] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [10:07:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11334788 (10LSobanski) Here's the increase in disk space and inode usage since October 27th: {F69754786} [10:08:13] (03CR) 10David Caro: [V:03+1] "Tested in tools, all endpoints scraping ok https://phabricator.wikimedia.org/T409047#11334799" [puppet] - 10https://gerrit.wikimedia.org/r/1201011 (https://phabricator.wikimedia.org/T409047) (owner: 10David Caro) [10:11:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 60%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84612 and previous config saved to /var/cache/conftool/dbconfig/20251103-101138-root.json [10:16:08] (03CR) 10Jcrespo: [C:03+1] backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [10:17:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:19:35] (03PS1) 10Muehlenhoff: Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 [10:20:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [10:20:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P84614 and previous config saved to /var/cache/conftool/dbconfig/20251103-102018-marostegui.json [10:22:11] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-worker1001.eqiad.wmnet with OS bullseye [10:24:44] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11334866 (10TheDJ) Not sure if this font issue T408884 is related, but it was reported around the switch to the new services, so might be worth double checking if the k8s images have... [10:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:39] (03PS10) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:26:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84616 and previous config saved to /var/cache/conftool/dbconfig/20251103-102645-root.json [10:27:01] (03CR) 10Brouberol: [C:03+2] Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:27:04] (03CR) 10Brouberol: [C:03+2] Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:27:24] (03PS1) 10Marostegui: wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) [10:29:30] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334890 (10elukey) I've set up the `UEFINetwork` list with `90:5A:08:9F:08:80` UEFI HTTP first, and it got reflected to `FixedBootOrder`. Ran a chassis reset, waited for the os t... [10:33:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:35:01] (03PS11) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:35:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20251103-103527-marostegui.json [10:38:49] (03PS12) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:38:51] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:39:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11334905 (10elukey) >>! In T381565#11334866, @TheDJ wrote: > Not sure if this font issue T408884 is related, but it was reported around the switch to the new services, so might be wo... [10:40:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84617 and previous config saved to /var/cache/conftool/dbconfig/20251103-104152-root.json [10:43:40] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:43:56] (03CR) 10Federico Ceratto: [C:03+1] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:44:00] (03CR) 10Marostegui: [C:03+2] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:44:06] !log marostegui@dns1006 START - running authdns-update [10:44:07] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:44:30] !log Switch m3 (phabricator) proxy to dbproxy1028 T408956 [10:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:41] T408956: Occasional database errors when using/browsing Phabricator - https://phabricator.wikimedia.org/T408956 [10:44:59] !log marostegui@dns1006 END - running authdns-update [10:46:56] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:47:07] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [10:49:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84618 and previous config saved to /var/cache/conftool/dbconfig/20251103-105038-marostegui.json [10:50:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:50:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:52:44] (03PS2) 10Muehlenhoff: Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 [10:52:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:54:20] 06SRE, 10AQS2.0, 10Cassandra, 06serviceops, 07Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855#11334980 (10Htriedman) I would love it to be but have no control over priorities here! What could I do o help move it forward? [10:57:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [10:59:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1100) [11:01:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [11:01:09] 06SRE, 10Wikimedia-Mailing-lists: Reports of unsubscribe from wikitech-ambassadors failing to work - https://phabricator.wikimedia.org/T405153#11335012 (10Aklapper) 05Open→03Stalled > Tried again earlier today, we'll see if I get the mailing list mail again next week. @Technical13: Is this still an issue? [11:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84619 and previous config saved to /var/cache/conftool/dbconfig/20251103-110111-marostegui.json [11:01:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:03:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84620 and previous config saved to /var/cache/conftool/dbconfig/20251103-110326-marostegui.json [11:05:51] (03CR) 10JMeybohm: [C:03+1] "Cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [11:06:08] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335034 (10elukey) >>! In T404356#11331717, @elukey wrote: > There are still some provisioning issues for sretest2010 (see T394357)... [11:07:53] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335037 (10elukey) >>! In T404356#11335034, @elukey wrote: >>>! In T404356#11331717, @elukey wrote: >> There are still some provisio... [11:08:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [11:10:06] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-worker1001.eqiad.wmnet with OS bullseye [11:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:13:30] (03PS1) 10Muehlenhoff: Remove code to install hp-health [puppet] - 10https://gerrit.wikimedia.org/r/1201030 [11:14:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408963#11335049 (10phaultfinder) [11:15:19] (03CR) 10Jcrespo: [C:03+2] "Merge for easier migration to gitlab." [software/transferpy] - 10https://gerrit.wikimedia.org/r/972446 (owner: 10Jcrespo) [11:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:55] (03CR) 10Jcrespo: [C:03+2] "https://gerrit.wikimedia.org/r/operations/software/transferpy" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 (owner: 10Jcrespo) [11:16:14] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Add a few fixes after lintering to clean up the code [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 (owner: 10Jcrespo) [11:16:35] (03CR) 10Jcrespo: [V:03+2 C:03+2] RemoteExecution: Restore RemoteExecution class back into transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:16:53] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:17:45] (03CR) 10Brouberol: [C:03+2] Create the growthbook.wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1200317 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [11:18:04] !log brouberol@dns1004 START - running authdns-update [11:18:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P84621 and previous config saved to /var/cache/conftool/dbconfig/20251103-111834-marostegui.json [11:18:43] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab (issues were fixed on a latter patch)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:18:59] !log brouberol@dns1004 END - running authdns-update [11:19:01] (03CR) 10Jcrespo: [C:03+2] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:19:05] (03PS13) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:19:18] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:20:03] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo) [11:20:55] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab (this was fixed on a latter commit)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [11:21:48] (03PS3) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 [11:21:51] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [11:22:04] (03CR) 10Jcrespo: [C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:22:20] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:22:36] (03PS4) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) [11:22:38] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:23:00] (03CR) 10Jcrespo: [C:03+2] transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 (owner: 10Jcrespo) [11:23:04] (03PS5) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [11:23:06] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 (owner: 10Jcrespo) [11:23:13] (03CR) 10Jcrespo: [C:03+2] transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:23:17] (03PS2) 10Jcrespo: transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) [11:23:18] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:24:05] (03CR) 10Jcrespo: [V:03+2 C:03+2] "New command is here" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 (owner: 10Jcrespo) [11:24:11] (03PS2) 10Jcrespo: transferpy: Fix the check for empty directories [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 [11:24:17] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Fix the check for empty directories [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 (owner: 10Jcrespo) [11:24:28] (03PS2) 10Jcrespo: transferpy: Force ipv4 usage for now, fix bug with found port [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198521 [11:24:49] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Force ipv4 usage for now, fix bug with found port [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198521 (owner: 10Jcrespo) [11:25:01] (03PS2) 10Jcrespo: Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200112 [11:25:10] (03CR) 10Jcrespo: [V:03+2 C:03+2] Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200112 (owner: 10Jcrespo) [11:25:34] (03CR) 10Jcrespo: "And here is the second part of the fix" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:25:42] (03CR) 10Jcrespo: [C:03+2] Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:25:47] (03PS3) 10Jcrespo: Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) [11:25:51] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:26:31] (03Abandoned) 10Jcrespo: transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:27:03] (03Abandoned) 10Jcrespo: transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [11:27:10] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [11:27:41] (03Abandoned) 10Jcrespo: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:28:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [11:28:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS trixie [11:28:27] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11335120 (10elukey) Found a little odd spike today in Pyrra for `xlab-standalone-event-validation-success-rate-v1`: [[ https://thanos.wikimedia.org/graph?g0.exp... [11:28:38] (03Abandoned) 10Jcrespo: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:28:47] (03Abandoned) 10Jcrespo: Fix:: InvalidQueryException handling [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:29:09] (03Abandoned) 10Jcrespo: [POC5 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/621898 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:29:13] (03Abandoned) 10Jcrespo: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:33:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P84622 and previous config saved to /var/cache/conftool/dbconfig/20251103-113341-marostegui.json [11:33:52] (03PS1) 10Jcrespo: [WIP]Prepare for release 2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1201036 [11:34:31] (03Abandoned) 10Jcrespo: [WIP]Prepare for release 2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1201036 (owner: 10Jcrespo) [11:35:07] (03PS1) 10Federico Ceratto: Flip es1, es2, es3 masters [dns] - 10https://gerrit.wikimedia.org/r/1201037 (https://phabricator.wikimedia.org/T402859) [11:35:41] (03CR) 10Elukey: [C:03+1] Remove code to install hp-health [puppet] - 10https://gerrit.wikimedia.org/r/1201030 (owner: 10Muehlenhoff) [11:36:26] (03CR) 10Marostegui: [C:03+1] Flip es1, es2, es3 masters [dns] - 10https://gerrit.wikimedia.org/r/1201037 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:43:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:48:25] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:48:27] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409060 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:48:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060 (10ops-monitoring-bot) 03NEW [11:48:38] (03PS14) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:48:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84623 and previous config saved to /var/cache/conftool/dbconfig/20251103-114849-marostegui.json [11:48:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:49:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [11:49:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T407997)', diff saved to https://phabricator.wikimedia.org/P84624 and previous config saved to /var/cache/conftool/dbconfig/20251103-114913-marostegui.json [11:51:57] (03CR) 10Vgutierrez: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:54:11] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11335237 (10cmooney) Thanks @papaul. One to discuss with @ayounsi when he is back are the IPv6 gateway addresses on the vlans. ` on asw1-22 irb.411 public1-ul... [11:55:14] (03PS15) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:58:01] !log move analytics1-c-eqiad gateway IPs to new spine switch ports eqiad T405579 [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:11] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [12:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T407997)', diff saved to https://phabricator.wikimedia.org/P84625 and previous config saved to /var/cache/conftool/dbconfig/20251103-120108-marostegui.json [12:01:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:01:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.844s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:11:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:06] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335269 (10hnowlan) 05Open→03In progress [12:12:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335273 (10hnowlan) Awaiting out of band verification of SSH key on Slack. Tagging @thcipriani as approver for `deployment` group. [12:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P84626 and previous config saved to /var/cache/conftool/dbconfig/20251103-121617-marostegui.json [12:16:18] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11335277 (10hnowlan) 05Open→03Stalled Blocked on approval from @mark. [12:16:50] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335279 (10hnowlan) [12:16:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335280 (10hnowlan) Key verified out of band. [12:18:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure