[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166494 (owner: 10TrainBranchBot) [00:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 [00:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 (owner: 10TrainBranchBot) [00:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 (owner: 10TrainBranchBot) [00:51:07] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [01:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:29:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:37:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29361 bytes in 4.056 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:42:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:43:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 2.880 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:46:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:47:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 7.267 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:50:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:51:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29770 bytes in 0.837 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:21:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 3.216 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:29:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29772 bytes in 7.457 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:35:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:44:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29507 bytes in 3.630 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:51:07] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [05:04:37] (03PS1) 10KartikMistry: WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [05:06:35] (03CR) 10CI reject: [V:04-1] WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:29] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 2 (gerrit1003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:22:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 6.472 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:25:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:26:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29872 bytes in 6.695 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:29:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:30:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 2.899 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:53:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:58:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 4.479 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:02:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:12:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29770 bytes in 0.496 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:27] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:22:21] (03PS2) 10KartikMistry: WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [06:24:17] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (604888s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [06:28:38] (03PS2) 10Nikerabbit: Remove chararacterEditStatsTranslate [puppet] - 10https://gerrit.wikimedia.org/r/1164956 (https://phabricator.wikimedia.org/T398171) [06:29:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [06:29:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [06:43:48] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:54:58] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [06:56:17] (03CR) 10Marostegui: [C:03+2] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:00] (03PS1) 10Brouberol: mediawiki-dumps-legacy: define the globalusage.dblist file in the dblists configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166670 (https://phabricator.wikimedia.org/T398788) [07:04:46] !log testing haproxy 2.8.15 in cp5017 and cp5025 - T398720 [07:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:48] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [07:06:00] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: define the globalusage.dblist file in the dblists configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166670 (https://phabricator.wikimedia.org/T398788) (owner: 10Brouberol) [07:10:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:11:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:13:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x1 T397612 [07:13:59] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:20:48] (03PS13) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) [07:21:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1220 with weight 0 T397612', diff saved to https://phabricator.wikimedia.org/P78760 and previous config saved to /var/cache/conftool/dbconfig/20250707-072157-root.json [07:22:01] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:22:28] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [07:23:06] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1162851 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:24:41] (03CR) 10Vgutierrez: [C:03+2] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [07:24:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:25:20] !log Starting x1 eqiad failover from db1237 to db1220 - T397612 [07:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:42] !log depooling cp7006 to test Ia82b9354a5b9e7bd5443b4af0888325919ddb19e - T397917 [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:47] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [07:25:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 1.375 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:29:02] (03CR) 10Gmodena: [C:03+1] Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [07:37:17] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:37:47] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:40:23] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:41:21] ^ we are on that [07:42:47] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:45:23] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:47:17] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:47:37] (03CR) 10Volans: New structure for sshd_config starting with trixie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [07:50:00] (03CR) 10Marostegui: [C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:50:05] (03PS2) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) [07:50:15] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:50:21] !log marostegui@dns1006 START - running authdns-update [07:50:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:51:26] !log marostegui@dns1006 END - running authdns-update [07:52:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1220 to x1 primary and set section read-write T397612', diff saved to https://phabricator.wikimedia.org/P78762 and previous config saved to /var/cache/conftool/dbconfig/20250707-075254-root.json [07:53:00] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:53:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1237 T397612', diff saved to https://phabricator.wikimedia.org/P78763 and previous config saved to /var/cache/conftool/dbconfig/20250707-075308-root.json [07:53:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:53:23] !log repooling cp7006 with Ia82b9354a5b9e7bd5443b4af0888325919ddb19e applied - T397917 [07:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:25] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [07:58:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:58:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29781 bytes in 7.826 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:59:10] (03PS1) 10Giuseppe Lavagetto: Code changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166749 [07:59:28] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166749 (owner: 10Giuseppe Lavagetto) [08:00:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1237.eqiad.wmnet with reason: Maintenance [08:00:56] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: logging of deny actions; add rename functionality - oblivian@cumin1003" [08:00:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: logging of deny actions; add rename functionality - oblivian@cumin1003 [08:01:30] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: logging of deny actions; add rename functionality - oblivian@cumin1003 [08:01:31] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: logging of deny actions; add rename functionality - oblivian@cumin1003" [08:02:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:05:12] (03PS1) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 [08:09:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29855 bytes in 0.721 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:21] (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [08:14:01] (03CR) 10Muehlenhoff: New structure for sshd_config starting with trixie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [08:14:10] (03PS9) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [08:15:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: Maintenance [08:17:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [08:17:48] (03CR) 10Volans: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [08:19:17] 10ops-eqiad, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794 (10Marostegui) 03NEW [08:19:24] 10ops-eqiad, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10977707 (10Marostegui) p:05Triage→03Medium