[00:00:14] RECOVERY - Host db1179 #page is UP: PING WARNING - Packet loss = 50%, RTA = 30.34 ms [00:01:17] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:01:54] (03CR) 10CI reject: [V:04-1] foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:02:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 53.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:03:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054036 (owner: 10TrainBranchBot) [00:04:18] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:43] (03CR) 10Pppery: "`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:12:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [00:12:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [00:12:56] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:13:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66433 and previous config saved to /var/cache/conftool/dbconfig/20240714-001301-marostegui.json [00:13:05] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:14:20] PROBLEM - Host cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [00:15:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:15:20] RECOVERY - Host cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.31 ms [00:16:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:16] (03PS4) 10Seawolf35gerrit: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) [00:21:22] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:27:14] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29668 bytes in 1.995 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:29:21] (03CR) 10Seawolf35gerrit: "Used the magical power of copy and paste, I doubt I would end up on the allow list for Jenkins anytime soon, but that would be nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:39:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:39:41] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:39:57] !incidents [00:39:58] 4856 (UNACKED) Host cr3-ulsfo - PING - Packet loss = 100% [00:40:00] !ack 4856 [00:40:00] 4856 (ACKED) Host cr3-ulsfo - PING - Packet loss = 100% [00:41:47] (03CR) 10Pppery: "The system is happy now! (My comments about scheduling for deployment from earlier still apply)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:43:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [00:51:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [01:13:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66434 and previous config saved to /var/cache/conftool/dbconfig/20240714-011317-marostegui.json [01:13:24] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:24:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:28:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P66435 and previous config saved to /var/cache/conftool/dbconfig/20240714-012824-marostegui.json [01:32:46] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:37:38] (03PS5) 10Seawolf35gerrit: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) [01:37:46] FIRING: [5x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:43:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P66436 and previous config saved to /var/cache/conftool/dbconfig/20240714-014331-marostegui.json [01:52:46] FIRING: [5x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:57:46] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:58:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66437 and previous config saved to /var/cache/conftool/dbconfig/20240714-015838-marostegui.json [01:58:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [01:58:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:58:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [01:59:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T367856)', diff saved to https://phabricator.wikimedia.org/P66438 and previous config saved to /var/cache/conftool/dbconfig/20240714-015901-marostegui.json [02:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:45:36] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:47:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:59:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:38] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:33:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:45:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:00:24] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:02:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29668 bytes in 4.301 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:43:26] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:46:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29687 bytes in 6.487 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:55:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [05:00:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T367856)', diff saved to https://phabricator.wikimedia.org/P66439 and previous config saved to /var/cache/conftool/dbconfig/20240714-050027-marostegui.json [05:00:40] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:00:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [05:10:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P66440 and previous config saved to /var/cache/conftool/dbconfig/20240714-051535-marostegui.json [05:30:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P66441 and previous config saved to /var/cache/conftool/dbconfig/20240714-053042-marostegui.json [05:45:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T367856)', diff saved to https://phabricator.wikimedia.org/P66442 and previous config saved to /var/cache/conftool/dbconfig/20240714-054549-marostegui.json [05:45:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [05:45:54] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:46:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [05:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T367856)', diff saved to https://phabricator.wikimedia.org/P66443 and previous config saved to /var/cache/conftool/dbconfig/20240714-054611-marostegui.json [06:00:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 185811656 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:01:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:02:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240714T0700) [07:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:24:41] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [07:33:56] (03CR) 10Dreamrimmer: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [07:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:52:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:54:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29687 bytes in 7.230 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:22:40] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:40] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:48:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:48:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:49:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T367856)', diff saved to https://phabricator.wikimedia.org/P66444 and previous config saved to /var/cache/conftool/dbconfig/20240714-084903-marostegui.json [08:49:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:49:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T367856)', diff saved to https://phabricator.wikimedia.org/P66445 and previous config saved to /var/cache/conftool/dbconfig/20240714-084956-marostegui.json [08:52:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P66446 and previous config saved to /var/cache/conftool/dbconfig/20240714-090504-marostegui.json [09:20:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P66447 and previous config saved to /var/cache/conftool/dbconfig/20240714-092011-marostegui.json [09:25:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 57.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:32:19] (03CR) 10Tacsipacsi: "All wikis have wmf.13 now, could this be abandoned?" [extensions/Flow] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051396 (https://phabricator.wikimedia.org/T357600) (owner: 10Jdlrobson) [09:35:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T367856)', diff saved to https://phabricator.wikimedia.org/P66448 and previous config saved to /var/cache/conftool/dbconfig/20240714-093518-marostegui.json [09:35:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [09:35:22] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:35:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [09:35:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T367856)', diff saved to https://phabricator.wikimedia.org/P66449 and previous config saved to /var/cache/conftool/dbconfig/20240714-093540-marostegui.json [09:39:15] (03Abandoned) 10Umherirrender: Make Flow work in dark mode by disabling backgrounds and setting text [extensions/Flow] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051396 (https://phabricator.wikimedia.org/T357600) (owner: 10Jdlrobson) [09:47:10] (03PS4) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) [10:07:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [10:20:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [10:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:15:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:20:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:01:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:06:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:15:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T367856)', diff saved to https://phabricator.wikimedia.org/P66450 and previous config saved to /var/cache/conftool/dbconfig/20240714-131502-marostegui.json [13:15:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:30:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P66451 and previous config saved to /var/cache/conftool/dbconfig/20240714-133010-marostegui.json [13:45:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P66452 and previous config saved to /var/cache/conftool/dbconfig/20240714-134517-marostegui.json [14:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T367856)', diff saved to https://phabricator.wikimedia.org/P66453 and previous config saved to /var/cache/conftool/dbconfig/20240714-140024-marostegui.json [14:00:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:00:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:00:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:00:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T367856)', diff saved to https://phabricator.wikimedia.org/P66454 and previous config saved to /var/cache/conftool/dbconfig/20240714-140046-marostegui.json [14:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:42] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 348.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:42] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:14:39] (03PS6) 10Seawolf35gerrit: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) [16:27:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66455 and previous config saved to /var/cache/conftool/dbconfig/20240714-162755-root.json [16:28:20] (03CR) 10Seawolf35gerrit: foundationwiki: Restrict `unfuzzy` right to autoconfirmed users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [16:30:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:30:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:43:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66456 and previous config saved to /var/cache/conftool/dbconfig/20240714-164300-root.json [16:56:05] (03PS1) 10Ladsgroup: db1179: Disable notification for db1179 [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) [16:58:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66457 and previous config saved to /var/cache/conftool/dbconfig/20240714-165805-root.json [16:58:16] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9979761 (10Ladsgroup) Also noting that this is a candidate master. [17:06:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9979768 (10Papaul) I did some more testing this weekend by downgrading my PXELINUX to 6.03 see below . I was still able to pxeboot. {F56420534} ` Jul... [17:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66458 and previous config saved to /var/cache/conftool/dbconfig/20240714-171311-root.json [17:28:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66459 and previous config saved to /var/cache/conftool/dbconfig/20240714-172816-root.json [17:43:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66460 and previous config saved to /var/cache/conftool/dbconfig/20240714-174322-root.json [17:58:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66461 and previous config saved to /var/cache/conftool/dbconfig/20240714-175827-root.json [18:56:04] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:57:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 39.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:01:04] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:08:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:59:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:00:44] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:01:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:03:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29698 bytes in 0.538 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:06:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:46:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T367856)', diff saved to https://phabricator.wikimedia.org/P66462 and previous config saved to /var/cache/conftool/dbconfig/20240714-214603-marostegui.json [21:46:17] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P66463 and previous config saved to /var/cache/conftool/dbconfig/20240714-220110-marostegui.json [22:03:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P66464 and previous config saved to /var/cache/conftool/dbconfig/20240714-221617-marostegui.json [22:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T367856)', diff saved to https://phabricator.wikimedia.org/P66465 and previous config saved to /var/cache/conftool/dbconfig/20240714-223124-marostegui.json [22:31:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [22:31:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:31:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [22:31:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T367856)', diff saved to https://phabricator.wikimedia.org/P66466 and previous config saved to /var/cache/conftool/dbconfig/20240714-223146-marostegui.json [22:54:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9979871 (10jhathaway) [22:55:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9979872 (10jhathaway) 05Open→03Resolved [22:56:19] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9979874 (10jhathaway) [22:56:27] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9979875 (10jhathaway) 05Open→03Resolved [22:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:46] 06SRE, 06Infrastructure-Foundations, 10Mail: Implement MTA-STS - https://phabricator.wikimedia.org/T203883#9979898 (10jhathaway) [23:06:08] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10Mail, 10observability, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171#9979900 (10jhathaway) [23:07:12] (03PS1) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [23:07:36] (03CR) 10CI reject: [V:04-1] Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [23:08:25] 06SRE, 06Infrastructure-Foundations, 10Mail: Split MXes into inbound and outbound - https://phabricator.wikimedia.org/T175362#9979903 (10jhathaway) 05Open→03Resolved a:03jhathaway With the new Postfix architecture the rolls are now split. [23:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:23:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:32:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [23:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054071 [23:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054071 (owner: 10TrainBranchBot) [23:39:15] (03PS2) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [23:41:31] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3219/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [23:44:47] (03PS1) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [23:51:30] (03PS3) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [23:51:30] (03PS2) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [23:53:41] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3221/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall)