[00:01:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041254 (owner: 10TrainBranchBot) [00:03:45] RESOLVED: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:45] FIRING: [8x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:10:46] FIRING: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:56] (03CR) 10Ssingh: [C:03+1] delete pk.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041245 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [00:15:46] FIRING: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:45] RESOLVED: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:20:46] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:45] FIRING: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:45] RESOLVED: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:35:46] FIRING: [7x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:45] FIRING: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:40:42] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-eqiad [00:43:45] RESOLVED: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:45:13] (03PS21) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [00:45:38] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [01:05:39] (03PS22) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [01:06:04] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [01:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.9 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041277 (https://phabricator.wikimedia.org/T361403) [01:07:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.9 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041277 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [01:20:36] (03PS23) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [01:21:22] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [01:33:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.9 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041277 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [01:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0200) [02:00:30] (03PS1) 10Bartosz Dziewoński: Fix Linker::makeExternalLink build failures [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041297 (https://phabricator.wikimedia.org/T367127) [02:27:40] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1489:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:44] (03PS1) 10Jdlrobson: Avoid wrapping floated tables using computed styles [skins/Vector] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041311 (https://phabricator.wikimedia.org/T366314) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0300) [03:01:36] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041313 (https://phabricator.wikimedia.org/T361403) [03:01:38] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041313 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [03:02:16] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041313 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [03:02:46] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.9 refs T361403 [03:02:50] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [03:13:21] (03PS1) 10Jdrewniak: Enable Vector appearance menu & larger font-size on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041320 (https://phabricator.wikimedia.org/T362148) [03:32:35] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:34:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T352010)', diff saved to https://phabricator.wikimedia.org/P64564 and previous config saved to /var/cache/conftool/dbconfig/20240611-033418-ladsgroup.json [03:34:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:42:33] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:49:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P64565 and previous config saved to /var/cache/conftool/dbconfig/20240611-034925-ladsgroup.json [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0400) [04:00:05] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.9 refs T361403 (duration: 57m 19s) [04:00:20] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [04:01:06] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.6 (duration: 01m 05s) [04:04:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P64566 and previous config saved to /var/cache/conftool/dbconfig/20240611-040432-ladsgroup.json [04:14:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 223, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T352010)', diff saved to https://phabricator.wikimedia.org/P64567 and previous config saved to /var/cache/conftool/dbconfig/20240611-041938-ladsgroup.json [04:19:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:28:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:33:02] (03PS2) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038796 (https://phabricator.wikimedia.org/T366687) [04:33:09] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1038795 (https://phabricator.wikimedia.org/T366687) [04:33:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T366687 [04:33:29] T366687: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T366687 [04:33:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1162 with weight 0 T366687', diff saved to https://phabricator.wikimedia.org/P64568 and previous config saved to /var/cache/conftool/dbconfig/20240611-043333-marostegui.json [04:33:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T366687 [04:34:48] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1038795 (https://phabricator.wikimedia.org/T366687) (owner: 10Gerrit maintenance bot) [04:42:30] (03PS1) 10Marostegui: Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041352 [04:43:00] (03CR) 10Marostegui: [C:03+2] Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041352 (owner: 10Marostegui) [04:43:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64569 and previous config saved to /var/cache/conftool/dbconfig/20240611-044339-root.json [04:45:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [04:46:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [04:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T364069)', diff saved to https://phabricator.wikimedia.org/P64570 and previous config saved to /var/cache/conftool/dbconfig/20240611-044616-marostegui.json [04:46:22] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:47:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 223, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:48:00] (03PS1) 10Marostegui: db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1041357 [04:48:44] (03CR) 10Marostegui: [C:03+2] db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1041357 (owner: 10Marostegui) [04:53:25] !log Starting s2 eqiad failover from db1222 to db1162 - T366687 [04:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:32] T366687: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T366687 [04:53:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T366687', diff saved to https://phabricator.wikimedia.org/P64571 and previous config saved to /var/cache/conftool/dbconfig/20240611-045341-root.json [04:54:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T366687', diff saved to https://phabricator.wikimedia.org/P64572 and previous config saved to /var/cache/conftool/dbconfig/20240611-045359-root.json [04:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222 T366687', diff saved to https://phabricator.wikimedia.org/P64573 and previous config saved to /var/cache/conftool/dbconfig/20240611-045447-root.json [04:55:29] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038796 (https://phabricator.wikimedia.org/T366687) (owner: 10Gerrit maintenance bot) [04:56:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1222.eqiad.wmnet with reason: Long schema change [04:56:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Long schema change [04:57:08] !log dbmaint eqiad s2 deploy schema change on db1222 T364299 [04:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64574 and previous config saved to /var/cache/conftool/dbconfig/20240611-045845-root.json [05:02:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:17] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1041363 (https://phabricator.wikimedia.org/T367140) [05:03:21] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041364 (https://phabricator.wikimedia.org/T367140) [05:03:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T367140 [05:03:49] T367140: Switchover s3 master (db1223 -> db1157) - https://phabricator.wikimedia.org/T367140 [05:03:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1157 with weight 0 T367140', diff saved to https://phabricator.wikimedia.org/P64575 and previous config saved to /var/cache/conftool/dbconfig/20240611-050351-root.json [05:04:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T367140 [05:04:27] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1041363 (https://phabricator.wikimedia.org/T367140) (owner: 10Gerrit maintenance bot) [05:06:31] (03PS1) 10Marostegui: db1223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1041365 [05:06:54] (03CR) 10Marostegui: [C:03+2] db1223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1041365 (owner: 10Marostegui) [05:13:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64576 and previous config saved to /var/cache/conftool/dbconfig/20240611-051351-root.json [05:19:20] !log Starting s3 eqiad failover from db1223 to db1157 - T367140 [05:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:27] T367140: Switchover s3 master (db1223 -> db1157) - https://phabricator.wikimedia.org/T367140 [05:19:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T367140', diff saved to https://phabricator.wikimedia.org/P64577 and previous config saved to /var/cache/conftool/dbconfig/20240611-051941-root.json [05:20:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1157 to s3 primary and set section read-write T367140', diff saved to https://phabricator.wikimedia.org/P64578 and previous config saved to /var/cache/conftool/dbconfig/20240611-052000-root.json [05:20:18] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041364 (https://phabricator.wikimedia.org/T367140) (owner: 10Gerrit maintenance bot) [05:21:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223 T367140', diff saved to https://phabricator.wikimedia.org/P64579 and previous config saved to /var/cache/conftool/dbconfig/20240611-052101-root.json [05:21:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Long schema change [05:21:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Long schema change [05:22:43] !log dbmaint eqiad s3 deploy schema change on db1223 T364299 [05:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:47] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:24:27] !log dbmaint eqiad s3 deploy schema change on db1223 T364069 [05:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:31] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64580 and previous config saved to /var/cache/conftool/dbconfig/20240611-052856-root.json [05:33:36] (03CR) 10Brouberol: [C:03+1] "This looks great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [05:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:40:37] (03PS1) 10KartikMistry: Update MinT to 2024-06-11-052620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041372 (https://phabricator.wikimedia.org/T364122) [05:44:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64581 and previous config saved to /var/cache/conftool/dbconfig/20240611-054401-root.json [05:48:10] (03PS4) 10Giuseppe Lavagetto: mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 [05:49:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb) [05:50:47] (03Merged) 10jenkins-bot: dbconfig: temporary disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb) [05:51:39] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1041107|dbconfig: temporary disable writes on es6 (T367055)]] [05:51:44] T367055: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T367055 [05:53:19] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 (owner: 10Giuseppe Lavagetto) [05:55:03] (03Merged) 10jenkins-bot: mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 (owner: 10Giuseppe Lavagetto) [05:56:08] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1041107|dbconfig: temporary disable writes on es6 (T367055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:56:30] PROBLEM - MariaDB Replica SQL: s2 #page on db1233 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:57:43] depooling it [05:57:48] thanks [05:58:04] <_joe_> arnaudb: can you also ack the alert? [05:58:14] ack will do [05:58:14] <_joe_> !incidents [05:58:15] 4730 (UNACKED) db1233 (paged)/MariaDB Replica SQL: s2 (paged) [05:58:15] 4729 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [05:58:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1233', diff saved to https://phabricator.wikimedia.org/P64582 and previous config saved to /var/cache/conftool/dbconfig/20240611-055816-arnaudb.json [05:58:18] <_joe_> I can do it [05:58:23] <_joe_> !ack 4730 [05:58:24] 4730 (ACKED) db1233 (paged)/MariaDB Replica SQL: s2 (paged) [05:58:29] <_joe_> (done) [05:58:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: maintenance [05:58:45] !log arnaudb@deploy1002 arnaudb: Continuing with sync [05:58:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: maintenance [05:59:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64583 and previous config saved to /var/cache/conftool/dbconfig/20240611-055907-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0600) [06:00:05] marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0600). [06:02:07] db1233 is fixed now [06:02:30] RECOVERY - MariaDB Replica SQL: s2 #page on db1233 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:54] <_joe_> we get this poolcounter alert every day at this time [06:07:21] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1041107|dbconfig: temporary disable writes on es6 (T367055)]] (duration: 15m 42s) [06:07:27] T367055: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T367055 [06:07:39] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:09:12] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64584 and previous config saved to /var/cache/conftool/dbconfig/20240611-060935-root.json [06:11:36] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [06:12:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [06:14:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64585 and previous config saved to /var/cache/conftool/dbconfig/20240611-061413-root.json [06:17:27] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [06:17:35] (03CR) 10CI reject: [V:04-1] mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [06:17:43] (03PS4) 10Giuseppe Lavagetto: mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) [06:17:47] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [06:18:25] I plan to update MinT service. OK to deploy? [06:19:58] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:21:26] kart_: We are deploying MW during our maintenance window cc arnaudb [06:22:11] yep, will keep you posted when done with it kart_ [06:23:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es6 T367055 [06:23:07] T367055: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T367055 [06:23:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T367055 [06:23:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es1037 with weight 0 T367055', diff saved to https://phabricator.wikimedia.org/P64586 and previous config saved to /var/cache/conftool/dbconfig/20240611-062353-arnaudb.json [06:24:35] !log oblivian@deploy1002 Locking from deployment [ALL REPOSITORIES]: incident in progress, blocking deploys --joe [06:24:36] arnaudb: sure. Thanks! [06:24:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64587 and previous config saved to /var/cache/conftool/dbconfig/20240611-062441-root.json [06:26:38] (03PS1) 10Giuseppe Lavagetto: Revert "mw-debug: start using php.envvars, expose statsd-exporter" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041392 [06:27:40] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1489:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:23] <_joe_> marostegui: sorry if you need to deploy mw you'll need to wait ~ 10 minutes [06:28:33] arnaudb: ^ [06:28:45] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1041092 (https://phabricator.wikimedia.org/T367055) (owner: 10Gerrit maintenance bot) [06:28:49] <_joe_> I did an infra change that isn't working [06:28:58] <_joe_> and that would make your scap fail [06:29:03] <_joe_> that's why I took the lock [06:30:13] !log Starting es6 eqiad failover from es1038 to es1037 - T367055 [06:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:20] T367055: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T367055 [06:30:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:31:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es1037 to es6 primary T367055', diff saved to https://phabricator.wikimedia.org/P64588 and previous config saved to /var/cache/conftool/dbconfig/20240611-063109-arnaudb.json [06:32:13] ack marostegui _joe_ [06:34:17] (03CR) 10Arnaudb: [C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041093 (https://phabricator.wikimedia.org/T367055) (owner: 10Gerrit maintenance bot) [06:34:27] (03PS2) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041093 (https://phabricator.wikimedia.org/T367055) [06:34:30] (03CR) 10Arnaudb: [V:03+2 C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041093 (https://phabricator.wikimedia.org/T367055) (owner: 10Gerrit maintenance bot) [06:37:21] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "mw-debug: start using php.envvars, expose statsd-exporter" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041392 (owner: 10Giuseppe Lavagetto) [06:37:40] <_joe_> ok, when this revert is merged I'll unlock the deployments [06:37:53] <_joe_> sorry for the inconvenience, I have no idea what's causing this [06:38:05] (03Merged) 10jenkins-bot: Revert "mw-debug: start using php.envvars, expose statsd-exporter" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041392 (owner: 10Giuseppe Lavagetto) [06:39:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'mimic weight', diff saved to https://phabricator.wikimedia.org/P64589 and previous config saved to /var/cache/conftool/dbconfig/20240611-063903-arnaudb.json [06:39:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64590 and previous config saved to /var/cache/conftool/dbconfig/20240611-063947-root.json [06:40:09] !log oblivian@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: incident in progress, blocking deploys --joe (duration: 15m 33s) [06:40:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'mimic weight', diff saved to https://phabricator.wikimedia.org/P64591 and previous config saved to /var/cache/conftool/dbconfig/20240611-064041-arnaudb.json [06:42:08] (03PS1) 10Arnaudb: Revert "dbconfig: temporary disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041401 [06:51:13] _joe_: is it safe to deploy ? [06:51:26] <_joe_> arnaudb: yep, I did unlock scap [06:51:52] <_joe_> arnaudb: I tend to run scap lock --all "" whenever I want people not to deploy [06:52:09] <_joe_> so that even if communication doesn't reach you, the command line will :) [06:52:35] ack, thanks! [06:53:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041401 (owner: 10Arnaudb) [06:53:51] (03Merged) 10jenkins-bot: Revert "dbconfig: temporary disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041401 (owner: 10Arnaudb) [06:54:09] (03CR) 10Muehlenhoff: [C:03+2] Remove iegreview module [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) (owner: 10Muehlenhoff) [06:54:20] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1041401|Revert "dbconfig: temporary disable writes on es6"]] [06:54:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64592 and previous config saved to /var/cache/conftool/dbconfig/20240611-065453-root.json [06:55:33] (03CR) 10Brouberol: [C:03+2] datahub: add securityContext to all containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [06:55:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [06:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:56:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1041227 (https://phabricator.wikimedia.org/T365574) (owner: 10Dzahn) [06:56:56] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1041401|Revert "dbconfig: temporary disable writes on es6"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:57:17] !log arnaudb@deploy1002 arnaudb: Continuing with sync [06:58:56] (03CR) 10Muehlenhoff: admin: add radimer to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron) [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:36] !log failover ganeti master in codfw to ganeti2020 [07:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:34] (03CR) 10Kosta Harlan: "Some follow-up notes in T367113" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [07:05:12] PROBLEM - ganeti-wconfd running on ganeti2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:05:57] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1041401|Revert "dbconfig: temporary disable writes on es6"]] (duration: 11m 36s) [07:06:12] kart_: I'm done! [07:07:27] arnaudb: cool. [07:09:21] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-06-11-052620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041372 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry) [07:09:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64593 and previous config saved to /var/cache/conftool/dbconfig/20240611-070958-root.json [07:11:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [07:12:07] (03Merged) 10jenkins-bot: Update MinT to 2024-06-11-052620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041372 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry) [07:12:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64594 and previous config saved to /var/cache/conftool/dbconfig/20240611-071253-root.json [07:13:11] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:16:15] (03PS1) 10Marostegui: Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041407 [07:17:23] (03CR) 10Marostegui: [C:03+2] Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041407 (owner: 10Marostegui) [07:17:48] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:18:55] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:25:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64595 and previous config saved to /var/cache/conftool/dbconfig/20240611-072504-root.json [07:25:39] (03CR) 10Filippo Giunchedi: [C:03+2] titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [07:25:45] (03PS3) 10Filippo Giunchedi: titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) [07:25:50] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [07:26:58] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:27:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64596 and previous config saved to /var/cache/conftool/dbconfig/20240611-072758-root.json [07:28:24] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:36:44] !log filippo@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw [07:37:14] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:38:07] (03PS1) 10Muehlenhoff: Run vrts_aliases in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/1041414 (https://phabricator.wikimedia.org/T284145) [07:38:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [07:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64597 and previous config saved to /var/cache/conftool/dbconfig/20240611-074009-root.json [07:40:29] !log Updated MinT to 2024-06-11-052620-production (T364122, T346226, T357548) [07:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:35] T364122: In zgh.wikipedia Content Translation use machine translation with MinT Translation with tzm code - https://phabricator.wikimedia.org/T364122 [07:40:35] T346226: Package conflicts for tox causes CI failure - https://phabricator.wikimedia.org/T346226 [07:40:35] T357548: Blubber Python builder: Always use a virtualenv - https://phabricator.wikimedia.org/T357548 [07:43:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64598 and previous config saved to /var/cache/conftool/dbconfig/20240611-074304-root.json [07:43:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:43:55] (03PS6) 10Brouberol: global_config: expose services for all mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) [07:44:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [07:45:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [07:45:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [07:47:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [07:48:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [07:49:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [07:52:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:54:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [07:54:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [07:55:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [07:58:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64599 and previous config saved to /var/cache/conftool/dbconfig/20240611-075809-root.json [07:58:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [08:02:04] (03CR) 10Volans: "There are few comments still open from previous reviews" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:02:10] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:02:12] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:02:19] (03PS1) 10Filippo Giunchedi: thanos: send sigkill to compact on stop [puppet] - 10https://gerrit.wikimedia.org/r/1041522 [08:02:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:03:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [08:05:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [08:07:37] PROBLEM - OpenSearch health check for shards on 9200 on logstash2023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa536ca8280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w [08:07:37] org/wiki/Search%23Administration [08:09:21] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:10:41] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9878379 (10Fuzzy) What would you suggest to reduce the template size? The external ` I iterated on your code with dcl in wmcs." [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:20:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on logstash2023.codfw.wmnet with reason: reboot/ganeti [08:20:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash2023.codfw.wmnet with reason: reboot/ganeti [08:21:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [08:23:43] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp2027.codfw.wmnet [08:23:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64601 and previous config saved to /var/cache/conftool/dbconfig/20240611-082342-root.json [08:24:41] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp2027.codfw.wmnet [08:27:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9878402 (10Ifrahkhanyaree_WMDE) [08:27:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:27:56] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041526 [08:27:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9878403 (10Ifrahkhanyaree_WMDE) Hi @herron done! I hopefully did it the right way. Let me know if there's anything else, thank you! [08:27:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [08:28:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [08:28:58] !log restarting stashbot that disconnected [08:30:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:30:49] hello stashbot [08:30:55] !log Install 10.11 on db1153 (non used x2 replioca) [08:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-codfw [08:31:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [08:31:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:31:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1153.eqiad.wmnet with reason: Long schema change [08:31:26] (03CR) 10Btullis: "One nit about making sure that it specifically references the analytics_meta in the commit message, but then feel free to proceed." [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:31:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1153.eqiad.wmnet with reason: Long schema change [08:31:52] !log Install 10.11 on db1153 (non used x2 replica) T365805 [08:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:55] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [08:32:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:32:35] (03PS7) 10Brouberol: global_config: expose services for all analytics mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) [08:32:37] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2027.codfw.wmnet [08:32:44] (03CR) 10Brouberol: global_config: expose services for all analytics mariadb hosts and masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:33:02] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp2027.ulsfo.wmnet [08:34:11] (03PS1) 10Marostegui: db1153: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1041528 (https://phabricator.wikimedia.org/T365805) [08:34:37] (03CR) 10Marostegui: [C:03+2] db1153: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1041528 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [08:36:08] (03CR) 10Brouberol: [C:03+2] global_config: expose services for all analytics mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:37:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [08:37:33] (03PS1) 10Marostegui: db1153: Add warning note [puppet] - 10https://gerrit.wikimedia.org/r/1041529 (https://phabricator.wikimedia.org/T365805) [08:38:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:38:04] <_joe_> jouncebot: nowandnext [08:38:04] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [08:38:04] In 1 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1000) [08:38:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:38:15] <_joe_> ok I guess I can anticipate that a bit [08:38:29] (03CR) 10Marostegui: [C:03+2] db1153: Add warning note [puppet] - 10https://gerrit.wikimedia.org/r/1041529 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [08:38:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52066 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:38:53] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 --start '["55019880"]' 2>&1 | tee -a ~/T315510-enwiki-8; date [08:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:20] (03PS1) 10Giuseppe Lavagetto: mw-debug: remove statsd exporter sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041533 [08:41:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [08:42:14] (03PS2) 10Brouberol: datahub-next: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040874 (https://phabricator.wikimedia.org/T359423) [08:42:14] (03PS2) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [08:43:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [08:43:42] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: remove statsd exporter sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041533 (owner: 10Giuseppe Lavagetto) [08:44:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [08:44:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1041534 (https://phabricator.wikimedia.org/T367145) [08:44:37] (03Merged) 10jenkins-bot: mw-debug: remove statsd exporter sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041533 (owner: 10Giuseppe Lavagetto) [08:45:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1041535 (https://phabricator.wikimedia.org/T367146) [08:45:10] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041536 (https://phabricator.wikimedia.org/T367146) [08:45:28] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:45:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:45:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:46:17] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:46:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [08:46:36] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:46:59] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:47:01] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:50:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:51:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [08:53:42] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:53:45] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:57:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [08:57:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [08:58:13] (03PS1) 10Giuseppe Lavagetto: mw-debug: Add env variables in codfw for mcrouter, statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041538 [09:01:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [09:03:36] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040874 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:04:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [09:04:58] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:07:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [09:07:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [09:08:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [09:13:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [09:16:16] !log filippo@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-eqiad [09:19:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [09:20:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [09:20:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [09:22:46] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9878558 (10dcaro) >>! In T348643#9868739, @wiki_willy wrote: > Ok, got it. Thanks for the info @dcaro. And just to... [09:22:55] PROBLEM - Host aux-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx2001.wikimedia.org [09:23:24] (03CR) 10JMeybohm: [C:03+1] mw-debug: Add env variables in codfw for mcrouter, statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041538 (owner: 10Giuseppe Lavagetto) [09:24:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [09:24:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [09:25:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T360332)', diff saved to https://phabricator.wikimedia.org/P64602 and previous config saved to /var/cache/conftool/dbconfig/20240611-092504-arnaudb.json [09:25:08] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:25:25] RECOVERY - Host aux-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [09:26:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [09:26:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [09:27:32] (03CR) 10JMeybohm: [C:03+1] k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:27:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx2001.wikimedia.org [09:28:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T360332)', diff saved to https://phabricator.wikimedia.org/P64603 and previous config saved to /var/cache/conftool/dbconfig/20240611-092839-arnaudb.json [09:30:23] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041547 (https://phabricator.wikimedia.org/T366918) [09:30:36] (03CR) 10JMeybohm: [C:03+1] kubernetes: alert on persistent unavailable replicas (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [09:31:31] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041547 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci) [09:32:14] (03CR) 10JMeybohm: [C:03+1] kubernetes: alert on persistent unavailable replicas (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [09:32:18] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041547 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci) [09:34:17] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [09:34:32] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [09:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:35:25] !log rebalance ganeti clusters in codfw following reboots [09:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:36] (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:36:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [09:37:08] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp2027.codfw.wmnet [09:38:08] (03CR) 10JMeybohm: [C:03+2] toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:38:12] (03PS5) 10Ssingh: geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) [09:39:04] (03Merged) 10jenkins-bot: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:39:57] (03CR) 10Ssingh: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [09:40:07] (03PS2) 10JMeybohm: calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) [09:40:08] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2877/co" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [09:41:08] (03CR) 10Jelto: [V:03+1] "looks mostly good, one question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [09:41:34] (03CR) 10JMeybohm: "Indeed. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:41:51] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [09:42:17] !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki [09:42:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [09:42:43] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [09:42:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [09:43:26] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [09:43:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64604 and previous config saved to /var/cache/conftool/dbconfig/20240611-094347-arnaudb.json [09:44:16] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [09:44:47] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [09:45:16] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [09:46:28] (03CR) 10JMeybohm: [C:03+2] linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:47:16] (03Merged) 10jenkins-bot: linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:49:10] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [09:49:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [09:49:17] (03CR) 10JMeybohm: [C:03+2] developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:49:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [09:50:14] (03Merged) 10jenkins-bot: developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:50:43] <_joe_> jouncebot: nowandnext [09:50:43] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [09:50:43] In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1000) [09:50:46] FIRING: [22x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:27] (03PS4) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [09:55:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [09:55:54] (03CR) 10Brouberol: [C:03+2] datahub-next: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040874 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:55:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [09:56:01] (03PS76) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [09:56:06] (03PS5) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [09:56:08] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [09:56:15] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [09:56:19] (03CR) 10Arnaudb: mariadb: add some logic to allow instance conversion (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:56:33] (03CR) 10Arnaudb: mariadb: add some logic to allow instance conversion (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:56:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [09:56:57] (03PS6) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [09:57:10] I'm going to apply some external-services/admin-ng changes to all k8s clusters [09:57:14] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:57:25] RESOLVED: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1489:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:40] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:58:10] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:58:12] (03CR) 10Ssingh: [V:03+2 C:03+2] geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [09:58:53] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:58:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64605 and previous config saved to /var/cache/conftool/dbconfig/20240611-095853-arnaudb.json [09:59:02] !log [start] running authdns-update to send Bolivia (BO) and Paraguay (PY) to magru [09:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:13] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:59:40] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1000) [10:00:14] !log [end] running authdns-update to send Bolivia (BO) and Paraguay (PY) to magru: T346722 [10:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:23] T346722: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722 [10:00:23] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:01:05] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:01:20] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: Add env variables in codfw for mcrouter, statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041538 (owner: 10Giuseppe Lavagetto) [10:01:44] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:01:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.pki.restart-reboot (exit_code=0) rolling reboot on A:pki [10:01:51] (03PS7) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [10:02:08] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:02:14] (03Merged) 10jenkins-bot: mw-debug: Add env variables in codfw for mcrouter, statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041538 (owner: 10Giuseppe Lavagetto) [10:02:28] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:02:34] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:02:54] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:03:25] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:03:45] PROBLEM - MariaDB Replica SQL: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:45] PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:45] PROBLEM - MariaDB Replica SQL: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:47] PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:04:09] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1014.eqiad.wmnet [10:04:10] (03CR) 10Majavah: [C:03+1] delete langcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [10:04:10] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:04:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx1001.wikimedia.org [10:05:18] (03PS77) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [10:06:09] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:06:37] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:06:55] !log brouberol@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:06:57] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:07:18] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:07:40] !log brouberol@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:08:27] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [10:08:41] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [10:08:41] (03PS13) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) [10:08:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx1001.wikimedia.org [10:09:19] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [10:09:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [10:10:45] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [10:10:53] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [10:11:12] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [10:11:23] PROBLEM - Host logstash1023 is DOWN: PING CRITICAL - Packet loss = 100% [10:11:45] RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:45] RECOVERY - MariaDB Replica SQL: s2 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:21] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:12:45] RECOVERY - MariaDB Replica SQL: s7 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:47] RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:45] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T360332)', diff saved to https://phabricator.wikimedia.org/P64606 and previous config saved to /var/cache/conftool/dbconfig/20240611-101400-arnaudb.json [10:14:05] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:14:28] (03PS1) 10Santiago Faci: page-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041568 (https://phabricator.wikimedia.org/T363013) [10:14:32] !log filippo@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-eqiad [10:14:54] (03PS1) 10Santiago Faci: edit-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) [10:15:23] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [10:15:25] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1014.eqiad.wmnet [10:15:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [10:15:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [10:15:50] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [10:15:51] RECOVERY - Host logstash1023 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [10:16:05] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [10:16:08] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [10:16:16] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [10:16:26] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [10:16:37] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [10:16:46] (03PS1) 10Santiago Faci: media-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041573 (https://phabricator.wikimedia.org/T363012) [10:16:59] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [10:17:21] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 9 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:18:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [10:18:39] (03CR) 10JMeybohm: [C:03+2] machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:18:40] (03CR) 10JMeybohm: [C:03+2] python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:18:45] RESOLVED: [3x] ProbeDown: Service ganeti1029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:34] (03Merged) 10jenkins-bot: machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:19:39] (03Merged) 10jenkins-bot: python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [10:20:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [10:21:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [10:21:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [10:21:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64607 and previous config saved to /var/cache/conftool/dbconfig/20240611-102125-ladsgroup.json [10:21:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:21:34] (03PS1) 10Santiago Faci: editor-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041577 (https://phabricator.wikimedia.org/T363015) [10:22:28] (03PS2) 10Santiago Faci: media-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041573 (https://phabricator.wikimedia.org/T363012) [10:23:14] (03PS2) 10Santiago Faci: edit-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) [10:24:02] (03PS2) 10Santiago Faci: page-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041568 (https://phabricator.wikimedia.org/T363013) [10:24:15] (03PS2) 10Santiago Faci: editor-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041577 (https://phabricator.wikimedia.org/T363015) [10:24:25] (03PS3) 10Santiago Faci: edit-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) [10:24:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [10:24:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [10:24:39] (03PS3) 10Santiago Faci: media-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041573 (https://phabricator.wikimedia.org/T363012) [10:24:44] (03PS1) 10Ssingh: Revert "geo-maps: define initial mapping for South America (magru)" [dns] - 10https://gerrit.wikimedia.org/r/1041580 [10:24:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64608 and previous config saved to /var/cache/conftool/dbconfig/20240611-102444-ladsgroup.json [10:24:51] ^ emergency depool patch if required [10:26:43] ack [10:27:28] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:27:54] (03CR) 10Ssingh: [C:04-2] Revert "geo-maps: define initial mapping for South America (magru)" [dns] - 10https://gerrit.wikimedia.org/r/1041580 (owner: 10Ssingh) [10:28:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367145 [10:28:18] T367145: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T367145 [10:28:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2213 with weight 0 T367145', diff saved to https://phabricator.wikimedia.org/P64609 and previous config saved to /var/cache/conftool/dbconfig/20240611-102820-root.json [10:28:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367145 [10:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2213 from API/vslow/dump T367145', diff saved to https://phabricator.wikimedia.org/P64610 and previous config saved to /var/cache/conftool/dbconfig/20240611-102900-root.json [10:29:38] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1041534 (https://phabricator.wikimedia.org/T367145) (owner: 10Gerrit maintenance bot) [10:30:16] (03PS1) 10Btullis: Enable the GPU on stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1041585 (https://phabricator.wikimedia.org/T367154) [10:31:43] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2878/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041585 (https://phabricator.wikimedia.org/T367154) (owner: 10Btullis) [10:31:57] (03CR) 10EoghanGaffney: lists: Add option to switch mailman root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [10:32:17] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:34:14] (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:34:15] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [10:35:03] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:36:49] (03PS1) 10Clément Goubert: trafficserver: move 90% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1041589 (https://phabricator.wikimedia.org/T362323) [10:37:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [10:37:22] (03CR) 10Hnowlan: [C:03+1] trafficserver: move 90% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1041589 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:37:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:37:39] (03PS2) 10Hnowlan: mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323) [10:37:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:38:06] (03CR) 10Sg912: [C:03+1] editor-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041577 (https://phabricator.wikimedia.org/T363015) (owner: 10Santiago Faci) [10:38:20] (03CR) 10Clément Goubert: [C:03+1] mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [10:38:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:38:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:38:46] (03CR) 10Sg912: [C:03+1] edit-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) (owner: 10Santiago Faci) [10:38:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:39:06] (03PS78) 10Arnaudb: mariadb: rework mariadb_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [10:39:09] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:39:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:39:34] (03CR) 10Sg912: [C:03+1] page-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041568 (https://phabricator.wikimedia.org/T363013) (owner: 10Santiago Faci) [10:40:05] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [10:40:10] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s6 [10:40:56] (03CR) 10Sg912: [C:03+1] media-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041573 (https://phabricator.wikimedia.org/T363012) (owner: 10Santiago Faci) [10:41:37] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:41:57] (03CR) 10MVernon: "> One option you might want to consider is converting Puppet data structures to yaml directly, rather then templating out yaml files." [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:42:06] !log Starting s5 codfw failover from db2123 to db2213 - T367145 [10:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:10] T367145: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T367145 [10:42:12] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [10:42:28] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [10:42:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T367145', diff saved to https://phabricator.wikimedia.org/P64611 and previous config saved to /var/cache/conftool/dbconfig/20240611-104232-root.json [10:42:43] sukhe: just checking, is it ok if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041589 or will it conflict with your work? [10:43:09] claime: thanks for checking. should be OK, we are getting very little traffic anyway right now [10:43:20] ok cool thanks [10:43:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [10:43:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [10:43:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2123 T367145', diff saved to https://phabricator.wikimedia.org/P64612 and previous config saved to /var/cache/conftool/dbconfig/20240611-104336-root.json [10:45:05] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 90% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1041589 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:45:31] !log move 90% of traffic to mw-on-k8s - T362323 [10:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:35] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [10:45:38] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1015.eqiad.wmnet [10:45:48] (03CR) 10Volans: [C:03+1] "Great! LGTM, as agreed we're ok to ship this for now without the tests that will be added soon to not block the current efforts and needs." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:47:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: Long schema change [10:47:40] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm now after the discussion." [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [10:47:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: Long schema change [10:48:03] !log dbmaint codfw s5 deploy schema change on db2123 T364299 [10:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:48:25] !log dbmaint codfw s5 deploy schema change on db2123 T364069 [10:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:48:45] RESOLVED: [22x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364069)', diff saved to https://phabricator.wikimedia.org/P64613 and previous config saved to /var/cache/conftool/dbconfig/20240611-104908-marostegui.json [10:49:36] (03CR) 10Arnaudb: [C:03+2] mariadb: rework mariadb_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:50:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [10:51:26] (03PS1) 10Giuseppe Lavagetto: mw-debug: enable envvars in eqiad too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041592 [10:51:26] (03PS1) 10Giuseppe Lavagetto: mw-debug: protect debug endpoints with a password [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041593 [10:52:21] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the GPU on stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1041585 (https://phabricator.wikimedia.org/T367154) (owner: 10Btullis) [10:53:38] (03CR) 10Ladsgroup: mw-debug: protect debug endpoints with a password (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041593 (owner: 10Giuseppe Lavagetto) [10:53:48] 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9879026 (10fgiunchedi) [10:54:38] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9879027 (10Chqaz) @Dzahn Can you reset to the original administrator? Thank you. [10:57:15] (03Merged) 10jenkins-bot: mariadb: rework mariadb_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:57:21] !log klausman@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:57:27] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:00:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [11:02:50] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1041553 (owner: 10L10n-bot) [11:03:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:04:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P64614 and previous config saved to /var/cache/conftool/dbconfig/20240611-110414-marostegui.json [11:04:47] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts stat1004.eqiad.wmnet [11:05:27] FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic2066:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:05:47] !log Starting kafka-main reboots in codfw [11:05:48] !log klausman@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:00] !log cgoubert@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-codfw [11:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [11:07:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [11:07:18] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1015.eqiad.wmnet [11:07:34] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: enable envvars in eqiad too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041592 (owner: 10Giuseppe Lavagetto) [11:08:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:08:17] (03Merged) 10jenkins-bot: mw-debug: enable envvars in eqiad too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041592 (owner: 10Giuseppe Lavagetto) [11:09:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [11:09:35] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s6 [11:09:39] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [11:10:27] FIRING: [16x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:11:09] (03PS3) 10Hnowlan: wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274) [11:12:43] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [11:13:25] !log removing similar-users service - T345274 [11:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:33] T345274: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 [11:13:48] ^ Amir1, godog [11:13:58] just fyi [11:14:03] thanks for the headsup [11:14:15] (03CR) 10JMeybohm: [C:03+2] wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [11:15:26] (03PS1) 10Marostegui: Revert "db1223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041600 [11:15:27] RESOLVED: [16x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:15:40] !log klausman@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:15:55] (03CR) 10Marostegui: [C:03+2] Revert "db1223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1041600 (owner: 10Marostegui) [11:16:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64615 and previous config saved to /var/cache/conftool/dbconfig/20240611-111616-root.json [11:16:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [11:18:51] (03PS2) 10Zabe: Add u4cwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1041242 (https://phabricator.wikimedia.org/T366649) [11:18:53] (03CR) 10Ladsgroup: [C:03+2] Add u4cwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1041242 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [11:18:55] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add u4cwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1041242 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [11:19:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P64616 and previous config saved to /var/cache/conftool/dbconfig/20240611-111922-marostegui.json [11:20:21] (03CR) 10JMeybohm: [C:03+2] service: set similar-users to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014499 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [11:21:04] !log klausman@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:21:07] Amir1: feel free to merge with yours [11:21:27] sure [11:21:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64617 and previous config saved to /var/cache/conftool/dbconfig/20240611-112149-ladsgroup.json [11:21:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:22:54] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:23:14] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:23:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [11:23:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [11:24:29] !log klausman@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:26:14] !log klausman@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:27:28] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [11:29:35] !log failover ganeti master in ulsfo to ganeti4008 [11:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] (03PS2) 10JMeybohm: service: remove similar-users from realserver, set service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014500 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [11:30:03] (03PS1) 10JMeybohm: service: Remove similar-users from conftool-data and service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1041612 (https://phabricator.wikimedia.org/T345274) [11:30:46] (03CR) 10Hnowlan: [C:03+1] service: Remove similar-users from conftool-data and service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1041612 (https://phabricator.wikimedia.org/T345274) (owner: 10JMeybohm) [11:30:59] (03PS1) 10JMeybohm: deployment_server: Remove similar-users deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1041613 (https://phabricator.wikimedia.org/T345274) [11:31:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64618 and previous config saved to /var/cache/conftool/dbconfig/20240611-113121-root.json [11:31:57] PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:32:38] (03PS2) 10Jforrester: Undeploy the 'similar-users' service, unused for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) [11:32:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [11:32:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:34:19] (03CR) 10JMeybohm: [C:03+2] service: remove similar-users from realserver, set service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014500 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [11:34:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364069)', diff saved to https://phabricator.wikimedia.org/P64619 and previous config saved to /var/cache/conftool/dbconfig/20240611-113430-marostegui.json [11:34:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:34:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:34:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:34:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T364069)', diff saved to https://phabricator.wikimedia.org/P64620 and previous config saved to /var/cache/conftool/dbconfig/20240611-113452-marostegui.json [11:36:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64621 and previous config saved to /var/cache/conftool/dbconfig/20240611-113656-ladsgroup.json [11:37:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:38:30] (03Merged) 10jenkins-bot: media-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041573 (https://phabricator.wikimedia.org/T363012) (owner: 10Santiago Faci) [11:38:46] !log restarted pybal on lvs2014.codfw.wmnet,lvs1020.eqiad.wmnet - T345274 [11:38:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [11:39:03] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:39:10] expected [11:39:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [11:39:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [11:39:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts stat1004.eqiad.wmnet [11:40:04] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [11:41:02] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [11:41:06] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts stat1005.eqiad.wmnet [11:41:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [11:42:01] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:43:35] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:43:47] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [11:44:11] !log restarted pybal on lvs2013.codfw.wmnet,lvs1019.eqiad.wmnet - T345274 [11:45:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [11:45:36] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [11:46:09] !log ipvsadm --delete-service --tcp-service 10.2.[12].57:4110 - T345274 [11:46:13] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [11:46:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64622 and previous config saved to /var/cache/conftool/dbconfig/20240611-114627-root.json [11:46:40] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [11:46:59] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:47:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223', diff saved to https://phabricator.wikimedia.org/P64623 and previous config saved to /var/cache/conftool/dbconfig/20240611-114746-root.json [11:47:59] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [11:48:35] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:48:46] (03CR) 10Santiago Faci: [C:03+2] page-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041568 (https://phabricator.wikimedia.org/T363013) (owner: 10Santiago Faci) [11:49:01] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:49:39] (03Merged) 10jenkins-bot: page-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041568 (https://phabricator.wikimedia.org/T363013) (owner: 10Santiago Faci) [11:50:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [11:50:46] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [11:51:32] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [11:51:46] !log removed similar-users deployments from all k8s clusters - T345274 [11:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:50] T345274: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 [11:52:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64624 and previous config saved to /var/cache/conftool/dbconfig/20240611-115203-ladsgroup.json [11:52:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:53:41] PROBLEM - MariaDB Replica SQL: s3 on db1240 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cywiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:54:07] (03CR) 10JMeybohm: [C:03+2] Undeploy the 'similar-users' service, unused for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [11:54:14] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [11:54:41] will depool it [11:55:36] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [11:55:51] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [11:55:58] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [11:57:04] (03CR) 10Stevemunene: [C:03+2] Delete datahub kubeconfigs on main [puppet] - 10https://gerrit.wikimedia.org/r/1039618 (https://phabricator.wikimedia.org/T366338) (owner: 10Stevemunene) [11:57:06] (03Merged) 10jenkins-bot: Undeploy the 'similar-users' service, unused for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [11:57:25] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [11:57:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: repl issues [11:57:33] (03PS1) 10Majavah: O:openstack: Install OpenTofu in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1041622 (https://phabricator.wikimedia.org/T365696) [11:57:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: repl issues [11:59:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2879/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041622 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [11:59:09] (03CR) 10Santiago Faci: [C:03+2] edit-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) (owner: 10Santiago Faci) [12:00:01] (03Merged) 10jenkins-bot: edit-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041569 (https://phabricator.wikimedia.org/T363014) (owner: 10Santiago Faci) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1200) [12:00:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [12:00:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [12:01:23] (03CR) 10Majavah: [V:03+1 C:03+2] O:openstack: Install OpenTofu in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1041622 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [12:02:17] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [12:03:03] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [12:04:11] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [12:04:13] !log rebalance ganeti cluster in ulsfo following reboots [12:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [12:04:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:55] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts stat1005.eqiad.wmnet [12:05:38] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [12:05:45] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [12:05:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:06:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-codfw [12:06:38] !log Finished kafka-main reboots in codfw [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:11] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [12:07:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64625 and previous config saved to /var/cache/conftool/dbconfig/20240611-120710-ladsgroup.json [12:07:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:07:58] (03CR) 10Santiago Faci: [C:03+2] editor-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041577 (https://phabricator.wikimedia.org/T363015) (owner: 10Santiago Faci) [12:08:55] (03Merged) 10jenkins-bot: editor-analytics: Documentation improvements and blubber updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041577 (https://phabricator.wikimedia.org/T363015) (owner: 10Santiago Faci) [12:09:22] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [12:26:51] !log cancelled previous command (text@eqiad is going to be depooled at the same time) [12:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:49] (03CR) 10Elukey: "Hi! We need to upgrade the base Bookworm image for eventgate, I just needed to trigger a rebuild and I sent patches for the old gerrit rep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [12:29:19] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic1@eqiad for text services [puppet] - 10https://gerrit.wikimedia.org/r/1041625 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:30:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [12:30:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [12:30:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T352010)', diff saved to https://phabricator.wikimedia.org/P64626 and previous config saved to /var/cache/conftool/dbconfig/20240611-123046-ladsgroup.json [12:30:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:31:09] (03PS5) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [12:31:30] (03CR) 10CI reject: [V:04-1] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:32:22] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2883/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:32:36] (03PS6) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [12:32:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:32:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:33:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:34:10] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on text@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1041626 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:35:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T352010)', diff saved to https://phabricator.wikimedia.org/P64627 and previous config saved to /var/cache/conftool/dbconfig/20240611-123521-ladsgroup.json [12:40:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:42:40] (03PS1) 10Vgutierrez: hiera: Fix role/eqiad/cache/text.yaml path [puppet] - 10https://gerrit.wikimedia.org/r/1041628 (https://phabricator.wikimedia.org/T366466) [12:44:20] (03CR) 10Ssingh: [C:03+1] hiera: Fix role/eqiad/cache/text.yaml path [puppet] - 10https://gerrit.wikimedia.org/r/1041628 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:44:31] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2884/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041628 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:44:55] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Fix role/eqiad/cache/text.yaml path [puppet] - 10https://gerrit.wikimedia.org/r/1041628 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:45:14] (03CR) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:49:49] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5 [12:49:55] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8 [12:50:05] !log rolling restart of pybal on lvs1020 and lvs1017 - T366466 [12:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:10] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:50:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P64628 and previous config saved to /var/cache/conftool/dbconfig/20240611-125028-ladsgroup.json [12:53:13] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1016.eqiad.wmnet [12:56:35] (03PS1) 10Vgutierrez: Revert "depool text@eqiad before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041635 (https://phabricator.wikimedia.org/T366466) [12:58:14] (03PS1) 10Jelto: gitlab: notify ssh-gitlab service when interface aliases are added [puppet] - 10https://gerrit.wikimedia.org/r/1041636 (https://phabricator.wikimedia.org/T367021) [12:58:20] (03CR) 10Ottomata: "@ltoscano@wikimedia.org we were hoping to get Sandra some hands on experience doing this. I was going to help, but if the timing works ou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [12:58:28] jouncebot: now [12:58:28] For the next 0 hour(s) and 1 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1200) [12:58:35] jouncebot: Nemo_bis [12:58:37] grrr [12:58:40] jouncebot: next [12:58:40] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1300) [12:59:14] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts stat1006.eqiad.wmnet [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1300). Please do the needful. [13:00:05] cmelo and jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:39] o/ [13:00:48] (03CR) 10JMeybohm: [C:03+2] deployment_server: Remove similar-users deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1041613 (https://phabricator.wikimedia.org/T345274) (owner: 10JMeybohm) [13:00:50] o/ [13:00:51] (03CR) 10JMeybohm: [C:03+2] service: Remove similar-users from conftool-data and service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1041612 (https://phabricator.wikimedia.org/T345274) (owner: 10JMeybohm) [13:00:58] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9879565 (10stjn) >>! In T275319#9878379, @Fuzzy wrote: > The external `` is nec... [13:00:59] o/ [13:01:10] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9879559 (10JMeybohm) I just had to deploy machinetranslation for {T346638} and noticed container startup times of ar... [13:01:12] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [13:01:14] 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173 (10Ottomata) 03NEW [13:01:17] (03CR) 10Ottomata: "On it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [13:01:41] I can deploy, I think :) [13:01:48] unless effie wanted to do something else? [13:02:07] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:02:08] Lucas_WMDE go ahead please [13:02:10] ok! [13:02:22] 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9879592 (10Ottomata) @Snwachukwu needs this to finish {T344730} [13:02:40] (03CR) 10Ssingh: [C:03+1] Revert "depool text@eqiad before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041635 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:03:00] (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@eqiad before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041635 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:03:14] !log repool text@eqiad with IPIP encapsulation enabled - T366466 [13:03:18] !oncall-now [13:03:18] Oncall now for team SRE, rotation business_hours: [13:03:18] A.mir1, g.odog, h.erron, j.hathaway [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:23] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:03:24] Amir1, godog, herron, jhathaway ^^ [13:03:50] ack thanks [13:03:52] (03PS6) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [13:04:01] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:04:01] awesome [13:04:29] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1016.eqiad.wmnet [13:04:45] (03CR) 10Lucas Werkmeister (WMDE): Enable CampaignEvents on swahili wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:04:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:05:06] cmelo: I’ll deploy your two patches separately [13:05:16] ok thank you [13:05:30] (and left a comment on the second one) [13:05:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P64629 and previous config saved to /var/cache/conftool/dbconfig/20240611-130535-ladsgroup.json [13:05:45] (03Merged) 10jenkins-bot: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:06:00] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:06:14] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [13:06:15] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:06:17] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041094|Configures the necessary user rights for CampaignEvents on swahili (T366502)]] [13:06:20] !log start rebooting all cp-text_codfw hosts for T366555 (spaced 1.5 hrs) [13:06:25] T366502: Add configs on mediawiki-config to enable CampaignEvents on swahili wikipedia - https://phabricator.wikimedia.org/T366502 [13:06:26] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2885/console" [puppet] - 10https://gerrit.wikimedia.org/r/1041636 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [13:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] !log disable puppet on A:ncredir before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035724 - T365689 [13:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:34] T365689: Provide a ferm based alternative to tcp-mss-clamper - https://phabricator.wikimedia.org/T365689 [13:06:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [13:06:40] (03CR) 10Elukey: [C:04-1] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [13:06:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable Vector appearance menu & larger font-size on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041320 (https://phabricator.wikimedia.org/T362148) (owner: 10Jdrewniak) [13:06:58] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw [13:07:08] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:07:23] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:07:58] (03CR) 10Elukey: [C:04-1] "Sure no problem! I can have a chat with Sandra on IRC after the patch to read some documentation and chat about how to deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [13:08:04] (03PS2) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) [13:08:34] hm, scap failed to connect to four kubernetes hosts (no route to host) [13:08:40] kubernetes1032..1035 [13:08:44] should I worry? [13:09:00] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:09:04] this is during the docker pull, so I guess it means the actual deployment might be slower if the nodes have to pull the image then [13:09:07] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:09:13] or if they’re still unreachable then, it’ll be a different issue…? [13:09:39] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:09:51] (03CR) 10Cmelo: Enable CampaignEvents on swahili wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:10:05] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [13:10:08] Lucas_WMDE: comment solved, I added the comment thank you https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1041096 [13:10:09] correction, kubernetes1031..1035 [13:10:31] cmelo: thanks :) [13:10:35] (03PS1) 10Jelto: gitlab: use IPv4 and IPv6 for SSH check [puppet] - 10https://gerrit.wikimedia.org/r/1041639 (https://phabricator.wikimedia.org/T367021) [13:10:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:10:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:10:59] !log lucaswerkmeister-wmde@deploy1002 cmelo, lucaswerkmeister-wmde: Backport for [[gerrit:1041094|Configures the necessary user rights for CampaignEvents on swahili (T366502)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:30] cmelo: the group changes should be live on the debug servers, please test! [13:11:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [13:12:17] Lucas_WMDEok, thanks testing it now [13:12:22] (03PS1) 10Fabfur: Fixed typo in help [cookbooks] - 10https://gerrit.wikimedia.org/r/1041640 [13:13:10] https://sw.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2 looks good to me fwiw [13:13:29] (03CR) 10Herron: [C:03+2] admin: add rickijay to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041227 (https://phabricator.wikimedia.org/T365574) (owner: 10Dzahn) [13:13:33] (03CR) 10Vgutierrez: [V:03+1 C:03+2] lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [13:13:36] PROBLEM - Host aux-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:42] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:48] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:06] herron: can I go ahead and merge your CR? [13:14:13] vgutierrez: please and thank you! [13:14:18] merging [13:14:43] `sudo -u mwdeploy ssh kubernetes1031.eqiad.wmnet` now gives me a permission denied error from kubernetes1031 [13:14:47] which probably means I’m doing the ssh wrong [13:14:57] but also, if the host is replying to my ssh connection at all, that’s probably good ^^ [13:15:16] (03CR) 10JHathaway: [C:03+1] Run vrts_aliases in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/1041414 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [13:15:19] herron: done [13:15:23] so once cmelo confirms I’ll just go ahead with the deployment and hope that the kubernetes1031..35 errors just went away I guess [13:15:26] RECOVERY - Host aux-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [13:15:38] and hope that the docker pull doesn’t take too long [13:15:46] FIRING: [2x] ProbeDown: Service kubestagemaster1005:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:52] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [13:15:58] !log depool ncredir6001 - T365689 [13:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:02] T365689: Provide a ferm based alternative to tcp-mss-clamper - https://phabricator.wikimedia.org/T365689 [13:16:10] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [13:16:54] (03CR) 10Fabfur: [C:03+2] Fixed typo in help [cookbooks] - 10https://gerrit.wikimedia.org/r/1041640 (owner: 10Fabfur) [13:17:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [13:18:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [13:18:15] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041642 [13:18:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9879663 (10herron) 05In progress→03Resolved a:03herron The patch to provision this access has been merged, and will fully propagate within 30 m... [13:18:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [13:18:45] RESOLVED: [3x] ProbeDown: Service ganeti1031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:28] Lucas_WMDE: I tested the first one and it is ok [13:19:32] ok! [13:19:33] !log lucaswerkmeister-wmde@deploy1002 cmelo, lucaswerkmeister-wmde: Continuing with sync [13:19:34] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041643 [13:19:38] we can go with the next one, thanks [13:20:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T352010)', diff saved to https://phabricator.wikimedia.org/P64630 and previous config saved to /var/cache/conftool/dbconfig/20240611-132043-ladsgroup.json [13:20:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:20:49] (03PS3) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) [13:21:39] (03CR) 10Lucas Werkmeister (WMDE): Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:21:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [13:22:28] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8 [13:22:31] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5 [13:24:54] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9879697 (10Fuzzy) > I get the point of external class, but I don’t get the point of the internal... [13:24:58] (03PS1) 10JMeybohm: Remove deprecated uses_ingress option from service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1041644 (https://phabricator.wikimedia.org/T346638) [13:26:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [13:26:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [13:26:36] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9879703 (10Jhancock.wm) proposed relocations for each server. lemme know if that works for you @Papaul wikikube-ctrl2001 curren... [13:28:47] hm, the k8s deployment has been stuck on 94% (113 left) for a minute or two now… [13:28:55] *looks up how many thingies per host we have* [13:30:03] 113 / 22.5 (“22-23 additional replicas” per T362323) is almost exactly 5, so that would track with the 5 kubernetes hosts that missed the docker pull earlier [13:30:05] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [13:30:07] * Lucas_WMDE waits patiently [13:30:41] (docker pull usually seems to take 2½ minutes, at least when it runs as a separate step) [13:30:42] effie: you rebooting nodes in eqiad rn? [13:31:16] Lucas_WMDE: probably k8s node reboots [13:31:39] looking at the cluster status dash, there's unschedulable nodes, and that's because they're cordoned off for reboots [13:31:49] hm, I see [13:32:01] but scap still seems to be waiting for them [13:32:13] !log failover ganeti cluster for esams02 to ganeti3006 [13:32:14] yeah we have no way of excluding them from the pull right now [13:32:15] (since 13:26:37 ftr) [13:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:20] :/ [13:32:35] the pull failing isn’t a huge problem afaict, but the actual deployment is now stuck too :/ [13:32:42] I guess I’ll see if scap times out [13:33:01] jan_drewniak: we might not get to your change within the window :/ [13:33:13] but I’m still here afterwards and there’s a free hour in the calendar, so if you’re still around, we could deploy then I guess [13:33:34] (03PS1) 10Vgutierrez: realserver::ipip: Fix ferm MSS clamping rule [puppet] - 10https://gerrit.wikimedia.org/r/1041645 (https://phabricator.wikimedia.org/T365689) [13:33:35] claime: yes [13:33:39] Lucas_WMDE: I'm fine to wait around a bit longer :) [13:33:53] !log failover ganeti cluster for esams01 to ganeti3005 [13:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:24] Lucas_WMDE: I assumed the backport would be faster [13:34:34] PROBLEM - ganeti-wconfd running on ganeti3008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:34:35] (03PS2) 10Vgutierrez: realserver::ipip: Fix ferm MSS clamping rule [puppet] - 10https://gerrit.wikimedia.org/r/1041645 (https://phabricator.wikimedia.org/T365689) [13:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:35:17] any idea when kubernetes1031..35 will be schedulable / uncordoned again? [13:35:52] Lucas_WMDE: The problem isn't those 5 nodes [13:35:57] ok [13:36:00] It's the 28 that are cordoned as part of the batch [13:36:08] Makes CPU available in the negative [13:36:12] (do I have access to that cluster status dash by any chance? ^^) [13:36:34] PROBLEM - ganeti-wconfd running on ganeti3007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:36:42] (03PS1) 10JMeybohm: refresh_fixtures: Remove code that mocks listener upstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) [13:36:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [13:36:49] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-codfw [13:36:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1041030 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff) [13:36:53] !log repool ncredir6001 - T365689 [13:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:59] T365689: Provide a ferm based alternative to tcp-mss-clamper - https://phabricator.wikimedia.org/T365689 [13:37:07] Lucas_WMDE: https://grafana.wikimedia.org/goto/Rf-0A48Ig?orgId=1 [13:37:07] claime: where our other assumption was also wrong [13:37:27] (03CR) 10CI reject: [V:04-1] refresh_fixtures: Remove code that mocks listener upstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:37:28] effie: I think that's because I raised the number of replicas this morning [13:37:35] thanks! [13:37:49] claime: oh dear, I didn't make that connection [13:37:49] (03PS1) 10Fabfur: haproxy: adding haproxy30 component and support [puppet] - 10https://gerrit.wikimedia.org/r/1041647 (https://phabricator.wikimedia.org/T366885) [13:38:13] but still, the number of hosts was not particularly large [13:38:31] Lucas_WMDE: sorry you got caught up in some poor assumptions from my end [13:38:33] oh, scap’s k8s deployment progress is now at 301 left [13:38:35] 309 [13:38:39] is it rolling back again [13:38:43] (03PS8) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [13:38:48] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:50] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1041645 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [13:38:55] (I remember this confusing behavior from two weeks or so ago ^^) [13:38:58] Lucas_WMDE: let it do its thing [13:39:03] ok :) [13:39:08] (03CR) 10Vgutierrez: [C:03+2] realserver::ipip: Fix ferm MSS clamping rule [puppet] - 10https://gerrit.wikimedia.org/r/1041645 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [13:39:08] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:39:16] effie: can you ctrl-c after the 5 currently rebooting and manually uncordon so the deployment window can proceed [13:39:18] ? [13:39:22] I'll get on moving more appservers [13:39:37] (but it's gonna take more than the deployment window to get them there) [13:39:48] once the scap finishes (fails), I’m guessing everything will be back to before – k8s rolled back, and bare metal never got the change deployed (other than mwdebug, and maybe bare-metal canaries?) [13:40:03] claime: happy to do that too [13:40:08] and then, when I have the green light from you, I can probably scap-backport the second config change and let them roll out together? [13:40:14] (as the first one was already successfully tested on its own) [13:40:24] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9879781 (10stjn) Then it should be added only where anchors are involved. I suggest converting to... [13:40:29] hmmm I don't know if it'll roll back bare metal, but yeah, once the hosts are uncordoned you'll be able to proceed [13:40:34] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [13:40:34] (03PS9) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [13:40:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: stat1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [13:40:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts stat1006.eqiad.wmnet [13:40:38] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:40:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:40:46] FIRING: [7x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:52] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [13:40:52] If the patches can be deployed together there should be no problem rolling them out at the same time [13:40:55] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts stat1007.eqiad.wmnet [13:40:58] I’m assuming bare-metal is still on the old code because it gets deployed after k8s (iirc) [13:41:00] ok thanks! [13:41:23] but maybe scap will still proceed with the bare-metal deployment after k8s fails? idk, we’ll see… [13:41:36] claime: I will stop and check uncordon any hosts [13:41:53] ack [13:42:23] !log jiji@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:wikikube-worker-eqiad [13:42:31] okay, something timed out and scap printed lots of output [13:42:46] now it’s doing another k8s deployment progress that seems to be increasing again [13:43:14] so… k8s had started rolling back on its own (hence “left” increasing again), but now scap is also doing an explicit rollback, as a new deployment with the older image version? [13:43:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [13:43:41] huh [13:43:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [13:43:52] (scap says “rolling back to prior state...”) [13:43:58] I thought it would just let the helmfile rollback [13:44:27] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1035-38 - jclark@cumin1002" [13:44:36] or *maybe* it’s just tracking k8s’s own rollback in a different way (swapping “ok” and “left” or something like that)? no idea [13:44:54] I didn’t watch the output closely enough to see if the new deployment progress started at 0% or above [13:45:34] !log rolling switch from tcp-mss-clamper to ferm based MSS clamping on A:ncredir - T365689 [13:45:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [13:45:37] Lucas_WMDE: tell me when it's done I'll check what image is deployed [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:39] T365689: Provide a ferm based alternative to tcp-mss-clamper - https://phabricator.wikimedia.org/T365689 [13:45:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for cloudcephosd1035-38 - jclark@cumin1002" [13:45:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:45:46] RESOLVED: [6x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:10] claime: ok [13:46:17] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9879840 (10kamila) @Papaul Thanks for the additional details! I think moving to the new per-rack VLAN shouldn't be a problem for... [13:46:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy FORCED [13:46:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [13:46:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy FORCED [13:46:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [13:46:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [13:47:26] Lucas_WMDE: all nodes are schduable now, where do you stand? [13:47:43] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [13:47:47] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [13:47:51] scap is now doing sync-apaches [13:47:57] claime: k8s is done afaict [13:48:06] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [13:48:13] so bare-metal will have the new code after all, scap didn’t abort the deployment as I thought it would [13:48:21] Lucas_WMDE: ok checking the image on k8s [13:48:21] should be mostly harmless [13:48:48] users *might* be confused by a new user group only appearing in 10% of requests but IIUC the user group doesn’t do anything yet, and we’ll fix it soon anyway [13:49:03] (now php-fpm restart) [13:49:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [13:49:08] mediawiki-multiversion:2024-06-11-065428-publish [13:49:14] So it's got the old code [13:49:15] Lucas_WMDE: I guess we can run another scap [13:49:19] <_joe_> that's from this morning I think [13:49:34] claime: yeah, the new one would’ve been 130626 [13:49:43] (from helm output scap printed earlier) [13:49:51] effie: okay, will do once this one’s done [13:49:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [13:50:11] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1017.eqiad.wmnet [13:50:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts stat1007.eqiad.wmnet [13:50:27] cool, sorry for the fuss [13:50:41] (03PS1) 10Giuseppe Lavagetto: Use the statsd-exporter service where available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) [13:50:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:50:46] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:09] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041094|Configures the necessary user rights for CampaignEvents on swahili (T366502)]] (duration: 44m 51s) [13:51:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:51:19] T366502: Add configs on mediawiki-config to enable CampaignEvents on swahili wikipedia - https://phabricator.wikimedia.org/T366502 [13:51:25] started the new backport now [13:51:48] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041644 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:52:00] 🤔 actually, I guess it’s good that bare-metal was synced after all? these patches touch different files, so I’m not sure if the second backport would sync the first change to bare metal… [13:52:16] (03Merged) 10jenkins-bot: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:52:19] <_joe_> it definitely would if files weren't changed [13:52:24] <_joe_> it's just rsync [13:52:27] (in k8s it’s all in one big image so I’m not worried, but for bare-metal I assume `scap backport` does the equivalent of `sync-file` only for files touched in that commit) [13:52:44] <_joe_> I think it does sync-full actually [13:52:44] hm, ok [13:52:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:46] <_joe_> I hope [13:52:46] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041096|Enable CampaignEvents on swahili wikipedia (T366502)]] [13:52:52] that would be more robust, at least ^^ [13:53:06] inflatador, dcausse: CirrusConsumerFetchErrorRate above, should we do something about it? [13:53:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [13:53:28] _joe_: yeah the source code looks like it does sync-world [13:53:45] FIRING: [12x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:53:53] <_joe_> uhhh [13:53:58] <_joe_> claime: ^^ [13:54:03] <_joe_> that doesn't sound right [13:54:09] <_joe_> if scap had rolled back [13:54:16] <_joe_> but also I don't know how that metric is counted [13:54:41] gehel: eqiad seems backlogged so processing more events than usual and causing this alert to flap, will tune it [13:55:05] <_joe_> Lucas_WMDE: are you running scap backport rn? [13:55:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, cmelo: Backport for [[gerrit:1041096|Enable CampaignEvents on swahili wikipedia (T366502)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:55:27] _joe_: yes [13:55:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [13:55:39] (now paused for debug server testing) [13:55:48] <_joe_> Lucas_WMDE: ack thanks [13:55:50] cmelo: can you test CampaignEvents on swahili wikipedia on mwdebug? [13:55:57] _joe_: looks like mw-api-ext still serves about 4k rps [13:56:08] <_joe_> yeah I was looking and there's nothing wrong [13:56:18] Testing it [13:56:24] should I wait before I continue with sync? (once the test is done) [13:57:03] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [13:58:06] It is working thank you [13:58:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy FORCED [13:58:15] ok [13:58:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy FORCED [13:58:35] _joe_ / claime / effie: can I go ahead or should I wait because of that alert? [13:58:43] Lucas_WMDE: go ahead [13:58:45] RESOLVED: [12x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:48] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, cmelo: Continuing with sync [13:58:50] ok! [13:59:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [14:00:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [14:00:21] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184 (10AudreyPenven_WMDE) 03NEW [14:00:37] (03PS1) 10Jforrester: [wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041659 (https://phabricator.wikimedia.org/T365627) [14:00:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [14:01:16] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1017.eqiad.wmnet [14:01:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [14:01:29] k8s deployment in progress… [14:02:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy FORCED [14:02:21] effie: there’s one more config change to deploy, can I do that after this one or did you want to do something else to the k8s nodes first? [14:03:00] (03CR) 10Muehlenhoff: [C:03+2] Change ping host in codfw to ping2004 [homer/public] - 10https://gerrit.wikimedia.org/r/1041030 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff) [14:03:33] Lucas_WMDE: just finish up your things [14:03:38] ok :) [14:03:47] RESOLVED: [2x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:04:16] (03CR) 10Cwhite: Use the statsd-exporter service where available (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:04:23] k8s deployment finished \o/ [14:04:33] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [14:04:37] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [14:04:41] (03PS2) 10Giuseppe Lavagetto: mw-debug: protect debug endpoints with a password [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041593 [14:04:46] Lucas_WMDE: thanks <3 [14:05:44] * Lucas_WMDE is still not used to seeing these small numbers in the php-fpm-restart :D [14:05:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [14:05:46] FIRING: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9879969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye [14:07:26] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041096|Enable CampaignEvents on swahili wikipedia (T366502)]] (duration: 14m 40s) [14:07:30] T366502: Add configs on mediawiki-config to enable CampaignEvents on swahili wikipedia - https://phabricator.wikimedia.org/T366502 [14:07:43] alright, then let’s go ahead with jan_drewniak [14:07:48] (03PS2) 10Jdrewniak: Enable Vector appearance menu & larger font-size on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041320 (https://phabricator.wikimedia.org/T362148) [14:08:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041320 (https://phabricator.wikimedia.org/T362148) (owner: 10Jdrewniak) [14:08:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy FORCED [14:08:45] FIRING: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:53] (03Merged) 10jenkins-bot: Enable Vector appearance menu & larger font-size on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041320 (https://phabricator.wikimedia.org/T362148) (owner: 10Jdrewniak) [14:09:07] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1036 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:09:22] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041320|Enable Vector appearance menu & larger font-size on wikipedias (T362148)]] [14:09:26] T362148: Deploy reading accessibility settings menu and new typography defaults to remaining Wikipedias - https://phabricator.wikimedia.org/T362148 [14:09:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [14:10:46] RESOLVED: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:01] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:33] ^ is someone looking at these? [14:11:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [14:11:57] !log lucaswerkmeister-wmde@deploy1002 jdrewniak, lucaswerkmeister-wmde: Backport for [[gerrit:1041320|Enable Vector appearance menu & larger font-size on wikipedias (T362148)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:13:26] (03PS3) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) [14:13:29] jan_drewniak: please test :) [14:13:37] (03CR) 10CI reject: [V:04-1] admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron) [14:13:45] FIRING: [15x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:46] * jan_drewniak Lucas_WMDE: alright, testing! [14:13:52] (03PS1) 10DCausse: search: relax CirrusConsumerFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1041668 [14:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:03] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:16:57] FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:17:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [14:17:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [14:18:03] (03PS4) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) [14:18:45] FIRING: [15x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:51] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl1002.eqiad.wmnet [14:19:15] Lucas_WMDE: ok, good to sync [14:19:26] !log lucaswerkmeister-wmde@deploy1002 jdrewniak, lucaswerkmeister-wmde: Continuing with sync [14:19:28] ok! [14:20:46] RESOLVED: [15x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:52] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:21:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:22:21] Lucas_WMDE: could you please let me know when you're done with the deployment? [14:22:26] sure [14:22:44] thanks! [14:24:57] (03PS24) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [14:26:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [14:27:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [14:28:01] (03PS1) 10Clément Goubert: kubernetes: rename and reimage 4 servers [puppet] - 10https://gerrit.wikimedia.org/r/1041670 (https://phabricator.wikimedia.org/T351074) [14:28:03] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:20:00 on lsw1-f5-eqiad.mgmt with reason: prep upgrade of device [14:28:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:20:00 on lsw1-f5-eqiad.mgmt with reason: prep upgrade of device [14:28:24] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880091 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=adbdaf29-9da2-42ea-b64e-fc6d141eaf9e) set by cmooney... [14:28:31] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041320|Enable Vector appearance menu & larger font-size on wikipedias (T362148)]] (duration: 19m 08s) [14:28:35] T362148: Deploy reading accessibility settings menu and new typography defaults to remaining Wikipedias - https://phabricator.wikimedia.org/T362148 [14:28:45] FIRING: [15x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:48] * Lucas_WMDE done [14:28:56] ping kamila_ and effie ^^ [14:29:08] thanks! [14:29:19] !log UTC afternoon backport+config window done [14:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:42] Lucas_WMDE: tx [14:29:47] !log depooling mw1402 mw1403 mw1406 mw1411 for reimage to k8s - T351074 [14:29:49] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl1002.eqiad.wmnet [14:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:30:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es1038.eqiad.wmnet with reason: T365982 [14:30:49] T365982: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982 [14:30:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1038.eqiad.wmnet with reason: T365982 [14:31:23] (03PS3) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [14:31:23] (03PS1) 10Brouberol: datahub: update datahubsearch hostname to use external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) [14:31:40] (03CR) 10Scott French: [C:03+2] wikifeeds: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037164 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:32:28] (03CR) 10Hnowlan: [C:03+1] kubernetes: rename and reimage 4 servers [puppet] - 10https://gerrit.wikimedia.org/r/1041670 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:32:46] (03Merged) 10jenkins-bot: wikifeeds: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037164 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:33:07] (03CR) 10DCausse: [C:03+2] search: relax CirrusConsumerFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1041668 (owner: 10DCausse) [14:33:45] RESOLVED: [15x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9880114 (10Ahoelzl) Approved. [14:34:18] (03Merged) 10jenkins-bot: search: relax CirrusConsumerFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1041668 (owner: 10DCausse) [14:34:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9880116 (10ttaylor) Approved in place of @thcipriani while he is on vacation [14:34:55] (03CR) 10Clément Goubert: [C:03+2] kubernetes: rename and reimage 4 servers [puppet] - 10https://gerrit.wikimedia.org/r/1041670 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:34:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [14:35:16] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:35:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [14:36:02] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:37:16] (03PS3) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) [14:37:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [14:37:56] (03CR) 10Snwachukwu: "Thanks @ltoscano@wikimedia.org. I would sync with you on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [14:38:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1402 to wikikube-worker1013 [14:38:39] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:38:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:52] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [14:39:06] (03CR) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [14:39:09] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:40:46] FIRING: [13x] ProbeDown: Service ganeti3008:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:42:16] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:42:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [14:43:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [14:43:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1107-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:44:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [14:44:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:44:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl1002.eqiad.wmnet [14:44:27] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1402 to wikikube-worker1013 - cgoubert@cumin1002" [14:44:31] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880133 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl1002.eqiad.... [14:44:32] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:44:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:44:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:44:56] yeah, the signs were clear [14:44:58] hmm hello [14:45:17] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:19] here but meeting [14:45:20] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:45:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1402 to wikikube-worker1013 - cgoubert@cumin1002" [14:45:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:24] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1013 [14:45:31] here [14:45:32] checking too [14:45:38] known/expected sukhe ? [14:45:46] RESOLVED: [13x] ProbeDown: Service ganeti3008:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:58] Looks like availability is back up [14:46:03] godog: see -private [14:46:07] (03PS1) 10Giuseppe Lavagetto: mw-debug: add general values to the statsd releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041673 [14:46:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 depool T365982', diff saved to https://phabricator.wikimedia.org/P64631 and previous config saved to /var/cache/conftool/dbconfig/20240611-144624-arnaudb.json [14:46:28] T365982: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982 [14:46:36] (03PS1) 10Jdlrobson: Don't squish images in non-responsive skins e.g. Vector 2010 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) [14:47:17] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:24] 06SRE, 06Traffic, 13Patch-For-Review: Add unique error IDs to 4xx responses - https://phabricator.wikimedia.org/T330973#9880141 (10TheDJ) I randomly found this. It seems this was forgotten about, even though most agreed it was a good idea ? A quick revisit might help bring a result to this or a decision to... [14:47:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:20] (03PS3) 10JMeybohm: calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) [14:48:20] (03PS2) 10JMeybohm: refresh_fixtures: Remove code that mocks listener upstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) [14:48:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:49:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:50:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1013 [14:50:15] (03CR) 10Scott French: [C:03+1] "Thank you! Will do :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:50:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1402 to wikikube-worker1013 [14:50:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9880169 (10Snwachukwu) [14:50:24] (03CR) 10Scott French: [C:03+2] calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:50:59] (03Merged) 10jenkins-bot: calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:51:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9880179 (10Snwachukwu) [14:51:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [14:51:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1403 to wikikube-worker1014.eqiad.wmnet [14:51:24] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw1403 to wikikube-worker1014.eqiad.wmnet [14:51:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [14:51:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [14:51:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-f5-eqiad,lsw1-f5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: prep upgrade of device [14:51:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1403 to wikikube-worker1014 [14:52:04] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:52:09] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-f5-eqiad,lsw1-f5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: prep upgrade of device [14:52:21] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880192 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22e81c7a-3dde-4cd2-9376-bd003c744dc6) set by cmooney... [14:52:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:29] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1013.eqiad.wmnet on all recursors [14:52:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1013.eqiad.wmnet on all recursors [14:52:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1013.eqiad.wmnet with OS bullseye [14:53:45] FIRING: [13x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:05] (03PS1) 10Herron: admin: add ifrahkh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041675 (https://phabricator.wikimedia.org/T366558) [14:54:32] (03PS2) 10Brouberol: datahub: update datahubsearch hostname to use external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) [14:54:32] (03PS4) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [14:54:32] (03PS1) 10Brouberol: Deploy calico network policy templates to all datahub charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) [14:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2003.codfw.wmnet [14:55:55] (03PS1) 10Lucas Werkmeister (WMDE): Allow loading EntitySchema on client (only) wikis [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041677 (https://phabricator.wikimedia.org/T363153) [14:56:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:35:00 on 6 hosts with reason: upgrade lsw1-f5-eqiad [14:56:14] (03CR) 10CI reject: [V:04-1] datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [14:56:19] (03PS1) 10Lucas Werkmeister (WMDE): Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041678 (https://phabricator.wikimedia.org/T363153) [14:56:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on 6 hosts with reason: upgrade lsw1-f5-eqiad [14:56:37] (03CR) 10CI reject: [V:04-1] datahub: update datahubsearch hostname to use external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [14:56:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d67744a2-77a0-40dc-aff6-4af804b0b5ce) set by cmooney... [14:56:45] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:56:45] (03CR) 10CI reject: [V:04-1] admin: add ifrahkh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041675 (https://phabricator.wikimedia.org/T366558) (owner: 10Herron) [14:56:51] (03PS1) 10Lucas Werkmeister (WMDE): Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) [14:57:00] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9880208 (10MoritzMuehlenhoff) The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003. [14:57:03] (03CR) 10Dzahn: [C:03+2] delete pk.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041245 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [14:57:07] (03CR) 10CI reject: [V:04-1] Deploy calico network policy templates to all datahub charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [14:57:14] (03CR) 10Dzahn: [C:03+2] delete langcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [14:57:19] (03PS2) 10Dzahn: delete langcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) [14:57:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:34] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:59:06] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1403 to wikikube-worker1014 - cgoubert@cumin1002" [14:59:06] !log rebalance ganeti cluster in esams02 following reboots [14:59:06] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:59:06] (03PS2) 10Herron: admin: add ifrahkh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041675 (https://phabricator.wikimedia.org/T366558) [14:59:06] RESOLVED: [13x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [14:59:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [14:59:16] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:59:16] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:59:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1403 to wikikube-worker1014 - cgoubert@cumin1002" [14:59:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:18] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1014 [15:04:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2003.codfw.wmnet [15:04:48] !log rebooting lsw1-f5-eqiad to complete JunOS upgrade (T365982) [15:04:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1014 [15:04:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1403 to wikikube-worker1014 [15:04:48] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:48] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:48] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:04:48] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:48] FIRING: [15x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:59] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for new k8s workers - cgoubert@cumin1002" [15:06:56] FIRING: [14x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for new k8s workers - cgoubert@cumin1002" [15:06:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1013.eqiad.wmnet with reason: host reimage [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1406 to wikikube-worker1017 [15:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1013.eqiad.wmnet with reason: host reimage [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:19:05] !log rebalance ganeti cluster in esams01 following reboots [15:19:05] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1500). [15:19:05] !log beginning rolling ram upgrades for prometheus200[56] T360895 [15:19:05] RESOLVED: [13x] ProbeDown: Service ganeti3007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:05] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1406 to wikikube-worker1017 - cgoubert@cumin1002" [15:19:05] FIRING: [9x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es1038.eqiad.wmnet with reason: T365982 [15:19:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1038.eqiad.wmnet with reason: T365982 [15:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1406 to wikikube-worker1017 - cgoubert@cumin1002" [15:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1017 [15:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1017 [15:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1406 to wikikube-worker1017 [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1411 to wikikube-worker1018 [15:19:05] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:19:05] FIRING: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:05] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:17] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:17] arnaudb, inflatador, btullis, Emperor: lsw1-f5-eqiad back online after the upgrade, I'm looking at an issue for cloud services which has just hit so haven't done anything but hte basic checks [15:24:17] please check things look ok and ping me if any doubts [15:24:17] thanks! [15:24:17] everything ok on my end, repooling es1038 [15:24:17] thanks! [15:24:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64633 and previous config saved to /var/cache/conftool/dbconfig/20240611-152131-arnaudb.json [15:24:17] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:17] topranks: swift all good, thanks. [15:24:17] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1411 to wikikube-worker1018 - cgoubert@cumin1002" [15:24:17] topranks: did the switch upgrades affect anything for Ceph? cc andrewbogott and bd808 [15:24:17] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:24:17] RESOLVED: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1411 to wikikube-worker1018 - cgoubert@cumin1002" [15:24:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:17] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1018 [15:25:21] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:26:06] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:26:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1018 [15:26:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1411 to wikikube-worker1018 [15:26:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [15:26:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:25] FIRING: [11x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:12] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [15:28:12] !log kamila@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-ctrl1002 [15:29:35] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [15:29:36] !log kamila@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-ctrl1002 [15:29:47] cdanis: no switch upgrades were all in wmf prod land - just affected 6 hosts in rack F5 [15:29:56] thanks topranks [15:30:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1014.eqiad.wmnet wikikube-worker1017.eqiad.wmnet wikikube-worker1018.eqiad.wmnet on all recursors [15:30:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1014.eqiad.wmnet wikikube-worker1017.eqiad.wmnet wikikube-worker1018.eqiad.wmnet on all recursors [15:30:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1014.eqiad.wmnet with OS bullseye [15:30:46] FIRING: [11x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:54] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1017.eqiad.wmnet with OS bullseye [15:31:18] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:18] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1018.eqiad.wmnet with OS bullseye [15:32:25] FIRING: [12x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1013.eqiad.wmnet with OS bullseye [15:33:18] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:45] FIRING: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:54] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:18] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:31] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [15:35:46] FIRING: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker1013.eqiad.wmnet [15:35:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker1013.eqiad.wmnet [15:36:01] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:36:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [15:36:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64634 and previous config saved to /var/cache/conftool/dbconfig/20240611-153636-arnaudb.json [15:37:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:37:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:28] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:38:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:38:18] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:38:45] RESOLVED: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:40:31] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [15:41:41] hmm [15:41:54] claime: https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad+prometheus%2Fops&viewPanel=18 this doesn't look good I think? [15:42:07] yeah [15:42:07] not that I know anything about it but going by the dashboard :) [15:42:40] jobqueue errors [15:42:55] eventbus maybe? [15:42:56] yeah [15:43:45] FIRING: [9x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:44:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1014.eqiad.wmnet with reason: host reimage [15:44:26] erm [15:44:40] https://grafana.wikimedia.org/goto/5EN_yVUSg?orgId=1 [15:45:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage [15:45:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage [15:45:46] FIRING: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1014.eqiad.wmnet with reason: host reimage [15:48:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:49:02] Some throttling of changeprop but I'm not sure that explains it [15:49:07] I'm not finding a smoking gun [15:49:45] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [15:50:04] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:50:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage [15:50:44] restbase2030 seems okay from a cassandra perspective fwiw, probably not related [15:50:46] RESOLVED: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:01] There's a big hole in jobqueue codfw metrics between 1514 and 1527 [15:51:05] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:51:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64635 and previous config saved to /var/cache/conftool/dbconfig/20240611-155143-arnaudb.json [15:52:27] weird [15:53:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage [15:53:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:53:22] corresponding hole in eventgate codfw metrics [15:54:02] I feel like it's a metrics issue, not the actual problem [15:54:14] probably related to the wikikube-ctrl work [15:54:14] yeah seems eqiad is where the real problem is [15:54:33] as is tradition, the standard eventgate issues https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=All&var-kubernetes_namespace=All&var-destination=eventgate-analytics&var-destination=eventgate-main&from=now-3h&to=now&viewPanel=10 [15:54:48] ugh [15:54:52] Kick it [15:55:04] yeah, I'll roll-restart [15:55:19] that's safe enough I assume [15:55:32] is it just me or does that dashboard take forever to load? [15:55:46] it does [15:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:56:02] yeah, also makes my browser barf pretty hard [15:57:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:32] there are quiet a few "JobQueueError: Could not enqueue jobs" errors, which started showing up ~14:50 [15:57:47] yeah, we're looking at those atm [15:58:12] okay this is pretty odd - could be unrelated but [15:58:14] eventgate-production-bdd67974b-9c2sj 2/2 Running 0 65m [15:58:23] started more or less exactly when this started? [15:58:45] FIRING: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:46] Was it a kill, or it being moved because of a reboot? [15:58:47] deleted it [15:59:01] moved [15:59:11] or at least no reason specified, no restarts or anything [16:00:03] stuck Terminating for now [16:00:04] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1600). [16:00:04] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:19] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:00:31] gotta run to a meeting :/ [16:01:09] what did I break? [16:01:17] killed [16:01:18] I'm only doing wikikube-ctrl stuff in eqiad [16:02:04] hnowlan: is the roll restart running? [16:02:10] no I just killed the one pod [16:02:13] ack [16:02:16] to see if it changed anything [16:02:19] feel free to roll_restart [16:02:25] RESOLVED: [3x] SystemdUnitFailed: ferm.service on kubernetes1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:27] yeah [16:03:20] did killing one work? [16:03:42] downstream error rates seem unchanged AFAICt [16:03:45] RESOLVED: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:50] doesn't look like it [16:03:59] !log roll restarting eventgate-main eqiad [16:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [16:04:09] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update moved wikikube-ctrl1002 host in eqiad - kamila@cumin1002" [16:04:14] would it be worth trying envoy debug logging on one of the eventgates? [16:04:27] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [16:04:34] just mentioning since that came up when this happened last week :) [16:05:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update moved wikikube-ctrl1002 host in eqiad - kamila@cumin1002" [16:05:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:19] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 219730 MB (5% inode=91%): /srv/swift-storage/sde1 200236 MB (5% inode=92%): /srv/swift-storage/sdh1 199343 MB (5% inode=92%): /srv/swift-storage/sdc1 206297 MB (5% inode=91%): /srv/swift-storage/sdd1 184486 MB (4% inode=92%): /srv/swift-storage/sdg1 205038 MB (5% inode=91%): /srv/swift-storage/sdi1 196728 MB (5% inode=92%): /srv/swift-s [16:05:19] j1 203950 MB (5% inode=91%): /srv/swift-storage/sdk1 151074 MB (3% inode=90%): /srv/swift-storage/sdl1 208490 MB (5% inode=92%): /srv/swift-storage/sdm1 199173 MB (5% inode=91%): /srv/swift-storage/sdn1 192941 MB (5% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [16:05:38] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [16:05:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1014.eqiad.wmnet with OS bullseye [16:05:54] the pods take a while to terminate [16:06:14] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [16:06:45] eventgate-analytics looks in trouble aswell, not just main [16:06:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64636 and previous config saved to /var/cache/conftool/dbconfig/20240611-160649-arnaudb.json [16:07:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1002 [16:08:05] doesn't look like the roll-restart helped [16:08:35] zabe: (btw if you're following along, I'll get your Apache config patch deployed after this settles down) [16:08:47] alright:) [16:08:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1018.eqiad.wmnet with OS bullseye [16:10:46] FIRING: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:19] swfrench-wmf: if you want to debug envoy go ahead [16:11:31] it ain't gonna get more broken [16:12:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1017.eqiad.wmnet with OS bullseye [16:12:25] FIRING: [7x] SystemdUnitFailed: ferm.service on kubernetes1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:41] claime: ack, let me see if there's an obvious outlier pod to enable it on [16:15:46] RESOLVED: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:11] !log manual run of docker-report-k8s on build2001 (some failed results) [16:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:23] elukey: there will still be some [16:16:33] (one) [16:17:13] waiting on this https://gerrit.wikimedia.org/r/c/1038248 [16:17:25] RESOLVED: [7x] SystemdUnitFailed: ferm.service on kubernetes1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:48] claime: ack perfect, didn't know it, I'll let the service run and then we can re-run [16:19:48] swfrench-wmf: btw, the error reporting from the dashboard is from the caller envoy [16:20:02] yup, thanks! [16:21:05] I was breaking out local envoy metrics (i.e. -> local_service) by pod to look for outliers on envoy_cluster_upstream_cx_destroy_local_with_active_rq (which we say as a correlate last time) [16:21:15] !log homer 'cr*eqiad*' commit 'T351074' [16:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:18] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:21:35] alas, it's a trickle that has no clear correlation with pod [16:21:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64637 and previous config saved to /var/cache/conftool/dbconfig/20240611-162154-arnaudb.json [16:22:10] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9880695 (10herron) a:03Soda Hi @Soda could you please coordinate obtaining a comment of support on this task from a sponsor as outlined in https://wikitech.wikimedia.org/wiki/Volunteer_NDA?... [16:22:30] 06SRE, 06Infrastructure-Foundations, 10netops: Sub-optimal cloud routing for WMCS in eqiad when link fails - https://phabricator.wikimedia.org/T367203 (10cmooney) 03NEW p:05Triage→03Low [16:23:18] fwiw a bunch of eventgate-production pods *also* restarted at/from the start of the error rate spike [16:23:22] er eventgate-analytics [16:23:40] FIRING: KubernetesRsyslogDown: rsyslog on ml-serve1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:23:45] FIRING: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:09] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: optimize API calls to Netbox - https://phabricator.wikimedia.org/T271864#9880763 (10elukey) [16:24:16] hnowlan: I think it's been triggered by a k8s worker reboot, but why >< [16:24:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9880770 (10herron) 05Stalled→03Resolved a:03herron The patch to provision this access has been merged and will fully propagate within the nex... [16:25:55] hmmm no message in logstash for the last two minutes [16:26:09] can't look right now but - is it all pods or some pods? [16:26:16] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:26:18] and now the error rate is ok [16:26:28] hnowlan: all pods of? mw? [16:27:02] The errors come from every mediawiki pod, no outlier [16:27:14] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl1001.eqiad.wmnet [16:27:20] error rate back to normal [16:27:40] FIRING: [6x] SystemdUnitFailed: ferm.service on kubernetes1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:20] that's exceedingly puzzling ... [16:28:21] swfrench-wmf: did you do something? [16:28:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:35] claime: have not touched anything yet, no :) [16:28:39] wth [16:28:40] RESOLVED: KubernetesRsyslogDown: rsyslog on ml-serve1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:28:45] RESOLVED: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:55] hnowlan: in my case I was referring to all pods of eventgate-main [16:29:01] this is such a heisenbug [16:29:10] this sorta happened last week too .. [16:29:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:29:31] yeah it's happened a bunch of times actually [16:29:37] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [16:29:38] and we've never been able to find a cause [16:30:45] got side-tracked trying to guesstimate how much logging volume I was going to generate, and I guess missed my chance [16:31:16] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "updated wikikube-ctrl1002 status - kamila@cumin1002 - T366204" [16:31:20] T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204 [16:31:42] !log pool and uncordon wikikube-worker1013.eqiad.wmnet,wikikube-worker1014.eqiad.wmnet,wikikube-worker1017.eqiad.wmnet,wikikube-worker1018.eqiad.wmnet - T351074 [16:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:46] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:31:53] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1013.eqiad.wmnet|wikikube-worker1014.eqiad.wmnet|wikikube-worker1017.eqiad.wmnet|wikikube-worker1018.eqiad.wmnet),cluster=kubernetes,service=kubesvc [16:33:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "updated wikikube-ctrl1002 status - kamila@cumin1002 - T366204" [16:34:23] sorry I meant are errors coming from all pods of eventgate-* or just some? [16:35:07] !log ebernhardson@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:35:14] !log ebernhardson@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:35:46] FIRING: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:32] semi following [16:36:44] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:36:52] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [16:37:27] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:37:40] RESOLVED: [2x] SystemdUnitFailed: ferm.service on kubernetes1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880841 (10cmooney) [16:39:12] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9880854 (10elukey) I had a chat with Riccardo about a possible first change that could help one of the use cases mentioned (a sort of version-0 of the final solution) could be si... [16:39:39] rzl: looks like it's done, I think you can proceed with the puppet window [16:39:47] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [16:39:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880864 (10cmooney) [16:40:17] claime: thanks! [16:40:19] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-codfw [16:40:46] FIRING: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:59] zabe: can I get back with you in 20 minutes? sorry, meeting conflict [16:41:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880893 (10cmooney) [16:43:45] RESOLVED: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:00] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:44:05] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-eqiad [16:44:11] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [16:45:09] hnowlan: if the cx_destroy_local_with_active_rq metric emitted by the eventgate-side envoys is a reasonable correlate (since we can't "see" the errors on that side), then it looks like ~ all pods on both -main and -analytics [16:47:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [16:48:45] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:47] RECOVERY - Host elastic2088 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [16:51:24] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:52:27] (03CR) 10Brouberol: [C:03+2] superset: replace IP-based networkpolicy by its service counterpart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [16:52:34] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [16:53:20] (03PS1) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [16:53:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:53:25] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:27] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:53:45] RESOLVED: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:46] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [16:56:10] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [16:56:16] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [16:56:57] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:57:07] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [16:58:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1107-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:58:45] FIRING: [11x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [16:59:41] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9880993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1700) [17:00:46] FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1107-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:03:45] RESOLVED: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9881001 (10Jclark-ctr) This server is out of warranty Will check decom servers to see if we have any suitable dimms [17:04:28] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [17:04:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [17:04:40] !log ebernhardson@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:04:46] !log ebernhardson@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:04:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9881008 (10Jclark-ctr) The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1. [17:05:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9881014 (10Jclark-ctr) DIMM B1 BankLabel: B CacheSize: Information Not Available CurrentOperatingSpeed: 2400 MHz DeviceDescription: DIMM B1 DeviceType: Memory FQDD: DIM... [17:06:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881020 (10VRiley-WMF) Swapped 40Base-LR4 in port et-0/0/53. [17:08:26] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:08:35] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:08:48] rzl: okay:) [17:09:02] just getting set up now, here we go :) sorry for the delay [17:09:20] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [17:09:26] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9881034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [17:09:30] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:09:37] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:10:43] PROBLEM - BFD status on cloudsw1-f4-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:10:46] FIRING: [11x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:33] thanks rzl, missed this as well [17:13:45] FIRING: [8x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:55] (03PS2) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:14:12] !log rzl@cumin2002:~$ sudo cumin 'C:profile::mediawiki::webserver' 'disable-puppet T366649' [17:14:15] (03CR) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [17:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:17] T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649 [17:14:43] RECOVERY - BFD status on cloudsw1-f4-eqiad.mgmt is OK: UP: 3 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:14:45] (03CR) 10RLazarus: [C:03+2] Add Apache configuration for u4c.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041240 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [17:15:48] cumin failed on mw1403,mw1406 which are getting kubernetized, proceeding [17:18:45] RESOLVED: [8x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:55] zabe: deployed to mwdebug1001, httpbb is running, have a look [17:20:38] "no wiki found" [17:20:41] that looks correct [17:20:46] 👍 [17:20:51] httpbb is happy too, no regression [17:21:11] let me deploy to mwdebug on k8s too out of an abundance of caution, and then I'll send it everywhere [17:23:45] FIRING: [8x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:10] (03PS2) 10Andrea Denisse: discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) [17:25:46] (03CR) 10Andrea Denisse: "Thanks for taking a look, I've sent a new patch." [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [17:26:39] !log rzl@deploy1002 Started scap: (no justification provided) [17:27:04] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:27:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T367209 (10ops-monitoring-bot) 03NEW [17:28:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1107-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:28:45] RESOLVED: [8x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364069)', diff saved to https://phabricator.wikimedia.org/P64638 and previous config saved to /var/cache/conftool/dbconfig/20240611-172928-marostegui.json [17:29:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:30:42] !log rzl@deploy1002 rzl: (no justification provided) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:31:10] zabe: lgty at k8s-mwdebug too? [17:31:28] rzl: yup [17:32:25] jouncebot: nowandnext [17:32:25] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1700) [17:32:25] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1800) [17:32:35] hello, are any deployers around? the next backport window seems a bit overfilled, i wonder if anyone might be available to deploy some of the scheduled changes now. https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T2000 [17:32:37] taavi: apache config deploy in progress, done shortly [17:33:04] rzl: thanks. I have a mw-config patch to sync out, but not sure if I have time before the train [17:33:10] httpbb against k8s-mwdebug is happy, sending it everywhere [17:33:19] !log rzl@deploy1002 rzl: Continuing with sync [17:33:23] (03PS3) 10Dduvall: admin_ng: remove blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) [17:33:37] (there's also a wmf.9 backport, we could probably save some time if it was merged before the train) [17:33:44] !log rzl@cumin2002:~$ sudo cumin 'C:profile::mediawiki::webserver' 'enable-puppet T366649' [17:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:48] T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649 [17:34:16] (03CR) 10Dduvall: "@effie@wikimedia.org rebased and ping :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:35:46] FIRING: [7x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:57] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1041683 (https://phabricator.wikimedia.org/T367173) (owner: 10Herron) [17:36:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9881242 (10Dwisehaupt) [17:36:20] (03CR) 10Herron: [C:03+2] admin: add ebysans to group deployment [puppet] - 10https://gerrit.wikimedia.org/r/1041683 (https://phabricator.wikimedia.org/T367173) (owner: 10Herron) [17:37:25] (03PS2) 10Bking: dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) [17:37:44] !log rzl@deploy1002 Finished scap: (no justification provided) (duration: 11m 40s) [17:38:40] all set! sorry zabe for slipping the schedule so much [17:38:45] FIRING: [8x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:54] and not sure who's got the conch next but I'm through with it :) [17:39:09] (and I believe that's all for the MW infra window) [17:40:15] (03CR) 10Bking: [C:03+2] dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:40:30] (03CR) 10Bking: [V:03+2 C:03+2] dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:40:46] FIRING: [8x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:41:06] (03Merged) 10jenkins-bot: dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:41:28] (03PS7) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) [17:41:46] I should have enough time, so I'll deploy a config patch of mine [17:41:54] would it be possible to merge some wmf.9 changes before the train deployment, so that we can save time backporting them later? [17:42:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [17:42:37] MatmaRex: I don't think that would save time? [17:42:55] wmf.9 was synced to test wikis early this morning [17:43:05] oh. okay [17:43:35] (03Merged) 10jenkins-bot: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [17:43:45] RESOLVED: [8x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:03] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1038750|wikitech: Stop loading OpenStackManager (T161553 T338477 T359544)]] [17:44:10] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [17:44:10] T338477: `Nova Resource:` namespace should be declared in wmf-config, not in Extension:OpenStackManager - https://phabricator.wikimedia.org/T338477 [17:44:10] T359544: Disable SSH key management on Wikitech - https://phabricator.wikimedia.org/T359544 [17:44:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P64639 and previous config saved to /var/cache/conftool/dbconfig/20240611-174434-marostegui.json [17:45:07] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:45:12] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:46:39] (03CR) 10Bking: [C:03+2] dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:46:45] (03CR) 10CI reject: [V:04-1] dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:46:49] (03PS4) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) [17:46:58] !log taavi@deploy1002 taavi: Backport for [[gerrit:1038750|wikitech: Stop loading OpenStackManager (T161553 T338477 T359544)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:47:07] !log taavi@deploy1002 taavi: Continuing with sync [17:47:17] no way to test wikitech stuff on mwdebug :/ [17:48:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9881285 (10herron) 05Open→03Resolved a:03herron The patch to provision this access has been merged, and will be fully propagated... [17:48:45] FIRING: [7x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:26] (03CR) 10Bking: [V:03+2 C:03+2] dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:50:46] FIRING: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:20] (03PS1) 10Ebernhardson: cirrus: Update container image and set http user agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041730 [17:53:12] (03PS1) 10BCornwall: taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 [17:53:31] (03CR) 10CI reject: [V:04-1] taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [17:53:45] RESOLVED: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:16] (03PS2) 10BCornwall: taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 [17:56:02] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:56:03] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1038750|wikitech: Stop loading OpenStackManager (T161553 T338477 T359544)]] (duration: 12m 00s) [17:56:08] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:56:16] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [17:56:16] T338477: `Nova Resource:` namespace should be declared in wmf-config, not in Extension:OpenStackManager - https://phabricator.wikimedia.org/T338477 [17:56:16] T359544: Disable SSH key management on Wikitech - https://phabricator.wikimedia.org/T359544 [17:57:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9881345 (10Jclark-ctr) @MoritzMuehlenhoff Can i take server down to replace dimm? [17:58:37] (03PS1) 10Dzahn: admin: add Andrew Otto to approvers for analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1041735 [17:58:49] (03CR) 10CI reject: [V:04-1] admin: add Andrew Otto to approvers for analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn) [17:58:57] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:59:02] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:59:03] (03PS2) 10Dzahn: admin: add Andrew Otto to approvers for analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1041735 [17:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P64640 and previous config saved to /var/cache/conftool/dbconfig/20240611-175941-marostegui.json [18:00:05] brennen and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T1800). [18:00:23] (03PS2) 10Ebernhardson: cirrus: Update container image and set http user agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041730 (https://phabricator.wikimedia.org/T366363) [18:00:46] FIRING: [8x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:00] (03PS3) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [18:03:09] (03PS1) 10Dzahn: rename gitlab-replica-old to gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1041740 [18:03:13] (03CR) 10Scott French: [C:03+1] "Great, thanks for confirming." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [18:03:45] FIRING: [8x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:51] (03CR) 10Ottomata: [C:03+1] admin: add Andrew Otto to approvers for analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn) [18:03:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9881401 (10Ottomata) Thank you! [18:05:46] RESOLVED: [8x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:12] o/ [18:10:14] !log 1.43.0-wmf.9 train (T361403): no blockers, rolling to group0 [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:18] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [18:10:26] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041741 (https://phabricator.wikimedia.org/T361403) [18:10:28] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041741 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [18:11:11] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041741 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [18:13:21] (03PS1) 10Majavah: Stop loading OSM i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041742 (https://phabricator.wikimedia.org/T161553) [18:13:45] FIRING: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:48] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image and set http user agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041730 (https://phabricator.wikimedia.org/T366363) (owner: 10Ebernhardson) [18:14:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364069)', diff saved to https://phabricator.wikimedia.org/P64641 and previous config saved to /var/cache/conftool/dbconfig/20240611-181448-marostegui.json [18:14:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [18:14:53] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:15:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [18:15:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:15:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:15:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T364069)', diff saved to https://phabricator.wikimedia.org/P64642 and previous config saved to /var/cache/conftool/dbconfig/20240611-181526-marostegui.json [18:15:34] (03CR) 10Jbond: "lgtm comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [18:15:39] (03Merged) 10jenkins-bot: cirrus: Update container image and set http user agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041730 (https://phabricator.wikimedia.org/T366363) (owner: 10Ebernhardson) [18:15:46] RESOLVED: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:05] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 200475 MB (5% inode=92%): /srv/swift-storage/sdg1 210181 MB (5% inode=91%): /srv/swift-storage/sdc1 183702 MB (4% inode=92%): /srv/swift-storage/sdi1 200572 MB (5% inode=92%): /srv/swift-storage/sde1 205793 MB (5% inode=92%): /srv/swift-storage/sdh1 190892 MB (5% inode=91%): /srv/swift-storage/sdj1 206229 MB (5% inode=91%): /srv/swift-s [18:16:05] k1 183173 MB (4% inode=91%): /srv/swift-storage/sdd1 152175 MB (3% inode=90%): /srv/swift-storage/sdm1 204820 MB (5% inode=91%): /srv/swift-storage/sdl1 203993 MB (5% inode=92%): /srv/swift-storage/sdn1 183796 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [18:16:30] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [18:18:44] (03PS3) 10BCornwall: taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 [18:19:32] !log ebernhardson@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:19:41] !log ebernhardson@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:19:44] (03CR) 10BCornwall: taskgen: Ignore acme-chief certificate typos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [18:21:12] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9881520 (10Dzahn) Hello Audrey, please get one of the WMDE managers listed here: https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group to approve on this ticket. A... [18:21:53] (03CR) 10Scott French: [C:03+1] "Thanks, Janis!" [puppet] - 10https://gerrit.wikimedia.org/r/1041644 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [18:22:53] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 2 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222 (10RobH) 03NEW p:05Triage→03Medium [18:22:58] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.9 refs T361403 [18:23:02] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [18:23:32] (03PS4) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) [18:23:45] FIRING: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:05] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9881535 (10Dzahn) cc: @KFrancis Audrey needs to sign an NDA as a WMDE employee. Thanks as always. [18:25:11] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9881573 (10Dzahn) 05Open→03In progress p:05Triage→03High [18:25:40] (03CR) 10Ottomata: [C:03+1] "I tested this on mwdebug2 by manually making a second .php file with the same code as index.php. It works!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:25:50] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 2 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881600 (10RobH) Accounting Report: https://netbox.wikimedia.org/extras/reports/results/5914366/ The only items for caching sites are listed in the tes... [18:26:30] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 2 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881626 (10RobH) Cables: No errors: https://netbox.wikimedia.org/extras/reports/results/5914371/ Rack: No errors: https://netbox.wikimedia.org/extras/rep... [18:26:51] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 2 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881631 (10RobH) Network: https://netbox.wikimedia.org/extras/reports/results/5914311/ : Errors but none in caching sites. Physical Hosts: https://netbo... [18:27:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9881640 (10MoritzMuehlenhoff) >>! In T367071#9881345, @Jclark-ctr wrote: > @MoritzMuehlenhoff Can i take server down to replace dimm? Yes, please! All VMs have been move... [18:27:39] (03CR) 10Ottomata: [C:03+1] "Since the code itself has previously been reviewed, I plan to merge and deploy this today or tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:28:05] 06SRE, 06Growth-Team, 10GrowthExperiments-Homepage, 07Grafana: Growth team product KPI Grafana dashboard has `update_` task type, which does not exist - https://phabricator.wikimedia.org/T362633#9881637 (10Michael) Please delete the metric `MediaWiki.jawiki.GrowthExperiments.NewcomerTask.update_.*` from Gr... [18:28:45] RESOLVED: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881669 (10cmooney) p:05Medium→03Low Thanks for the help with this @VRiley-WMF. The link has now b... [18:30:47] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 3 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881690 (10RobH) [18:31:07] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 3 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881694 (10RobH) [18:31:15] 10ops-drmrs, 10ops-eqsin, 10ops-esams, 10ops-magru, and 3 others: 2024-06-11 caching site netbox report sweep - https://phabricator.wikimedia.org/T367222#9881696 (10RobH) 05Open→03Resolved [18:33:45] FIRING: [7x] ProbeDown: Service aqs1017-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:46] FIRING: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:36:22] (03CR) 10Scott French: [C:03+1] "Interesting, so those failures to render don't fail CI overall (i.e., some kind of mystery build timeout in this case)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [18:37:46] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:38:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64643 and previous config saved to /var/cache/conftool/dbconfig/20240611-183841-ladsgroup.json [18:38:45] RESOLVED: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:40:44] (03CR) 10Dzahn: "for wikistats I don't mind either way. for "simplelamp2" I do think that something small is lost, it does explain better what this role do" [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff) [18:41:05] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:41:41] (03CR) 10Dzahn: [C:03+2] gitlab: use IPv4 and IPv6 for SSH check [puppet] - 10https://gerrit.wikimedia.org/r/1041639 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [18:44:15] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:45:46] FIRING: [8x] ProbeDown: Service aqs1018-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:03] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:46:17] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Enhancement: view status of all running cookbooks on demand - https://phabricator.wikimedia.org/T367210#9881758 (10Volans) p:05Triage→03Medium Locally I have a 90% done draft of a list locks cookbook I started a while ago that will show all the exis... [18:47:01] (03PS1) 10Ahmon Dancy: logstash_checker.py: Add --time option [puppet] - 10https://gerrit.wikimedia.org/r/1041746 [18:47:10] (03CR) 10Snwachukwu: [C:03+1] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:47:29] (03CR) 10CI reject: [V:04-1] logstash_checker.py: Add --time option [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy) [18:48:34] (03CR) 10Snwachukwu: [C:03+1] "Thanks! I have checked CI and it looks good to me. I think we can merge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:48:45] FIRING: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:02] (03CR) 10Dzahn: [C:03+1] "https://phabricator.wikimedia.org/T367021#9880167" [puppet] - 10https://gerrit.wikimedia.org/r/1041636 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [18:49:22] (03PS2) 10Ahmon Dancy: logstash_checker.py: Add --time option [puppet] - 10https://gerrit.wikimedia.org/r/1041746 [18:49:50] (03PS1) 10Jdlrobson: Disable quick surveys using deprecated configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) [18:50:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for ebysans - https://phabricator.wikimedia.org/T367173#9881782 (10Snwachukwu) Thanks! [18:50:46] RESOLVED: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:16] (03PS2) 10Jdlrobson: Disable quick surveys using deprecated configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) [18:51:33] (03CR) 10Jdlrobson: "Jason: could you check I haven't broken your surveys in any way?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [18:51:40] (03PS2) 10Dzahn: move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 [18:53:05] (03CR) 10Ahmon Dancy: "How to test:" [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy) [18:53:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64644 and previous config saved to /var/cache/conftool/dbconfig/20240611-185348-ladsgroup.json [18:54:15] (03PS3) 10Dzahn: move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 [18:58:36] (03CR) 10BCornwall: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:58:45] FIRING: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:54] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-eqiad [19:00:46] FIRING: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:48] (03CR) 10Dzahn: "not a netbox change since the IPs are just described there as service IPs on a certain machine, see https://netbox.wikimedia.org/search/?q" [dns] - 10https://gerrit.wikimedia.org/r/1041740 (owner: 10Dzahn) [19:02:28] (03CR) 10Dzahn: "the puppet part would be:" [dns] - 10https://gerrit.wikimedia.org/r/1041740 (owner: 10Dzahn) [19:03:45] RESOLVED: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:02] (03PS1) 10Dzahn: acme_chief: add replica-a and replica-b to gitlab cert names [puppet] - 10https://gerrit.wikimedia.org/r/1041749 [19:07:06] (03CR) 10JHathaway: [C:03+1] taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [19:08:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64645 and previous config saved to /var/cache/conftool/dbconfig/20240611-190855-ladsgroup.json [19:09:55] (03PS1) 10Dzahn: acme_chief/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 [19:11:05] (03CR) 10Dzahn: [C:03+1] "lol, yea, typo domains.. facepalm" [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [19:12:16] (03PS2) 10Dzahn: acme_chief/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 [19:14:00] (03PS1) 10Dzahn: gitlab: change service name on gitlab1003 to gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1041751 [19:19:45] PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:07] (03PS1) 10Dwisehaupt: frack: Enable check_audit_downloads check [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) [19:20:21] 10SRE-tools, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230 (10taavi) 03NEW [19:20:23] 10SRE-tools, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367231 (10taavi) 03NEW [19:20:46] FIRING: ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:53] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367231#9881913 (10taavi) →14Duplicate dup:03T367230 [19:20:54] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9881916 (10taavi) [19:23:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [19:23:39] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9881923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [19:24:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64646 and previous config saved to /var/cache/conftool/dbconfig/20240611-192403-ladsgroup.json [19:24:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:29:13] (03PS2) 10Pppery: MediaWiki.org: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041249 (https://phabricator.wikimedia.org/T366994) [19:29:20] (03PS4) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) [19:29:47] (03PS2) 10Pppery: [ptwikinews] Set atom feed link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038901 (https://phabricator.wikimedia.org/T356003) [19:29:52] (03PS2) 10Pppery: [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189) [19:30:17] jouncebot: next [19:30:17] In 0 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T2000) [19:30:24] (03CR) 10CDanis: [C:03+2] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [19:30:59] there's 8 patches listed. if anyone feels like starting the deployments early… [19:31:02] (03PS2) 10Dwisehaupt: frack: Enable check_audit_downloads check [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) [19:33:34] T367229 [19:33:35] T367229: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229 [19:33:39] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [19:33:55] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9881979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [19:34:23] (03CR) 10Jgreen: [C:03+1] frack: Enable check_audit_downloads check [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) (owner: 10Dwisehaupt) [19:37:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:38:33] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [19:38:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1002 [19:43:45] (03PS1) 10JHathaway: mediawiki: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041758 (https://phabricator.wikimedia.org/T365395) [19:44:28] (03CR) 10Krinkle: [POC] Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [19:44:58] (03CR) 10Jdlrobson: CommonSettings: Restore the original behaviour of Reference Previews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [19:46:47] (03PS9) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [19:47:01] (03PS10) 10Jdlrobson: Drop unused config, enable responsive tables on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [19:47:05] (03PS11) 10Jdlrobson: Drop unused config, enable responsive tables on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [19:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [19:51:57] (03PS1) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:57:03] (03PS1) 10CDanis: puppetserver syncs: also add monitoring + timeout [puppet] - 10https://gerrit.wikimedia.org/r/1041760 (https://phabricator.wikimedia.org/T367113) [19:57:16] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041760 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [19:59:53] (03PS1) 10JHathaway: mw: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041763 (https://phabricator.wikimedia.org/T365395) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T2000). [20:00:05] Pppery, jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] Here [20:01:02] hi [20:01:06] that's a busy window [20:01:13] I think the max is six [20:01:27] Yeah, but ScheduleDeploymentBot didn't know that so happily added it up to 8 [20:01:43] T367229 [20:01:44] T367229: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229 [20:01:52] (03CR) 10Ladsgroup: [C:03+2] MediaWiki.org: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041249 (https://phabricator.wikimedia.org/T366994) (owner: 10Pppery) [20:02:13] I'm here [20:02:27] (03Merged) 10jenkins-bot: MediaWiki.org: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041249 (https://phabricator.wikimedia.org/T366994) (owner: 10Pppery) [20:03:29] just the scap time will exceed 60 minutes [20:03:38] we can go over the window though [20:03:41] i've been asking for someone to volunteer to deploy a few of the patches early for the past couple of hours, alas no one did [20:03:43] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041249|MediaWiki.org: restrict unfuzzy rights to autoconfirmed (T366994)]] [20:03:48] T366994: Restrict unfuzzy rights on MediaWiki.org - https://phabricator.wikimedia.org/T366994 [20:06:17] !log ladsgroup@deploy1002 ladsgroup, pppery: Backport for [[gerrit:1041249|MediaWiki.org: restrict unfuzzy rights to autoconfirmed (T366994)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:25] Pppery: mwdebug [20:06:51] Checked Special:ListGroupRights, looks good [20:07:19] (03CR) 10CDanis: [C:03+2] puppetserver syncs: also add monitoring + timeout [puppet] - 10https://gerrit.wikimedia.org/r/1041760 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:07:34] !log ladsgroup@deploy1002 ladsgroup, pppery: Continuing with sync [20:08:20] (03PS5) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) [20:09:40] (03CR) 10Ladsgroup: [C:03+2] [ptwikinews] Set atom feed link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038901 (https://phabricator.wikimedia.org/T356003) (owner: 10Pppery) [20:10:16] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1041311 and https://gerrit.wikimedia.org/r/c/1031459/ derisks this week's train deploy by allowing more QA testing beforehand so pretty critical for this window. [20:10:30] Amir1: just an FYI ^ [20:10:58] I will try to get it out [20:11:03] (03Merged) 10jenkins-bot: [ptwikinews] Set atom feed link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038901 (https://phabricator.wikimedia.org/T356003) (owner: 10Pppery) [20:11:07] but is that okay for you to stay longer? [20:11:25] My remaining config patches are all not especially urgent, other than trying to avoid the experience where you file a site request and it gets no response for weeks, so feel free to do Jon's change ahead of them [20:11:43] (03CR) 10JHathaway: "kindly review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041763 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [20:11:57] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1041758 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [20:12:34] Amir1: yeh fine for me [20:13:35] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-codfw [20:14:31] (03PS3) 10Pppery: [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189) [20:14:34] (03CR) 10Ladsgroup: [C:03+2] [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189) (owner: 10Pppery) [20:14:50] I'm going to batch deploy two of them that look quite innocent [20:15:08] Re the jawikinews patch, someone probably needs to run updateArticleCount.php for it to have any affect [20:15:15] (03Merged) 10jenkins-bot: [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189) (owner: 10Pppery) [20:15:29] But that runs by itself periodically IIRC [20:16:17] yeah, weekly [20:16:37] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041249|MediaWiki.org: restrict unfuzzy rights to autoconfirmed (T366994)]] (duration: 12m 54s) [20:16:41] T366994: Restrict unfuzzy rights on MediaWiki.org - https://phabricator.wikimedia.org/T366994 [20:16:53] All that means is that I won't be able to test the affect of it on mwdebug [20:17:16] yeah, that's fine [20:17:21] (03PS1) 10Dzahn: gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 [20:17:28] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1038901|[ptwikinews] Set atom feed link (T356003)]], [[gerrit:1038897|[jawikinews] Set $wgArticleCountMethod to any (T364189)]] [20:17:34] T356003: change atom feed link on ptwikinews - https://phabricator.wikimedia.org/T356003 [20:17:35] T364189: Number of articles on Statistics page of jawikinews should include unlinked articles - https://phabricator.wikimedia.org/T364189 [20:18:27] (03CR) 10Ebernhardson: [C:03+1] "Looks good, ready to ship in one of the backport windows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [20:18:38] (03PS1) 10Dzahn: idp/gitlab: add gitlab-replica-a and -b to regex [puppet] - 10https://gerrit.wikimedia.org/r/1041768 [20:18:45] FIRING: [5x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:19] (03PS12) 10Jdlrobson: Drop unused config, enable responsive tables on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [20:19:21] (03CR) 10Ladsgroup: [C:03+2] Drop unused config, enable responsive tables on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:19:46] (03CR) 10Ladsgroup: [C:03+2] Avoid wrapping floated tables using computed styles [skins/Vector] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041311 (https://phabricator.wikimedia.org/T366314) (owner: 10Jdlrobson) [20:20:01] !log ladsgroup@deploy1002 pppery, ladsgroup: Backport for [[gerrit:1038901|[ptwikinews] Set atom feed link (T356003)]], [[gerrit:1038897|[jawikinews] Set $wgArticleCountMethod to any (T364189)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:08] Pppery: test server [20:20:10] (03PS3) 10Dzahn: acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 [20:20:19] (03Merged) 10jenkins-bot: Drop unused config, enable responsive tables on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:21:05] ptwikinews patch looks good [20:21:32] !log ladsgroup@deploy1002 pppery, ladsgroup: Continuing with sync [20:21:39] let's go [20:23:14] Per the puppet code it looks like UpdateArticleCount runs on the 21st of every month [20:23:45] FIRING: [5x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:24:28] ah, okay. [20:28:45] FIRING: [9x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:08] (03PS1) 10Ladsgroup: Stop writing to the old pagelinks columns of s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041769 (https://phabricator.wikimedia.org/T352010) [20:30:20] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038901|[ptwikinews] Set atom feed link (T356003)]], [[gerrit:1038897|[jawikinews] Set $wgArticleCountMethod to any (T364189)]] (duration: 12m 52s) [20:30:25] T356003: change atom feed link on ptwikinews - https://phabricator.wikimedia.org/T356003 [20:30:26] T364189: Number of articles on Statistics page of jawikinews should include unlinked articles - https://phabricator.wikimedia.org/T364189 [20:30:27] (03PS4) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:31:38] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1031459|Drop unused config, enable responsive tables on group 0 (T301212 T366314)]] [20:31:44] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:31:44] T366314: Deploy and QA responsive tables change - https://phabricator.wikimedia.org/T366314 [20:33:45] FIRING: [9x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:14] !log ladsgroup@deploy1002 ladsgroup, jdlrobson: Backport for [[gerrit:1031459|Drop unused config, enable responsive tables on group 0 (T301212 T366314)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:32] Jdlrobson: on debug servers [20:35:32] Amir1: on it [20:36:50] Amir1: LGTM please sync [20:36:55] !log ladsgroup@deploy1002 ladsgroup, jdlrobson: Continuing with sync [20:37:01] syncing [20:37:54] (03PS4) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) [20:39:45] (03CR) 10Dzahn: [V:03+1 C:03+1] "looks good to me! https://puppet-compiler.wmflabs.org/output/1041232/2892/" [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [20:40:46] FIRING: [9x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:15] PROBLEM - Check whether ferm is active by checking the default input chain on mw2369 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:41:45] (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [20:43:44] jouncebot: next [20:43:44] In 9 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T0600) [20:43:51] cool [20:44:01] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:44:08] (03Merged) 10jenkins-bot: Avoid wrapping floated tables using computed styles [skins/Vector] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041311 (https://phabricator.wikimedia.org/T366314) (owner: 10Jdlrobson) [20:45:44] (03CR) 10Scott French: "Thank you both for the reviews. Unless there are any objections, I'll plan to merge / apply this at some point tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:45:46] FIRING: [9x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:56] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1031459|Drop unused config, enable responsive tables on group 0 (T301212 T366314)]] (duration: 14m 18s) [20:46:06] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:46:06] T366314: Deploy and QA responsive tables change - https://phabricator.wikimedia.org/T366314 [20:46:45] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041311|Avoid wrapping floated tables using computed styles (T366314)]] [20:48:46] (03PS3) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) [20:49:22] !log ladsgroup@deploy1002 jdlrobson, ladsgroup: Backport for [[gerrit:1041311|Avoid wrapping floated tables using computed styles (T366314)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:49:38] Jdlrobson: the backport is live in test servers [20:50:25] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9882205 (10Dzahn) I sent an email to the current admins and asked them to clarify. [20:50:40] Amir1: on it [20:50:52] MatmaRex: are you around for the https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1041297 deploy? [20:50:55] (03PS1) 10BryanDavis: [DNM] Testing things in Gerrit UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 [20:51:17] Amir1: yeah [20:51:22] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9882207 (10Volans) @taavi technically it already can, taking advantage of filesystem autocompletion ;). As specified in the `cookbook -h` help message, the cookbook "name" can... [20:52:13] (03CR) 10Ladsgroup: [C:04-1] "That's really can't go in with backports. It requires rebuilding of i18n cache which takes a long time (last time I did, it took an hour)," [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040139 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [20:52:25] Amir1 please sync! [20:52:30] !log ladsgroup@deploy1002 jdlrobson, ladsgroup: Continuing with sync [20:52:52] (03CR) 10Ladsgroup: [C:03+2] Fix Linker::makeExternalLink build failures [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041297 (https://phabricator.wikimedia.org/T367127) (owner: 10Bartosz Dziewoński) [20:53:40] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:45] FIRING: [9x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:22] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9882233 (10Dzahn) For reasons unrelated to the discussion so far I also noticed these large files on people* hosts a... [20:55:46] FIRING: [9x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:45] FIRING: [9x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:12] (03CR) 10Dwisehaupt: "This is clear to go out at any time. Jeff and I can't deploy it. Thanks for the review and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) (owner: 10Dwisehaupt) [21:01:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041311|Avoid wrapping floated tables using computed styles (T366314)]] (duration: 14m 28s) [21:01:19] T366314: Deploy and QA responsive tables change - https://phabricator.wikimedia.org/T366314 [21:01:59] (03CR) 10Ladsgroup: [C:03+2] Stop writing to the old pagelinks columns of s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041769 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [21:02:13] I deploy this in the mean time [21:02:18] Jdlrobson: your patches are out [21:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041769 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [21:02:27] (03CR) 10Dzahn: [C:03+2] frack: Enable check_audit_downloads check [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) (owner: 10Dwisehaupt) [21:02:33] Is my zghwiki patch going to get done? [21:02:37] Amir1: cool thanks for all your help! [21:02:39] (03Merged) 10jenkins-bot: Stop writing to the old pagelinks columns of s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041769 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [21:03:10] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041769|Stop writing to the old pagelinks columns of s2 (T352010)]] [21:03:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:03:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9882249 (10RobH) [21:03:45] FIRING: [9x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:47] (03PS2) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [21:04:05] Pppery: that one has to stay for now, is it urgent? [21:04:33] No. [21:04:43] But I wanted to know whether I still needed to be here waiting for it to happen [21:05:03] Pppery: On a second thought, if you're okay with staying, I can deploy it after MatmaRex's patch [21:05:13] if you're not, that's totally fine [21:05:29] I can wait. [21:05:46] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1041769|Stop writing to the old pagelinks columns of s2 (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882267 (10VRiley-WMF) You're welcome @cmooney We do have spares if they are needed in the future. Clos... [21:06:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882268 (10VRiley-WMF) 05Open→03Resolved [21:06:32] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [21:07:00] thanks [21:08:45] FIRING: [9x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9882280 (10Jclark-ctr) @MoritzMuehlenhoff Replaced Dimm. looks like image is corrupt and might need to be reimaged. Also updated idrac firmware /bios T367075 was aut... [21:09:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9882292 (10Jclark-ctr) failed drive was replaced also [21:11:37] (03CR) 10Ladsgroup: [C:03+2] [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery) [21:11:49] selenium had a random failure [21:12:22] (03Merged) 10jenkins-bot: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery) [21:14:25] i'm just chillin, doing something else in the background [21:15:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041769|Stop writing to the old pagelinks columns of s2 (T352010)]] (duration: 12m 02s) [21:15:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:15:46] FIRING: [9x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:03] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1038899|[zghwiki] Add patroller and autopatrolled groups (T357411)]] [21:16:12] T357411: User access levels on zgh wikipedia - https://phabricator.wikimedia.org/T357411 [21:16:17] (03CR) 10CI reject: [V:04-1] Fix Linker::makeExternalLink build failures [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041297 (https://phabricator.wikimedia.org/T367127) (owner: 10Bartosz Dziewoński) [21:16:55] (03CR) 10Ladsgroup: [C:03+2] "try again" [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041297 (https://phabricator.wikimedia.org/T367127) (owner: 10Bartosz Dziewoński) [21:18:38] !log ladsgroup@deploy1002 pppery, ladsgroup: Backport for [[gerrit:1038899|[zghwiki] Add patroller and autopatrolled groups (T357411)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:44] btw, that patch doesn't do anything visible/testable, but it will avoid build failures when backporting other things to wmf.9 [21:18:55] !log ladsgroup@deploy1002 pppery, ladsgroup: Continuing with sync [21:19:09] RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:19:39] PROBLEM - SSH on ganeti1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:19:40] Belatedly confirming the zghwiki patch looks good [21:20:46] FIRING: [9x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9882394 (10Jclark-ctr) @MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access... [21:27:57] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038899|[zghwiki] Add patroller and autopatrolled groups (T357411)]] (duration: 11m 53s) [21:28:01] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 11.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:28:01] T357411: User access levels on zgh wikipedia - https://phabricator.wikimedia.org/T357411 [21:28:11] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:28:27] (03CR) 10Dzahn: [C:03+2] "deployed on alert1001 (icinga). the new services appear here in web UI: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_stri" [puppet] - 10https://gerrit.wikimedia.org/r/1041754 (https://phabricator.wikimedia.org/T365466) (owner: 10Dwisehaupt) [21:28:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041698 (owner: 10Ladsgroup) [21:28:45] FIRING: [9x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:40] (03Merged) 10jenkins-bot: Reduce the threshold for section wide circuit breaking to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041698 (owner: 10Ladsgroup) [21:29:53] (03PS3) 10Cwhite: discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:30:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041698|Reduce the threshold for section wide circuit breaking to 300]] [21:30:46] FIRING: [9x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:19] (03CR) 10Cwhite: [C:03+1] "Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:32:44] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1041698|Reduce the threshold for section wide circuit breaking to 300]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:33:01] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [21:33:45] FIRING: [9x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:54] (03CR) 10Cwhite: [C:03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1041155 (https://phabricator.wikimedia.org/T366308) (owner: 10Filippo Giunchedi) [21:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:36:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw2314 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:37:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw2301 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:38:45] FIRING: [9x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:38] (03PS1) 10Eevans: data-gateway: Upgrade to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041784 [21:40:48] (03CR) 10BCornwall: [C:03+2] taskgen: Ignore acme-chief certificate typos [puppet] - 10https://gerrit.wikimedia.org/r/1041731 (owner: 10BCornwall) [21:41:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw2369 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:42:10] (03PS2) 10Ncmonitor: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039849 [21:42:16] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041698|Reduce the threshold for section wide circuit breaking to 300]] (duration: 12m 08s) [21:43:08] (03Merged) 10jenkins-bot: Fix Linker::makeExternalLink build failures [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041297 (https://phabricator.wikimedia.org/T367127) (owner: 10Bartosz Dziewoński) [21:43:45] FIRING: [9x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:09] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041297|Fix Linker::makeExternalLink build failures (T367127)]] [21:44:14] T367127: CI reports possible XSS vulnerability in SecurePoll - https://phabricator.wikimedia.org/T367127 [21:47:32] !log ladsgroup@deploy1002 matmarex, ladsgroup: Backport for [[gerrit:1041297|Fix Linker::makeExternalLink build failures (T367127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:47:37] !log ladsgroup@deploy1002 matmarex, ladsgroup: Continuing with sync [21:48:35] (03CR) 10Cwhite: [C:03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [21:50:21] (03PS2) 10Cwhite: traffic: Route logstash.w.o to logstash.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1039887 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:50:31] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1039887 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:50:46] FIRING: [6x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1403 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:52:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw2378 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:53:45] FIRING: [9x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [21:55:46] FIRING: [9x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:43] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041297|Fix Linker::makeExternalLink build failures (T367127)]] (duration: 12m 33s) [21:56:47] T367127: CI reports possible XSS vulnerability in SecurePoll - https://phabricator.wikimedia.org/T367127 [22:00:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [22:03:45] FIRING: [9x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:06:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/User:BryanDavis/Sandbox/D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (owner: 10BryanDavis) [22:06:10] (03CR) 10ScheduleDeploymentBot: [DNM] Testing things in Gerrit UI (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (owner: 10BryanDavis) [22:06:55] RECOVERY - Check whether ferm is active by checking the default input chain on mw2314 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:07:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw2301 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:08:45] FIRING: [9x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:45] FIRING: [11x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:15:46] FIRING: [9x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:43] (03CR) 10Scott French: [C:03+1] data-gateway: Upgrade to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041784 (owner: 10Eevans) [22:18:45] FIRING: [9x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:57] RECOVERY - Check whether ferm is active by checking the default input chain on mw1403 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:22:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw2378 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:25:46] FIRING: [10x] ProbeDown: Service aqs2010-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-codfw [22:30:46] FIRING: [9x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:33:45] FIRING: [9x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:00] (03CR) 10EoghanGaffney: [C:03+1] gitlab: use IPv4 and IPv6 for SSH check [puppet] - 10https://gerrit.wikimedia.org/r/1041639 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [22:48:56] (03CR) 10EoghanGaffney: [C:03+1] "Initially I was wondering should we keep the old names around as SNIs but people shouldn't be using them anyway!" [puppet] - 10https://gerrit.wikimedia.org/r/1041750 (owner: 10Dzahn) [22:50:11] (03CR) 10EoghanGaffney: [C:03+1] "lg, but what about" [dns] - 10https://gerrit.wikimedia.org/r/1041740 (owner: 10Dzahn) [22:56:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [22:56:24] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [23:01:45] (03CR) 10EoghanGaffney: [C:03+1] "This can be ignored ^" [dns] - 10https://gerrit.wikimedia.org/r/1041740 (owner: 10Dzahn) [23:09:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:10:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:11:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 3.627 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:37:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041790 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041790 (owner: 10TrainBranchBot) [23:41:49] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 189365 MB (4% inode=91%): /srv/swift-storage/sdg1 201751 MB (5% inode=92%): /srv/swift-storage/sdc1 151940 MB (3% inode=91%): /srv/swift-storage/sdh1 188467 MB (4% inode=91%): /srv/swift-storage/sde1 175874 MB (4% inode=92%): /srv/swift-storage/sdd1 154945 MB (4% inode=90%): /srv/swift-storage/sdj1 204472 MB (5% inode=92%): /srv/swift-s [23:41:49] k1 170228 MB (4% inode=91%): /srv/swift-storage/sdi1 186086 MB (4% inode=91%): /srv/swift-storage/sdl1 198371 MB (5% inode=92%): /srv/swift-storage/sdn1 195267 MB (5% inode=92%): /srv/swift-storage/sdm1 221008 MB (5% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [23:43:28] (03CR) 10Eevans: [C:03+2] data-gateway: Upgrade to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041784 (owner: 10Eevans) [23:44:21] (03Merged) 10jenkins-bot: data-gateway: Upgrade to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041784 (owner: 10Eevans) [23:45:10] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [23:45:23] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [23:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [23:50:46] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9882614 (10Eevans) >>! In T362033#9814462, @Papaul wrote: > @Eevans like you mentioned on IRC "it's the same slot(s) that are having issues" I think we need to replace the main board and see. We... [23:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable