[00:02:38] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:05:43] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:10] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c1-codfw.mgmt.codfw.wmnet [00:06:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:07:38] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:08:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c1-codfw - pt1979@cumin2002" [00:09:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c1-codfw - pt1979@cumin2002" [00:09:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:14:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [00:14:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:19:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [00:19:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:29:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364299)', diff saved to https://phabricator.wikimedia.org/P63613 and previous config saved to /var/cache/conftool/dbconfig/20240530-002930-marostegui.json [00:29:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:35:41] (03CR) 10Dzahn: "no changes in prod: https://puppet-compiler.wmflabs.org/output/1036771/2684/" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [00:36:58] (03CR) 10Dzahn: "but maybe it can't absent the class either if dest_host isn't set: https://puppet-compiler.wmflabs.org/output/1036771/2685/gerrit-bullsey" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [00:37:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:37:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:38:24] (03CR) 10Dzahn: [C:04-1] "the problem is that this code isn't adjustable to cloud: $gerrit_replica_hosts = wmflib::role::hosts('gerrit').filter |$gerrit_host| { $ge" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [00:41:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c1-codfw.mgmt.codfw.wmnet [00:42:08] (03CR) 10Dzahn: [gerrit] Add rsync job for lfs sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [00:44:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P63614 and previous config saved to /var/cache/conftool/dbconfig/20240530-004438-marostegui.json [00:50:41] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c2-codfw.mgmt.codfw.wmnet [00:50:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:52:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c2-codfw - pt1979@cumin2002" [00:53:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c2-codfw - pt1979@cumin2002" [00:53:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:54:28] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c3-codfw.mgmt.codfw.wmnet [00:54:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:56:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c3-codfw - pt1979@cumin2002" [00:57:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c3-codfw - pt1979@cumin2002" [00:57:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:59:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9844770 (10Papaul) [00:59:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P63615 and previous config saved to /var/cache/conftool/dbconfig/20240530-005946-marostegui.json [01:03:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P63616 and previous config saved to /var/cache/conftool/dbconfig/20240530-010302-ladsgroup.json [01:03:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:06:10] (03PS4) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [01:14:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364299)', diff saved to https://phabricator.wikimedia.org/P63617 and previous config saved to /var/cache/conftool/dbconfig/20240530-011454-marostegui.json [01:14:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [01:15:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:15:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [01:15:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T364299)', diff saved to https://phabricator.wikimedia.org/P63618 and previous config saved to /var/cache/conftool/dbconfig/20240530-011518-marostegui.json [01:18:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P63619 and previous config saved to /var/cache/conftool/dbconfig/20240530-011810-ladsgroup.json [01:24:37] (03PS5) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [01:28:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c3-codfw.mgmt.codfw.wmnet [01:33:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P63620 and previous config saved to /var/cache/conftool/dbconfig/20240530-013319-ladsgroup.json [01:39:15] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c4-codfw.mgmt.codfw.wmnet [01:39:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:45:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:47:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:47:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T364069)', diff saved to https://phabricator.wikimedia.org/P63621 and previous config saved to /var/cache/conftool/dbconfig/20240530-014725-marostegui.json [01:47:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:48:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P63622 and previous config saved to /var/cache/conftool/dbconfig/20240530-014827-ladsgroup.json [01:48:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:48:33] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:48:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:48:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P63623 and previous config saved to /var/cache/conftool/dbconfig/20240530-014850-ladsgroup.json [01:55:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c4-codfw - pt1979@cumin2002" [01:56:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c4-codfw - pt1979@cumin2002" [01:56:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:59:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:59:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:14:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364299)', diff saved to https://phabricator.wikimedia.org/P63624 and previous config saved to /var/cache/conftool/dbconfig/20240530-021430-marostegui.json [02:14:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:27:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c4-codfw.mgmt.codfw.wmnet [02:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P63625 and previous config saved to /var/cache/conftool/dbconfig/20240530-022938-marostegui.json [02:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P63626 and previous config saved to /var/cache/conftool/dbconfig/20240530-024447-marostegui.json [02:47:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9844864 (10Papaul) [02:55:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364299)', diff saved to https://phabricator.wikimedia.org/P63627 and previous config saved to /var/cache/conftool/dbconfig/20240530-025955-marostegui.json [03:00:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:03:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:44:36] 06SRE, 06Traffic-Icebox, 10Wiki-Setup, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#9844895 (10Pppery) [03:44:42] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#9844896 (10Pppery) [03:46:46] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:47:34] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#9844898 (10Pppery) [03:51:44] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:06:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:08:46] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:08:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:09:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:11:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:11:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:13:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:13:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:20:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:20:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:23:44] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:30:44] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 56 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:42:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:42:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:42:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T364541 [04:42:46] T364541: Switchover s8 master (db1209 -> db1192) - https://phabricator.wikimedia.org/T364541 [04:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1192 with weight 0 T364541', diff saved to https://phabricator.wikimedia.org/P63628 and previous config saved to /var/cache/conftool/dbconfig/20240530-044249-root.json [04:43:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T364541 [04:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1192 from API/vslow/dump T364541', diff saved to https://phabricator.wikimedia.org/P63629 and previous config saved to /var/cache/conftool/dbconfig/20240530-044328-root.json [04:44:26] (03PS5) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) [04:45:02] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) (owner: 10Marostegui) [04:51:11] (03PS1) 10Marostegui: db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 [04:52:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 725 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:54:38] (03PS1) 10Marostegui: db1209: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1037212 (https://phabricator.wikimedia.org/T363792) [04:56:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:56:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:57:04] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 74 probes of 725 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:02:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [05:09:44] !log Starting s8 eqiad failover from db1209 to db1192 - T364541 [05:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:49] T364541: Switchover s8 master (db1209 -> db1192) - https://phabricator.wikimedia.org/T364541 [05:10:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T364541', diff saved to https://phabricator.wikimedia.org/P63630 and previous config saved to /var/cache/conftool/dbconfig/20240530-051012-marostegui.json [05:10:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1192 to s8 primary and set section read-write T364541', diff saved to https://phabricator.wikimedia.org/P63631 and previous config saved to /var/cache/conftool/dbconfig/20240530-051031-marostegui.json [05:11:10] (03PS3) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541) [05:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1209 T364541', diff saved to https://phabricator.wikimedia.org/P63632 and previous config saved to /var/cache/conftool/dbconfig/20240530-051132-root.json [05:11:56] (03CR) 10Marostegui: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541) (owner: 10Gerrit maintenance bot) [05:11:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [05:12:00] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541) (owner: 10Gerrit maintenance bot) [05:12:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [05:12:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T366123)', diff saved to https://phabricator.wikimedia.org/P63633 and previous config saved to /var/cache/conftool/dbconfig/20240530-051220-marostegui.json [05:12:25] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [05:13:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:14:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T366123)', diff saved to https://phabricator.wikimedia.org/P63634 and previous config saved to /var/cache/conftool/dbconfig/20240530-051433-marostegui.json [05:15:49] (03CR) 10Marostegui: [C:03+2] db1209: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1037212 (https://phabricator.wikimedia.org/T363792) (owner: 10Marostegui) [05:16:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:16:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:17:43] !log Deploy schema changes on old s8 eqiad master (db1209) dbmaint T355609 T356166 [05:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [05:17:50] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:19:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:19:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:20:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T364299)', diff saved to https://phabricator.wikimedia.org/P63635 and previous config saved to /var/cache/conftool/dbconfig/20240530-052006-marostegui.json [05:20:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:27:49] (03PS2) 10Marostegui: db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 [05:27:50] (03PS1) 10Marostegui: db1209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037213 [05:28:29] (03CR) 10Marostegui: [C:03+2] db1209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037213 (owner: 10Marostegui) [05:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63636 and previous config saved to /var/cache/conftool/dbconfig/20240530-052941-marostegui.json [05:35:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037211 (owner: 10Marostegui) [05:38:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [05:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63638 and previous config saved to /var/cache/conftool/dbconfig/20240530-054451-marostegui.json [05:54:48] (03CR) 10Abijeet Patro: [C:03+1] Add Phabricator antivandalism extension to Phabricator translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036726 (https://phabricator.wikimedia.org/T365858) (owner: 10Pppery) [05:56:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:56:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:59:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T366123)', diff saved to https://phabricator.wikimedia.org/P63639 and previous config saved to /var/cache/conftool/dbconfig/20240530-055959-marostegui.json [06:00:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T0600) [06:00:04] marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T0600). [06:00:10] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [06:00:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [06:00:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T366123)', diff saved to https://phabricator.wikimedia.org/P63640 and previous config saved to /var/cache/conftool/dbconfig/20240530-060023-marostegui.json [06:18:48] (03CR) 10Slyngshede: [C:03+1] Remove skel files for former WMF staff members [puppet] - 10https://gerrit.wikimedia.org/r/1037064 (owner: 10Muehlenhoff) [06:18:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037064 (owner: 10Muehlenhoff) [06:19:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:19:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:30:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T366123)', diff saved to https://phabricator.wikimedia.org/P63641 and previous config saved to /var/cache/conftool/dbconfig/20240530-063011-marostegui.json [06:30:17] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [06:31:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:31:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:33:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:33:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:36:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:36:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:45:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63642 and previous config saved to /var/cache/conftool/dbconfig/20240530-064519-marostegui.json [06:46:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:46:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:47:11] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 8674 [06:48:18] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 8674 [06:48:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:48:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:49:14] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 8674 [06:50:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8674 [06:53:05] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 8674 [06:55:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8674 [06:56:23] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 49666 [06:56:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:56:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:56:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49666 [06:57:24] !log Deploy schema changes on old s8 eqiad master (db1209) dbmaint T364299 [06:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:30] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63643 and previous config saved to /var/cache/conftool/dbconfig/20240530-070027-marostegui.json [07:01:01] (03PS1) 10DCausse: Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 [07:01:57] (03CR) 10DCausse: [C:03+1] cirrus: Send weighted tags to known clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037153 (owner: 10Ebernhardson) [07:03:28] (03CR) 10Marostegui: [C:03+2] db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 (owner: 10Marostegui) [07:05:03] (03PS3) 10Marostegui: db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 [07:05:12] (03CR) 10CI reject: [V:04-1] db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 (owner: 10Marostegui) [07:06:47] (03PS1) 10Marostegui: db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037218 [07:07:51] (03Abandoned) 10Marostegui: db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037211 (owner: 10Marostegui) [07:07:56] (03CR) 10Marostegui: [C:03+2] db1*: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037218 (owner: 10Marostegui) [07:15:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T366123)', diff saved to https://phabricator.wikimedia.org/P63644 and previous config saved to /var/cache/conftool/dbconfig/20240530-071535-marostegui.json [07:15:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [07:15:42] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [07:15:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [07:16:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T366123)', diff saved to https://phabricator.wikimedia.org/P63645 and previous config saved to /var/cache/conftool/dbconfig/20240530-071559-marostegui.json [07:23:36] (03PS1) 10Marostegui: db1243: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037369 [07:24:03] (03CR) 10Marostegui: [C:03+2] db1243: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037369 (owner: 10Marostegui) [07:32:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:32:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:34:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:34:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:39:43] (03CR) 10Jelto: [C:03+2] gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar) [07:40:12] (03CR) 10Aklapper: [C:03+2] Add Phabricator antivandalism extension to Phabricator translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036726 (https://phabricator.wikimedia.org/T365858) (owner: 10Pppery) [07:40:21] (03CR) 10Aklapper: [V:03+2 C:03+2] Add Phabricator antivandalism extension to Phabricator translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036726 (https://phabricator.wikimedia.org/T365858) (owner: 10Pppery) [07:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T366123)', diff saved to https://phabricator.wikimedia.org/P63648 and previous config saved to /var/cache/conftool/dbconfig/20240530-074501-marostegui.json [07:45:08] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [07:48:00] (03PS64) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [07:50:56] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:55:41] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:57:46] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 57 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:59:26] (03PS65) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [08:00:05] dancy and andre: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T0800). [08:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63649 and previous config saved to /var/cache/conftool/dbconfig/20240530-080009-marostegui.json [08:01:21] (03CR) 10Arnaudb: "sorry for the deviation, this PS should bring us up to the agreed upon speed. I've specified in the noqa mentions why they were used in th" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:02:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:02:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:03:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:04:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:05] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:08:40] (03PS66) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [08:08:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:13:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:13:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:15:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63650 and previous config saved to /var/cache/conftool/dbconfig/20240530-081517-marostegui.json [08:17:32] (03PS67) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [08:23:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:23:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:29:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:29:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:30:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T366123)', diff saved to https://phabricator.wikimedia.org/P63651 and previous config saved to /var/cache/conftool/dbconfig/20240530-083025-marostegui.json [08:30:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [08:30:30] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1036601 (https://phabricator.wikimedia.org/T366241) [08:30:31] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [08:30:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [08:30:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:30:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:30:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1036602 (https://phabricator.wikimedia.org/T366242) [08:30:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T366123)', diff saved to https://phabricator.wikimedia.org/P63652 and previous config saved to /var/cache/conftool/dbconfig/20240530-083054-marostegui.json [08:31:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:31:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:32:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T366123)', diff saved to https://phabricator.wikimedia.org/P63653 and previous config saved to /var/cache/conftool/dbconfig/20240530-083204-marostegui.json [08:33:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:33:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:35:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:35:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:37:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:37:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:37:58] (03CR) 10Volans: "That's great, a small fix for the name->slug conversion and the test and we're good to go." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:39:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:39:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:46:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:46:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:47:03] (03PS1) 10Marostegui: db1215: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037435 (https://phabricator.wikimedia.org/T364296) [08:47:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63654 and previous config saved to /var/cache/conftool/dbconfig/20240530-084712-marostegui.json [08:47:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1215.eqiad.wmnet with OS bookworm [08:47:49] (03CR) 10Marostegui: [C:03+2] db1215: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037435 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [08:48:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:48:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:51:06] (03PS1) 10Awight: Temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1037437 (https://phabricator.wikimedia.org/T366144) [08:51:34] (03PS1) 10Marostegui: db1125: Add clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/1037438 [08:52:17] (03CR) 10Marostegui: [C:03+2] db1125: Add clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/1037438 (owner: 10Marostegui) [08:54:52] (03PS1) 10Marostegui: redact_sanitarium.sh: Clarifying what this script is for [puppet] - 10https://gerrit.wikimedia.org/r/1037439 [08:55:16] (03CR) 10Filippo Giunchedi: [C:03+2] Temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1037437 (https://phabricator.wikimedia.org/T366144) (owner: 10Awight) [08:55:25] (03CR) 10Marostegui: [C:03+2] redact_sanitarium.sh: Clarifying what this script is for [puppet] - 10https://gerrit.wikimedia.org/r/1037439 (owner: 10Marostegui) [09:02:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [09:02:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:02:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63655 and previous config saved to /var/cache/conftool/dbconfig/20240530-090220-marostegui.json [09:02:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:02:27] (03CR) 10Filippo Giunchedi: [C:03+1] rsyslog kafka_shipper: use the new global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway) [09:03:05] (03CR) 10Filippo Giunchedi: [C:03+1] Add an icinga/nsca collector for Fundraising kafka client cert expire check. [puppet] - 10https://gerrit.wikimedia.org/r/1037075 (https://phabricator.wikimedia.org/T360779) (owner: 10Jgreen) [09:03:07] (03CR) 10Filippo Giunchedi: [C:03+2] Add an icinga/nsca collector for Fundraising kafka client cert expire check. [puppet] - 10https://gerrit.wikimedia.org/r/1037075 (https://phabricator.wikimedia.org/T360779) (owner: 10Jgreen) [09:04:08] godog: Thanks for the merge! Let me know if you also have time to refresh the service so that it picks up the new config. [09:04:37] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9845479 (10Fuzzy) [09:04:56] awight: yep should be live shortly! [09:05:07] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9845481 (10Aklapper) I don't think we want to interfere without any information //why// "all members of our user group suddenly had their admins removed"... should probably check who are the... [09:05:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [09:05:26] awight: see also the web interface at https://prometheus-eqiad.wikimedia.org/analytics/targets [09:05:28] (03PS1) 10Marostegui: site.pp: Reorganize es7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037443 [09:06:06] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1037443 (owner: 10Marostegui) [09:06:07] (03PS68) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [09:06:08] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037443 (owner: 10Marostegui) [09:07:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:07:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:07:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366241 [09:08:01] T366241: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366241 [09:08:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366241 [09:08:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T366241', diff saved to https://phabricator.wikimedia.org/P63656 and previous config saved to /var/cache/conftool/dbconfig/20240530-090840-arnaudb.json [09:09:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:09:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:13:13] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9845519 (10Chqaz) @Aklapper @Ladsgroup Are you able to check who are the admin? [09:13:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 500 revert T366241', diff saved to https://phabricator.wikimedia.org/P63658 and previous config saved to /var/cache/conftool/dbconfig/20240530-091323-arnaudb.json [09:13:29] T366241: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366241 [09:13:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:14:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:14:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9845525 (10Clement_Goubert) As these servers are up for decom, they won't be migrated to k8s, and they are in the current secondary datacenter. It doesn't rea... [09:15:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:16:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T366123)', diff saved to https://phabricator.wikimedia.org/P63659 and previous config saved to /var/cache/conftool/dbconfig/20240530-091728-marostegui.json [09:17:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [09:17:35] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [09:17:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [09:17:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T366123)', diff saved to https://phabricator.wikimedia.org/P63660 and previous config saved to /var/cache/conftool/dbconfig/20240530-091751-marostegui.json [09:17:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:18:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:20:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:20:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T366123)', diff saved to https://phabricator.wikimedia.org/P63661 and previous config saved to /var/cache/conftool/dbconfig/20240530-092004-marostegui.json [09:20:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:20:24] (03PS1) 10Marostegui: Revert "db1215: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037136 [09:21:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1215.eqiad.wmnet with OS bookworm [09:21:14] (03CR) 10Marostegui: [C:03+2] Revert "db1215: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037136 (owner: 10Marostegui) [09:22:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:22:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:23:32] (03PS1) 10Marostegui: db1215: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037448 [09:24:04] (03CR) 10Marostegui: [C:03+2] db1215: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037448 (owner: 10Marostegui) [09:25:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:25:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:28:15] (03PS1) 10Hnowlan: api-gateway: add workaround for urlencoded characters in liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037449 (https://phabricator.wikimedia.org/T365439) [09:29:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T366242 [09:29:19] T366242: Switchover s3 master (db2205 -> db2127) - https://phabricator.wikimedia.org/T366242 [09:29:34] godog: Thanks for the URL--verified that the config is active now. [09:29:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T366242 [09:30:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2127 with weight 0 T366242', diff saved to https://phabricator.wikimedia.org/P63662 and previous config saved to /var/cache/conftool/dbconfig/20240530-093007-arnaudb.json [09:32:27] (03CR) 10Alexandros Kosiaris: [C:03+1] install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [09:32:51] (03CR) 10Klausman: [C:03+2] install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [09:33:58] (03PS1) 10Marostegui: site.pp: Reorganize es6 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037451 [09:35:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63663 and previous config saved to /var/cache/conftool/dbconfig/20240530-093514-marostegui.json [09:35:48] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es6 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037451 (owner: 10Marostegui) [09:37:51] (03PS1) 10AikoChou: ml-services: remove WIKI_URL for revertrisk isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037452 (https://phabricator.wikimedia.org/T366250) [09:38:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:38:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:38:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P63664 and previous config saved to /var/cache/conftool/dbconfig/20240530-093850-ladsgroup.json [09:38:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:39:27] (03CR) 10Klausman: [C:03+1] ml-services: remove WIKI_URL for revertrisk isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037452 (https://phabricator.wikimedia.org/T366250) (owner: 10AikoChou) [09:40:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:40:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:40:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:42:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:44:02] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1036602 (https://phabricator.wikimedia.org/T366242) (owner: 10Gerrit maintenance bot) [09:45:11] !log Starting s3 codfw failover from db2205 to db2127 - T366242 [09:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:19] T366242: Switchover s3 master (db2205 -> db2127) - https://phabricator.wikimedia.org/T366242 [09:46:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2127 to s3 primary T366242', diff saved to https://phabricator.wikimedia.org/P63665 and previous config saved to /var/cache/conftool/dbconfig/20240530-094632-arnaudb.json [09:47:33] (03PS1) 10Effie Mouzeli: memcached: migrate 10 servers to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037454 (https://phabricator.wikimedia.org/T352891) [09:49:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 mirror former candidate master weight T366242', diff saved to https://phabricator.wikimedia.org/P63666 and previous config saved to /var/cache/conftool/dbconfig/20240530-094936-root.json [09:50:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63667 and previous config saved to /var/cache/conftool/dbconfig/20240530-095021-marostegui.json [09:53:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:53:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:53:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P63668 and previous config saved to /var/cache/conftool/dbconfig/20240530-095358-ladsgroup.json [09:54:16] (03PS2) 10Effie Mouzeli: Add wikikube-ctrl1003 to server SRV record for etcd 4 [dns] - 10https://gerrit.wikimedia.org/r/1036623 (https://phabricator.wikimedia.org/T353464) [09:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T364299)', diff saved to https://phabricator.wikimedia.org/P63669 and previous config saved to /var/cache/conftool/dbconfig/20240530-095447-marostegui.json [09:54:53] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:55:48] (03CR) 10Effie Mouzeli: [C:03+2] Add wikikube-ctrl1003 to server SRV record for etcd 4 [dns] - 10https://gerrit.wikimedia.org/r/1036623 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [09:58:22] (03CR) 10AikoChou: [C:03+2] ml-services: remove WIKI_URL for revertrisk isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037452 (https://phabricator.wikimedia.org/T366250) (owner: 10AikoChou) [09:59:17] (03Merged) 10jenkins-bot: ml-services: remove WIKI_URL for revertrisk isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037452 (https://phabricator.wikimedia.org/T366250) (owner: 10AikoChou) [09:59:19] !log dcausse@deploy1002 Started deploy [airflow-dags/search@66de0db]: search: add missing lexeme fields [09:59:38] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@66de0db]: search: add missing lexeme fields (duration: 00m 19s) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1000) [10:05:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T366123)', diff saved to https://phabricator.wikimedia.org/P63670 and previous config saved to /var/cache/conftool/dbconfig/20240530-100531-marostegui.json [10:05:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [10:05:37] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [10:05:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [10:05:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T366123)', diff saved to https://phabricator.wikimedia.org/P63671 and previous config saved to /var/cache/conftool/dbconfig/20240530-100554-marostegui.json [10:07:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1036604 (https://phabricator.wikimedia.org/T366259) [10:07:12] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) [10:07:57] !log add wikikube-ctrl1003 to etcd and run puppet - T353464 [10:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:02] T353464: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 [10:08:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:09:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P63672 and previous config saved to /var/cache/conftool/dbconfig/20240530-100906-ladsgroup.json [10:09:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P63673 and previous config saved to /var/cache/conftool/dbconfig/20240530-100955-marostegui.json [10:13:43] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:29] (03PS1) 10Marostegui: site.pp: Reorganize es7 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037460 [10:16:40] (03PS2) 10Effie Mouzeli: memcached: migrate 10 servers to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037454 (https://phabricator.wikimedia.org/T352891) [10:16:41] (03PS1) 10Marostegui: mariadb: Remove 10.4 bullseye files [software] - 10https://gerrit.wikimedia.org/r/1037462 [10:16:42] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es7 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037460 (owner: 10Marostegui) [10:16:49] (03CR) 10CI reject: [V:04-1] mariadb: Remove 10.4 bullseye files [software] - 10https://gerrit.wikimedia.org/r/1037462 (owner: 10Marostegui) [10:18:43] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:01] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete wikikube/staging etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:23:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1479:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1479 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:23:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:23:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:24:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P63674 and previous config saved to /var/cache/conftool/dbconfig/20240530-102414-ladsgroup.json [10:24:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [10:24:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:24:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [10:24:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P63675 and previous config saved to /var/cache/conftool/dbconfig/20240530-102439-ladsgroup.json [10:25:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P63676 and previous config saved to /var/cache/conftool/dbconfig/20240530-102503-marostegui.json [10:25:53] !log label wikikube-ctrl1003 as master [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] !log Restarted rsyslog on mw1479 [10:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:33] !log homer "cr*eqiad*" commit 'Add wikikube-ctrl1003' [10:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw1479:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1479 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:32:30] (03PS1) 10Marostegui: db1164: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037465 [10:32:48] (03CR) 10Marostegui: [C:03+2] pc1014: Remove puppet7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037096 (owner: 10Marostegui) [10:32:56] (03CR) 10Marostegui: [C:03+2] db1164: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037465 (owner: 10Marostegui) [10:35:23] (03PS1) 10Marostegui: Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 [10:35:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T366123)', diff saved to https://phabricator.wikimedia.org/P63677 and previous config saved to /var/cache/conftool/dbconfig/20240530-103523-marostegui.json [10:35:29] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [10:35:32] (03CR) 10CI reject: [V:04-1] Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 (owner: 10Marostegui) [10:35:51] (03PS2) 10Marostegui: Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 [10:36:03] (03CR) 10CI reject: [V:04-1] Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 (owner: 10Marostegui) [10:36:07] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9845869 (10ayounsi) > assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this As we couldn't get a /24 from LACNIC for magru, we only have two free /24s We have to decide between a few options:... [10:36:20] !log jiji@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl1003.eqiad.wmnet [10:36:36] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 (owner: 10Marostegui) [10:36:59] (03PS1) 10Hnowlan: apiserver: colocate API canary with scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) [10:37:47] (03Abandoned) 10Marostegui: Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037137 (owner: 10Marostegui) [10:38:18] !log dcausse@deploy1002 Started deploy [airflow-dags/search@ded0f17]: search: fix alter table command [10:38:38] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@ded0f17]: search: fix alter table command (duration: 00m 20s) [10:39:19] (03CR) 10Clément Goubert: [C:03+1] apiserver: colocate API canary with scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) (owner: 10Hnowlan) [10:40:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T364299)', diff saved to https://phabricator.wikimedia.org/P63678 and previous config saved to /var/cache/conftool/dbconfig/20240530-104011-marostegui.json [10:40:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:40:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:40:27] (03PS1) 10Marostegui: db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037467 [10:40:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:40:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T364299)', diff saved to https://phabricator.wikimedia.org/P63679 and previous config saved to /var/cache/conftool/dbconfig/20240530-104034-marostegui.json [10:42:05] (03CR) 10Marostegui: [C:03+2] db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037467 (owner: 10Marostegui) [10:42:08] (03PS2) 10Hnowlan: apiserver: colocate API canary with scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) [10:42:36] (03CR) 10Clément Goubert: "Needs change to conftool-data/node" [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) (owner: 10Hnowlan) [10:43:06] (03CR) 10Clément Goubert: [C:03+1] apiserver: colocate API canary with scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) (owner: 10Hnowlan) [10:43:21] !log Deploy schema changes on old s3 codfw master (db2205) dbmaint T364299ç [10:43:22] !log Deploy schema changes on old s3 codfw master (db2205) dbmaint T364299 [10:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:47] (03PS1) 10Marostegui: db2204: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037469 [10:44:59] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1037469 (owner: 10Marostegui) [10:45:17] (03CR) 10Marostegui: [C:03+2] db2204: Remove puppet 7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037469 (owner: 10Marostegui) [10:46:36] (03CR) 10Klausman: [C:03+1] api-gateway: add workaround for urlencoded characters in liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037449 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [10:47:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9845938 (10jijiki) [10:47:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9845940 (10jijiki) [10:48:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:48:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:48:59] (03CR) 10Hnowlan: [C:03+2] apiserver: colocate API canary with scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1037466 (https://phabricator.wikimedia.org/T361856) (owner: 10Hnowlan) [10:50:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:50:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:50:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63680 and previous config saved to /var/cache/conftool/dbconfig/20240530-105031-marostegui.json [10:52:27] !log switched mw2300 to be an api canary + scap_proxy, removed mw228[34] as canaries [10:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:48] (03PS1) 10Marostegui: db2185: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037470 [10:54:10] (03CR) 10Marostegui: [C:03+2] db2185: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037470 (owner: 10Marostegui) [11:00:18] (03CR) 10Hnowlan: [C:03+2] utils: remove pem copying step from ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1034952 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:05:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:05:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:05:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63681 and previous config saved to /var/cache/conftool/dbconfig/20240530-110539-marostegui.json [11:08:08] (03PS2) 10Stevemunene: Remove datahub from LVS [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) [11:08:08] (03PS1) 10Stevemunene: Remove datahub from LVS [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) [11:13:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:16:16] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1037462 (owner: 10Marostegui) [11:17:57] (03CR) 10Marostegui: [C:03+2] mariadb: Remove 10.4 bullseye files [software] - 10https://gerrit.wikimedia.org/r/1037462 (owner: 10Marostegui) [11:18:22] (03Merged) 10jenkins-bot: mariadb: Remove 10.4 bullseye files [software] - 10https://gerrit.wikimedia.org/r/1037462 (owner: 10Marostegui) [11:19:44] (03CR) 10Hnowlan: [C:03+2] api-gateway: add workaround for urlencoded characters in liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037449 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [11:20:38] (03Merged) 10jenkins-bot: api-gateway: add workaround for urlencoded characters in liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037449 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [11:20:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T366123)', diff saved to https://phabricator.wikimedia.org/P63682 and previous config saved to /var/cache/conftool/dbconfig/20240530-112047-marostegui.json [11:20:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [11:20:52] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [11:21:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [11:21:45] (03PS3) 10Stevemunene: Set datahub LVS to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) [11:21:45] (03PS2) 10Stevemunene: Set datahub LVS to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) [11:23:39] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:23:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:24:29] (03PS1) 10Clément Goubert: push-notifications: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037478 [11:24:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:25:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:26:36] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:26:56] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:28:56] (03PS1) 10Stevemunene: Remove datahub service entry [puppet] - 10https://gerrit.wikimedia.org/r/1037479 (https://phabricator.wikimedia.org/T366137) [11:30:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:30:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:31:58] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:32:09] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:32:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:32:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:32:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:33:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:34:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:34:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:34:51] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:35:16] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:38:48] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9846069 (10cmooney) >>! In T366193#9845869, @ayounsi wrote: >> assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this > As we couldn't get a /24 from LACNIC for magru, we only have two free /24... [11:41:22] (03PS2) 10Clément Goubert: push-notifications: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037478 [11:42:36] (03PS1) 10Marostegui: db2186: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037481 [11:43:35] (03CR) 10Marostegui: [C:03+2] db2186: Remove puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037481 (owner: 10Marostegui) [11:44:06] (03CR) 10Hnowlan: [C:03+1] push-notifications: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037478 (owner: 10Clément Goubert) [11:44:17] (03CR) 10Clément Goubert: [C:03+2] push-notifications: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037478 (owner: 10Clément Goubert) [11:44:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:44:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:45:19] (03Merged) 10jenkins-bot: push-notifications: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037478 (owner: 10Clément Goubert) [11:45:52] (03CR) 10Stevemunene: "added the patches for this" [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [11:46:42] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [11:46:52] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [11:47:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [11:47:12] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9846086 (10Ladsgroup) We can. But we are not allowed to disclose them. Did you send the email to `wikija-g-owner@lists.wikimedia.org`? [11:47:42] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [11:47:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [11:47:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:47:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:48:08] (03PS1) 10Vgutierrez: fifo_log_demux,ATS: Support prometheus and buffer options [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) [11:48:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [11:50:27] (03PS2) 10Hnowlan: conftool-data: remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1004637 [11:51:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [11:52:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [11:53:44] (03PS1) 10Slyngshede: P:netbox Move to OIDC for authentication [puppet] - 10https://gerrit.wikimedia.org/r/1037506 (https://phabricator.wikimedia.org/T308002) [11:54:15] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:58:05] (03PS3) 10Hnowlan: kubernetes: make 5 eqiad api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) [11:59:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:59:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1200) [12:01:58] !log Deploy schema changes on old s3 codfw master (db2205) dbmaint T364069 [12:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:03] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:04:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205', diff saved to https://phabricator.wikimedia.org/P63683 and previous config saved to /var/cache/conftool/dbconfig/20240530-120455-root.json [12:05:36] (03PS1) 10Marostegui: db2205: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037510 [12:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:07] (03CR) 10Marostegui: [C:03+2] db2205: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037510 (owner: 10Marostegui) [12:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364069)', diff saved to https://phabricator.wikimedia.org/P63684 and previous config saved to /var/cache/conftool/dbconfig/20240530-120638-marostegui.json [12:08:26] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:16:28] (03CR) 10Brouberol: "question: According to https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service, we should always respect the following stat" [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [12:20:52] (03PS2) 10Gergő Tisza: [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 [12:21:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance [12:21:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P63685 and previous config saved to /var/cache/conftool/dbconfig/20240530-122146-marostegui.json [12:21:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance [12:22:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T366123)', diff saved to https://phabricator.wikimedia.org/P63686 and previous config saved to /var/cache/conftool/dbconfig/20240530-122206-marostegui.json [12:22:14] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [12:23:56] (03CR) 10Stevemunene: "Nope." [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [12:24:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T366123)', diff saved to https://phabricator.wikimedia.org/P63687 and previous config saved to /var/cache/conftool/dbconfig/20240530-122422-marostegui.json [12:25:15] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037517 (owner: 10L10n-bot) [12:27:38] (03PS2) 10Vgutierrez: fifo_log_demux,ATS: Support prometheus and buffer options [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) [12:29:20] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:30:21] !log joal@deploy1002 Started deploy [analytics/refinery@ac0b789]: Regular analytics weekly train [analytics/refinery@ac0b789b] [12:31:26] (03CR) 10Vgutierrez: [V:03+1] "cp4052.yaml hiera file will be dropped before merging, just there for PCC validation purposes" [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:33:41] (03CR) 10Gergő Tisza: [multiversion] Add 'manage-dblist init-labs' subcommand (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [12:36:03] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 151 probes of 725 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:36:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P63688 and previous config saved to /var/cache/conftool/dbconfig/20240530-123655-marostegui.json [12:39:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:39:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:39:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P63689 and previous config saved to /var/cache/conftool/dbconfig/20240530-123930-marostegui.json [12:40:48] (03CR) 10Vgutierrez: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:41:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:41:05] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 725 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:41:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:42:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:42:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:43:19] !log joal@deploy1002 Finished deploy [analytics/refinery@ac0b789]: Regular analytics weekly train [analytics/refinery@ac0b789b] (duration: 12m 58s) [12:43:41] !log dcausse@deploy1002 Started deploy [airflow-dags/search@0faf248]: search: use discolytics 0.23 [12:44:07] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@0faf248]: search: use discolytics 0.23 (duration: 00m 26s) [12:47:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:47:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:49:15] (03PS1) 10Kosta Harlan: geoip: Use GeoLite2 instead of GeoIP2 Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [12:50:44] !log joal@deploy1002 Started deploy [analytics/refinery@ac0b789] (thin): Regular analytics weekly train THIN [analytics/refinery@ac0b789b] [12:50:56] (03PS1) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) [12:51:35] (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:51:41] (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:51:48] (03CR) 10CI reject: [V:04-1] geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:52:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364069)', diff saved to https://phabricator.wikimedia.org/P63690 and previous config saved to /var/cache/conftool/dbconfig/20240530-125204-marostegui.json [12:52:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [12:52:10] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:52:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [12:52:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:52:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T364069)', diff saved to https://phabricator.wikimedia.org/P63691 and previous config saved to /var/cache/conftool/dbconfig/20240530-125232-marostegui.json [12:53:05] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846286 (10jijiki) >>! In T366094#9844588, @CDanis wrote: >>>! In T366094#9842327, @akosiaris wrote: >> I am gonna disagree on this one. [This](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t36... [12:53:18] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846288 (10jijiki) 05Open→03In progress [12:53:22] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:53:31] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:53:54] (03PS2) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) [12:54:12] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846289 (10jijiki) [12:54:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9846291 (10jijiki) [12:54:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9846290 (10jijiki) [12:54:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P63692 and previous config saved to /var/cache/conftool/dbconfig/20240530-125438-marostegui.json [12:55:11] !log joal@deploy1002 Finished deploy [analytics/refinery@ac0b789] (thin): Regular analytics weekly train THIN [analytics/refinery@ac0b789b] (duration: 04m 27s) [12:56:40] jouncebot: nowandnext [12:56:40] For the next 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1200) [12:56:40] In 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1300) [12:58:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:58:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:58:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9846328 (10Clement_Goubert) `mw2282` is a kubernetes server, so would need to be drained and cordoned as well. However since they are to be decommed and in th... [12:58:49] !log joal@deploy1002 Started deploy [analytics/refinery@ac0b789] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ac0b789b] [12:59:22] (03CR) 10Kosta Harlan: geoip: Download GeoLite2 ASN file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [13:00:04] (03CR) 10Jforrester: [C:03+1] Add a stream for tracking the API of WikiLambda (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1300). [13:00:05] dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] (03CR) 10Brouberol: "doh :facepalm:" [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [13:00:35] (03CR) 10Brouberol: [C:03+1] Set datahub LVS to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [13:00:50] o/ [13:01:03] (03CR) 10Effie Mouzeli: [C:03+2] memcached: migrate 10 servers to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037454 (https://phabricator.wikimedia.org/T352891) (owner: 10Effie Mouzeli) [13:01:19] I can deploy [13:01:34] dcausse: regarding the deployment window coming up in 1 minute, we wanna try out whether the thing that was contributing to creating deployment issues the other day still does (we already took various actions to alleviate that). Kindly asking to give cdanis a few mins to deploy the otel collector [13:01:38] dcausse: FYI, we're just turned back on a fea-- [13:01:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:01:41] akosiaris: it's deployed [13:01:43] !log joal@deploy1002 Finished deploy [analytics/refinery@ac0b789] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ac0b789b] (duration: 02m 54s) [13:01:45] perfect! [13:01:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:02:28] dcausse: you got the go ahead on our side, keep in mind the deployment might fail or take a while longer than usual (that's what we noticed last time). Please do keep us informed on how it goes [13:02:45] akosiaris, cdanis ack [13:03:05] (03CR) 10DCausse: [C:03+2] Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:03:06] in case it does, we can quickly rollback our change. [13:04:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037153 (owner: 10Ebernhardson) [13:04:15] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2044.codfw.wmnet with OS bookworm [13:04:28] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1044.eqiad.wmnet with OS bookworm [13:05:29] (03Merged) 10jenkins-bot: cirrus: Send weighted tags to known clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037153 (owner: 10Ebernhardson) [13:06:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:06:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:06:20] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1037153|cirrus: Send weighted tags to known clusters]] [13:08:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:08:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:08:15] (03PS1) 10Volans: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) [13:08:44] (03CR) 10Volans: "To be tested on netbox-next once back online" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [13:09:02] !log dcausse@deploy1002 dcausse and ebernhardson: Backport for [[gerrit:1037153|cirrus: Send weighted tags to known clusters]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T366123)', diff saved to https://phabricator.wikimedia.org/P63693 and previous config saved to /var/cache/conftool/dbconfig/20240530-130946-marostegui.json [13:09:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2220.codfw.wmnet with reason: Maintenance [13:09:54] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [13:10:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2220.codfw.wmnet with reason: Maintenance [13:10:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:10:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T366123)', diff saved to https://phabricator.wikimedia.org/P63694 and previous config saved to /var/cache/conftool/dbconfig/20240530-131012-marostegui.json [13:10:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:10:22] !log dcausse@deploy1002 dcausse and ebernhardson: Continuing with sync [13:11:45] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9846383 (10BBlack) Re: anycast-ns1 and future plans, etc (I won't quote all the relevant bits from both msgs above): * Currently we still strongly prefer to avoid mixing any other service with the DoH anycast, due to the d... [13:12:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:12:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:13:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1173 T356240', diff saved to https://phabricator.wikimedia.org/P63695 and previous config saved to /var/cache/conftool/dbconfig/20240530-131349-arnaudb.json [13:14:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: upgrade [13:14:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: upgrade [13:15:31] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet [13:15:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:15:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:17:42] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1044.eqiad.wmnet with reason: host reimage [13:17:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:17:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:19:04] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1037153|cirrus: Send weighted tags to known clusters]] (duration: 12m 43s) [13:20:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet [13:20:34] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1044.eqiad.wmnet with reason: host reimage [13:20:53] PROBLEM - MariaDB Replica Lag: s6 #page on db1173 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:22:17] should I depool ^ ? [13:22:17] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage [13:22:18] !incidents [13:22:19] 4711 (UNACKED) db1173 (paged)/MariaDB Replica Lag: s6 (paged) [13:22:25] !ack 4711 [13:22:26] 4711 (ACKED) db1173 (paged)/MariaDB Replica Lag: s6 (paged) [13:22:43] oops [13:22:45] mybad [13:23:02] its supposed to be downtimed [13:23:29] arnaudb: no worries, thanks. [13:23:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:23:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:24:22] (03PS9) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [13:25:17] (03CR) 10Clément Goubert: [C:03+1] conftool-data: remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1004637 (owner: 10Hnowlan) [13:25:27] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage [13:25:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:25:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:25:47] (03CR) 10Elukey: redfish: expand support for Supermicro hosts (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:27:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:27:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:27:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: upgrade [13:28:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: upgrade [13:29:13] (03CR) 10CI reject: [V:04-1] Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:29:22] sigh... [13:30:37] 09:04:33 [PostBuildScript] - [ERROR] An error occured during post-build processing. [13:30:40] 09:04:33 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: hudson.AbortException: castor-save-workspace-cache aborted. [13:31:29] dcausse: I would guess that's a spurious CI failure [13:31:46] the pre-post-build-script output is identical to when it got V+1 before [13:31:56] cdanis: ok, thanks, re-submitting [13:32:08] (03CR) 10DCausse: Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:32:15] (03CR) 10DCausse: [C:03+2] Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:33:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T366123)', diff saved to https://phabricator.wikimedia.org/P63697 and previous config saved to /var/cache/conftool/dbconfig/20240530-133348-marostegui.json [13:33:54] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [13:33:55] RECOVERY - MariaDB Replica Lag: s6 #page on db1173 is OK: OK slave_sql_lag Replication lag: 40.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:34:10] yes mwgate-node18 passed now [13:36:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:37:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:38:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1044.eqiad.wmnet with OS bookworm [13:38:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:39:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:40:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:40:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:43:01] (03CR) 10Stevemunene: [C:03+2] Set datahub LVS to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [13:43:18] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2044.codfw.wmnet with OS bookworm [13:43:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:43:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:45:22] jouncebot: next [13:45:22] In 1 hour(s) and 14 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1500) [13:45:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [13:45:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [13:46:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1010.eqiad.wmnet with OS bullseye [13:47:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye executed... [13:47:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9846603 (10Jclark-ctr) [13:48:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:48:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:48:17] (03PS1) 10Umherirrender: rdbms: Pass array values to makeList on insert/upsert [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037138 (https://phabricator.wikimedia.org/T366268) [13:48:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9846613 (10Jclark-ctr) [13:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P63698 and previous config saved to /var/cache/conftool/dbconfig/20240530-134856-marostegui.json [13:50:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:50:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:51:14] dcausse: were you continuing with deploying that patch, or? [13:51:32] I'm not deploying anything myself, just watching the k8s control plane during deployus [13:51:33] cdanis: yes waiting for ci to finish [13:51:40] ah of course [13:51:46] it has to re-run the 20min+ tests too [13:51:52] yes :( [13:52:02] 😔 [13:52:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:52:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:53:26] cdanis: quick random question, when scap runs php-fpm-restarts, this is only done on baremetal hosts? [13:53:34] dcausse: correct [13:53:40] thanks [13:55:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [13:57:12] cdanis: ci is about to finish, starting another deploy [13:57:21] thanks! [13:57:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:57:57] dcausse: By the way I don't know if it was applicable to your patches, but you can scap backport multiple patches at once, fyi [13:58:20] (03Merged) 10jenkins-bot: Add UpdateGroup for weighted tags [extensions/CirrusSearch] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037135 (owner: 10DCausse) [13:58:51] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1037135|Add UpdateGroup for weighted tags]] [13:59:28] claime: did not know that thanks! what I did here is +2 the one patch while another one was being deployed [14:00:44] Yeah, that's the right solution if your patches need to be applied separately, but if they can be merged/deployed together, you can indeed do scap backport gerritId1 gerritId2... [14:01:34] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1037135|Add UpdateGroup for weighted tags]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:01:35] good to know, thanks :) [14:02:02] !log dcausse@deploy1002 dcausse: Continuing with sync [14:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P63699 and previous config saved to /var/cache/conftool/dbconfig/20240530-140404-marostegui.json [14:05:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:05:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [14:05:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [14:05:55] (03PS1) 10Tchanders: Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 [14:06:24] (03PS2) 10Tchanders: Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) [14:07:15] (03PS3) 10Tchanders: Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) [14:08:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:12] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:10:42] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1037135|Add UpdateGroup for weighted tags]] (duration: 11m 51s) [14:11:17] !log backport window done [14:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:13:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:13:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:54] (03CR) 10Volans: [C:03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:15:42] (03PS3) 10Vgutierrez: fifo_log_demux,ATS: Support prometheus and buffer options [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) [14:17:55] (03CR) 10Santiago Faci: [C:03+1] Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [14:18:02] (03PS10) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [14:18:11] (03CR) 10Elukey: redfish: expand support for Supermicro hosts (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:18:36] (03CR) 10Vgutierrez: [C:03+2] fifo_log_demux,ATS: Support prometheus and buffer options [puppet] - 10https://gerrit.wikimedia.org/r/1037485 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:19:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T366123)', diff saved to https://phabricator.wikimedia.org/P63700 and previous config saved to /var/cache/conftool/dbconfig/20240530-141914-marostegui.json [14:19:20] T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123 [14:20:25] (03CR) 10JHathaway: [C:03+2] rsyslog kafka_shipper: use the new global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway) [14:21:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9846804 (10Papaul) @Clement_Goubert thanks for the update. Since i can not edit your comment I updating it here. The move should be something like: serviceop... [14:22:48] (03PS1) 10Vgutierrez: hiera: Set prometheus port on fifo-log-demux@cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/1037546 (https://phabricator.wikimedia.org/T364383) [14:23:52] (03CR) 10Ssingh: [C:03+1] "🚢 it." [puppet] - 10https://gerrit.wikimedia.org/r/1037546 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:24:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: upgrade db1165 [14:24:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: upgrade db1165 [14:25:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 depool for T356240', diff saved to https://phabricator.wikimedia.org/P63701 and previous config saved to /var/cache/conftool/dbconfig/20240530-142519-arnaudb.json [14:25:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63702 and previous config saved to /var/cache/conftool/dbconfig/20240530-142555-arnaudb.json [14:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:26:29] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1165.eqiad.wmnet [14:26:37] (03CR) 10Elukey: [C:03+2] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:26:58] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9846834 (10Chqaz) I sent an email to `wikija-g-owner@lists.wikimedia.org` asking them to contact me. [14:27:48] (03CR) 10Vgutierrez: [C:03+2] hiera: Set prometheus port on fifo-log-demux@cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/1037546 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:29:44] (03PS1) 10CDanis: re-enable otelcol in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037548 (https://phabricator.wikimedia.org/T366094) [14:31:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1165.eqiad.wmnet [14:31:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:33:47] (03Merged) 10jenkins-bot: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:34:21] (03CR) 10CDanis: [C:03+2] "already deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037548 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:34:35] (03CR) 10Kosta Harlan: [C:04-1] "There is also some stuff in volatile and ipinfo.pp that I am not sure about (ipinfo.pp hardcodes 506 as the Country data product ID)" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [14:35:29] (03CR) 10Hnowlan: [C:03+2] conftool-data: remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1004637 (owner: 10Hnowlan) [14:35:44] (03CR) 10Kosta Harlan: [C:03+1] Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [14:35:59] (03PS3) 10Stevemunene: Set datahub LVS to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) [14:37:42] (03Merged) 10jenkins-bot: re-enable otelcol in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037548 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [14:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:16] (03PS1) 10Vgutierrez: hiera: Set prometheus_port on fifo-log-demux@cp30[73,81] [puppet] - 10https://gerrit.wikimedia.org/r/1037551 (https://phabricator.wikimedia.org/T364383) [14:39:25] (03CR) 10Brouberol: [C:03+1] Set datahub LVS to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [14:40:13] (03CR) 10Stevemunene: [C:03+2] Set datahub LVS to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1037471 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [14:41:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63703 and previous config saved to /var/cache/conftool/dbconfig/20240530-144101-arnaudb.json [14:41:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63704 and previous config saved to /var/cache/conftool/dbconfig/20240530-144142-arnaudb.json [14:42:09] !log Running `decommission` on 5 eqiad api appservers [14:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:43] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:48:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:48:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:49:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:51:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:51:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:52:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main1010.eqiad.wmnet with OS bullseye [14:52:56] (03CR) 10Ahmon Dancy: [C:03+2] Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 (owner: 10RLazarus) [14:53:33] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846950 (10CDanis) https://grafana.wikimedia.org/goto/1rSRUSsSg?orgId=1 As we expected/hoped, the increase in eqiad TX bytes was only about 10-15%. [14:54:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:54:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:54:44] (03Merged) 10jenkins-bot: Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 (owner: 10RLazarus) [14:55:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63705 and previous config saved to /var/cache/conftool/dbconfig/20240530-145607-arnaudb.json [14:56:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63706 and previous config saved to /var/cache/conftool/dbconfig/20240530-145648-arnaudb.json [14:57:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:57:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:57:25] James_F: Is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1037138 good to deploy? [14:57:38] dancy: Should be, yes. [14:57:47] Excellent. [14:58:04] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c5-codfw.mgmt.codfw.wmnet [14:58:06] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:58:07] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-c2-codfw - pt1979@cumin2002" [14:58:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037138 (https://phabricator.wikimedia.org/T366268) (owner: 10Umherirrender) [14:58:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-c2-codfw - pt1979@cumin2002" [14:58:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-c2-codfw.mgmt.codfw.wmnet [14:59:14] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c6-codfw.mgmt.codfw.wmnet [14:59:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:59:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:59:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:59:32] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:00:05] dancy and andre: How many deployers does it take to do Train log triage deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1500). [15:00:26] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9846964 (10ssingh) >>! In T366193#9845869, @ayounsi wrote: >> assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this Thanks for the detailed breakdown of the options! > As we couldn't get a /... [15:01:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:01:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:05:06] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:05:20] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c6-codfw - pt1979@cumin2002" [15:06:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-c6-codfw - pt1979@cumin2002" [15:06:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9847000 (10aborrero) I guess I never got to refresh the interfaces in netbox. [15:10:04] (03PS4) 10Hnowlan: kubernetes: rename and repurpose 5 api appservers as k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) [15:10:15] (03PS1) 10Alexandros Kosiaris: preseed: Remove kafka-main1009 exception [puppet] - 10https://gerrit.wikimedia.org/r/1037554 (https://phabricator.wikimedia.org/T363212) [15:11:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63707 and previous config saved to /var/cache/conftool/dbconfig/20240530-151113-arnaudb.json [15:11:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63708 and previous config saved to /var/cache/conftool/dbconfig/20240530-151155-arnaudb.json [15:12:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:12:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:13:05] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [15:14:27] (03CR) 10Alexandros Kosiaris: [C:03+2] preseed: Remove kafka-main1009 exception [puppet] - 10https://gerrit.wikimedia.org/r/1037554 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [15:18:49] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:19:24] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1041 [15:19:40] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1041 [15:21:41] (03Merged) 10jenkins-bot: rdbms: Pass array values to makeList on insert/upsert [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037138 (https://phabricator.wikimedia.org/T366268) (owner: 10Umherirrender) [15:22:14] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037138|rdbms: Pass array values to makeList on insert/upsert (T366268)]] [15:22:19] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [15:22:40] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9847060 (10aborrero) a:05aborrero→03Andrew Just run the cookbook: ` aborrero@cumin1002:~ $ sudo cookbook sre.network.configure-switch-interfaces cloudvirt1041 Acquir... [15:23:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:23:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:24:47] !log dancy@deploy1002 umherirrender and dancy: Backport for [[gerrit:1037138|rdbms: Pass array values to makeList on insert/upsert (T366268)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:24:58] !log dancy@deploy1002 umherirrender and dancy: Continuing with sync [15:25:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:25:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:26:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63709 and previous config saved to /var/cache/conftool/dbconfig/20240530-152619-arnaudb.json [15:27:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63710 and previous config saved to /var/cache/conftool/dbconfig/20240530-152703-arnaudb.json [15:32:56] (03PS1) 10MVernon: New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) [15:34:09] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1037138|rdbms: Pass array values to makeList on insert/upsert (T366268)]] (duration: 11m 55s) [15:34:15] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [15:34:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:34:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:35:21] (03PS2) 10MVernon: New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) [15:35:41] (03CR) 10CI reject: [V:04-1] New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:36:23] (03PS3) 10MVernon: New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) [15:36:55] !log joal@deploy1002 Started deploy [airflow-dags/analytics@3659547]: Regular analytics weekly train [airflow-dags/analytics@3659547f] [15:37:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c6-codfw.mgmt.codfw.wmnet [15:37:23] !log dancy@deploy1002 Started scap: Testing [15:37:25] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@3659547]: Regular analytics weekly train [airflow-dags/analytics@3659547f] (duration: 00m 29s) [15:37:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:37:46] dancy: how has scap execution speed been, btw? [15:37:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:38:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-c5-codfw.mgmt.codfw.wmnet [15:38:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [15:38:18] cdanis: Stable... still 3m3s for sync-prod-k8s phase. [15:38:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [15:38:26] great :) [15:38:51] (03PS1) 10Cwhite: logstash: limit LogstashKafkaConsumerLag to Logstash-specific consumer groups [alerts] - 10https://gerrit.wikimedia.org/r/1037487 (https://phabricator.wikimedia.org/T366227) [15:38:52] (03PS1) 10Cwhite: o11y: add BenthosKafkaConsumerLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1037488 (https://phabricator.wikimedia.org/T366227) [15:40:14] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:40:17] (03CR) 10Vgutierrez: [C:03+2] hiera: Set prometheus_port on fifo-log-demux@cp30[73,81] [puppet] - 10https://gerrit.wikimedia.org/r/1037551 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:41:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63712 and previous config saved to /var/cache/conftool/dbconfig/20240530-154127-arnaudb.json [15:42:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P63713 and previous config saved to /var/cache/conftool/dbconfig/20240530-154208-arnaudb.json [15:43:02] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-c7-codfw.mgmt.codfw.wmnet [15:43:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:43:24] (03PS4) 10MVernon: New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) [15:43:32] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:45:33] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-d2-codfw.mgmt.codfw.wmnet [15:45:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:45:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:47:46] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:48:07] !log dancy@deploy1002 Finished scap: Testing (duration: 10m 43s) [15:48:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:50:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d2-codfw - pt1979@cumin2002" [15:51:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d2-codfw - pt1979@cumin2002" [15:51:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:34] denisse: Do you have a way to test the effects of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029664 ? [15:53:28] dancy: Yes, I could look at the Grafana graph to see if the metrics are being ingested. [15:53:37] ok.. Starting the process [15:53:39] But that would require the change to be merged and applied. [15:53:43] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [15:54:24] (03Merged) 10jenkins-bot: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [15:54:53] dancy: nothing planned for the Puppet window today, feel free to roll into it [15:54:57] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1029664|Migrate `wmfstatic` metrics to Prometheus store (T359255)]] [15:55:02] T359255: Migrate MediaWiki.wmfstatic.* to statslib - https://phabricator.wikimedia.org/T359255 [15:55:06] rzl: Thanks! [15:55:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [15:57:35] !log dancy@deploy1002 denisse and dancy: Backport for [[gerrit:1029664|Migrate `wmfstatic` metrics to Prometheus store (T359255)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:57:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:57:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:57:58] denisse: Ready for testing using the test/debug servers. [15:58:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [15:59:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns700[1-2].wikimedia.org,service=authdns-ns2 [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:15] topranks: define ACAST_PS_ADVERTISE = [16:00:18] 198.35.27.27/32 [16:00:55] dancy: Thank you, one question. Do you have any ideas/suggestions to test with those servers? I'm unfamiliar with the process. 😬 [16:01:17] I was mostly thinking of looking at certain graphs like the MedaWiki StatsLib Migration graph. https://grafana.wikimedia.org/d/nCxX65cSk/mediawiki-statslib-migration?orgId=1&from=now-30m&to=now [16:01:25] denisse: https://wikitech.wikimedia.org/wiki/WikimediaDebug [16:01:33] cdanis: Thanks! [16:01:38] and btw denisse https://grafana.wikimedia.org/goto/OZfgrSySg?orgId=1 [16:02:36] sukhe: yep [16:02:45] cdanis: Thanks a lot for both the link to the docs and to the graph. Looking at the graph it seems to be working as expected! 🙌 [16:03:20] (03CR) 10Kamila Součková: [C:03+2] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [16:03:32] denisse: Ok.. proceeding w/ the deployment [16:03:32] So this is my very 1st MediaWiki change live in production, what a milestone! :D [16:03:36] !log dancy@deploy1002 denisse and dancy: Continuing with sync [16:04:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:04:36] (03Merged) 10jenkins-bot: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [16:04:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:04:53] nice denisse \o/ [16:05:07] kamila_: Thanks! :3 [16:05:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T364299)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240530-160528-marostegui.json [16:05:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:05:44] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:13] (03CR) 10Tchanders: "I've commented on T366272#9847371 about whether we can use GeoIP2 data instead of Geolite2, since GeoLite2 was never actually used on prod" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [16:06:58] denisse: you beat me by a few days I guess, I don't think I have done a mediawiki-config one yet :D [16:07:37] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:08:04] kamila_: Hopefully soon! :D [16:08:05] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:08:16] yeah, I'll have one next week :D [16:08:30] That's so cool! What is it about? :) [16:08:33] (03PS1) 10Ssingh: config/sites: start advertising ns2 (198.35.27.27) from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1037568 [16:08:41] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:09:28] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:09:56] denisse: i'd watch a screen recording :) [16:10:25] mutante: Hahaha, yeah, I should've streamed it. :P [16:10:41] record it and upload a video to wikitech :) [16:11:18] https://asciinema.org/ [16:12:15] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1029664|Migrate `wmfstatic` metrics to Prometheus store (T359255)]] (duration: 17m 19s) [16:12:21] T359255: Migrate MediaWiki.wmfstatic.* to statslib - https://phabricator.wikimedia.org/T359255 [16:12:59] (03CR) 10Cathal Mooney: [C:03+1] config/sites: start advertising ns2 (198.35.27.27) from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1037568 (owner: 10Ssingh) [16:13:11] mutante: That's an excellent idea, however, this time dancy helped me with the deployment. Thanks dancy for that! [16:13:22] asciinema rulez [16:13:30] No problem. Next time around, you can log into the deploy server and run `scap backport ` [16:13:34] one of my favourite tools of all time [16:13:52] I'd like to get more familiar with the process and deploy it myself next time. Recording it with asciinema is an excellent idea! :D [16:13:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:14:08] (03PS5) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) [16:14:11] dancy: Thank you, I will do. :) [16:14:26] (03CR) 10Ssingh: [C:03+2] config/sites: start advertising ns2 (198.35.27.27) from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1037568 (owner: 10Ssingh) [16:14:34] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:14:46] denisse: you're a root, generally if there's not a major prod issue and if there's not a deploy window going on (you can ask a bot here, I'll demonstrate shortly), you can scap backport [16:14:50] jouncebot: nowandnext [16:14:50] For the next 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1600) [16:14:50] In 0 hour(s) and 45 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1700) [16:14:50] In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1700) [16:14:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye [16:15:03] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye comple... [16:15:11] or rather, if the deploy window isn't *active*, different from what the bot says [16:15:19] many of them go reserved but un-used [16:15:54] cdanis: Good to know, thank you. It's good to know about that bot. [16:16:18] ^ that's why I asked about borrowing the current deployment window for something else [16:16:59] there is also https://wikitech.wikimedia.org/wiki/Deployments where you can all the deployment windows and whether they have patches or nothing in them [16:17:06] !log sudo homer asw*magru* commit "add 198.35.27.0/24 for magru to announce ns2.wikimedia.org": T346722 [16:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:12] T346722: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722 [16:17:41] sukhe: 🎉 [16:17:59] (03PS1) 10JHathaway: logstash: add postfix filters & patterns [puppet] - 10https://gerrit.wikimedia.org/r/1037571 [16:18:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:18:23] (03PS1) 10Elukey: sre.hardware: add useless-suppression to pylint disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1037572 [16:18:23] (03PS1) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [16:19:18] cdanis: it's a party and everyone is invited :) [16:19:43] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [16:19:58] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:20:05] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:20:07] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:20:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:13] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:20:30] !log [correction] sudo homer cr*magru* commit "add 198.35.27.0/24 for magru to announce ns2.wikimedia.org": T346722 [16:20:30] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:36] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P63714 and previous config saved to /var/cache/conftool/dbconfig/20240530-162040-marostegui.json [16:21:22] (03CR) 10CI reject: [V:04-1] logstash: add postfix filters & patterns [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [16:21:26] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9847478 (10cmooney) > At least pdns-recursor seems to do this: Anecdotally Bind seems to do the same, in a test this morning my local server went to ns2 626 times when I queried a bunch of wikis, and ns1 and ns0 5 times ea... [16:21:26] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:21:32] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:22:11] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:22:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d2-codfw.mgmt.codfw.wmnet [16:22:56] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [16:23:03] (03PS2) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [16:23:35] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-d3-codfw.mgmt.codfw.wmnet [16:23:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:23:42] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [16:23:49] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:23:51] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:23:57] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:24:26] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:24:32] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:24:35] (03PS1) 10Dzahn: gerrit/test: set lfs sync dest host to itself [puppet] - 10https://gerrit.wikimedia.org/r/1037574 (https://phabricator.wikimedia.org/T363196) [16:25:01] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:25:07] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [16:25:13] (03CR) 10Elukey: "This is just a proposal so the approach could be discussed, but I am not strongly voting for it. From my limited understanding of the curr" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:25:41] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [16:25:44] (03CR) 10Dzahn: "hrmm,, so first something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037574 and then back to this" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [16:26:01] (03CR) 10Dzahn: [C:04-1] gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [16:31:04] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:31:43] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:31:49] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:31:51] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:31:58] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:32:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9847549 (10VRiley-WMF) @kamila is there a preferred time for this activity? I'm more than happy to schedule this at any time. [16:32:21] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:32:26] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d3-codfw - pt1979@cumin2002" [16:32:28] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:33:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d3-codfw - pt1979@cumin2002" [16:33:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:33] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:33:39] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:34:51] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:35:36] (03CR) 10Dzahn: [V:04-1] "doesn't like an array - https://puppet-compiler.wmflabs.org/output/1037574/2692/gerrit-bullseye.devtools.eqiad1.wikimedia.cloud/change.ger" [puppet] - 10https://gerrit.wikimedia.org/r/1037574 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [16:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P63715 and previous config saved to /var/cache/conftool/dbconfig/20240530-163549-marostegui.json [16:40:10] (03PS2) 10Dzahn: gerrit/test: set lfs sync dest host to itself [puppet] - 10https://gerrit.wikimedia.org/r/1037574 (https://phabricator.wikimedia.org/T363196) [16:41:47] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9847581 (10cmooney) >>! In T362421#9808055, @cmooney wrote: > Cogent are picking the magru announcement as best globally from Novvacore it seems also. We could add `28189:8094... [16:42:14] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9847571 (10Andrew) 05Open→03Resolved This host is now pooled and working properly. [16:43:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847589 (10akosiaris) kafka-main1009 is successfully imaged fully [17:08:15] 10ops-magru: magru account report errors - https://phabricator.wikimedia.org/T365500#9847724 (10RobH) The msws don't have any serial number in the elevation document (wasn't populated with other serials( or on the racking directions (where some serials were listed). I'm not sure what they are for the two msws,... [17:08:27] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d5-codfw - pt1979@cumin2002" [17:09:04] (03PS1) 10JHathaway: Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037141 [17:09:12] (03PS1) 10DCausse: noc: add a 'strict' param to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [17:09:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d5-codfw - pt1979@cumin2002" [17:09:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:26] (03CR) 10CI reject: [V:04-1] noc: add a 'strict' param to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [17:10:30] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9847731 (10cmooney) Ok after change Cogent are going to Telia, which we have at all the other POPs, so I think a better result. Novvacore are still announcing it at IX.br so d... [17:10:49] (03PS1) 10JHathaway: Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org""" [puppet] - 10https://gerrit.wikimedia.org/r/1037142 [17:11:12] (03Abandoned) 10JHathaway: Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037141 (owner: 10JHathaway) [17:11:42] (03CR) 10JHathaway: [C:03+2] Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org""" [puppet] - 10https://gerrit.wikimedia.org/r/1037142 (owner: 10JHathaway) [17:12:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:12:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:12:19] (03PS2) 10DCausse: noc: add a 'strict' param to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [17:13:44] (03PS1) 10Cathal Mooney: Policy for Novvacore at magru to not announce Anycast to Cogent [homer/public] - 10https://gerrit.wikimedia.org/r/1037588 (https://phabricator.wikimedia.org/T362421) [17:14:00] 10ops-magru: magru account report errors - https://phabricator.wikimedia.org/T365500#9847770 (10RobH) 05Open→03Resolved Added the mr1 for magru, and now all items on the error report are for Accounting to add to the main ledger area (they handle that) and everything below the threshhold is now done. [17:14:35] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [17:14:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:14:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:16:32] (03PS3) 10DCausse: noc: add a 'strict' param to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [17:16:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:16:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:20:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:20:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:21:29] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9847808 (10KFrancis) Hi all, the NDA is out for signatures. I'll confirm when it's complete. Thanks! [17:22:25] (03CR) 10Cwhite: [C:04-1] "It would be nice to at least have one test that proves that the patterns and typecasting are behaving as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [17:24:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:24:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d4-codfw.mgmt.codfw.wmnet [17:32:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:32:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:33:07] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-d6-codfw.mgmt.codfw.wmnet [17:33:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:34:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:34:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:34:36] !log joal@deploy1002 Started deploy [airflow-dags/analytics@e74e164]: Regular analytics weekly train HOTFIX [airflow-dags/analytics@e74e164f] [17:35:03] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@e74e164]: Regular analytics weekly train HOTFIX [airflow-dags/analytics@e74e164f] (duration: 00m 27s) [17:36:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:36:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:36:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d6-codfw - pt1979@cumin2002" [17:37:24] James_F: Is there an expectation that the warning rate will increase on rollout to group2? [17:37:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d6-codfw - pt1979@cumin2002" [17:37:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:37:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:37:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:40:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:40:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:40:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d5-codfw.mgmt.codfw.wmnet [17:41:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:01] dancy: Yes, I think it's firing from a lot of code paths. [17:42:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:42:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:42:23] Thanks. Holding the train in that case. [17:42:27] :-( [17:42:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:42:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:43:24] And Gerrit is being hammered right now, so holding for that too. [17:44:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:44:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:46:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:49:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:50:14] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-d7-codfw.mgmt.codfw.wmnet [17:50:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:50:59] Gerrit has recovered. [17:55:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:57:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:57:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:57:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d7-codfw - pt1979@cumin2002" [17:58:30] dancy: Gerrit is struggling again, but hopefully not for long [17:58:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d7-codfw - pt1979@cumin2002" [17:58:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:04] dancy and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T1800). [18:00:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:53] Holding the train for T366268 [18:00:54] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [18:08:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d6-codfw.mgmt.codfw.wmnet [18:13:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:27] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:30] (03PS1) 10Jforrester: Temporarily silence noisy new warnings [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037143 (https://phabricator.wikimedia.org/T366268) [18:15:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:15:41] (03CR) 10Jforrester: [C:03+1] Temporarily silence noisy new warnings [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037143 (https://phabricator.wikimedia.org/T366268) (owner: 10Jforrester) [18:15:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:15:52] dancy: Should be good to deploy 1037143 now. [18:17:34] (03PS2) 10Jforrester: Temporarily silence noisy new warnings [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037143 (https://phabricator.wikimedia.org/T366268) [18:17:41] OK. Waiting for CI to be satisfied [18:17:42] (03PS1) 10Jdlrobson: Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) [18:18:34] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-d8-codfw.mgmt.codfw.wmnet [18:18:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:19:08] (03CR) 10CI reject: [V:04-1] Wrap tables in Vector 2022 for projects where legacy Vector is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037600 (https://phabricator.wikimedia.org/T366314) (owner: 10Jdlrobson) [18:20:43] FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:23:49] (03PS1) 10Cwhite: admin: add derenrich to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037489 (https://phabricator.wikimedia.org/T365381) [18:26:36] (03Abandoned) 10Cory Massaro: Update orchestrator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032579 (owner: 10Cory Massaro) [18:26:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d8-codfw - pt1979@cumin2002" [18:27:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-d8-codfw - pt1979@cumin2002" [18:27:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:27:45] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9848175 (10colewhite) 05Stalled→03In progress a:05Milimetric→03colewhite [18:28:07] (03CR) 10Dzahn: [C:03+1] "lgtm, they said on ticket no shell needed, has approval now, UID matches LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/1037489 (https://phabricator.wikimedia.org/T365381) (owner: 10Cwhite) [18:29:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d7-codfw.mgmt.codfw.wmnet [18:31:10] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-d1-codfw.mgmt.codfw.wmnet [18:31:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:32:05] (03CR) 10JHathaway: "sure, what is the best way to support the testing with custom patterns?" [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [18:34:46] (03PS3) 10Krinkle: [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [18:34:49] (03CR) 10Krinkle: [C:03+1] [multiversion] Add 'manage-dblist init-labs' subcommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [18:35:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-d1-codfw - pt1979@cumin2002" [18:35:05] (03CR) 10Krinkle: [C:03+1] "I wonder why PHPCS didn't catch the whitespace issue? Anyway, fixed. LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [18:35:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-d1-codfw - pt1979@cumin2002" [18:35:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:40:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037143 (https://phabricator.wikimedia.org/T366268) (owner: 10Jforrester) [18:41:25] !log T365571 💙root@deploy1002.eqiad.wmnet ~ 🕝⁉ kubectl delete node kubernetes2032.codfw.wmnet [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:33] T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571 [18:43:43] FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:51:19] (03CR) 10Cwhite: [C:04-1] "On the logstash host, the patterns directory name must match what's in the profile/files/logstash directory (e.g. rename `patterns.d` -> `" [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [18:51:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P63720 and previous config saved to /var/cache/conftool/dbconfig/20240530-185125-ladsgroup.json [18:51:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:52:22] (03CR) 10Cwhite: [C:03+2] admin: add derenrich to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037489 (https://phabricator.wikimedia.org/T365381) (owner: 10Cwhite) [18:52:45] (03PS4) 10Krinkle: Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 (https://phabricator.wikimedia.org/T308932) [18:56:09] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:56:58] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9848366 (10colewhite) 05In progress→03Resolved The group membership change has been deployed. Please feel free to reopen if you encounter any r... [18:58:42] RESOLVED: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:59:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-d8-codfw.mgmt.codfw.wmnet [18:59:39] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9848405 (10Dzahn) Thanks! I'll upload a patch for this. [19:00:20] (03PS6) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [19:01:46] (03PS1) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:02:32] (03PS2) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:03:04] (03PS1) 10Dzahn: admin: add Joely Rooke (WMDE) to ldap_only (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) [19:03:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [19:03:52] (03CR) 10CI reject: [V:04-1] admin: add Joely Rooke (WMDE) to ldap_only (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) (owner: 10Dzahn) [19:04:07] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:04:29] (03Merged) 10jenkins-bot: Temporarily silence noisy new warnings [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037143 (https://phabricator.wikimedia.org/T366268) (owner: 10Jforrester) [19:05:01] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037143|Temporarily silence noisy new warnings (T366268)]] [19:05:07] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [19:06:04] (03PS3) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:06:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P63722 and previous config saved to /var/cache/conftool/dbconfig/20240530-190633-ladsgroup.json [19:08:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-d1-codfw.mgmt.codfw.wmnet [19:09:03] (03PS2) 10Dzahn: admin: add Joely Rooke (WMDE) to ldap_only (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) [19:09:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:10:43] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:11:43] !log dancy@deploy1002 jforrester and dancy: Backport for [[gerrit:1037143|Temporarily silence noisy new warnings (T366268)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:11:50] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [19:11:53] !log dancy@deploy1002 jforrester and dancy: Continuing with sync [19:12:01] Whee. [19:13:47] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:18:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:18:47] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [19:19:21] !log bouncing exim on mx1001 [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:41] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1037143|Temporarily silence noisy new warnings (T366268)]] (duration: 15m 39s) [19:20:46] T366268: 12 million warnings of Wikimedia\Rdbms\Platform\SQLPlatform::makeList: array key {key} in list of values ignored (via SQLPlatform::makeInsertLists) - https://phabricator.wikimedia.org/T366268 [19:21:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P63723 and previous config saved to /var/cache/conftool/dbconfig/20240530-192145-ladsgroup.json [19:22:46] (03PS2) 10CDobbins: varnish: add better error page when HTTP status code 429 is returned [puppet] - 10https://gerrit.wikimedia.org/r/1035011 (https://phabricator.wikimedia.org/T354718) [19:22:54] Rolling the train to group2 [19:23:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:24:05] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037604 (https://phabricator.wikimedia.org/T361401) [19:24:08] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037604 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [19:24:11] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-d8-codfw.mgmt.codfw.wmnet [19:24:11] RESOLVED: MXQueueHigh: MX host mx1001:9100 has many queued messages: 4460 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [19:24:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:24:42] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037604 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [19:26:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-d8-codfw - pt1979@cumin2002" [19:27:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-d8-codfw - pt1979@cumin2002" [19:27:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:33] !log bounce exim on mx2001 [19:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9848500 (10Jclark-ctr) @akosiaris kafka-main1010 has imaged but is still failing cookbook for me would you be able to try that one for me? [19:32:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:32:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:34:22] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9848512 (10VRiley-WMF) The motherboard for parse1002 has been replaced with a brand new one that Dell has shipped out. All the cables have been hooked back into it. [19:34:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:34:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:35:34] (03PS4) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:35:43] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.7 refs T361401 [19:35:50] Yay! [19:35:51] T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401 [19:36:51] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P63724 and previous config saved to /var/cache/conftool/dbconfig/20240530-193653-ladsgroup.json [19:36:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:37:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:37:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:37:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P63725 and previous config saved to /var/cache/conftool/dbconfig/20240530-193717-ladsgroup.json [19:38:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:38:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:40:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:40:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:41:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:44:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:44:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:47:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:47:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:48:21] (03PS5) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:50:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [19:53:20] (03PS6) 10Bking: wdqs: Add alert for maxlag [puppet] - 10https://gerrit.wikimedia.org/r/1037602 (https://phabricator.wikimedia.org/T361114) [19:57:33] !log cdanis@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [19:59:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-d8-codfw.mgmt.codfw.wmnet [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240530T2000). [20:00:05] Jdlrobson, dmartin-WMF, and Tchanders: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:24] Hello, I'm around [20:02:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9848597 (10Papaul) [20:02:04] Hi [20:02:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:02:21] !log cdanis@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:02:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:02:29] hi - i can deploy [20:02:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9848600 (10Papaul) [20:02:45] Hi Clare :-) [20:02:50] o/ [20:02:54] hi David! [20:02:59] cjming: thanks! [20:03:05] Jdlrobson: i'll do yours first [20:03:25] (03PS2) 10Jdlrobson: Popups setting should be string not integer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) [20:04:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [20:04:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:04:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:04:50] (03Merged) 10jenkins-bot: Popups setting should be string not integer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson) [20:05:08] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1037189|Popups setting should be string not integer (T364347)]] [20:05:13] (03PS1) 10Andrew Bogott: codfw1dev: update horizon release [puppet] - 10https://gerrit.wikimedia.org/r/1037609 (https://phabricator.wikimedia.org/T365096) [20:05:14] T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347 [20:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:26] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1037189|Popups setting should be string not integer (T364347)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:29] Jdlrobson: shall i sync? [20:10:12] cjming: yes please! [20:10:16] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:10:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:10:42] (03PS7) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [20:10:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:13:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:14:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:15:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:17:06] (03PS6) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [20:18:48] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1037189|Popups setting should be string not integer (T364347)]] (duration: 13m 39s) [20:18:54] T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347 [20:19:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:19:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:20:07] not sure if any more seasoned deployers are around - but i saw this message from the last scap backport: "20:18:48 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=jdlrobson', 'Backport for [[gerrit:1037189|Popups setting should be string not integer (T364347)]]']' returned non-zero exit status 1." [20:20:30] ^^ does this mean it didn't finish properly? [20:21:01] Jdlrobson: can you verify if the popups change is live? [20:21:50] i think it did complete because i also see: "20:18:48 Finished scap: Backport for [[gerrit:1037189|Popups setting should be string not integer (T364347)]] (duration: 13m 39s)" [20:22:04] just not sure which message is the truth [20:22:12] and if i have to try again [20:22:17] * dancy reads [20:22:25] cjming: Are there any earlier errors in the output? [20:22:39] (rhetorical.. there must be) [20:23:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:23:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:23:16] https://www.irccloud.com/pastebin/6gyTng2g/ [20:23:27] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:23:33] ok.. that's a non-fatal issue so you're good. [20:23:43] cool ! thanks for confirming [20:23:49] carrying on then [20:24:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [20:24:13] There are some systems being reimaged, which is probably why that happened (and may happen again) [20:24:24] so just ignore for now if it happens again? [20:24:29] Yeah [20:24:32] gtk [20:24:41] Jdlrobson: should be live - lmk if not [20:24:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:24:47] (03Merged) 10jenkins-bot: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin) [20:25:03] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1017962|Add a stream for tracking the API of WikiLambda (T356228 T360369)]] [20:25:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:25:07] dmartin-WMF: doing your patch now [20:25:09] thanks cjming [20:25:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:25:12] appreciated! [20:25:20] T356228: WikiLambda metrics: Track uses of the wikilambda_function_call API - https://phabricator.wikimedia.org/T356228 [20:25:20] T360369: Instrument the new public Wikifunction call API - https://phabricator.wikimedia.org/T360369 [20:25:35] cjming: Great; I'm here [20:25:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:25:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:26:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:26:43] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:27:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:28:21] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:29:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:29:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:29:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:29:51] !log cjming@deploy1002 cjming and dmartin: Backport for [[gerrit:1017962|Add a stream for tracking the API of WikiLambda (T356228 T360369)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:59] dmartin-WMF: can you test? [20:30:05] cjming: This is where I test using WikimediaDebug, right? Just a minute please ... [20:30:16] yup! do you have the snippet to try? [20:31:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:31:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:32:56] !log robh@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:33:26] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:33:29] cjming: yes, I'm trying the snippet, but it's failing. (Same test I ran a month or 2 ago, which worked for a different stream) [20:33:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:33:38] mw.loader.moduleRegistry['ext.eventLogging'].packageExports['ext.eventLogging/data.json'].streamConfigs[ 'mediawiki.product_metrics.wikifunctions_ui' ]; [20:33:55] (03PS6) 10MVernon: New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) [20:34:09] I mean this one is failing: [20:34:09] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [20:34:10] mw.loader.moduleRegistry['ext.eventLogging'].packageExports['ext.eventLogging/data.json'].streamConfigs[ 'mediawiki.product_metrics.wikilambda_api' ]; [20:34:51] hmm - bummer -- what does it return? [20:35:01] undefined [20:35:21] Just a moment I will double check the name of the stream in the patch [20:35:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:35:29] !log robh@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:35:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:35:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:35:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:36:05] Yes, the stream name looks right [20:36:17] cjming: I will try the test again [20:36:22] looks right to me too - idk if there's a latency maybe? [20:37:30] when you tested last time, did you get a response from the console right away? [20:37:50] cjming: yes, I did [20:37:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:38:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:38:25] I notice that WikimediaDebug is offering a couple additional server options now; no idea if that's relevant [20:38:34] But I've tried all of the server options and I still get undefined [20:39:39] 06SRE, 10Wikimedia-Mailing-lists: Toronto WikiClub mailing list - https://phabricator.wikimedia.org/T366307#9848718 (10Ladsgroup) Hi, I'm not seeing any meta page about this club. Is it done by Wikimedia Canada? [20:40:45] Hmm, not sure what to do. I'm still getting undefined [20:41:17] Undefined for the new stream, but I get a valid response for the existing stream [20:41:55] hmm - i'm not sure why you wouldn't see it - we could revert or sync - i don't think syncing would harm anything [20:42:16] Wait - it seems be working now ... [20:42:23] 06SRE, 10Wikimedia-Mailing-lists: Toronto WikiClub mailing list - https://phabricator.wikimedia.org/T366307#9848726 (10Legowerewolf) You mean on meta.wikimedia.org? Yeah, no, our page is over on ca.wikimedia.org: https://ca.wikimedia.org/wiki/Toronto_WikiClub [20:42:23] oh good! [20:42:26] Let me double check [20:42:37] It started working after I refreshed the page [20:42:41] Just a minute please [20:43:31] refresh/reload -- always something to double-check [20:44:18] Okay, tests are passing now, so I'd say go ahead with the sync [20:44:20] 06SRE, 10Wikimedia-Mailing-lists: Toronto WikiClub mailing list - https://phabricator.wikimedia.org/T366307#9848730 (10Ladsgroup) That's good enough. Thanks. I create it soon. [20:44:23] yay!! [20:44:30] !log cjming@deploy1002 cjming and dmartin: Continuing with sync [20:51:14] 06SRE, 10Wikimedia-Mailing-lists: Toronto WikiClub mailing list - https://phabricator.wikimedia.org/T366307#9848732 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikiclub-toronto.lists.wikimedia.org/ [20:51:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:53:12] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1017962|Add a stream for tracking the API of WikiLambda (T356228 T360369)]] (duration: 28m 08s) [20:53:19] T356228: WikiLambda metrics: Track uses of the wikilambda_function_call API - https://phabricator.wikimedia.org/T356228 [20:53:20] T360369: Instrument the new public Wikifunction call API - https://phabricator.wikimedia.org/T360369 [20:53:28] dmartin-WMF: should be live! [20:53:54] Tchanders: yours is up next [20:54:02] Super! Yes, I'm getting a correct result from production server. Thanks so much! [20:54:06] (03PS4) 10Tchanders: Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) [20:54:07] cjming: just in time! [20:54:32] ya - sorry - sometimes things take longer than i think they should [20:54:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [20:54:58] cjming: I'm used to it - wasn't sure if mine would get in as third... [20:55:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:55:31] no worries - should go quick 🤞 [20:55:37] (03Merged) 10jenkins-bot: Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037139 (https://phabricator.wikimedia.org/T361884) (owner: 10Tchanders) [20:55:56] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1037139|Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" (T361884)]] [20:56:02] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [20:56:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:58:23] !log cjming@deploy1002 tchanders and cjming: Backport for [[gerrit:1037139|Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:58:27] Tchanders: should i sync? [20:58:56] cjming: Yes please, go ahead! [20:59:00] !log cjming@deploy1002 tchanders and cjming: Continuing with sync [20:59:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:02:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:04:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:04:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:04:17] !log dropping old replication user from backup sources [21:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:06:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:06:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:07:12] (03PS1) 10Scott French: shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) [21:07:20] (03CR) 10CI reject: [V:04-1] shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [21:07:39] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1037139|Revert "IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath" (T361884)]] (duration: 11m 43s) [21:07:45] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [21:07:54] Tchanders: should be live! [21:07:59] (03PS1) 10Ebernhardson: cirrus: Remove cirrus_index.py script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037616 [21:08:41] !log end of UTC late backport window [21:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:59] cjming: Looks great. Thanks loads! [21:09:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:09:08] yw! [21:09:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:09:15] (03PS2) 10Ebernhardson: cirrus: Remove cirrus_index.py script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037616 [21:09:35] (03PS2) 10Scott French: shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) [21:11:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:11:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:11:45] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:13:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:15:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:16:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:17:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:17:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:18:21] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:20:53] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:21:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:21:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:23:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:23:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:33] (03PS1) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [21:29:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:44:00] (03PS2) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [21:44:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [22:10:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:35] PROBLEM - Disk space on backup1011 is CRITICAL: DISK CRITICAL - free space: /srv/objectstorage 2054674MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [22:15:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:15:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:16:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T364299)', diff saved to https://phabricator.wikimedia.org/P63726 and previous config saved to /var/cache/conftool/dbconfig/20240530-221604-marostegui.json [22:16:11] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:17:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:17:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:26:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:26:07] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:29:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:29:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P63727 and previous config saved to /var/cache/conftool/dbconfig/20240530-223112-marostegui.json [22:31:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:31:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:38:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:38:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:45:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:45:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:46:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P63728 and previous config saved to /var/cache/conftool/dbconfig/20240530-224621-marostegui.json [22:47:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:47:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:49:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:49:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:51:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:51:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:53:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:54:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:01:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T364299)', diff saved to https://phabricator.wikimedia.org/P63729 and previous config saved to /var/cache/conftool/dbconfig/20240530-230129-marostegui.json [23:01:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:01:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:01:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:01:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:02:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:02:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T364299)', diff saved to https://phabricator.wikimedia.org/P63730 and previous config saved to /var/cache/conftool/dbconfig/20240530-230212-marostegui.json [23:06:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:06:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:08:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:08:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:10:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364069)', diff saved to https://phabricator.wikimedia.org/P63731 and previous config saved to /var/cache/conftool/dbconfig/20240530-231052-marostegui.json [23:10:59] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:12:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:12:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:14:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:14:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:16:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:16:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:18:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:18:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:20:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:20:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:25:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:26:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P63732 and previous config saved to /var/cache/conftool/dbconfig/20240530-232600-marostegui.json [23:26:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:28:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:28:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:30:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:30:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:32:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:32:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:34:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:34:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:36:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:36:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037490 [23:38:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037490 (owner: 10TrainBranchBot) [23:39:11] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:41:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P63733 and previous config saved to /var/cache/conftool/dbconfig/20240530-234109-marostegui.json [23:41:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:41:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:43:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:44:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:46:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:46:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:48:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:48:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:50:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:50:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:56:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364069)', diff saved to https://phabricator.wikimedia.org/P63734 and previous config saved to /var/cache/conftool/dbconfig/20240530-235617-marostegui.json [23:56:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:56:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:56:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T364069)', diff saved to https://phabricator.wikimedia.org/P63735 and previous config saved to /var/cache/conftool/dbconfig/20240530-235640-marostegui.json