[00:05:25] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:49] :O [00:08:34] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9798051 (10Eevans) The array has rebuilt, but I could swear I hear it ticking... `lang=sh-session eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Thu... [00:13:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P62397 and previous config saved to /var/cache/conftool/dbconfig/20240515-001352-ladsgroup.json [00:15:25] RESOLVED: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62398 and previous config saved to /var/cache/conftool/dbconfig/20240515-002900-ladsgroup.json [00:29:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [00:29:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:29:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [00:29:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62399 and previous config saved to /var/cache/conftool/dbconfig/20240515-002923-ladsgroup.json [01:13:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:38] RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [02:38:02] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:38] PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 9305 MB (2% inode=99%): /tmp 9305 MB (2% inode=99%): /var/tmp 9305 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [03:03:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:02] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:03:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:04:25] (03PS2) 10Cwhite: wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) [03:18:12] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:18:34] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:18:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1423:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1423 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:23:36] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw1423:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1423 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:24:12] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:38] RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [04:08:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:09:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 5.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:26:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:26:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [04:31:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-magru.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:31:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw1-b4-magru.mgmt.magru.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0600) [06:05:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:22] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948 (10phaultfinder) 03NEW [06:08:28] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798282 (10phaultfinder) [06:10:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:25] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798284 (10phaultfinder) [06:41:39] (03CR) 10Muehlenhoff: [C:03+2] parsoid/testing: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028793 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:42:57] (03PS1) 10KartikMistry: Enable Content/Section translation in io, nds, nds-nl and, mwl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666) [07:00:05] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0700). Please do the needful. [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:04:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:04:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:04:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [07:04:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [07:07:45] (03PS1) 10Muehlenhoff: mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) [07:08:35] (03PS1) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) [07:12:08] (03PS1) 10Fabfur: cache:benthos: switch to production topic names [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) [07:13:05] (03CR) 10LSobanski: mx: Stop ignoring errors from alias sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [07:14:36] Sorry for late joining. I'll self-deploy [07:14:53] jouncebot now [07:14:53] For the next 0 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0700) [07:17:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry) [07:17:48] (03CR) 10Muehlenhoff: mx: Stop ignoring errors from alias sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [07:18:08] (03Merged) 10jenkins-bot: Enable Content/Section translation in io, nds, nds-nl and, mwl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry) [07:18:42] (03PS2) 10Muehlenhoff: mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) [07:19:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] [07:19:07] (03PS2) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) [07:19:12] T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [07:19:35] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d1-codfw [07:19:35] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d1-codfw [07:19:47] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [07:20:11] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d1-codfw [07:20:12] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d1-codfw [07:21:31] (03CR) 10Fabfur: [V:03+1 C:04-2] "Do not merge until ready to switch to production topics" [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [07:21:51] !log kartik@deploy1002 kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:23:56] (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [07:26:36] (03PS3) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 [07:26:36] (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [07:26:45] (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [07:29:06] (03CR) 10David Caro: [C:03+2] openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [07:30:20] (03Merged) 10jenkins-bot: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [07:30:33] !log kartik@deploy1002 Sync cancelled. [07:31:23] ah. wrong keypress :/ [07:31:28] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] [07:31:32] T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [07:34:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1172.eqiad.wmnet [07:34:08] !log kartik@deploy1002 kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:35:32] (03PS1) 10Muehlenhoff: Switch db1172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031802 (https://phabricator.wikimedia.org/T349619) [07:36:36] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [07:37:08] !log kartik@deploy1002 kartik: Continuing with sync [07:37:17] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [07:38:25] (03CR) 10Filippo Giunchedi: "No real reason except that it isn't needed AFAIK, I'll put it back though since it isn't really a relevant change and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi) [07:38:43] !log installing curl security updates [07:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:20] (03PS1) 10KartikMistry: Section Translation: Fix nds-nl language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803 [07:39:39] (03CR) 10LSobanski: [C:03+1] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [07:39:48] (03CR) 10LSobanski: [C:03+1] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [07:43:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:46] (03CR) 10Muehlenhoff: [C:03+2] Switch db1172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031802 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1172.eqiad.wmnet [07:49:35] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] (duration: 18m 06s) [07:49:38] T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [07:50:42] (03CR) 10Brouberol: [C:03+1] "Looks good, and PCC shows a NOOP. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff) [07:52:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1177.eqiad.wmnet [07:53:05] I would like to deploy another quick fix (wrong language code) from previous patch. [07:53:12] (03PS1) 10Muehlenhoff: Switch db1177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031804 (https://phabricator.wikimedia.org/T349619) [07:55:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803 (owner: 10KartikMistry) [07:55:44] (03CR) 10Filippo Giunchedi: [V:03+1] "Thank you Scott for the review" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [07:56:01] (03Merged) 10jenkins-bot: Section Translation: Fix nds-nl language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803 (owner: 10KartikMistry) [07:56:31] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]] [07:58:58] (03CR) 10Muehlenhoff: [C:03+2] Switch db1177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031804 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:59:16] !log kartik@deploy1002 kartik: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0800) [08:00:15] o/ [08:00:32] (03PS3) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462 [08:00:32] (03PS4) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463 [08:00:32] (03PS4) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 [08:00:36] (03PS6) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [08:01:26] !log kartik@deploy1002 kartik: Continuing with sync [08:01:32] (03CR) 10Filippo Giunchedi: [C:03+2] utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi) [08:01:46] andre: good morning. I am in the deployment google meet :) [08:01:50] andre: I'm finishing my deployment.. [08:02:10] no worries kart_ , let us know when it has completed [08:03:06] sure [08:03:52] !log installing nodejs security updates on buster [08:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1177.eqiad.wmnet [08:04:58] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:05:29] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:05:31] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:05:53] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:06:02] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:07:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62401 and previous config saved to /var/cache/conftool/dbconfig/20240515-080700-marostegui.json [08:07:05] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [08:07:06] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:07:37] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:13:16] (03PS1) 10Filippo Giunchedi: jaeger: add back port names for otlp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031805 (https://phabricator.wikimedia.org/T364477) [08:13:46] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]] (duration: 17m 14s) [08:13:49] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:13:52] hashar: done [08:14:00] excellent! [08:14:01] :) [08:14:23] * hashar looks at logs [08:15:11] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [08:15:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1178.eqiad.wmnet [08:15:35] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: add back port names for otlp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031805 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [08:16:03] (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [08:16:10] the !log there from me actually did nothing, I pressed "n" at confirmation time [08:16:29] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:16:46] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399) [08:16:48] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [08:17:15] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:17:46] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [08:18:52] (03Merged) 10jenkins-bot: flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [08:18:59] (03PS1) 10Muehlenhoff: Switch db1178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031807 (https://phabricator.wikimedia.org/T349619) [08:19:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62402 and previous config saved to /var/cache/conftool/dbconfig/20240515-081934-ladsgroup.json [08:19:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:20:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [08:21:31] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [08:22:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P62403 and previous config saved to /var/cache/conftool/dbconfig/20240515-082209-marostegui.json [08:22:45] oh joy [08:22:47] httpbb fails [08:22:50] fun [08:24:53] (03CR) 10Muehlenhoff: [C:03+2] Switch db1178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031807 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:25:25] hmm [08:26:06] so the scap deployment failed again due to httpbb [08:26:19] mwdebug2002 yields a 503 error for one of the test page [08:26:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [08:26:51] and looking at logstash for that host, php7.4-fpm sent a rsyslog message `[NOTICE] exiting, bye-bye!` [08:27:04] which is hmm.. confusing [08:27:30] oh that is the php fpm restart [08:27:41] but that also mean after restarting the server is not immediately available [08:27:50] and we tend to ignore those timeout/503 errors in log spam [08:29:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1178.eqiad.wmnet [08:29:57] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:30:10] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:30:48] !log installing openjdk-17/jetty9 security updates on Bookworm [08:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1192.eqiad.wmnet [08:32:28] (03PS1) 10Muehlenhoff: Switch db1192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031808 (https://phabricator.wikimedia.org/T349619) [08:34:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62404 and previous config saved to /var/cache/conftool/dbconfig/20240515-083443-ladsgroup.json [08:35:41] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.5 refs T361399 [08:35:44] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [08:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P62405 and previous config saved to /var/cache/conftool/dbconfig/20240515-083717-marostegui.json [08:38:00] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [08:38:19] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [08:38:43] (03PS1) 10JMeybohm: rdf-streaming-updater: Remove duplicate definition of k8s api-servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) [08:40:18] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [08:40:36] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [08:42:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [08:45:47] (03PS1) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491) [08:48:07] !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505]: Regular analytics weekly train [analytics/refinery@88ed505e] [08:48:43] (03CR) 10Muehlenhoff: [C:03+2] Switch db1192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031808 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:49:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62406 and previous config saved to /var/cache/conftool/dbconfig/20240515-084950-ladsgroup.json [08:52:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62407 and previous config saved to /var/cache/conftool/dbconfig/20240515-085224-marostegui.json [08:52:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:52:29] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:52:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1192.eqiad.wmnet [08:52:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:52:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62408 and previous config saved to /var/cache/conftool/dbconfig/20240515-085247-marostegui.json [08:58:39] (03PS1) 10Muehlenhoff: profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 [08:58:44] (03PS1) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) [09:00:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on seaborgium.wikimedia.org with reason: OS update [09:01:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on seaborgium.wikimedia.org with reason: OS update [09:01:15] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9798634 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=be009031-0cc0-4a4d-97a0-f4d990831efe) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with... [09:02:30] (03PS1) 10JMeybohm: Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307) [09:02:38] (03CR) 10CI reject: [V:04-1] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [09:02:46] (03PS1) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) [09:02:48] !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505]: Regular analytics weekly train [analytics/refinery@88ed505e] (duration: 14m 41s) [09:03:51] !log upgrade seaborgium to bullseye T364823 [09:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:58] T364823: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823 [09:04:44] (03CR) 10JMeybohm: [V:03+1 C:03+2] "In my tests a couple of seconds (in staging). It's been a narrow race there and publish-sa-certs failed on 1 of 3 new masters." [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:04:57] !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505] (thin): Regular analytics weekly train THIN [analytics/refinery@88ed505e] [09:04:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62409 and previous config saved to /var/cache/conftool/dbconfig/20240515-090458-ladsgroup.json [09:05:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:05:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:05:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:05:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62410 and previous config saved to /var/cache/conftool/dbconfig/20240515-090522-ladsgroup.json [09:07:59] (03PS2) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) [09:09:15] !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505] (thin): Regular analytics weekly train THIN [analytics/refinery@88ed505e] (duration: 04m 17s) [09:09:19] (03PS1) 10Fabfur: benthos:cache: removed unused fields from grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1031818 (https://phabricator.wikimedia.org/T358109) [09:10:55] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.copy (exit_code=97) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:11:07] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubestagemaster200[12].codfw.wmnet [09:13:33] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:13:51] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:14:07] (03PS3) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) [09:14:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [09:19:41] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:20:21] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.copy (exit_code=97) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:20:22] (03CR) 10Ladsgroup: [C:03+1] "I think this is fine to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [09:22:02] !log Starting MediaModeration script on group2 wikis for a test [09:22:03] (03PS1) 10Mvolz: Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880) [09:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff) [09:23:02] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:51] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:07] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [09:25:42] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [09:26:59] MediaWiki trains looks good [09:28:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [09:28:51] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:21] (03PS1) 10Effie Mouzeli: flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 [09:32:15] (03PS2) 10Zabe: filtered_tables: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) [09:32:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [09:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:51] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:12] (03CR) 10Ladsgroup: "I'd say drop it in prod first, don't worry about this here, it won't break things AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe) [09:35:44] (03CR) 10DCausse: [C:03+1] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli) [09:36:25] (03CR) 10JMeybohm: [C:03+1] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli) [09:36:38] (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli) [09:36:46] (03CR) 10JMeybohm: [C:04-1] "Chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli) [09:37:53] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.3 [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656) [09:38:02] FIRING: [4x] ProbeDown: Service kubestagemaster2001:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:34] (03PS1) 10Effie Mouzeli: flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823 [09:39:33] (03Merged) 10jenkins-bot: flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli) [09:40:13] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2451/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:40:26] !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88ed505e] [09:41:48] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9798753 (10MoritzMuehlenhoff) 05Open→03Resolved Both production LDAP r/w servers have been migrated to Bullseye. [09:42:31] (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823 (owner: 10Effie Mouzeli) [09:42:49] (03PS1) 10DCausse: rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824 [09:42:51] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: bump exporter version to v1.0.3 [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:43:14] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster[2001-2002].codfw.wmnet [09:43:19] !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88ed505e] (duration: 02m 53s) [09:43:56] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451 (owner: 10Muehlenhoff) [09:44:55] (03Merged) 10jenkins-bot: flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823 (owner: 10Effie Mouzeli) [09:47:20] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:47:33] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:48:51] FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:20] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824 (owner: 10DCausse) [09:49:45] !log Manually relaunching mediawiki_job_update_special_pages_s5.service [09:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:59] (03Merged) 10jenkins-bot: rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824 (owner: 10DCausse) [09:50:03] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:50:30] (03PS1) 10JMeybohm: Decom kubestagemaster200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1031825 (https://phabricator.wikimedia.org/T363307) [09:52:32] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:53:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:53:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:53:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster[2001-2002].codfw.wmnet [09:53:54] (03CR) 10JMeybohm: [C:03+2] Decom kubestagemaster200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1031825 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:54:11] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:54:44] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826 [09:54:47] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:57:06] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:57:13] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:58:43] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) [09:59:20] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:59:30] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1000) [10:00:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [10:02:13] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [10:02:24] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [10:03:31] (03CR) 10Vgutierrez: [C:04-2] "do not merge till tcp-mss-clamper is ready" [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [10:06:00] !log btullis@deploy1002 Started deploy [airflow-dags/analytics_test@ecf603d]: (no justification provided) [10:06:11] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_test@ecf603d]: (no justification provided) (duration: 00m 11s) [10:06:23] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@ecf603d]: (no justification provided) [10:06:45] (03CR) 10Fabfur: [C:03+1] "+1 for me" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [10:06:54] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@ecf603d]: (no justification provided) (duration: 00m 30s) [10:08:42] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798853 (10phaultfinder) [10:09:34] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:09:51] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:09:51] (03PS2) 10JMeybohm: Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307) [10:12:40] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798897 (10phaultfinder) [10:15:12] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:15:31] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:20:04] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:21:15] (03PS2) 10DCausse: rdf-streaming-updater: Remove duplicate definition of k8s and zk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [10:22:10] (03CR) 10DCausse: [C:03+1] "testing on staging showed that these policies are indeed no longer needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [10:26:47] (03PS5) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [10:27:00] (03CR) 10Ladsgroup: configure parsercache servers via dbconfig in etcd (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French) [10:27:30] (03CR) 10Slyngshede: "This still needs to be hooked up to the build pipeline, but that will happen in another CR." [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [10:27:51] (03PS1) 10Alexandros Kosiaris: preseed for kafka-main10(0[6789]|10) [puppet] - 10https://gerrit.wikimedia.org/r/1031832 (https://phabricator.wikimedia.org/T363212) [10:28:00] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:28:25] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:29:00] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad [10:31:12] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device cloudsw1-e4-eqiad [10:32:10] (03PS1) 10Zabe: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) [10:32:30] (03PS1) 10Zabe: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) [10:32:32] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:32:43] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:32:47] (03CR) 10Alexandros Kosiaris: [C:03+2] preseed for kafka-main10(0[6789]|10) [puppet] - 10https://gerrit.wikimedia.org/r/1031832 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [10:32:58] (03CR) 10Slyngshede: [C:03+2] Configuration for disabling signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede) [10:34:20] (03Merged) 10jenkins-bot: Configuration for disabling signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede) [10:34:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9799039 (10akosiaris) >>! In T363212#9797805, @Jclark-ctr wrote: > @akosiaris could you please update preseed.yaml file? Done. Note t... [10:37:51] jouncebot: now [10:37:51] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1000) [10:38:20] zabe, Amir1: should we just deploy the FlaggedRevs fix now? (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1031484 and wmf.5) [10:39:02] sure [10:40:31] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:40:45] go ahead and thanks for the fix [10:40:47] (03CR) 10Zabe: [C:03+2] Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:40:49] just checking if I can reproduce the issue at the moment [10:40:51] (03CR) 10Zabe: [C:03+2] Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:40:58] yup [10:41:05] Lucas_WMDE: Thanks for the fix <3 [10:41:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:41:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:41:17] np :) [10:44:44] (03PS2) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) [10:44:57] (03CR) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [10:45:56] (03PS1) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) [10:47:23] (03CR) 10Clément Goubert: [C:03+1] Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [10:49:36] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:49:39] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:50:04] (03Merged) 10jenkins-bot: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:50:06] (03Merged) 10jenkins-bot: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe) [10:50:40] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1031484|Fix capitalization of Subquery (T364974)]], [[gerrit:1031483|Fix capitalization of Subquery (T364974)]] [10:50:44] T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974 [10:52:09] FIRING: HelmReleaseBadStatus: Helm release flink-operator/flink-operator on k8s-staging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=flink-operator - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:52:41] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [10:52:57] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [10:53:21] !log lucaswerkmeister-wmde@deploy1002 zabe and lucaswerkmeister-wmde: Backport for [[gerrit:1031484|Fix capitalization of Subquery (T364974)]], [[gerrit:1031483|Fix capitalization of Subquery (T364974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:53:33] (03CR) 10Muehlenhoff: [C:03+2] Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff) [10:53:49] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [10:53:51] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:53:55] oop, different error [10:54:33] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [10:54:36] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:54:49] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1041: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031836 (https://phabricator.wikimedia.org/T319184) [10:55:00] pasted the error at https://phabricator.wikimedia.org/T364974#9799122 [10:56:43] at this point I’m tempted to say let’s just revert the SQB migration there [10:56:45] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1042: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031839 (https://phabricator.wikimedia.org/T319184) [10:59:07] reverting at least in wmf.5 seems reasonable to me [10:59:08] (03PS1) 10Muehlenhoff: an-test-druid: Use firewall::service for Zookeeper firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1031842 [10:59:16] i posted an untested patch to fix that specific error [10:59:31] I’m about to test a very similar patch [10:59:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031842 (owner: 10Muehlenhoff) [11:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1100). [11:00:39] I’m still deploying, sorry mvolz :/ [11:00:53] okay, now the script worked [11:01:46] taavi: do you prefer ...$timeCondition or ->andWhere( $timeCondition )? [11:01:57] in either case I’d say let’s deploy that rather than revert [11:02:12] since it then seems to work [11:02:54] Lucas_WMDE: I think I slightly prefer andWhere() since that'd work even if $timeCondition would be changed to be something else than an array [11:03:01] yeah, makes sense [11:03:26] !log lucaswerkmeister-wmde@deploy1002 Sync cancelled. [11:03:43] (03PS1) 10Lucas Werkmeister (WMDE): backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974) [11:03:54] (03PS1) 10Lucas Werkmeister (WMDE): backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974) [11:04:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE)) [11:04:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE)) [11:05:36] !log aborrero@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1041.eqiad.wmnet with OS bookworm [11:05:53] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with... [11:06:00] and what lesson do we learn from this? FlaggedRevs is cursed, avoid it like the plague [11:06:18] (or slightly more seriously, FlaggedRevs is severely undertested, so be extra careful when making changes to it) [11:06:40] hi mvolz! I’m deploying in your window because a backport took longer than expected, sorry :/ [11:06:58] Lucas_WMDE: no worries [11:07:34] if you'd ping when you're done that's be great [11:07:41] sure, can do [11:08:49] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2452/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:09:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1193.eqiad.wmnet [11:09:56] heh, the backport will merge before the master change does, because the master change is chained behind a core change which has slower CI [11:10:50] (03PS1) 10Muehlenhoff: Switch db1193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031843 (https://phabricator.wikimedia.org/T349619) [11:10:59] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [11:11:11] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet... [11:12:37] (03PS1) 10Clément Goubert: mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) [11:12:38] (03Merged) 10jenkins-bot: backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE)) [11:12:40] (03Merged) 10jenkins-bot: backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE)) [11:13:13] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]] [11:13:18] T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974 [11:14:45] (03CR) 10Muehlenhoff: [C:03+2] Switch db1193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031843 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:15:53] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:16:02] testing the script on mwdebug2002 again… [11:16:14] works [11:16:16] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [11:16:50] (03PS2) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) [11:18:12] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2454/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:18:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1193.eqiad.wmnet [11:24:50] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:28:02] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1041: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031836 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:28:49] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]] (duration: 15m 36s) [11:28:53] T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974 [11:28:56] * Lucas_WMDE done deploying [11:28:58] mvolz: all yours :) [11:29:05] ty [11:29:11] claime: want to try starting the service again? (I doubt I have permission to do it ^^) [11:29:15] (03CR) 10Mvolz: [C:03+2] Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880) (owner: 10Mvolz) [11:29:21] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: openldap::rw [11:30:03] Lucas_WMDE: Sure, I'll find a broken section with a reasonable runtime [11:30:05] (03Merged) 10jenkins-bot: Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880) (owner: 10Mvolz) [11:30:16] hehe [11:30:28] (03PS1) 10Muehlenhoff: Switch openldap::rw to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031868 (https://phabricator.wikimedia.org/T349619) [11:31:03] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:31:25] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:31:27] * Lucas_WMDE afk for lunch [11:31:35] Which is none of them x) [11:31:59] Well I'll relaunch s5, it seems to be the shortest [11:32:44] The rest may have to wait until the next scheduled run during the night because I don't want to hammer the dbs [11:32:44] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:33:14] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:33:49] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:34:03] yeah, sounds fair [11:34:21] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:39:10] (03CR) 10Muehlenhoff: [C:03+2] Switch openldap::rw to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031868 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:42:32] (03PS4) 10Jsn.sherman: extension-list: Add AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) [11:42:32] (03PS4) 10Jsn.sherman: InitialiseSettings.php: Add wmgUseAutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) [11:42:33] (03PS4) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) [11:42:33] (03PS5) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) [11:44:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: openldap::rw [11:45:54] 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984 (10aborrero) 03NEW [11:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:11] 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799282 (10aborrero) [11:50:25] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1031872 [11:50:58] PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2024-05-15 10:57:54 is 871 GiB, but the previous one was 1058 GiB, a change of -17.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:52:35] !log aborrero@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1041.eqiad.wmnet with OS bookworm [11:52:49] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with... [11:58:07] 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799298 (10aborrero) p:05Triage→03Medium hey @Jclark-ctr or @Jhancock.wm could you please advice / help with this server? thanks in advance. [11:58:19] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1031872 (owner: 10Muehlenhoff) [11:59:11] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9799310 (10MoritzMuehlenhoff) [12:02:40] (03PS15) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [12:03:10] 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799319 (10aborrero) additional information: when reimaging the server, the debian installer failed, complaining about the volume group name being in use already. To try to workarou... [12:04:23] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff) [12:05:21] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "Double-checked package names." [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff) [12:05:58] (03CR) 10Filippo Giunchedi: [C:03+1] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff) [12:06:07] (03PS1) 10Clément Goubert: httpbb: Add tests for new redirects [puppet] - 10https://gerrit.wikimedia.org/r/1031874 (https://phabricator.wikimedia.org/T25216) [12:08:52] PROBLEM - Etcd cluster health on kubestagetcd2001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [12:08:52] PROBLEM - Etcd cluster health on kubestagetcd2003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [12:08:52] PROBLEM - Etcd cluster health on kubestagetcd2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [12:10:10] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff) [12:10:17] (03CR) 10Brouberol: [C:03+1] "Look good! Nice work :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:10:27] (03PS1) 10Muehlenhoff: Undeploy openldap prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1031875 [12:11:08] Lucas_WMDE: Well it didn't crash on dewiki this time so I'm inclined to call this solved [12:11:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [12:13:02] FIRING: [2x] JobUnavailable: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:23] (03PS2) 10Muehlenhoff: Undeploy openldap prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1031875 [12:14:43] (03CR) 10Hnowlan: [C:03+1] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [12:16:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [12:16:43] (03PS1) 10Majavah: P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) [12:18:12] (03CR) 10CI reject: [V:04-1] P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [12:18:44] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 5 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [12:19:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1203.eqiad.wmnet [12:19:49] (03CR) 10Majavah: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [12:20:10] (03PS1) 10Muehlenhoff: Switch db1203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031881 (https://phabricator.wikimedia.org/T349619) [12:20:13] (03PS1) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) [12:20:23] (03CR) 10Krinkle: "I believe, if the Varnish approach works out, this would not be needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024932 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [12:21:27] (03PS3) 10Phuedx: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:21:33] (03CR) 10JMeybohm: [C:03+2] Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:22:08] (03CR) 10CI reject: [V:04-1] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:23:05] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestagetcd[2001-2003].codfw.wmnet with reason: decom [12:23:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestagetcd[2001-2003].codfw.wmnet with reason: decom [12:25:12] (03CR) 10Phuedx: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:25:20] claime: nice \o/ [12:26:47] (03PS2) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) [12:28:19] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2457/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:30:26] (03CR) 10Phuedx: "The latest PS should have the effect that you want :) To test this locally, run composer buildConfigCache and check values in tests/data/c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:31:27] (03CR) 10Majavah: [C:03+1] httpbb: Add tests for new redirects [puppet] - 10https://gerrit.wikimedia.org/r/1031874 (https://phabricator.wikimedia.org/T25216) (owner: 10Clément Goubert) [12:37:38] (03PS3) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) [12:38:47] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:45:43] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [12:46:32] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagetcd[2001-2003].codfw.wmnet [12:51:32] (03PS1) 10DCausse: cirrus-streaming-updater: remove zk network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031892 (https://phabricator.wikimedia.org/T287491) [12:51:36] (03CR) 10Elukey: [C:03+1] kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:52:50] (03PS2) 10DCausse: cirrus-streaming-updater: remove zk network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031892 (https://phabricator.wikimedia.org/T287491) [12:53:58] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [12:54:00] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:56:25] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:57:53] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:57:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [12:57:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagetcd[2001-2003].codfw.wmnet [12:58:26] (03CR) 10Gmodena: [C:03+1] "Ack on naming - LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1300). [13:00:04] JSherman and Jdrewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] (03PS2) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491) [13:00:39] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:00:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:00:45] Roan has agreed to pair with me on my patches [13:01:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:01:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [13:02:09] RESOLVED: HelmReleaseBadStatus: Helm release flink-operator/flink-operator on k8s-staging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=flink-operator - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:02:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye [13:02:29] JSherman: I was looking at your patches too [13:02:37] 0/ [13:02:38] (but happy to let RoanKattouw take the lead and deploy) [13:02:53] some of the “how to new extension” docs on wikitech seem quite outdated :S [13:03:05] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:03:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:03:30] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:04:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [13:06:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [13:07:13] !log uploaded golang-github-florianl-go-tc 0.4.4-0.20240511074908-d584238bf6cb to apt.wm.o (bookworm-wikimedia) [13:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] (03Merged) 10jenkins-bot: extension-list: Add AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [13:09:39] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:09:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:09:51] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]] [13:09:56] T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034 [13:10:39] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:10:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:11:12] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:11:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:11:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:11:41] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:12:22] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:12:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:14:58] (03CR) 10Hashar: [C:03+1] Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [13:15:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:15:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:16:09] (03PS1) 10Muehlenhoff: Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) [13:16:37] (03CR) 10Elukey: [C:03+1] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff) [13:16:46] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:16:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:17:40] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:17:40] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.reboot_sanitaria (exit_code=99) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:18:02] FIRING: [2x] JobUnavailable: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:37] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:18:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:19:14] (03PS1) 10Brouberol: admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010) [13:19:23] (03PS3) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491) [13:19:23] (03PS1) 10JMeybohm: Remove kubernetesMasters definition from staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031901 (https://phabricator.wikimedia.org/T287491) [13:21:36] (03PS2) 10Muehlenhoff: Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) [13:22:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [13:23:02] (03CR) 10JMeybohm: [C:03+2] Remove kubernetesMasters definition from staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031901 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [13:24:13] JSherman: Hi, let me know when you're done deploying your patches [13:24:57] jan_drewniak: will do; still waiting on the extension list build steps [13:25:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [13:25:40] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:25:41] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.reboot_sanitaria (exit_code=99) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:26:14] JSherman: ok, gotcha [13:26:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:26:23] (03CR) 10Ssingh: "Looks good, thanks! One minor nit in-line; feel free to fix later/ignore." [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:26:29] (03CR) 10Effie Mouzeli: [C:03+1] admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010) (owner: 10Brouberol) [13:26:42] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.reboot_sanitaria (exit_code=97) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:27:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:27:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:27:27] !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:27:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts. [13:29:43] (03PS1) 10Cathal Mooney: Enable gNMI / gRPC on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) [13:30:53] the docker_pull_k8s step has been taking forever, but we're at 85% now [13:31:01] (03CR) 10Hashar: [C:03+1] Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) (owner: 10Muehlenhoff) [13:31:02] (03CR) 10Brouberol: [C:03+2] admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010) (owner: 10Brouberol) [13:31:54] (03CR) 10Muehlenhoff: [C:03+2] Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) (owner: 10Muehlenhoff) [13:32:08] (03PS1) 10Cathal Mooney: Add cloudsw to list of roles we enable gnmic telemtry for [puppet] - 10https://gerrit.wikimedia.org/r/1031905 (https://phabricator.wikimedia.org/T365012) [13:32:35] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:32:46] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:19] (03CR) 10Filippo Giunchedi: [C:03+1] P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:34:21] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe1001.eqiad.wmnet [13:34:44] (03CR) 10Muehlenhoff: [C:03+2] Switch db1203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031881 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:34:46] !log depool thanos-fe1001 and move envoy to PKI TLS cert [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:06] (03CR) 10Elukey: [C:03+2] Move Swift on thanos-fe1001 to PKI TLS cert (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:37:16] (03PS4) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) [13:38:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:39:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1203.eqiad.wmnet [13:40:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1209.eqiad.wmnet [13:40:27] we're on scap-cdb-rebuild [13:40:31] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe1001.eqiad.wmnet [13:40:38] (03CR) 10Eevans: [C:03+2] echostore: update cluster hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [13:41:19] !log installing libpgjava security updates [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:29] (03Merged) 10jenkins-bot: echostore: update cluster hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [13:41:48] (03CR) 10Ayounsi: [C:04-1] "I was wondering why I didn't do it sooner, but remembered why." [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney) [13:42:00] brouberol: o/ saw https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1030175 passing by, this can move to the new awesome calico netpolicies right? [13:42:09] (03CR) 10Ssingh: [C:03+1] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:42:21] (03PS1) 10Muehlenhoff: Switch db1209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031927 (https://phabricator.wikimedia.org/T349619) [13:42:26] !log jsn@deploy1002 jsn: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:29] T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034 [13:42:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:43:00] elukey indeed this is a prime candidate [13:43:11] (03CR) 10Muehlenhoff: [C:03+2] Switch db1209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031927 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:43:17] but we need to expose the restbase host IPs to k8s via puppet's global_config manifest first [13:43:38] !log jsn@deploy1002 jsn: Continuing with sync [13:44:01] oh wait, I'm seeing port 9042, so that smells like cassandra, which is already exposed [13:44:28] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [13:44:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 93 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:44:32] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:44:34] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:44:42] brouberol: it is cassandra yes [13:44:48] root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get svc -n external-services | grep rest [13:44:48] cassandra-restbase-a-codfw ClusterIP None 9042/TCP 16d [13:44:48] cassandra-restbase-a-eqiad ClusterIP None 9042/TCP 16d [13:44:48] cassandra-restbase-b-codfw ClusterIP None 9042/TCP 16d [13:44:48] cassandra-restbase-b-eqiad ClusterIP None 9042/TCP 16d [13:44:49] cassandra-restbase-c-codfw ClusterIP None 9042/TCP 16d [13:44:49] cassandra-restbase-c-eqiad ClusterIP None 9042/TCP 16d [13:44:52] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:44:58] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:45:27] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [13:45:49] so, theoretically, all you have to do is specify external_services.cassandra: [cassandra-restbase-a-codfw, cassandra-restbase-a-eqiad, cassandra-restbase-b-codfw, cassandra-restbase-b-eqiad, cassandra-restbase-c-codfw, cassandra-restbase-c-eqiad] [13:46:02] (03CR) 10TChin: [C:03+2] "Self-merging now :) Anything else we can fix later" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:46:09] (with a nested dict and not a dot, but I formatted it this way because IRC) [13:46:21] brouberol: ack thanks! [13:46:28] that diff will be nice :D [13:46:32] yw [13:46:45] (03CR) 10Elukey: "Same comment that Hugh made - let's use the new external-services-networkpolicies, so we can drop all IPs and let puppet populate the rela" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [13:46:47] (03Merged) 10jenkins-bot: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:48:26] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [13:48:27] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [13:48:29] PROBLEM - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:29] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-Y on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:29] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:32] (03CR) 10Cathal Mooney: Enable gNMI / gRPC on cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney) [13:48:35] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:48:35] RECOVERY - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-A-phase-X 374 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:35] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-Y on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-Y 282 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:35] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-X 277 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:37] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [13:49:01] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:49:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:49:10] hmmm [13:49:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS bullseye [13:49:31] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 75 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:49:42] (03CR) 10TChin: [C:03+2] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:49:50] (03CR) 10CI reject: [V:04-1] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:49:54] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [13:50:04] (03PS11) 10TChin: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [13:50:44] (03CR) 10TChin: [V:03+2 C:03+2] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:51:25] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply [13:51:33] (03Merged) 10jenkins-bot: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:52:06] elukey: thanks for pointing this out. I'm really glad to see that this is getting traction :) [13:52:14] (03CR) 10Muehlenhoff: "PCC output is outdated in terms of exported hosts, seaborgium was reimaged to Bookworm (like serpens yesterday( to Bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [13:52:17] (03CR) 10Hashar: [C:04-1] "The two existing entries are there for legacy reasons. That was for mobile repositories which once were hosted on Gerrit and got migrated " [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore) [13:52:37] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [13:53:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1209.eqiad.wmnet [13:54:42] (03CR) 10Brouberol: "As I mentioned it to Elukey on IRC, this would entail having the following block in your values (wherever it makes sense)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [13:54:45] !log installing nghttp2 security updates [13:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1211.eqiad.wmnet [13:57:40] (03CR) 10Eevans: [C:03+2] "Sorry, I only just noticed your comment. I would definitely like to learn more!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [13:57:58] (03PS1) 10Muehlenhoff: Switch db1211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031929 (https://phabricator.wikimedia.org/T349619) [13:58:46] (03CR) 10Muehlenhoff: [C:03+2] Switch db1211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031929 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1400) [14:00:23] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [14:00:33] (03CR) 10Slyngshede: [C:03+1] "LGTM. No alerts appear to be based on the exporter. Didn't check dashboard though." [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [14:00:38] jouncebot: nowandnext [14:00:38] For the next 0 hour(s) and 59 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1400) [14:00:38] In 2 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700) [14:01:00] !log disable puppet on A:lvs before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031814 - T357257 [14:01:01] we're running over on backport [14:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:04] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:01:32] My first patch is nearly done, but it's been running for 50 mintutes [14:01:35] JSherman: I guess it’s rebuilding the full l10n cache due to the new extension? [14:01:36] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]] (duration: 51m 44s) [14:01:36] We synced an extension-list change adding a new extension, which caused the i18n caches to be rebuilt, and deploying that apparently takes 50 minutes and counting [14:01:39] T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034 [14:01:44] jinx [14:01:44] Lucas_WMDE: Yes exactly [14:01:46] :/ [14:02:05] (03CR) 10Vgutierrez: [C:03+2] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:02:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [14:02:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye [14:02:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [14:02:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [14:02:35] hnowlan: are you here for the wikifunction services deploy? I was hoping to finish this extension deploy [14:02:47] JSherman: nah I'd like to deploy restbase, but there's no rush [14:02:57] I was also hoping to backport my changes :/ [14:02:59] take your time [14:03:11] Hopefully the next few changes will be faster [14:03:13] (03CR) 10Cathal Mooney: [C:04-1] "We can't do this yet as the devices in racks c8 and d5 eqiad do not fully support enabling gNMI via mgmt routing instance. Need to upgrad" [puppet] - 10https://gerrit.wikimedia.org/r/1031905 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney) [14:03:18] hnowlan: thanks! [14:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1211.eqiad.wmnet [14:04:23] (03CR) 10Muehlenhoff: P:ganeti Prometheus monitoring of ganeti services. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:04:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:05:37] (03CR) 10Muehlenhoff: "We have https://grafana.wikimedia.org/d/DnxQ26qmk/ldap?orgId=1, I'll delete it once the patch is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [14:05:38] (03Merged) 10jenkins-bot: InitialiseSettings.php: Add wmgUseAutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:05:49] jan_drewniak: I can deploy for you after I get through this; I have RoanKattouw pairing with me for training. [14:06:03] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1214.eqiad.wmnet [14:06:06] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]] [14:06:38] JSherman: yeah sure, that'd be great. don't worry about it taking so long (these things usually do) [14:07:29] (03CR) 10Bking: [C:03+1] cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [14:07:50] at leas mine can all be done with one command now :) `scap backport 1031477 1031479 1031478` [14:07:57] (03PS1) 10Muehlenhoff: Switch db1214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031930 (https://phabricator.wikimedia.org/T349619) [14:08:23] JSherman: for future deployments, you could probably have backported 1026973 1026974 and 1026975 together in one command [14:08:36] (03CR) 10Hashar: [C:03+1] "An alternative is to use the hostname in the configuration file and when `profile::ci::manager_host` changes on one of the hosts, restart " [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:08:50] Oh you haven't started with those yet [14:08:59] (03PS1) 10Btullis: Remove kubernetesMasters definition from dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031593 (https://phabricator.wikimedia.org/T287491) [14:09:56] !log jsn@deploy1002 jsn: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:57] That would avoid 3 image rebuild and pulls and make it only one [14:09:58] !log re-enable puppet on A:lvs - T357257 [14:10:00] T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034 [14:10:00] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:10:25] !log jsn@deploy1002 jsn: Continuing with sync [14:11:28] (03PS1) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [14:11:49] (03CR) 10Muehlenhoff: [C:03+2] Switch db1214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031930 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:11:50] claime: I was just planning on doing a git pull for the labs-only changes [14:11:55] fair enough [14:14:12] (03CR) 10JHathaway: [C:03+1] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [14:14:31] claime: Thanks for the tip, I forgot that scap backport could do that [14:14:41] I think `scap backport` does that anyway if the changes only touch *-labs.php [14:14:51] Oh really, is it that smart? [14:14:53] (“does that” = only pulling) [14:15:01] yeah [14:15:24] I wasn’t initially fond of it, since it does mean outdated code on the other servers (even if it’s only files that should™ never be used) [14:15:29] but it’s at least a timesaver [14:15:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1214.eqiad.wmnet [14:15:51] Lucas_WMDE: ah thanks, I'll backport the remaining patches together [14:16:10] (if you backport all three it’ll still be a full deploy because of InitialiseSettings.php, but that’s fine) [14:16:23] We're already running the InitialiseSettings.php now [14:16:29] ok [14:16:33] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:17:21] !log uploaded tcp-mss-clamper 0.5.1 to bullseye-wikimedia (apt.wm.o) - T357257 [14:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:24] Damn I keep learning about new `scap backport` smartness every time I do a deploy [14:17:26] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:17:31] Lucas_WMDE: after this, can I backport my remaining two changes + jan's in one swoop? [14:18:02] I’d just to your two remaining changes together [14:18:11] then you can see that (I think) scap backport will do the smart thing ^^ [14:18:15] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [14:18:25] and then maybe jan’s three changes together [14:18:53] ack [14:19:02] (oops, “I’d just do” → “to” ^^) [14:19:22] (03CR) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:19:31] (03CR) 10Muehlenhoff: [C:03+2] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [14:19:35] (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) [14:19:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [14:20:15] !log fab@deploy1002 Started deploy [airflow-dags/research@ecf603d]: (no justification provided) [14:20:48] !log fab@deploy1002 Finished deploy [airflow-dags/research@ecf603d]: (no justification provided) (duration: 00m 32s) [14:20:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [14:22:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [14:22:12] Lucas_WMDE: would +2ing jan's patches ahead of time save time for scap? [14:22:21] yes, that’s a good idea [14:22:27] Vector CI probably takes a while [14:22:34] (03PS2) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [14:22:48] (03CR) 10Jsn.sherman: [C:03+2] [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [14:22:51] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]] (duration: 16m 44s) [14:22:55] T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034 [14:23:14] (03CR) 10Jsn.sherman: [C:03+2] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:23:40] (03CR) 10Jsn.sherman: [C:03+2] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:24:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [14:24:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:24:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:24:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [14:25:17] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Deploy AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:25:21] (03Merged) 10jenkins-bot: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:25:59] (03PS1) 10TChin: Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) [14:26:05] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [14:26:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [14:27:18] okay, we're merged, going to do jan_drewniak: patches now [14:27:22] (03PS1) 10Vgutierrez: "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) [14:27:41] (03PS2) 10Vgutierrez: "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) [14:27:51] (03PS3) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) [14:27:55] (03PS4) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) [14:28:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [14:28:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:28:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:28:44] zuul says we have about a 15 minute wait [14:28:51] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:02] (03CR) 10Ssingh: [C:03+1] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:32:28] (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:32:41] !log depool upload@ulsfo before enabling IPIP encapsulation - T357257 [14:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:45] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:33:25] (03PS1) 10Muehlenhoff: Revert "mx: Stop ignoring errors from alias sync" [puppet] - 10https://gerrit.wikimedia.org/r/1031942 (https://phabricator.wikimedia.org/T284145) [14:36:55] (03CR) 10Muehlenhoff: [C:03+2] Revert "mx: Stop ignoring errors from alias sync" [puppet] - 10https://gerrit.wikimedia.org/r/1031942 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [14:37:24] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) [14:37:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:38:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:26] !log Removing downtime on mw2286.codfw.wmnet - T364863 [14:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:30] T364863: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863 [14:38:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2286.codfw.wmnet [14:38:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2286.codfw.wmnet [14:39:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:39:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2007.codfw.wmnet with OS bullseye [14:39:37] !log Repooling mw2286.codfw.wmnet - T364863 [14:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:41:19] (03CR) 10Gmodena: [C:03+1] Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:41:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:41:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS bullseye [14:41:56] (03CR) 10Gmodena: [C:04-1] Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:42:17] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:43:02] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:43:36] (03CR) 10Gmodena: "nit: deployment-charts is a multi-project repo. Could you specify the subsystem you are touching in your commit message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:43:51] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:44:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS bullseye [14:45:19] (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) [14:45:46] (03PS1) 10Muehlenhoff: Remove obsolete wmflabs certs [puppet] - 10https://gerrit.wikimedia.org/r/1031947 [14:45:55] (03Merged) 10jenkins-bot: [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [14:46:33] (03Merged) 10jenkins-bot: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:46:51] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:47:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 75 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:47:52] (03Merged) 10jenkins-bot: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [14:48:27] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]] [14:48:33] T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861 [14:48:34] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [14:48:43] (03PS2) 10TChin: datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) [14:48:55] (03CR) 10Majavah: [C:03+1] Remove obsolete wmflabs certs [puppet] - 10https://gerrit.wikimedia.org/r/1031947 (owner: 10Muehlenhoff) [14:48:59] (03CR) 10TChin: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:50:07] (03CR) 10Majavah: [C:03+1] Remove obsolete wmflabs certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031947 (owner: 10Muehlenhoff) [14:51:08] !log jsn@deploy1002 jsn and jdlrobson: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:51:41] !log disable puppet on A:lvs before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031827- T357257 [14:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:46] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:52:04] (03PS3) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [14:52:21] jan_drewniak: your patches are ready for testing [14:52:32] JSherman: ok one sec [14:53:07] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:53:46] JSherman: okn good to sync [14:53:55] syncing [14:53:56] !log jsn@deploy1002 jsn and jdlrobson: Continuing with sync [14:55:38] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [14:57:40] !log re-enable puppet on A:lvs - T357257 [14:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:45] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:58:29] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:58:51] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:10] (03CR) 10Gmodena: [C:03+1] datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:05:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [15:06:54] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]] (duration: 18m 26s) [15:07:01] T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861 [15:07:01] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [15:07:19] jan_drewniak: you should be good to go! [15:08:10] JSherman: thank you! sorry for making you stick around so long :P hopefully it won't usually take this long. [15:09:35] (03PS1) 10CDanis: otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907) [15:09:37] (03PS1) 10CDanis: otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) [15:09:39] (03PS1) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:10:03] (03PS1) 10C. Scott Ananian: [ParserCache] Preserve information from the JsonException when logging failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036) [15:10:05] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [15:10:46] jan_drewniak: no worries, it was that i18n cache rebuild on the new extension that ate up most of the time [15:12:36] (03CR) 10CDanis: [C:03+2] otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [15:14:22] (03Merged) 10jenkins-bot: otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [15:14:37] (03PS1) 10Vgutierrez: hiera: disable rp_filter on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257) [15:15:57] (03CR) 10CDanis: "I've tested this locally, and it seems to "work" inasmuch it doesn't do anything I don't expect. But I haven't yet managed to locally rep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:16:09] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2463/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:16:23] (03CR) 10JMeybohm: sre.hosts.rename: initial commit (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [15:16:49] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: disable rp_filter on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:18:33] (03CR) 10JMeybohm: sre.hosts.rename: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [15:20:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:46] (03PS2) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:22:11] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [15:23:06] (03CR) 10Clément Goubert: [C:03+1] otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:24:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 104 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:24:48] jouncebot: nowandnext [15:24:48] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [15:24:49] In 1 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700) [15:25:03] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:25:23] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [15:25:46] !log rolling restart of pybal on lvs4010 and lvs4009 - T357257 [15:25:48] claime: mind if I do a restbase deploy? [15:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:51] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:26:02] hnowlan: Nah, I can wait until after you do it, np [15:26:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:05] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:05] (03CR) 10CDanis: [C:03+2] otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:26:10] !log hnowlan@deploy1002 Started deploy [restbase/deploy@92abb6a]: Deploying new wikis T360304 T360311 T363244 T363250 T363257 T363264 T363271 [15:26:12] it'll just sit there for a minute [15:26:25] T360304: Add kuswiki to RESTBase - https://phabricator.wikimedia.org/T360304 [15:26:26] T360311: Add bewwiki to RESTBase - https://phabricator.wikimedia.org/T360311 [15:26:26] T363244: Add kawikisource to RESTBase - https://phabricator.wikimedia.org/T363244 [15:26:26] T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250 [15:26:27] T363257: Add kaawiktionary to RESTBase - https://phabricator.wikimedia.org/T363257 [15:26:27] T363264: Add iglwiki to RESTBase - https://phabricator.wikimedia.org/T363264 [15:26:28] T363271: Add mywikisource to RESTBase - https://phabricator.wikimedia.org/T363271 [15:26:31] claime: thanks for the +1 [15:26:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:49] (03Merged) 10jenkins-bot: mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [15:26:58] cdanis: figured you'd want to test this quickly, and since it's borked anyways [15:27:30] claime: oh if I didn't get a +1 soon I was just going to deploy it by hand without merging 😅 [15:27:36] (03Merged) 10jenkins-bot: otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:27:37] lol [15:28:37] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [15:28:42] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [15:29:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 80 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:29:52] (03CR) 10TChin: [C:03+2] datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:29:55] (03PS1) 10CDanis: otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407) [15:30:03] (03PS3) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:30:03] (03PS1) 10David Caro: openstack: use bobcat for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 [15:30:05] (03CR) 10CDanis: [C:03+2] otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:30:31] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [15:30:38] (03CR) 10CI reject: [V:04-1] openstack: use bobcat for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro) [15:30:50] (03Merged) 10jenkins-bot: datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:31:13] (03Merged) 10jenkins-bot: otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [15:31:40] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [15:31:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:31:53] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [15:32:42] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [15:32:42] (03PS2) 10David Caro: openstack: use bobcat/bookworm for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 [15:32:42] (03PS4) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:32:44] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [15:33:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:10] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [15:33:13] (03CR) 10CI reject: [V:04-1] openstack: use bobcat/bookworm for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro) [15:33:59] (03PS1) 10CDanis: I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959 [15:34:19] (03CR) 10CDanis: [C:03+2] I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959 (owner: 10CDanis) [15:34:57] * Lucas_WMDE resets “days since last YAML accident” to 0 [15:35:32] (03Merged) 10jenkins-bot: I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959 (owner: 10CDanis) [15:35:41] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [15:35:41] " This file only contains whitespace changes. Modify the whitespace setting to see the changes. " [15:35:43] yeah [15:35:49] yaml [15:35:51] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [15:36:29] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [15:36:34] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [15:36:59] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [15:37:18] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [15:38:16] sorry for the noise [15:38:40] I blame scap. [15:38:55] narrator: He didn't use scap [15:39:05] I blame it for not doing helm stuff. [15:39:07] (03PS1) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 [15:40:29] (03PS3) 10David Caro: openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 [15:40:29] (03PS5) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:40:57] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [15:41:00] (03CR) 10CI reject: [V:04-1] openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro) [15:41:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 99 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:41:47] (03CR) 10Ssingh: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez) [15:42:07] (03CR) 10BBlack: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez) [15:42:15] (03PS1) 10Kosta Harlan: AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) [15:43:03] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@92abb6a]: Deploying new wikis T360304 T360311 T363244 T363250 T363257 T363264 T363271 (duration: 16m 52s) [15:43:14] T360304: Add kuswiki to RESTBase - https://phabricator.wikimedia.org/T360304 [15:43:15] T360311: Add bewwiki to RESTBase - https://phabricator.wikimedia.org/T360311 [15:43:15] T363244: Add kawikisource to RESTBase - https://phabricator.wikimedia.org/T363244 [15:43:16] T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250 [15:43:16] T363257: Add kaawiktionary to RESTBase - https://phabricator.wikimedia.org/T363257 [15:43:17] T363264: Add iglwiki to RESTBase - https://phabricator.wikimedia.org/T363264 [15:43:17] T363271: Add mywikisource to RESTBase - https://phabricator.wikimedia.org/T363271 [15:43:19] James_F: We do use it for helmfile stuff for mw-on-k8s, but I don't know if we want it to be used for all helmfile things [15:43:30] Ack. [15:43:43] It's just helmfile is so much a black-box compared to scap's wrapper. [15:43:49] 'cause well, it'd be a wrapper around helmfile, that's a wrapper around helm, that's a wrapper around kube yaml [15:43:56] Fair. [15:44:02] But progress meters! [15:44:04] lol [15:45:08] James_F: yeahhhhhh [15:45:25] `helmfile apply` … and wait minutes to see if you maybe just broke stuff. [15:46:15] what's up with gerrit [15:46:22] was about to ask [15:46:28] * sukhe looks [15:46:30] +1 [15:46:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 79 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:01] hnowlan: all good for me to proceed with my deployment? [15:47:08] (I don't need no gerrit :p) [15:47:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:34] hmm I see nothing wrong on the dashboard though [15:47:35] oh hello [15:47:52] claime: yep, thanks! [15:47:55] sukhe: I see no data points on the dashbaord for the past few minutes [15:48:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:03] something is up for sure yeah [15:49:34] !log cgoubert@deploy1002 Started scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323 [15:49:35] weird [15:49:40] T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323 [15:50:38] The last Puppet run was at Fri Apr 26 17:27:53 UTC 2024 (19 minutes ago). [15:50:59] !log cgoubert@deploy1002 Finished scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323 (duration: 02m 01s) [15:53:02] FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:08] Is gerrit down? [15:53:12] yeah it is [15:53:16] looking into it [15:53:20] Thanks [15:54:02] hashar is filing a task for it btw [15:54:08] (03PS4) 10David Caro: openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 [15:54:09] (03PS6) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [15:54:10] The last Puppet run was at Tue Apr 30 07:18:32 UTC 2024 (0 minutes ago). [15:54:13] fun [15:54:56] (03CR) 10CI reject: [V:04-1] AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [15:54:56] Seems back up for me now [15:55:11] https://phabricator.wikimedia.org/T365041 [15:55:15] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [15:55:16] it's back [15:55:19] thanks claime [15:55:24] same here [15:55:25] mutante: did you restart? [15:55:30] I was about to restart the service but then did not.. after I saw discussion in -releng [15:55:39] I see. so do we know what happened? [15:55:41] and then hashar made https://phabricator.wikimedia.org/T365041#9800848 [15:55:42] (03CR) 10Dreamy Jazz: "recheck" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [15:56:09] the system time is correct but the last puppet runs are certainly not [15:56:37] (03CR) 10Xcollazo: [C:03+1] "Interesting, I wonder who the original owner of this is? I'd like to understand if Data Products can also mine this." [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis) [15:56:40] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [15:56:54] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [15:57:26] so I guess two unanswered questions: [15:57:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:35] hi, Gerrit had some issue between 15:42 and 15:55 it is recovering [15:57:41] filed as v [15:57:42] hashar: hi [15:57:46] T365041 [15:57:46] T365041: Gerrit not reachable over HTTPS - https://phabricator.wikimedia.org/T365041 [15:58:02] FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:58:08] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez) [15:58:13] so two things though, we didn't get paged for it (maybe we should) and the puppet agent run motd is not correct, even though the system time is [15:58:17] just as an fyi for what is so far [15:58:20] and please don't blindly restart services :) [15:58:32] hashar: ok, nothing was restarted here [15:58:36] !log repool upload@ulsfo with IPIP encapsulation enabled - T357257 [15:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:40] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:59:30] (03CR) 10David Caro: openstack::bobcat: apply cloud yaml patch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [16:00:17] sukhe: The last Puppet run was at Fri Apr 26 17:27:53 UTC 2024 (19 minutes ago). [16:00:24] what the heck [16:00:42] yeah. the system time is correct though [16:00:55] That's fun [16:01:14] Active: inactive (dead) since Wed 2024-05-15 15:48:20 UTC; 12min ago [16:01:18] Loaded: loaded (/lib/systemd/system/puppet-agent-timer.service; static) [16:01:39] no failed units though [16:01:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:02:27] runs the same command that this service uses to run puppet [16:04:43] well, that finished but changed nothing.. then let's see the command that builds the motd from the snippets [16:06:15] so that Gerrit issue was transient. It has self recovered and I have closed the task [16:06:48] !log Gerrit was briefly unreachable between 15:42 and 15:55 UTC | T365041 [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:54] T365041: Gerrit not reachable over HTTPS - https://phabricator.wikimedia.org/T365041 [16:07:10] that doesn't explain the wrong puppet run timer though but we will look at the independently [16:07:49] FIRING: HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:11:39] (03PS1) 10Superpes15: [pswiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) [16:12:54] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1031842 (owner: 10Muehlenhoff) [16:17:17] so regarding the not-updating MOTD mystery: [16:17:32] when I manually run "run-parts /etc/update-motd.d" I get the right parts: [16:17:35] The last Puppet run was at Wed May 15 16:02:53 UTC 2024 (13 minutes ago). [16:17:46] and other bullseye hosts dont have this issue [16:24:26] !log gerrit1003 - MOTD wasn't updating anymore but manual "run-parts /etc/update-motd.d" showed updated data - while /run/motd.dynamic was outdated. fixed by manually renaming /run/motd.dynamic.new to /run/motd.dynamic and logging in because it's triggered by PAM.. but .. why [16:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:22] same issue on the other gerrit server.. so it's the puppet role? what [16:27:38] so far just these two hosts yep [16:28:05] (03CR) 10Aklapper: [C:03+2] "Applies cleanly locally, and from a quick random test seems correct too. :D Again thanks a lot for your patience!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [16:28:30] (03Abandoned) 10Brennen Bearnes: WIP: gitlab: enable agent server for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/767249 (https://phabricator.wikimedia.org/T283894) (owner: 10Brennen Bearnes) [16:28:51] (03CR) 10Aklapper: [V:03+2 C:03+2] Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [16:30:16] also not a permission issue on those files in /run [16:30:50] even #debian is confused because motd became way too complext .. when it used to be a simple file to edit :) [16:31:04] (03PS1) 10Ebernhardson: cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966 [16:31:46] (03CR) 10Peter Fischer: [C:03+2] cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966 (owner: 10Ebernhardson) [16:31:53] !log gerrit2002 - mv /run/motd.dynamic.new /run/motd.dynamic [16:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:37] (03Merged) 10jenkins-bot: cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966 (owner: 10Ebernhardson) [16:33:32] !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [16:33:37] thanks mutante! [16:34:10] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [16:34:39] sukhe: yw, so it seems like it works normal again. it creates a new "motd.dynamic.new" file on login [16:34:52] the mystery remains on why just gerrit but I guess ... :) [16:37:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:37:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:39:07] I confirmed the relevant line is in /etc/pam.d/login and /etc/pam.d/sshd as normal [16:40:39] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [16:47:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62416 and previous config saved to /var/cache/conftool/dbconfig/20240515-164713-ladsgroup.json [16:47:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:48:23] (03CR) 10Ebernhardson: [C:03+1] CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper) [16:50:46] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [16:51:54] 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9801162 (10hashar) I had a similar issue while deploying the train this morning. One of the httpbb test failed due to mwdebug2002... [16:59:07] 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801220 (10SMMPakPanel) Its all-encompassing strategy for social media marketing in Pakistan makes [[ https://smmpakpanel.com/ | SMM Pak Panel ]] unique. It helps businesses efficiently improve their we... [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700) [17:01:55] deleted phab spam [17:02:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P62417 and previous config saved to /var/cache/conftool/dbconfig/20240515-170221-ladsgroup.json [17:02:50] 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801243 (10hashar) [17:07:17] phab spam on my humorous (?) task, how sad [17:07:55] (03PS1) 10Eevans: Add user xcollazo to cassandra-staging-devs group [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588) [17:08:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9801254 (10Eevans) [17:10:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801259 (10VRiley-WMF) a:03VRiley-WMF [17:15:57] (03PS4) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [17:17:09] (03PS5) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [17:17:17] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye [17:17:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P62418 and previous config saved to /var/cache/conftool/dbconfig/20240515-171729-ladsgroup.json [17:17:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye [17:20:16] (03PS6) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [17:21:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [17:22:19] !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [17:24:43] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [17:26:45] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 13), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9801327 (10Scott_French) Hi @SGupta-WMF and @mforns - Any updates on the timel... [17:28:48] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [17:32:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62419 and previous config saved to /var/cache/conftool/dbconfig/20240515-173236-ladsgroup.json [17:32:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:32:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:32:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:33:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62420 and previous config saved to /var/cache/conftool/dbconfig/20240515-173259-ladsgroup.json [17:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:23] ^ has anyone looked at this and why it alerts so frequently? it's a signficant contribute to AlertFatigue ratio [17:35:10] sukhe: I made https://phabricator.wikimedia.org/T364931 yesterday for that [17:35:33] wow thank you [17:35:47] adding my .1 cents :) [17:36:31] (03PS7) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:36:56] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:37:44] (03CR) 10Andrew Bogott: "latest version uses openstack::patch for consistency. I tested the patch application on bookworm and it worked with fuzz 1." [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:37:46] mutante: the real lesson here is to not care about alerts that are not pagin g I guess :) [17:38:11] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:39:01] (03CR) 10Dzahn: contint: set new default docker version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [17:40:05] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [17:41:58] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:43:54] (03PS8) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:43:54] (03PS1) 10Andrew Bogott: Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981 [17:44:20] (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [17:44:28] (03CR) 10CI reject: [V:04-1] Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981 (owner: 10Andrew Bogott) [17:46:05] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [17:49:17] (03CR) 10Scott French: "Thanks, Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [18:00:05] hashar and andre: Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1800) [18:01:25] (03PS1) 10CDanis: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 [18:01:31] (03PS2) 10CDanis: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 [18:01:37] (03CR) 10CDanis: [C:03+2] Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 (owner: 10CDanis) [18:02:29] (03Merged) 10jenkins-bot: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 (owner: 10CDanis) [18:03:28] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1007.eqiad.wmnet with OS bullseye [18:03:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed... [18:06:57] (03PS1) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 [18:08:15] (03PS2) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 [18:08:38] (03Abandoned) 10Andrew Bogott: Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981 (owner: 10Andrew Bogott) [18:09:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott) [18:10:03] (03Abandoned) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall) [18:10:11] (03PS1) 10BCornwall: [ncmonitor] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1031991 [18:10:35] (03Abandoned) 10BCornwall: [ncmonitor] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1031991 (owner: 10BCornwall) [18:11:27] (03CR) 10CI reject: [V:04-1] Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott) [18:11:59] (03PS3) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 [18:12:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott) [18:13:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [18:18:44] (03CR) 10Andrew Bogott: [C:03+2] Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott) [18:20:55] (03PS9) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [18:22:00] (03CR) 10Dzahn: "Yea, fair enough. Though restarting zuul-merger might also be forgotten and puppet at least runs by itself after a while." [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:22:50] (03CR) 10Dzahn: [C:03+2] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:24:12] (03CR) 10Dzahn: [C:03+2] "duh! DNS lookup failed for 127.0.0.1 Resolv::DNS::Resource::IN::A" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:39:04] (03PS1) 10Dzahn: ci/zuul: use localhost as gearman server [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517) [18:39:38] (03CR) 10Dzahn: [C:03+2] ci/zuul: use localhost as gearman server [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:44:45] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [18:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:35] (03PS1) 10Dzahn: zuul: add DNS lookup for gearman server IP [puppet] - 10https://gerrit.wikimedia.org/r/1032013 (https://phabricator.wikimedia.org/T334517) [18:52:33] (03CR) 10Dzahn: [C:03+2] zuul: add DNS lookup for gearman server IP [puppet] - 10https://gerrit.wikimedia.org/r/1032013 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:56:27] (03CR) 10Dzahn: [C:03+2] "needed 2 follow-ups but is working now:" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:56:56] (03CR) 10Dzahn: [C:03+2] "[contint2002:~] $ sudo grep -A1 "\[gearman\]" /etc/zuul/zuul-*.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:57:49] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:59:32] (03CR) 10Dzahn: [C:03+2] "first applied on contint2002 - fixed issues - then applied on contint1002 and it was complete noop. configs are the same before and after " [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:02:16] (03CR) 10Dzahn: [C:04-1] "looks like the function name was copied but not adjusted yet from "debian_php_version" to "wmf_php_version"." [puppet] - 10https://gerrit.wikimedia.org/r/1029900 (owner: 10Muehlenhoff) [19:03:06] (03PS1) 10Jdlrobson: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) [19:04:04] (03PS2) 10Jdlrobson: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) [19:05:13] (03CR) 10Jdrewniak: [C:03+2] [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [19:06:02] (03Merged) 10jenkins-bot: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [19:09:55] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:10:12] (03PS3) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 [19:13:47] (03PS2) 10Dzahn: contint: set new default docker version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) [19:15:55] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:56] (03PS1) 10Dzahn: ci: set puppet7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517) [19:18:25] (03PS2) 10Dzahn: ci: set puppet7 at role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517) [19:26:39] (03CR) 10AOkoth: [C:03+1] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff) [19:40:06] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031594 [19:41:03] jouncebot: nowandnext [19:41:03] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [19:41:04] In 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2000) [19:42:18] i'm here for the backport window [19:45:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62423 and previous config saved to /var/cache/conftool/dbconfig/20240515-194514-marostegui.json [19:45:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:48:28] cscott: shall we start your patch merging now? [19:48:36] Sure!  Thanks! [19:48:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [19:54:08] Sorry for the aliasing, joining from my phone as well as my desktop [19:55:11] np :) [19:55:32] (03PS2) 10Jdlrobson: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) [19:55:44] (03PS2) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) [19:56:12] (03CR) 10Scott French: "Thanks, Amir!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French) [19:58:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:58:07] (03PS8) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 [19:58:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "no change on the current prod server: https://puppet-compiler.wmflabs.org/output/1020344/2465/" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:59:06] (03PS4) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (https://phabricator.wikimedia.org/T362786) [19:59:20] (03PS2) 10Jsn.sherman: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031999 (https://phabricator.wikimedia.org/T364034) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2000). [20:00:04] Jdlrobson, Superpes, cscott, and Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] \o [20:00:21] You know, I've never gotten my sticker(s) [20:00:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P62424 and previous config saved to /var/cache/conftool/dbconfig/20240515-200022-marostegui.json [20:00:28] :D [20:00:39] and i've broken wikis back when i should have been rewarded with a t-shirt for the feat [20:00:57] I am happy to deploy, but don't mind if someone wants to combine my patch with another deploy. [20:01:02] (currently merging cscott's patch, another 10 minutes or so) [20:01:12] 👍 [20:01:39] o/ [20:01:42] (03CR) 10Scott French: "Thanks for the feedback on I0a62da18de21b609b7f07b075bd9be99cd8b8b9f, Amir. Let's go that route and continue the conversation there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [20:02:17] (03Abandoned) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French) [20:09:55] (03CR) 10Dzahn: [V:03+1 C:03+2] "no change on contint1002 - error on contint2002 will resolve once I reimage tomorrow morning" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [20:11:23] (03Merged) 10jenkins-bot: [ParserCache] Preserve information from the JsonException when logging failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [20:12:15] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]] [20:12:17] here's hoping cscott returns.. [20:12:19] T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036 [20:15:01] !log samtar@deploy1002 cscott and samtar: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P62425 and previous config saved to /var/cache/conftool/dbconfig/20240515-201529-marostegui.json [20:16:12] !log samtar@deploy1002 cscott and samtar: Continuing with sync [20:16:52] (03PS7) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [20:16:52] (03CR) 10Andrea Denisse: "Thanks for your review and the explanation on what is expected of these metrics. I've grouped them under the same metric name (`wmfstatic_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:18:50] (03PS8) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [20:20:05] (03PS6) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [20:20:44] i'm back, sorry libera.chat bounced me [20:20:55] cscott: hi, I went ahead and started the sync for your patch, but you're welcome to test it on mwdebug now while it syncs [20:21:04] i'm back, sorry libera.chat bounced me [20:21:12] ok, testing now [20:21:39] (03PS7) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [20:24:11] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801879 (10Eevans) [20:25:46] (03PS1) 10Dzahn: lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032 [20:26:58] (03PS2) 10Dzahn: lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032 [20:28:28] TheresNoTime: tested, looks good. thanks. [20:28:38] cscott: ack :) [20:28:42] Jdlrobson: doing your config patch next (combined with Superpes' throttle rule) [20:28:56] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]] (duration: 16m 41s) [20:29:00] T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036 [20:29:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [20:29:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) (owner: 10Superpes15) [20:30:02] (03Merged) 10jenkins-bot: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [20:30:02] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801884 (10Eevans) A Docker image is now published: ` docker pull docker-registry.wikimedia.org/repos/sre... [20:30:04] (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) (owner: 10Superpes15) [20:30:37] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]] [20:30:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62426 and previous config saved to /var/cache/conftool/dbconfig/20240515-203037-marostegui.json [20:30:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [20:30:46] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [20:30:46] T364708: Temp lift of IP cap for Chronobiology Edit-a-thon 18th May 2024 - https://phabricator.wikimedia.org/T364708 [20:30:51] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:30:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [20:30:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [20:31:00] exciting stuff [20:31:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [20:31:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T364299)', diff saved to https://phabricator.wikimedia.org/P62427 and previous config saved to /var/cache/conftool/dbconfig/20240515-203116-marostegui.json [20:33:15] !log samtar@deploy1002 samtar and superpes and jdlrobson: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:33:19] Jdlrobson: live on mwdebug :) [20:34:27] TheresNoTime: looking! [20:35:09] TheresNoTime: please sync! [20:35:15] !log samtar@deploy1002 samtar and superpes and jdlrobson: Continuing with sync [20:35:18] (03PS1) 10Eevans: cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) [20:36:28] Dreamy_Jazz: will start your patch merging now [20:36:36] Thanks [20:36:44] (03CR) 10Samtar: [C:03+2] "prep for deploy" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [20:36:46] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801918 (10Eevans) [20:44:38] thanks for the deploy! [20:44:47] np! [20:48:13] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]] (duration: 17m 35s) [20:48:18] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [20:48:18] T364708: Temp lift of IP cap for Chronobiology Edit-a-thon 18th May 2024 - https://phabricator.wikimedia.org/T364708 [20:50:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [20:51:56] Thanks TheresNoTime [20:51:57] :3 [20:52:07] no worries :D [20:54:54] Dreamy_Jazz: about 5mins to merging, you okay to hang on? [20:55:12] Yes, I can hang around till then. [20:56:07] gate-and-submit is certainly taking it's time :D [20:56:22] (03PS9) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [20:56:41] didn't think ConfirmEdit was that slow to merge..! [20:57:27] Ikr. I'd expect this for CheckUser or something, but ConfirmEdit seems an odd one. Perhaps it's because it now loads AbuseFilter as a dependency? [20:58:08] Um, about the dark mode patch, is it expected to be automatic by default ? [20:58:46] Do you have all beta features enabled? [20:59:03] (03Merged) 10jenkins-bot: AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [20:59:39] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]] [20:59:51] T20110: Define AbuseFilter consequence to display a CAPTCHA - https://phabricator.wikimedia.org/T20110 [21:00:02] I had the "Accessibility for Reading (Vector 2022)" feature enabled by default I guess [21:00:03] At least for me Dark Mode isn't enabled by default on production [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2100) [21:00:12] That might be it then [21:00:51] turning on "Accessibility for Reading (Vector 2022)" sets it to "Automatic" for me fwiw (so currently dark mode) [21:01:26] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801985 (10Eevans) p:05Triage→03High [21:01:34] (03PS10) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [21:02:20] !log samtar@deploy1002 samtar and kharlan: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:02:29] Dreamy_Jazz: on mwdebug [21:02:36] Ty [21:02:38] Testing [21:03:05] Yeah, I guess it was more sudden than I expected (also I was on Special:Watchlist which looks particularly bad (to me)) [21:03:27] TheresNoTime: Test successful. [21:03:32] !log samtar@deploy1002 samtar and kharlan: Continuing with sync [21:03:50] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801987 (10Eevans) p:05High→03Triage [21:03:51] Dreamy_Jazz: syncing :) [21:04:00] :D [21:04:23] I do see what you are saying about the Watchlist :) [21:04:58] Some of the colours seem to not yet be adapted for dark mode [21:06:58] (03PS1) 10Dzahn: admin: add Dennis Mburugu to ldap_only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/1032047 (https://phabricator.wikimedia.org/T364320) [21:07:07] Yep, looks like anything that has hardcoded styles is in pretty bad shape lol [21:07:27] Special:NewPagesFeed is also pretty bad [21:07:48] lucky you, getting to fix it! :D [21:08:50] :) [21:09:20] (03CR) 10Dzahn: [V:03+1] "https://wikimedia.namely.com/people/bc0ae9bc-9dd7-4390-afae-8bab4dc49684/show/personal/employee-information/" [puppet] - 10https://gerrit.wikimedia.org/r/1032047 (https://phabricator.wikimedia.org/T364320) (owner: 10Dzahn) [21:10:22] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9802031 (10Dzahn) 05Open→03In progress [21:14:02] (03CR) 10Dzahn: [C:03+1] "lgtm, has the approvals now" [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588) (owner: 10Eevans) [21:14:18] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [21:16:10] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]] (duration: 16m 31s) [21:16:14] T20110: Define AbuseFilter consequence to display a CAPTCHA - https://phabricator.wikimedia.org/T20110 [21:16:16] and done [21:16:24] !log UTC late backport window complete [21:16:24] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1-d1-codfw mgmt ip - cmooney@cumin1002" [21:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1-d1-codfw mgmt ip - cmooney@cumin1002" [21:17:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:17:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:21:11] (03PS3) 10JHathaway: postfix: prometheus ops config [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) [21:22:46] (03PS4) 10JHathaway: postfix: prometheus ops config for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) [21:23:02] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:23] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [21:44:07] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@718b2dd]: specify analytics-hadoop in hdfs urls [21:44:33] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@718b2dd]: specify analytics-hadoop in hdfs urls (duration: 00m 25s) [21:51:20] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074 (10Milimetric) 03NEW [21:54:13] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: delete ssw1-d1-codfw mgmt ip - cmooney@cumin1002" [21:55:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: delete ssw1-d1-codfw mgmt ip - cmooney@cumin1002" [21:55:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:32] (03CR) 10Eevans: [C:03+2] Add user xcollazo to cassandra-staging-devs group [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588) (owner: 10Eevans) [22:13:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9802260 (10Eevans) This is now done. The document is here: https://wikitech.wikimedia.org/wiki/Cassandra/Staging (it's still quite bare, so if you have any q... [22:13:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9802261 (10Eevans) 05In progress→03Resolved [22:13:55] (03PS11) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [22:17:17] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:18:02] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:28] (03PS2) 10Jdlrobson: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) [22:34:23] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:49] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@12e0cb9]: bump discolytics to 0.19.0 [22:41:17] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@12e0cb9]: bump discolytics to 0.19.0 (duration: 00m 27s) [22:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:16:09] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032086 [23:21:25] PROBLEM - carbon-cache write error on graphite1005 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [8.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30 [23:28:32] (03PS1) 10Jdlrobson: Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887) [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031596 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031596 (owner: 10TrainBranchBot) [23:43:25] RECOVERY - carbon-cache write error on graphite1005 is OK: OK: Less than 80.00% above the threshold [1.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30 [23:43:32] (03CR) 10Scott French: [C:03+1] "Thanks, Eric! Two minor questions, but otherwise looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [23:47:25] PROBLEM - carbon-cache write error on graphite1005 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [8.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30 [23:58:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable