[00:00:35] <icinga-wm>	 RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T354336)', diff saved to https://phabricator.wikimedia.org/P55433 and previous config saved to /var/cache/conftool/dbconfig/20240124-000802-marostegui.json
[00:08:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[00:08:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[00:08:19] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[00:08:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55434 and previous config saved to /var/cache/conftool/dbconfig/20240124-000824-marostegui.json
[00:10:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55435 and previous config saved to /var/cache/conftool/dbconfig/20240124-001044-marostegui.json
[00:25:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P55436 and previous config saved to /var/cache/conftool/dbconfig/20240124-002551-marostegui.json
[00:36:53] <icinga-wm>	 PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[00:39:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440
[00:39:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440 (owner: 10TrainBranchBot)
[00:40:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] httpd: ErrorLogFormat for ECS [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar)
[00:40:51] <wikibugs>	 (03PS4) 10Cwhite: httpd: ErrorLogFormat for ECS [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar)
[00:40:59] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[00:40:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P55437 and previous config saved to /var/cache/conftool/dbconfig/20240124-004058-marostegui.json
[00:41:11] <icinga-wm>	 PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[00:54:37] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[00:54:49] <icinga-wm>	 RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[00:56:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55438 and previous config saved to /var/cache/conftool/dbconfig/20240124-005605-marostegui.json
[00:56:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[00:56:11] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[00:56:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[00:56:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55439 and previous config saved to /var/cache/conftool/dbconfig/20240124-005627-marostegui.json
[00:56:35] <icinga-wm>	 RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[00:58:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55440 and previous config saved to /var/cache/conftool/dbconfig/20240124-005849-marostegui.json
[01:01:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440 (owner: 10TrainBranchBot)
[01:10:35] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 113.77, 103.15, 87.07 https://wikitech.wikimedia.org/wiki/Swift
[01:13:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P55441 and previous config saved to /var/cache/conftool/dbconfig/20240124-011355-marostegui.json
[01:15:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:15] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2075 is OK: OK - load average: 59.22, 72.02, 78.74 https://wikitech.wikimedia.org/wiki/Swift
[01:29:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P55442 and previous config saved to /var/cache/conftool/dbconfig/20240124-012902-marostegui.json
[01:35:10] <wikibugs>	 10SRE, 10serviceops: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Mstyles)
[01:44:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55443 and previous config saved to /var/cache/conftool/dbconfig/20240124-014408-marostegui.json
[01:44:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[01:44:21] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[01:44:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[01:44:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55444 and previous config saved to /var/cache/conftool/dbconfig/20240124-014430-marostegui.json
[01:46:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55445 and previous config saved to /var/cache/conftool/dbconfig/20240124-014651-marostegui.json
[02:01:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:01:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P55447 and previous config saved to /var/cache/conftool/dbconfig/20240124-020157-marostegui.json
[02:02:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:03:17] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51307 bytes in 7.849 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:03:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:08:50] <wikibugs>	 (03CR) 10Ssingh: "Please feel free to take 928 that is reserved for authdns (but we haven't used it anywhere so far, so all good)." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[02:17:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P55448 and previous config saved to /var/cache/conftool/dbconfig/20240124-021704-marostegui.json
[02:32:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55449 and previous config saved to /var/cache/conftool/dbconfig/20240124-023210-marostegui.json
[02:32:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[02:32:16] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[02:32:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[02:39:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:11] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:14:21] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:20:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy: add awareness of 'unmanaged' role [puppet] - 10https://gerrit.wikimedia.org/r/992543 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott)
[03:24:53] <wikibugs>	 (03PS8) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[03:27:32] <wikibugs>	 (03PS9) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[03:34:48] <wikibugs>	 (03CR) 10Andrea Denisse: "I've clarified the situation with Sukhbir." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[03:35:23] <wikibugs>	 (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[03:45:13] <wikibugs>	 (03PS1) 10Andrea Denisse: authdns: Add entry for the 'authdns' GID [puppet] - 10https://gerrit.wikimedia.org/r/992550
[03:47:09] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi, this patch is related to the issue discussed in 990795." [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse)
[05:45:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:45:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:47:00] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992511
[05:48:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance
[05:48:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992511 (owner: 10Marostegui)
[05:48:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance
[05:49:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[05:49:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55450 and previous config saved to /var/cache/conftool/dbconfig/20240124-054924-root.json
[05:49:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[05:49:31] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[05:49:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55451 and previous config saved to /var/cache/conftool/dbconfig/20240124-054932-marostegui.json
[05:49:38] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[05:51:05] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications on A1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/992555 (https://phabricator.wikimedia.org/T355437)
[05:51:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158 db2157 es2026 db2136 T355437', diff saved to https://phabricator.wikimedia.org/P55452 and previous config saved to /var/cache/conftool/dbconfig/20240124-055143-marostegui.json
[05:51:49] <stashbot>	 T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437
[05:51:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55453 and previous config saved to /var/cache/conftool/dbconfig/20240124-055157-marostegui.json
[05:52:15] <wikibugs>	 (03Abandoned) 10Ammarpad: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad)
[05:52:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on A1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/992555 (https://phabricator.wikimedia.org/T355437) (owner: 10Marostegui)
[05:56:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T354506', diff saved to https://phabricator.wikimedia.org/P55454 and previous config saved to /var/cache/conftool/dbconfig/20240124-055635-marostegui.json
[05:56:40] <stashbot>	 T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506
[05:57:08] <wikibugs>	 (03PS1) 10Marostegui: db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992556 (https://phabricator.wikimedia.org/T354506)
[05:58:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2129.codfw.wmnet with OS bookworm
[05:58:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992556 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[06:04:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55455 and previous config saved to /var/cache/conftool/dbconfig/20240124-060429-root.json
[06:04:38] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[06:07:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P55456 and previous config saved to /var/cache/conftool/dbconfig/20240124-060703-marostegui.json
[06:15:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: host reimage
[06:18:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: host reimage
[06:19:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55457 and previous config saved to /var/cache/conftool/dbconfig/20240124-061934-root.json
[06:19:40] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[06:22:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P55458 and previous config saved to /var/cache/conftool/dbconfig/20240124-062210-marostegui.json
[06:24:40] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992512
[06:34:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55459 and previous config saved to /var/cache/conftool/dbconfig/20240124-063440-root.json
[06:34:45] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[06:37:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55460 and previous config saved to /var/cache/conftool/dbconfig/20240124-063717-marostegui.json
[06:37:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:37:22] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[06:37:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:37:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55461 and previous config saved to /var/cache/conftool/dbconfig/20240124-063739-marostegui.json
[06:38:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992512 (owner: 10Marostegui)
[06:40:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55462 and previous config saved to /var/cache/conftool/dbconfig/20240124-064003-marostegui.json
[06:40:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55463 and previous config saved to /var/cache/conftool/dbconfig/20240124-064020-root.json
[06:40:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2129.codfw.wmnet with OS bookworm
[06:47:12] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992442 (https://phabricator.wikimedia.org/T355739)
[06:47:16] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/992443 (https://phabricator.wikimedia.org/T355739)
[06:49:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55464 and previous config saved to /var/cache/conftool/dbconfig/20240124-064944-root.json
[06:49:50] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[06:55:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P55465 and previous config saved to /var/cache/conftool/dbconfig/20240124-065510-marostegui.json
[06:55:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55466 and previous config saved to /var/cache/conftool/dbconfig/20240124-065525-root.json
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0700)
[07:04:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55467 and previous config saved to /var/cache/conftool/dbconfig/20240124-070449-root.json
[07:04:55] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[07:10:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P55468 and previous config saved to /var/cache/conftool/dbconfig/20240124-071016-marostegui.json
[07:10:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55469 and previous config saved to /var/cache/conftool/dbconfig/20240124-071030-root.json
[07:19:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55470 and previous config saved to /var/cache/conftool/dbconfig/20240124-071954-root.json
[07:20:00] <stashbot>	 T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489
[07:25:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55471 and previous config saved to /var/cache/conftool/dbconfig/20240124-072523-marostegui.json
[07:25:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance
[07:25:29] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[07:25:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55472 and previous config saved to /var/cache/conftool/dbconfig/20240124-072535-root.json
[07:25:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance
[07:25:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55473 and previous config saved to /var/cache/conftool/dbconfig/20240124-072557-marostegui.json
[07:28:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55474 and previous config saved to /var/cache/conftool/dbconfig/20240124-072821-marostegui.json
[07:40:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55475 and previous config saved to /var/cache/conftool/dbconfig/20240124-074040-root.json
[07:43:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P55476 and previous config saved to /var/cache/conftool/dbconfig/20240124-074327-marostegui.json
[07:45:55] <wikibugs>	 (03CR) 10Slyngshede: P:debmonitor::server rework debmonitor http monitoring. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:55:36] <wikibugs>	 (03PS1) 10Mxmxchere: etcd 3.4: Fix ETCD_CLIENT_CERT_AUTH=false [puppet] - 10https://gerrit.wikimedia.org/r/992629
[07:55:40] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere)
[07:55:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55477 and previous config saved to /var/cache/conftool/dbconfig/20240124-075545-root.json
[07:58:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P55478 and previous config saved to /var/cache/conftool/dbconfig/20240124-075834-marostegui.json
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0800).
[08:00:04] <jouncebot>	 WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:01:26] <WMDE-Fisch>	 \o
[08:01:43] <WMDE-Fisch>	 I can self serve
[08:04:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[08:05:05] <wikibugs>	 (03Merged) 10jenkins-bot: Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[08:05:56] <logmsgbot>	 !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]]
[08:06:01] <stashbot>	 T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798
[08:06:17] <wikibugs>	 (03CR) 10Awight: [C: 03+1] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[08:07:43] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:08:04] * WMDE-Fisch testing
[08:10:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55479 and previous config saved to /var/cache/conftool/dbconfig/20240124-081050-root.json
[08:12:38] <hashar>	 good morning
[08:13:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55480 and previous config saved to /var/cache/conftool/dbconfig/20240124-081340-marostegui.json
[08:13:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:13:46] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[08:13:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:14:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance
[08:14:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance
[08:14:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55481 and previous config saved to /var/cache/conftool/dbconfig/20240124-081445-marostegui.json
[08:17:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55482 and previous config saved to /var/cache/conftool/dbconfig/20240124-081708-marostegui.json
[08:17:18] <logmsgbot>	 !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]]
[08:17:23] <stashbot>	 T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798
[08:18:49] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:18:52] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync
[08:19:45] <wikibugs>	 (03PS1) 10Hashar: Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680)
[08:20:22] <hashar>	 I will deploy that LiquidThreads patch as well
[08:20:30] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680) (owner: 10Hashar)
[08:22:54] <wikibugs>	 (03Merged) 10jenkins-bot: Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680) (owner: 10Hashar)
[08:25:50] <logmsgbot>	 !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] (duration: 08m 32s)
[08:25:56] <stashbot>	 T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798
[08:27:45] <WMDE-Fisch>	 I'm done here
[08:28:00] <hashar>	 excellent
[08:28:06] <hashar>	 I am doing the LiquidThreads patch
[08:28:45] <logmsgbot>	 !log hashar@deploy2002 Started scap: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]]
[08:28:50] <stashbot>	 T355680: InvalidArgumentException: Passing a raw callable is not allowed here. Use [ 'factory' => $callable ] instead. - https://phabricator.wikimedia.org/T355680
[08:30:14] <logmsgbot>	 !log hashar@deploy2002 hashar: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:30:17] <logmsgbot>	 !log hashar@deploy2002 hashar: Continuing with sync
[08:30:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[08:32:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P55483 and previous config saved to /var/cache/conftool/dbconfig/20240124-083215-marostegui.json
[08:34:37] <hashar>	 [{reqId}] {exception_url} Error: Class 'GuzzleHttp\Exception\ConnectException' not found
[08:34:39] * hashar whistles
[08:34:47] <hashar>	 [{reqId}] {exception_url} PHP Warning: socket_create(): Unable to create socket [24]: Too many open files
[08:34:50] <hashar>	 ahh computers...
[08:36:46] <logmsgbot>	 !log hashar@deploy2002 Finished scap: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]] (duration: 08m 00s)
[08:36:51] <stashbot>	 T355680: InvalidArgumentException: Passing a raw callable is not allowed here. Use [ 'factory' => $callable ] instead. - https://phabricator.wikimedia.org/T355680
[08:41:03] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) (owner: 10Clare Ming)
[08:45:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1037.eqiad.wmnet
[08:45:54] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert)
[08:45:58] <wikibugs>	 10SRE, 10serviceops: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Scap deployments have been running fine following the proxy replacement. Re...
[08:46:13] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1037 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@eno12399np0.service,networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P55484 and previous config saved to /var/cache/conftool/dbconfig/20240124-084721-marostegui.json
[08:54:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[08:56:37] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede)
[08:58:52] <wikibugs>	 (03Merged) 10jenkins-bot: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede)
[08:59:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[08:59:51] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:04] <jouncebot>	 hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0900)
[09:02:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55485 and previous config saved to /var/cache/conftool/dbconfig/20240124-090228-marostegui.json
[09:02:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance
[09:02:34] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[09:02:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance
[09:02:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55486 and previous config saved to /var/cache/conftool/dbconfig/20240124-090250-marostegui.json
[09:03:02] <wikibugs>	 (03Merged) 10jenkins-bot: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[09:03:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1189/co" [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[09:04:12] <hashar>	 I will run the train in a few 
[09:04:20] <hashar>	 I am in the middle of completing a bug report
[09:05:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55487 and previous config saved to /var/cache/conftool/dbconfig/20240124-090512-marostegui.json
[09:08:17] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti1037.eqiad.wmnet
[09:10:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add labels to thanos-rule blocks [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[09:10:52] <wikibugs>	 (03CR) 10Muehlenhoff: "Ah, thanks for the pointer! I'll update this page to reflect that all allocations should only ever happen in data.yaml. Keeping two data s" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[09:11:22] <hashar>	 lets roll forward
[09:11:40] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433)
[09:11:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot)
[09:11:51] <wikibugs>	 (03CR) 10Muehlenhoff: "In addition that page is also terribly wrong, since there's no mention about the difference between local system users and system-wide use" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[09:12:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot)
[09:19:59] <wikibugs>	 (03PS1) 10WMDE-Fisch: Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798)
[09:20:15] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.15  refs T354433
[09:20:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P55488 and previous config saved to /var/cache/conftool/dbconfig/20240124-092019-marostegui.json
[09:20:21] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[09:20:49] <wikibugs>	 (03CR) 10Awight: [C: 03+1] Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[09:23:33] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[09:24:17] <wikibugs>	 (03CR) 10Muehlenhoff: Bird: move firewall and default neighbor to module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[09:27:10] <logmsgbot>	 !log hashar@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.15  refs T354433 (duration: 06m 55s)
[09:27:16] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[09:28:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:28:20] <stashbot>	 T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437
[09:28:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[09:28:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:29:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:29:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:29:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:29:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:30:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:30:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437
[09:31:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C
[09:32:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[09:32:30] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[09:35:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[09:35:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P55489 and previous config saved to /var/cache/conftool/dbconfig/20240124-093526-marostegui.json
[09:35:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[09:36:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[09:36:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[09:36:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[09:37:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse)
[09:38:33] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[09:41:28] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[09:41:29] <logmsgbot>	 !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[09:49:41] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C
[09:49:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: A1 codfw maintenance
[09:49:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: A1 codfw maintenance
[09:50:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55491 and previous config saved to /var/cache/conftool/dbconfig/20240124-095032-marostegui.json
[09:50:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance
[09:50:39] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[09:50:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance
[09:50:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55492 and previous config saved to /var/cache/conftool/dbconfig/20240124-095054-marostegui.json
[09:53:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55493 and previous config saved to /var/cache/conftool/dbconfig/20240124-095317-marostegui.json
[09:53:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS bullseye
[09:53:19] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) @papaul @jhancock.wm db2158 db2157 db2136 es2026 are now off and ready to be moved anytime
[09:53:48] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui)
[09:53:49] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:25] <wikibugs>	 (03PS14) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[09:55:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[09:58:08] <vgutierrez>	 !log depooling cp3066 - T354424
[09:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:13] <stashbot>	 T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424
[09:59:29] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS
[09:59:29] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS
[09:59:30] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS
[10:00:01] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 282598 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2024-03-24 11:27:11 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/HTTPS
[10:00:12] <vgutierrez>	  ^^ that was me, sorry about the noise
[10:00:15] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 539089 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/HTTPS
[10:00:23] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 539080 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/HTTPS
[10:00:38] <vgutierrez>	 !log repool cp3066 - T354424
[10:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:42] <wikibugs>	 (03PS15) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[10:08:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P55494 and previous config saved to /var/cache/conftool/dbconfig/20240124-100824-marostegui.json
[10:09:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We should allow the current behavior for earlier versions of etcd, maybe, or fix our current configuration before this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere)
[10:10:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage
[10:11:19] <wikibugs>	 (03CR) 10Ayounsi: "thanks for the feedback !" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:11:56] <wikibugs>	 (03PS4) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[10:11:58] <wikibugs>	 (03PS12) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[10:13:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage
[10:16:32] <wikibugs>	 (03PS1) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653)
[10:16:37] <wikibugs>	 (03CR) 10Muehlenhoff: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:17:32] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations: Investigate crypto KDC deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544 (10Gehel)
[10:19:07] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10Gehel)
[10:19:43] <wikibugs>	 (03CR) 10Ayounsi: "Overall that makes sens to me, maybe rename the flag to a more explicit "--keep-mgmt-dns" ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:21:09] <wikibugs>	 (03CR) 10Samtar: "**Note to reviewer:** This change *may* depend on the user preference added in Ibbb59eb84f1dd0b40f9576e048f2ac76044f9014, but given it cur" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[10:21:09] <hashar>	 I am rolling back
[10:21:20] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::kubeadm: worker: support containerd separate volume [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656)
[10:21:21] <hashar>	 that Echo issue sounds like it is breaking something
[10:22:01] <TheresNoTime>	 which issue?
[10:22:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1190/console" [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah)
[10:22:13] <TheresNoTime>	 (curiosity)
[10:22:16] <hashar>	 https://phabricator.wikimedia.org/T355751
[10:22:18] <hashar>	 from Echo
[10:22:27] <hashar>	 which emits a notification with some `null` summary for the event
[10:22:32] <wikibugs>	 (03PS2) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490
[10:22:45] <hashar>	 which is passed to some Parser sanitizer function which now requires a String as input and thus bails out
[10:22:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) @CCiufo-WMF Is this screenshot from Superset? If so, could you try accessing another service like https://icinga.wikimedia.org / https://turnilo.wikimedia.org and see if those work? It's al...
[10:23:00] <TheresNoTime>	 (ty, and oh dear :/)
[10:23:04] <hashar>	 that is for EchoRevertedPresentationModel
[10:23:11] <hashar>	 which I guess happens anytime some diff is reverted
[10:23:12] <hashar>	 maybe
[10:23:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P55495 and previous config saved to /var/cache/conftool/dbconfig/20240124-102330-marostegui.json
[10:24:12] <wikibugs>	 (03CR) 10Majavah: "Good idea, done." [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:25:26] <wikibugs>	 (03PS1) 10Hashar: Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433)
[10:25:28] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433) (owner: 10Hashar)
[10:25:43] <hashar>	 rollbacks are cheap
[10:26:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433) (owner: 10Hashar)
[10:26:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:26:57] <wikibugs>	 (03CR) 10Majavah: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:28:26] * hashar looks at the process to raise awareness of the blocker :D
[10:29:26] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman)
[10:30:21] <wikibugs>	 (03PS3) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490
[10:30:23] <wikibugs>	 (03PS1) 10Majavah: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636
[10:31:53] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.42.0-wmf.15" - T354433
[10:31:57] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[10:34:34] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah)
[10:34:51] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::kubeadm: worker: support containerd separate volume [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah)
[10:34:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:34:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah)
[10:35:09] <wikibugs>	 (03PS2) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653)
[10:35:13] <wikibugs>	 (03PS3) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653)
[10:36:40] <moritzm>	 !log upgrading cumin1002 to pymsql 1.0.2-2~wmf11u1 T355531
[10:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:45] <stashbot>	 T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531
[10:37:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1014.eqiad.wmnet
[10:38:06] <hashar>	 jforrester@gerrit.wikimedia.org: Permission denied (publickey).
[10:38:08] * hashar whistles
[10:38:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch snapshot1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992637 (https://phabricator.wikimedia.org/T349619)
[10:38:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55496 and previous config saved to /var/cache/conftool/dbconfig/20240124-103837-marostegui.json
[10:38:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[10:38:43] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[10:38:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[10:38:57] <hashar>	 !log deployment-server: removing `gerrit` remove from `/srv/mediawiki-staging`  given it is tied to a specific username and the `origin` remote already has ssh protocol for push # ping James_F 
[10:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55497 and previous config saved to /var/cache/conftool/dbconfig/20240124-103900-marostegui.json
[10:39:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[10:39:48] <TheresNoTime>	 a second set of eyes for `IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config]` (https://gerrit.wikimedia.org/r/992632) would be appreciated — namely if it would be safe to deploy without the referenced user option yet being in prod
[10:40:08] <wikibugs>	 (03PS2) 10Majavah: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636
[10:40:10] <wikibugs>	 (03PS4) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490
[10:40:27] <wikibugs>	 (03PS5) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[10:40:29] <wikibugs>	 (03PS13) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[10:40:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch snapshot1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992637 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:41:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55498 and previous config saved to /var/cache/conftool/dbconfig/20240124-104123-marostegui.json
[10:42:04] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[10:42:19] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992647 (https://phabricator.wikimedia.org/T355760)
[10:43:42] <wikibugs>	 (03PS4) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653)
[10:43:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1017.eqiad.wmnet with OS bullseye
[10:44:16] <wikibugs>	 (03CR) 10Ayounsi: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:44:23] <wikibugs>	 (03CR) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[10:44:27] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:44:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1014.eqiad.wmnet
[10:45:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1014.eqiad.wmnet with OS bullseye
[10:49:41] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "assuming the preference name is correct, which I have not checked :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[10:49:53] <wikibugs>	 (03CR) 10Ayounsi: sre.hosts.decommission: Add flag to disable removing mgmt DNS name (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[10:53:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10MoritzMuehlenhoff) > Assuming all our systems are no longer vulnerable   I double-checked and I can confirm that we have his consistently fixed across the fleet: The upstr...
[10:56:07] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one last nit (which I missed earlier)" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:56:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P55499 and previous config saved to /var/cache/conftool/dbconfig/20240124-105630-marostegui.json
[10:57:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1173 with weight 0 T355760', diff saved to https://phabricator.wikimedia.org/P55500 and previous config saved to /var/cache/conftool/dbconfig/20240124-105702-root.json
[10:57:08] <stashbot>	 T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760
[10:57:41] <zabe>	 !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=rowikinews --fix # T350889
[10:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:46] <stashbot>	 T350889: Run maintenance script to fix BBC:* titles in all wikis following set up of Toba Batak Wikipedia - https://phabricator.wikimedia.org/T350889
[10:59:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1100)
[11:00:50] <wikibugs>	 (03PS5) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490
[11:02:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage
[11:03:53] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397)
[11:04:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397) (owner: 10Kosta Harlan)
[11:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397) (owner: 10Kosta Harlan)
[11:07:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "We chatted on IRC with Filippo, the best path forward is probably to follow the wmf-stable/secrets chart path in the same vein as e.g. ml-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[11:08:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "I have many questions..." [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah)
[11:10:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[11:10:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] sre.hosts.decommission: Add flag to disable removing mgmt DNS name (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[11:11:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P55501 and previous config saved to /var/cache/conftool/dbconfig/20240124-111136-marostegui.json
[11:14:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10MoritzMuehlenhoff) >>! In T345724#9483239, @cmooney wrote: > I've been looking into these settings a little bit. >  > The man for //ipfrag_high_thresh// states: > ` > Maxi...
[11:14:45] <wikibugs>	 (03PS3) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555)
[11:15:23] <wikibugs>	 (03Merged) 10jenkins-bot: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah)
[11:15:25] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah)
[11:20:01] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992645 (https://phabricator.wikimedia.org/T355066)
[11:20:07] <wikibugs>	 (03PS1) 10Ladsgroup: GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514
[11:20:17] <Amir1>	 jouncebot: nowandnext
[11:20:17] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1100)
[11:20:17] <jouncebot>	 In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400)
[11:20:31] * hashar lunches
[11:20:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514 (owner: 10Ladsgroup)
[11:24:10] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[11:24:40] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[11:26:03] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[11:26:36] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[11:26:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55503 and previous config saved to /var/cache/conftool/dbconfig/20240124-112643-marostegui.json
[11:26:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[11:26:48] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[11:26:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[11:27:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55504 and previous config saved to /var/cache/conftool/dbconfig/20240124-112705-marostegui.json
[11:29:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55505 and previous config saved to /var/cache/conftool/dbconfig/20240124-112929-marostegui.json
[11:29:50] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[11:30:38] <wikibugs>	 (03PS1) 10Majavah: hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666
[11:31:18] <vgutierrez>	 !log depooling cp3066 - T354424
[11:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:23] <stashbot>	 T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424
[11:32:15] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[11:32:41] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[11:32:42] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[11:33:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1014.eqiad.wmnet with OS bullseye
[11:33:05] <vgutierrez>	 !log repool cp3066 - T354424
[11:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1193/co" [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah)
[11:34:08] <wikibugs>	 (03CR) 10Majavah: hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah)
[11:35:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "This should work. Before deployment via helmfile, we 'll need the corresponding private puppet change that populates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[11:35:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[11:35:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[11:38:36] <wikibugs>	 (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514 (owner: 10Ladsgroup)
[11:43:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host acmechief-test2001.codfw.wmnet
[11:44:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P55506 and previous config saved to /var/cache/conftool/dbconfig/20240124-114435-marostegui.json
[11:45:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch acmechief-test2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992669 (https://phabricator.wikimedia.org/T349619)
[11:46:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:47:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:48:06] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah)
[11:49:48] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet
[11:49:54] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet
[11:51:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch acmechief-test2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992669 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:52:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:52:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:54:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:55:42] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet
[11:55:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief-test2001.codfw.wmnet
[11:55:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:56:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet
[11:56:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:57:03] <wikibugs>	 (03PS1) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503)
[11:57:18] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:57:22] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]]
[11:57:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran)
[11:58:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host acmechief-test1001.eqiad.wmnet
[11:58:51] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:59:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P55509 and previous config saved to /var/cache/conftool/dbconfig/20240124-115942-marostegui.json
[11:59:59] <wikibugs>	 (03CR) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli)
[12:00:28] <wikibugs>	 (03PS1) 10Superpes15: [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126)
[12:00:35] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[12:00:40] <wikibugs>	 (03PS2) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503)
[12:00:48] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] kubernetes: Add usernames for mw-videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/992199 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[12:01:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355760
[12:01:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch acmechief-test1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992673 (https://phabricator.wikimedia.org/T349619)
[12:02:08] <stashbot>	 T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760
[12:02:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355760
[12:03:35] <wikibugs>	 (03CR) 10Dreamy Jazz: Update beta configs to reflect new temp account naming pattern (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran)
[12:04:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch acmechief-test1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992673 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:07:18] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]] (duration: 09m 55s)
[12:08:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992647 (https://phabricator.wikimedia.org/T355760) (owner: 10Gerrit maintenance bot)
[12:09:51] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[12:11:13] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Docker
[12:13:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[12:13:59] <stashbot>	 jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs.
[12:14:47] <TheresNoTime>	 >.>
[12:14:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55510 and previous config saved to /var/cache/conftool/dbconfig/20240124-121448-marostegui.json
[12:14:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance
[12:14:54] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[12:15:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance
[12:15:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[12:15:15] <wikibugs>	 (03PS3) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503)
[12:15:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[12:15:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55511 and previous config saved to /var/cache/conftool/dbconfig/20240124-121526-marostegui.json
[12:15:30] <stashbot>	 marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[12:15:42] <wikibugs>	 (03CR) 10STran: Update beta configs to reflect new temp account naming pattern (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran)
[12:16:55] <wikibugs>	 (03PS1) 10Majavah: openstack: keystone: ensure keystone-admin is restarted when keystone is [puppet] - 10https://gerrit.wikimedia.org/r/992676
[12:17:45] <wikibugs>	 (03CR) 10Dreamy Jazz: [C: 03+1] Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran)
[12:18:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1194/co" [puppet] - 10https://gerrit.wikimedia.org/r/992676 (owner: 10Majavah)
[12:19:11] <wikibugs>	 (03PS1) 10Majavah: wmcs-image-create: remove cloud-init-finished flag if present [puppet] - 10https://gerrit.wikimedia.org/r/992677
[12:19:44] <marostegui>	 !log Starting s6 eqiad failover from db1231 to db1173 - T355760
[12:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:49] <stashbot>	 T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760
[12:20:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1173 to s6 primary T355760', diff saved to https://phabricator.wikimedia.org/P55512 and previous config saved to /var/cache/conftool/dbconfig/20240124-122030-marostegui.json
[12:21:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231 T355760', diff saved to https://phabricator.wikimedia.org/P55513 and previous config saved to /var/cache/conftool/dbconfig/20240124-122148-root.json
[12:23:30] <wikibugs>	 (03PS9) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237)
[12:23:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P55514 and previous config saved to /var/cache/conftool/dbconfig/20240124-122354-root.json
[12:25:45] <wikibugs>	 (03PS1) 10Superpes15: [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041)
[12:28:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1052.eqiad.wmnet
[12:28:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc1052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991299 (owner: 10Effie Mouzeli)
[12:33:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1052.eqiad.wmnet
[12:34:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2052.codfw.wmnet
[12:37:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc2052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991300 (owner: 10Effie Mouzeli)
[12:39:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P55515 and previous config saved to /var/cache/conftool/dbconfig/20240124-123859-root.json
[12:42:33] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[12:42:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2052.codfw.wmnet
[12:42:49] <wikibugs>	 (03PS6) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[12:42:51] <wikibugs>	 (03PS14) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[12:43:00] <wikibugs>	 (03CR) 10Ayounsi: "I also removed the 2 elements mentioning bird6 as they were for the bird to bird2 transition." [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:43:37] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[12:44:20] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:47:40] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791)
[12:50:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete sysctls for setting lower boundary of IP frag [puppet] - 10https://gerrit.wikimedia.org/r/992680 (https://phabricator.wikimedia.org/T345724)
[12:54:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55516 and previous config saved to /var/cache/conftool/dbconfig/20240124-125404-root.json
[12:55:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:59:03] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724)
[12:59:48] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Remove obsolete sysctls for setting lower boundary of IP frag [puppet] - 10https://gerrit.wikimedia.org/r/992680 (https://phabricator.wikimedia.org/T345724) (owner: 10Muehlenhoff)
[13:04:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney)
[13:09:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55517 and previous config saved to /var/cache/conftool/dbconfig/20240124-130909-root.json
[13:16:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55518 and previous config saved to /var/cache/conftool/dbconfig/20240124-131600-marostegui.json
[13:16:06] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[13:16:37] <wikibugs>	 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482934, @Wilfredor wrote: > I think the simplest way to correct this error is to lower the maximum upload limit to 1 GB for validation.  That would be reducing f...
[13:17:38] <wikibugs>	 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482231, @MikhasikRV wrote: > @MatthewVernon I just used Upload Wizard to upload the file. I did not see neither attempt to delete the file after upload. After pr...
[13:21:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1038 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992689 (https://phabricator.wikimedia.org/T349925)
[13:22:31] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent)
[13:23:14] <TheresNoTime>	 jouncebot: nowandnext
[13:23:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 36 minute(s)
[13:23:15] <jouncebot>	 In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400)
[13:23:29] <wikibugs>	 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482093, @MatthewVernon wrote: > Right, those are all too far ago to still in the recent logs. Today's, however, I can find, and swift has done what was asked of...
[13:24:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55519 and previous config saved to /var/cache/conftool/dbconfig/20240124-132414-root.json
[13:28:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent)
[13:29:26] <wikibugs>	 (03Merged) 10jenkins-bot: Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent)
[13:30:09] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]]
[13:30:14] <stashbot>	 T354790: Add Diff to RSS whitelist for Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T354790
[13:31:05] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet
[13:31:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P55520 and previous config saved to /var/cache/conftool/dbconfig/20240124-133107-marostegui.json
[13:31:10] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet
[13:32:03] <logmsgbot>	 !log samtar@deploy2002 samtar and varnent: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:32:11] * TheresNoTime testing
[13:32:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 (owner: 10Effie Mouzeli)
[13:32:31] <logmsgbot>	 !log samtar@deploy2002 samtar and varnent: Continuing with sync
[13:33:29] <wikibugs>	 (03Merged) 10jenkins-bot: cache.mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 (owner: 10Effie Mouzeli)
[13:33:40] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli)
[13:33:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli)
[13:37:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet
[13:37:02] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet
[13:38:33] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[13:39:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55521 and previous config saved to /var/cache/conftool/dbconfig/20240124-133919-root.json
[13:39:23] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]] (duration: 09m 14s)
[13:39:43] <stashbot>	 T354790: Add Diff to RSS whitelist for Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T354790
[13:40:11] <wikibugs>	 (03CR) 10Ayounsi: "nice !" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney)
[13:40:20] <wikibugs>	 (03PS4) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555)
[13:40:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555)
[13:41:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1038 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992689 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff)
[13:41:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I have updated the secret name and pushed https://gerrit.wikimedia.org/r/c/labs/private/+/992699 based on what I could find both there and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[13:42:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Goes with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984143" [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[13:42:20] <wikibugs>	 (03PS10) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237)
[13:45:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove long-absented resource [puppet] - 10https://gerrit.wikimedia.org/r/992700
[13:46:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P55522 and previous config saved to /var/cache/conftool/dbconfig/20240124-134614-marostegui.json
[13:47:06] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) `ipmi_exporter` now has support to collect generic SEL entries and export metrics from those: https://github.com/...
[13:49:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Fold linux44 into the regular wmf kmod::blacklist [puppet] - 10https://gerrit.wikimedia.org/r/992702
[13:50:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1053.eqiad.wmnet
[13:50:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc1053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991301 (owner: 10Effie Mouzeli)
[13:52:42] <godog>	 jouncebot: next
[13:52:43] <jouncebot>	 In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400)
[13:54:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55523 and previous config saved to /var/cache/conftool/dbconfig/20240124-135424-root.json
[13:55:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1053.eqiad.wmnet
[13:59:38] <wikibugs>	 (03PS5) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690)
[13:59:50] <wikibugs>	 (03PS29) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400).
[14:00:04] <jouncebot>	 WMDE-Fisch and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:08] <WMDE-Fisch>	 \o
[14:00:15] <Superpes>	 Hi :)
[14:00:31] <Lucas_WMDE>	 o/
[14:01:12] <Lucas_WMDE>	 guess I’m deploying ^^
[14:01:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55524 and previous config saved to /var/cache/conftool/dbconfig/20240124-140120-marostegui.json
[14:01:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance
[14:01:26] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:01:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance
[14:01:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55525 and previous config saved to /var/cache/conftool/dbconfig/20240124-140142-marostegui.json
[14:01:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[14:01:56] <WMDE-Fisch>	 Lucas_WMDE: Sure. I could do mine but might not have time for more. So might make sense if you just go ahead.
[14:02:01] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[14:02:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[14:02:15] <Lucas_WMDE>	 alright, I’m deploying then
[14:02:23] <WMDE-Fisch>	 Thanks!
[14:03:30] <wikibugs>	 (03Merged) 10jenkins-bot: Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch)
[14:03:51] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437)
[14:03:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]]
[14:03:56] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437)
[14:04:00] <stashbot>	 T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437
[14:04:01] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f37d946c-6c32-4271-92ba-bc66a002809d) set by klausman@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with...
[14:04:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55526 and previous config saved to /var/cache/conftool/dbconfig/20240124-140406-marostegui.json
[14:04:11] <stashbot>	 T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798
[14:04:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2053.codfw.wmnet
[14:05:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15)
[14:05:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:05:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc2053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991302 (owner: 10Effie Mouzeli)
[14:06:02] <Lucas_WMDE>	 WMDE-Fisch: can you test the change on mwdebug?
[14:06:11] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:06:19] <wikibugs>	 (03PS6) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690)
[14:06:36] <wikibugs>	 (03PS30) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[14:07:52] <WMDE-Fisch>	 Lucas_WMDE: Hard to tell atm. I tried but I'm not sure if there's a delay. Please go on. There seems to be no problem at least.
[14:08:14] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15)
[14:08:19] <Lucas_WMDE>	 alright
[14:08:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Continuing with sync
[14:08:21] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:08:44] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) @Marostegui thank you.
[14:09:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2053.codfw.wmnet
[14:10:35] <wikibugs>	 (03PS1) 10Eevans: restbase: upgrade Cassandra to 'dev' (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719)
[14:10:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10CCiufo-WMF) Hmm I'm getting the same warning when accessing https://icinga.wikimedia.org/ and https://turnilo.wikimedia.org/  I was also just trying to access https://superset.wikimedia.org/ previously,...
[14:11:19] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:22] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706
[14:11:30] <Lucas_WMDE>	 ^ decided to also add my own change ^^
[14:12:27] <Lucas_WMDE>	 anyone want to +1 it? :)
[14:12:30] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719) (owner: 10Eevans)
[14:13:24] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman)
[14:13:38] <wikibugs>	 (03PS1) 10Samtar: EditRecovery: Add user preference [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992522 (https://phabricator.wikimedia.org/T350653)
[14:14:35] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:14:39] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] "Makes sense :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE))
[14:14:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]] (duration: 10m 52s)
[14:14:50] <stashbot>	 T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798
[14:14:51] <Lucas_WMDE>	 thanks :)
[14:14:58] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "ship it 🛳️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE))
[14:15:15] <Lucas_WMDE>	 emoji in gerrit :screm:
[14:15:30] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15)
[14:15:31] <TheresNoTime>	 :D
[14:15:31] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) ml-serve2005 is off and ready
[14:15:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15)
[14:15:54] <Superpes>	 :D Lol
[14:16:51] <wikibugs>	 (03Merged) 10jenkins-bot: [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15)
[14:16:55] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) @klausman thank you
[14:17:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]]
[14:17:25] <stashbot>	 T355126: Change the autoconfirmed user standard for gan.wikipedia - https://phabricator.wikimedia.org/T355126
[14:17:33] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: upgrade Cassandra to 'dev' (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719) (owner: 10Eevans)
[14:18:51] <wikibugs>	 (03PS7) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690)
[14:19:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:19:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P55527 and previous config saved to /var/cache/conftool/dbconfig/20240124-141912-marostegui.json
[14:19:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709
[14:19:42] <Lucas_WMDE>	 Superpes: anything to test about this change?
[14:19:48] <Lucas_WMDE>	 I guess it’s a bit difficult to test autoconfirmation settings
[14:20:16] <Superpes>	 Yep agree, nothing to test, I think who is already autoconfirmed won't be removed :/
[14:20:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Continuing with sync
[14:20:33] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Failover from grafana1002 to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992710 (https://phabricator.wikimedia.org/T352665)
[14:23:30] <wikibugs>	 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MikhasikRV) >>! In T355433#9484879, @Jeff_G wrote: >  > I was able to download the file as F 1-74-0217.PDF. In case one of us gets it to upload, what filename would you like and how wou...
[14:24:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet
[14:24:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:24:42] <wikibugs>	 (03PS31) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[14:25:03] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:25:07] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet
[14:25:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet
[14:25:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:25:29] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:25:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[14:25:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:25:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:26:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:27:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]] (duration: 09m 51s)
[14:27:10] <stashbot>	 T355126: Change the autoconfirmed user standard for gan.wikipedia - https://phabricator.wikimedia.org/T355126
[14:27:29] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15)
[14:27:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15)
[14:28:41] <wikibugs>	 (03Abandoned) 10Samtar: EditRecovery: Add user preference [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992522 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[14:28:48] <wikibugs>	 (03Merged) 10jenkins-bot: [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15)
[14:29:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]]
[14:29:14] <stashbot>	 T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041
[14:29:42] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[14:29:46] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[14:30:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:31:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet
[14:31:10] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet
[14:31:11] <Lucas_WMDE>	 Superpes: the azwiki change should be testable :)
[14:31:28] <Superpes>	 Yep I'm testing! just a minute since there are a lot of aliases
[14:31:29] <aqu>	 !log analytics/refinery weekly deployment train - begin
[14:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:15] <Lucas_WMDE>	 ok!
[14:32:36] <Superpes>	 Ok it's fine thanks Lucas_WMDE :)
[14:32:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync
[14:32:41] <Lucas_WMDE>	 ok!
[14:33:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1054.eqiad.wmnet
[14:33:36] <logmsgbot>	 !log aqu@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[14:33:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc1054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991303 (owner: 10Effie Mouzeli)
[14:34:06] <logmsgbot>	 !log aqu@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[14:34:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P55529 and previous config saved to /var/cache/conftool/dbconfig/20240124-143419-marostegui.json
[14:34:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet
[14:34:50] <Lucas_WMDE>	 hm, there are some new MessageCache errors in logspam-watch…
[14:34:52] * Lucas_WMDE looks
[14:35:01] <logmsgbot>	 !log aqu@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[14:35:10] <logmsgbot>	 !log aqu@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[14:35:24] <Lucas_WMDE>	 ok but according to logstash they already went away again
[14:35:50] <logmsgbot>	 !log aqu@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[14:36:00] <logmsgbot>	 !log aqu@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[14:36:10] <logmsgbot>	 !log aqu@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[14:37:02] <logmsgbot>	 !log aqu@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[14:37:18] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Ensure user traffic goes to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665)
[14:37:45] <Lucas_WMDE>	 let’s see if the current deployment triggers it again, I guess
[14:37:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1054.eqiad.wmnet
[14:38:20] <Lucas_WMDE>	 (it was apparently limited to frwikisource and shwiktionary, whatever it was)
[14:38:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2054.codfw.wmnet
[14:38:42] <Lucas_WMDE>	 (“LogicException: Process cache for 'fr' should be set by now.”, and same for sh instead of fr)
[14:38:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc2054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991304 (owner: 10Effie Mouzeli)
[14:39:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]] (duration: 10m 00s)
[14:39:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:22] <Lucas_WMDE>	 no recurrence in logstash so far
[14:39:26] <stashbot>	 T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041
[14:39:27] <Lucas_WMDE>	 I’ll do the birthday logo cleanup then
[14:39:44] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706
[14:39:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE))
[14:40:33] <Superpes>	 Many thanks for your assistance (and for the various cleanups! Sometimes we forget to complete things lmao) Lucas_WMDE :)
[14:40:38] <logmsgbot>	 !log aqu@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[14:40:43] <Lucas_WMDE>	 np :)
[14:40:45] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE))
[14:41:00] <Lucas_WMDE>	 yeah I looked for other *birthday* files yesterday and found this one :D
[14:41:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]]
[14:41:37] <logmsgbot>	 !log aqu@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[14:41:39] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c]: Regular analytics weekly train [analytics/refinery@d1ee04cc]
[14:43:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:43:32] <James_F>	 hashar: Whoops, sorry, thanks for cleaning up mw-staging; can you see when that was added? Don't think I've done any git stuff there for years…
[14:44:12] <Lucas_WMDE>	 checked that https://en.wikipedia.org/static/images/project-logos/cswiki-birthday.png goes away on mwdebug
[14:44:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[14:44:43] <wikibugs>	 (03PS1) 10Jforrester: Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751)
[14:46:12] <Lucas_WMDE>	 James_F: I’m done deploying in a few minutes if you want to backport that immediately
[14:46:46] <James_F>	 Lucas_WMDE: I was going to leave it to the train conductor.
[14:46:58] <Lucas_WMDE>	 alright
[14:47:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2054.codfw.wmnet
[14:48:59] <Lucas_WMDE>	 alright, https://en.wikipedia.org/static/images/project-logos/cswiki-birthday-1.5x.png is gone now
[14:49:21] <Lucas_WMDE>	 https://en.wikipedia.org/static/images/project-logos/cswiki-birthday.png is still in the front cache and will stay there for up to a year
[14:49:25] <Lucas_WMDE>	 I think that’s fine 🤷
[14:49:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55530 and previous config saved to /var/cache/conftool/dbconfig/20240124-144925-marostegui.json
[14:49:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:49:39] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:49:40] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[14:49:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:49:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55531 and previous config saved to /var/cache/conftool/dbconfig/20240124-144947-marostegui.json
[14:50:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709 (owner: 10Alexandros Kosiaris)
[14:50:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]] (duration: 09m 36s)
[14:50:50] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c]: Regular analytics weekly train [analytics/refinery@d1ee04cc] (duration: 09m 11s)
[14:51:54] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55532 and previous config saved to /var/cache/conftool/dbconfig/20240124-145211-marostegui.json
[14:52:25] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c] (thin): Regular analytics weekly train THIN [analytics/refinery@d1ee04cc]
[14:52:32] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c] (thin): Regular analytics weekly train THIN [analytics/refinery@d1ee04cc] (duration: 00m 06s)
[14:52:39] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d1ee04cc]
[14:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709 (owner: 10Alexandros Kosiaris)
[14:55:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:55:46] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic2094.codfw.wmnet']
[14:55:49] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:55:54] <akosiaris>	 !log bump eventrouter limits/requests memory/cpu
[14:55:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:18] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:56:19] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d1ee04cc] (duration: 03m 40s)
[14:56:37] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:56:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:56:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[14:56:58] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1198/co" [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[14:56:58] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:57:04] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:57:17] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:57:22] <wikibugs>	 (03PS1) 10Majavah: Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725
[14:57:42] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06]: Regular analytics weekly train [analytics/refinery@13f7a06c]
[14:57:50] <wikibugs>	 (03Merged) 10jenkins-bot: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[14:57:55] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:58:25] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah)
[14:59:00] <wikibugs>	 (03PS2) 10Majavah: Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725
[14:59:09] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2427.codfw.wmnet with OS bullseye
[14:59:12] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2430.codfw.wmnet with OS bullseye
[14:59:18] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2446.codfw.wmnet with OS bullseye
[14:59:21] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1500)
[15:00:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah)
[15:00:27] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1200/co" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah)
[15:00:32] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah)
[15:00:59] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2036 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:07] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1202/co" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah)
[15:03:15] <wikibugs>	 (03PS1) 10Volans: setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726
[15:03:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester)
[15:04:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2055.codfw.wmnet
[15:04:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) Hehe. Nice.
[15:05:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) 05In progress→03Resolved
[15:05:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:05:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mc2055: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991305 (owner: 10Effie Mouzeli)
[15:06:09] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:06:31] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] authdns: Add entry for the 'authdns' GID [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse)
[15:07:13] <wikibugs>	 (03CR) 10Hashar: ">  ArgumentCountError: Too few arguments to function MediaWiki\User\CentralId\CentralIdLookupFactory::__construct(), 3 passed in /workspac" [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester)
[15:07:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P55533 and previous config saved to /var/cache/conftool/dbconfig/20240124-150718-marostegui.json
[15:07:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 (owner: 10Kosta Harlan)
[15:07:55] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06]: Regular analytics weekly train [analytics/refinery@13f7a06c] (duration: 10m 12s)
[15:08:27] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06] (thin): Regular analytics weekly train THIN [analytics/refinery@13f7a06c]
[15:08:33] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06] (thin): Regular analytics weekly train THIN [analytics/refinery@13f7a06c] (duration: 00m 05s)
[15:08:35] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f7a06c]
[15:09:02] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Yes please. Look good." [puppet] - 10https://gerrit.wikimedia.org/r/992700 (owner: 10Muehlenhoff)
[15:09:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2055.codfw.wmnet
[15:10:43] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674)
[15:10:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "+2 ing after I have +2ed the CentralAuth tests fix Iac91046516a1c05da8a12de5cf03dde089050662" [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester)
[15:11:02] <moritzm>	 !log uploading pymsql 1.0.2-2~wmf11u1 to apt.wikimedia.org T355531
[15:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:13] <stashbot>	 T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531
[15:12:03] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f7a06c] (duration: 03m 28s)
[15:12:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) 05Open→03In progress
[15:12:37] <wikibugs>	 (03Merged) 10jenkins-bot: Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 (owner: 10Kosta Harlan)
[15:13:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) @SLopes-WMF @thcipriani Will need your approval for this.
[15:13:52] <wikibugs>	 (03CR) 10Marostegui: "Remember to push the dbctl configuration once this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[15:13:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[15:16:35] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage
[15:16:38] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage
[15:17:04] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage
[15:19:28] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage
[15:20:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:20:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:20:16] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[15:21:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: helm-state-metrics: Declare the healthcheck port [deployment-charts] - 10https://gerrit.wikimedia.org/r/992731 (https://phabricator.wikimedia.org/T355167)
[15:21:48] <aqu>	 !log Refinery weekly deployment train - end   (scap, then deployed onto hdfs) (test cluster deploy still broken T354703)
[15:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:59] <stashbot>	 T354703: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703
[15:22:05] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage
[15:22:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P55534 and previous config saved to /var/cache/conftool/dbconfig/20240124-152224-marostegui.json
[15:25:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage
[15:25:06] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@da2e61c]: Regular analytics weekly train [airflow-dags/analytics@da2e61c7]
[15:25:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:25:48] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@da2e61c]: Regular analytics weekly train [airflow-dags/analytics@da2e61c7] (duration: 00m 42s)
[15:26:12] <wikibugs>	 (03PS1) 10Muehlenhoff: mc: Switch to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619)
[15:26:24] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10dcaro)
[15:26:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:27:00] <wikibugs>	 10Puppet, 10Toolforge, 10Documentation: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10dcaro) 05Open→03Declined No more grid work is going to be done, we are retiring it :)
[15:29:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:29:12] <wikibugs>	 (03PS1) 10Slyngshede: Add uwsgi plugin dependency [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739
[15:30:05] <wikibugs>	 (03Merged) 10jenkins-bot: Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester)
[15:30:10] <wikibugs>	 (03PS2) 10Slyngshede: Debian packaging, dependencies and permissions [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739
[15:31:11] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM, that tripped me up in a couple of tests, so I won't miss it." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans)
[15:32:10] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167)
[15:32:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans)
[15:32:21] <moritzm>	 !log imported jenkins 2.426.3 for buster/bullseye T355503
[15:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:23] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans)
[15:36:03] <wikibugs>	 (03CR) 10Volans: "question inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede)
[15:36:54] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:37:08] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:37:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host phab2002.codfw.wmnet
[15:37:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55536 and previous config saved to /var/cache/conftool/dbconfig/20240124-153730-marostegui.json
[15:37:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance
[15:37:36] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[15:37:38] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:37:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance
[15:37:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55537 and previous config saved to /var/cache/conftool/dbconfig/20240124-153752-marostegui.json
[15:37:53] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:37:59] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:38:15] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:38:31] <sukhe>	 !log sudo cumin 'A:dns-rec' "disable-puppet 'merging CR 980929'"
[15:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:51] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2427.codfw.wmnet with OS bullseye
[15:40:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55538 and previous config saved to /var/cache/conftool/dbconfig/20240124-154013-marostegui.json
[15:40:28] <wikibugs>	 (03PS3) 10Slyngshede: Debian packaging, dependencies and permissions [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739
[15:41:02] <wikibugs>	 (03CR) 10Slyngshede: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede)
[15:41:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsrecursor: forward_zones for wikimedia.org, too [puppet] - 10https://gerrit.wikimedia.org/r/980929 (https://phabricator.wikimedia.org/T347054) (owner: 10BBlack)
[15:42:15] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2430.codfw.wmnet with OS bullseye
[15:43:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:44:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619)
[15:45:03] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2446.codfw.wmnet with OS bullseye
[15:46:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:46:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:46:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619)
[15:47:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:47:34] <logmsgbot>	 !log hashar@deploy2002 Synchronized php-1.42.0-wmf.15/extensions/CentralAuth/tests/phpunit/CentralAuthIdLookupTest.php: Fix CentralIdLookup tests (duration: 11m 18s)
[15:48:00] <wikibugs>	 (03CR) 10Tacsipacsi: IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar)
[15:48:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:48:40] <hashar>	 more than 11 minutes :-\
[15:48:43] <hashar>	 poor scap
[15:48:51] <sukhe>	 !log sudo cumin -b1 -s120 'A:dns-rec' "enable-puppet 'merging CR 980929' && run-puppet-agent"
[15:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Remove all kademlia support from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167)
[15:50:07] <vgutierrez>	 !log disable puppet on cp3066 - T354424
[15:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:14] <stashbot>	 T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424
[15:51:10] <wikibugs>	 (03CR) 10Muehlenhoff: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede)
[15:55:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P55539 and previous config saved to /var/cache/conftool/dbconfig/20240124-155519-marostegui.json
[15:55:43] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good, thanks for cleaning this up! Happy to take care of merging this and also we can reimage a durum host to see how the initial pu" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[15:57:36] <logmsgbot>	 !log hashar@deploy2002 Synchronized php-1.42.0-wmf.15/extensions/Echo/includes/Formatters/EchoRevertedPresentationModel.php: Fix EchoRevertedPresentationModel using null as string - T355751 (duration: 09m 06s)
[15:57:46] <stashbot>	 T355751: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::escapeHtmlAllowEntities() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/extensions/Echo/includes/DiscussionParser.php on line 1299 - https://phabricator.wikimedia.org/T355751
[15:58:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host phab2002.codfw.wmnet
[16:02:11] <wikibugs>	 (03PS1) 10Bking: cloudelastic: lay the groundwork for private IP migration [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617)
[16:02:50] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:03:32] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:03:38] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:04:23] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:04:31] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:10:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P55540 and previous config saved to /var/cache/conftool/dbconfig/20240124-161026-marostegui.json
[16:11:30] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good based on the phab discussion" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney)
[16:15:23] <wikibugs>	 (03CR) 10Ssingh: Puppet: Routed Ganeti support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:18:23] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Bawolff) Just trying to think up solutions - if thumbor gives a 429, could varnish inste...
[16:19:46] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm)
[16:20:06] <wikibugs>	 (03PS32) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[16:20:08] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney)
[16:21:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/992702 (owner: 10Muehlenhoff)
[16:23:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:25:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55541 and previous config saved to /var/cache/conftool/dbconfig/20240124-162532-marostegui.json
[16:25:39] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:25:46] <wikibugs>	 (03CR) 10Ayounsi: Puppet: Routed Ganeti support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:28:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Puppet: Routed Ganeti support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:30:58] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-eqiad: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[16:31:03] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[16:31:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert)
[16:35:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance
[16:35:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance
[16:38:01] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul)
[16:39:19] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase103[1-3].eqiad.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[16:39:24] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[16:41:14] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360)
[16:42:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo)
[16:42:53] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360)
[16:43:58] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596)
[16:44:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo)
[16:44:25] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360)
[16:44:48] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360)
[16:44:55] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo)
[16:49:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo)
[16:51:07] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] fifo-log-demux: Update project homepage [puppet] - 10https://gerrit.wikimedia.org/r/973887 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall)
[16:51:50] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758
[16:54:34] <XioNoX>	 !log disable puppet on all the hosts running bird to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/991699
[16:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:52] <wikibugs>	 (03CR) 10Ayounsi: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:54:57] <hashar>	 jouncebot: nowandnext
[16:54:57] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 5 minute(s)
[16:54:57] <jouncebot>	 In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1800)
[16:55:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:55:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:55:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55542 and previous config saved to /var/cache/conftool/dbconfig/20240124-165522-marostegui.json
[16:55:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758 (owner: 10Volans)
[16:55:26] <sukhe>	 !log enable puppet on durum1001 to test CR 991699
[16:55:32] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:37] <vgutierrez>	 !log enable puppet on cp3066 - T354424
[16:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:43] <stashbot>	 T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424
[16:56:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:57:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55543 and previous config saved to /var/cache/conftool/dbconfig/20240124-165732-marostegui.json
[16:57:59] <hashar>	 train blocker got lifted so I guess I can run the train again now?  poke thcipriani 
[16:58:23] <hashar>	 given I have to go in 40 minutes
[16:58:32] <wikibugs>	 (03PS33) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[16:58:41] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul)
[16:58:54] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) All db* and es* up and running
[16:59:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet
[16:59:36] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet
[17:02:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Some 'raw' data on the last 30 days increase of errors per-host/drive: ` cloudcephosd1021-sdb 88 cloudceph...
[17:03:05] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Create temporary fileset for dbprov for dbbackups archival" [puppet] - 10https://gerrit.wikimedia.org/r/992766
[17:04:42] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul)
[17:05:22] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet
[17:05:32] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet
[17:05:38] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@16476a9] (releasing): (no justification provided)
[17:06:46] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@16476a9] (releasing): (no justification provided) (duration: 01m 07s)
[17:07:11] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758 (owner: 10Volans)
[17:07:13] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) ml-serve2005 is back up and working fine
[17:07:19] <wikibugs>	 (03PS8) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690)
[17:07:23] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433)
[17:07:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot)
[17:08:07] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot)
[17:09:51] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase103[1-3].eqiad.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[17:10:02] <sukhe>	 !log sudo cumin -b1 -s60 "R:Class = Bird" "enable-puppet 'CR991699' && run-puppet-agent"
[17:10:08] <wikibugs>	 (03PS34) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[17:10:08] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[17:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:24] <volans>	 sukhe: pro-tip C:bird is equivalent to R:Class = Bird ;)
[17:11:46] <sukhe>	 volans: thanks :)
[17:11:56] <sukhe>	 mostly used to A: and P: and hence
[17:12:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney)
[17:12:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P55544 and previous config saved to /var/cache/conftool/dbconfig/20240124-171238-marostegui.json
[17:13:16] <volans>	 ehehe, https://wikitech.wikimedia.org/wiki/Cumin#PuppetDB_host_selection is your friend :D
[17:14:16] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul)
[17:14:24] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[17:14:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Create temporary fileset for dbprov for dbbackups archival" [puppet] - 10https://gerrit.wikimedia.org/r/992766 (owner: 10Jcrespo)
[17:16:48] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.15  refs T354433
[17:16:54] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[17:17:09] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2015-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[17:17:22] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[17:17:50] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:19:07] <hashar>	 of course
[17:19:11] <hashar>	 LiquidThreads is broken again
[17:20:31] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069)
[17:21:18] <Lucas_WMDE>	 damn, I didn’t realize we had it in production at all
[17:21:21] <Lucas_WMDE>	 (I only know it from TWN)
[17:22:21] <Nikerabbit>	 what's wrong with it this time?
[17:22:59] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 241, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:23:58] <logmsgbot>	 !log hashar@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.15  refs T354433 (duration: 07m 10s)
[17:24:11] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[17:24:13] <wikibugs>	 (03PS1) 10Jforrester: Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639)
[17:25:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo)
[17:26:21] <wikibugs>	 10SRE: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis)
[17:27:07] <hashar>	 T355808
[17:27:08] <stashbot>	 T355808: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T355808
[17:27:12] <hashar>	 that is for liquidthreads
[17:27:17] <hashar>	 I haven't marked it as a blocker though
[17:27:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P55545 and previous config saved to /var/cache/conftool/dbconfig/20240124-172745-marostegui.json
[17:28:54] <Nikerabbit>	 ah, probably some corner case as we haven't seen it at twn (yet)
[17:29:16] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) (owner: 10Jforrester)
[17:29:51] <wikibugs>	 (03CR) 10Cathal Mooney: "Good stuff thanks!  One comment on edge-case in line, otherwise LGTM.  I'll check on netbox-next and see if I can figure anything out abou" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi)
[17:29:55] <hashar>	 Nikerabbit: yeah I guess so  :)
[17:29:58] <hashar>	 and somehow
[17:30:06] <hashar>	 we have an error log from 1.42.0-wmf.13 ...
[17:30:20] <hashar>	 ah
[17:30:27] <hashar>	 that is from mwmaint2002
[17:31:06] <hashar>	 some `maintenance/migrateLinksTable.php(`  which yields  PHP Warning: EtcdConfig failed to fetch data: (curl error: 6) Couldn't resolve host name
[17:31:34] <hashar>	 I guess it is a very long on going migration
[17:33:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10cmooney) 05Open→03Resolved >>! In T345724#9484317, @MoritzMuehlenhoff wrote: > Given that we specifically only added this for Fragmentsmack (and not for a specific sca...
[17:35:29] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2015-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[17:35:34] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[17:36:13] <wikibugs>	 (03PS1) 10Klausman: ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516)
[17:36:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) I'm running a script now to gather nicer reports with smartctl included, will send it once it's finished.
[17:37:54] <hashar>	 so MediaWiki looks okish, I am off!
[17:38:51] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[17:42:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55546 and previous config saved to /var/cache/conftool/dbconfig/20240124-174251-marostegui.json
[17:42:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:43:08] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[17:43:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:43:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:43:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:43:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55547 and previous config saved to /var/cache/conftool/dbconfig/20240124-174332-marostegui.json
[17:44:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55548 and previous config saved to /var/cache/conftool/dbconfig/20240124-174442-marostegui.json
[17:44:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Here you go, that has one directory per host, with one file per drive with the total increase of errors in...
[17:46:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[17:46:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[17:47:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[17:50:02] <wikibugs>	 (03Merged) 10jenkins-bot: ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[17:50:41] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[17:50:48] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[17:58:10] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney)
[17:59:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P55549 and previous config saved to /var/cache/conftool/dbconfig/20240124-175948-marostegui.json
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1800)
[18:02:09] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788
[18:03:32] <sukhe>	 volans: quick question since I see you are around: for a long-running cumin command affecting multiple hosts, what's the best way to somehow see which host is currently being affected?
[18:03:46] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) @cmooney will you issue a downtime before the maintenance for each host?
[18:04:16] <volans>	 sukhe: already launched or to be launched?
[18:04:51] <sukhe>	 in this case, already launched but I do want to know for "to be launched" :)
[18:06:22] <volans>	 sukhe: in that case launch it with  -d, --debug and then tail /var/log/cumin/cumin.log
[18:07:51] * volans double checking as I'm going by memory
[18:08:11] <sukhe>	 hmm ok, that works. would you consider this as a feature request? sometimes it's helpful to know where the progress is
[18:08:25] <sukhe>	 I mean I can do some mental math given how many hosts is affecting and what's the current # but yeah
[18:08:37] <logmsgbot>	 !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@fed6de3]: (no justification provided)
[18:08:52] <volans>	 but how would that show in the UI? as part of the progress bar?
[18:09:10] <logmsgbot>	 !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@fed6de3]: (no justification provided) (duration: 00m 32s)
[18:09:16] <sukhe>	 something like that yeah. I am even happy with some more verbose output
[18:09:38] <volans>	 (the progress bar re-writing the screen has already created too many issues in the past :D but I guess we could inject the hostname in there if asked)
[18:09:44] <volans>	 (or by default)
[18:09:54] <sukhe>	 I can file a task on why/how this came up too for some context
[18:10:00] <volans>	 the verbose output would mess with the aggregation of output though
[18:10:03] <volans>	 sure
[18:10:08] <volans>	 that would be great, thanks
[18:10:19] <rzl>	 (or whenever `-b 1` with multiple hosts)
[18:10:39] <volans>	 there is a cumin tag in phab
[18:11:12] <volans>	 rzl: yep that's the other angle I was thinking about, if no batch or large batches are used, there are multiple hosts in parallel and they will scroll very rpidly so it would be a mess anyway
[18:11:15] <sukhe>	 thanks! we can discuss it there, I just realized I spammed this channel
[18:11:33] <volans>	 better you than icinga-wm and jinxer-wm  :D
[18:13:36] <wikibugs>	 (03CR) 10Volans: "Tested on build2002, lintian could be improved a bit more ideally:" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans)
[18:14:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P55550 and previous config saved to /var/cache/conftool/dbconfig/20240124-181455-marostegui.json
[18:16:28] <volans>	 sukhe: for completeness, if you're running the same via a cookbook, the debug logs are always available in the -extended.log files
[18:16:59] <sukhe>	 volans: ok thanks. but just cumin directly in this case
[18:17:04] <sukhe>	 writing that task now, you can read it tomorrow
[18:17:16] <volans>	 <3
[18:23:58] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2017-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[18:24:03] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[18:25:40] <wikibugs>	 10SRE, 10Cumin, 10Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811 (10ssingh)
[18:25:49] <volans>	 sukhe: thanks for the task, pro-tip run-puppet-agent accepts a -e --enable MSG argument ;)
[18:26:11] <sukhe>	 haha, you have told me this before but that breaks my mental model :P
[18:30:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55551 and previous config saved to /var/cache/conftool/dbconfig/20240124-183001-marostegui.json
[18:30:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[18:30:08] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:30:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[18:30:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[18:30:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[18:30:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[18:30:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[18:31:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55552 and previous config saved to /var/cache/conftool/dbconfig/20240124-183059-marostegui.json
[18:33:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55553 and previous config saved to /var/cache/conftool/dbconfig/20240124-183308-marostegui.json
[18:48:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P55554 and previous config saved to /var/cache/conftool/dbconfig/20240124-184815-marostegui.json
[18:48:35] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[18:48:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[18:50:51] <wikibugs>	 (03PS3) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[18:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[18:51:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[18:56:58] <wikibugs>	 (03PS4) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[19:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[19:03:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P55555 and previous config saved to /var/cache/conftool/dbconfig/20240124-190322-marostegui.json
[19:09:27] <wikibugs>	 (03PS5) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[19:10:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[19:12:04] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992801 (https://phabricator.wikimedia.org/T355066)
[19:13:14] <wikibugs>	 (03CR) 10CDanis: Add SameSite=Strict attribute to NetworkProbeLimit cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) (owner: 10Ayounsi)
[19:13:21] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2017-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[19:13:36] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[19:16:11] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2022-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[19:17:17] <wikibugs>	 (03PS6) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[19:18:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[19:18:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55557 and previous config saved to /var/cache/conftool/dbconfig/20240124-191828-marostegui.json
[19:18:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[19:18:43] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[19:18:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[19:18:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55558 and previous config saved to /var/cache/conftool/dbconfig/20240124-191850-marostegui.json
[19:20:35] <wikibugs>	 (03PS7) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[19:21:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55559 and previous config saved to /var/cache/conftool/dbconfig/20240124-192100-marostegui.json
[19:21:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[19:22:10] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Fix various pylint warnings [software/conftool] - 10https://gerrit.wikimedia.org/r/992105 (owner: 10Clément Goubert)
[19:22:20] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Raise yaml_log_error logging level to error [software/conftool] - 10https://gerrit.wikimedia.org/r/992104 (https://phabricator.wikimedia.org/T355256) (owner: 10Clément Goubert)
[19:23:04] <wikibugs>	 (03PS8) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[19:23:55] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:24:06] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:24:20] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) Today's work is complete. The only node left to relocation is gitlab2002. Service ops will get back with us with a day for sometimes next week. All old ports in netbox and on a...
[19:24:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[19:25:27] <wikibugs>	 (03Merged) 10jenkins-bot: Raise yaml_log_error logging level to error [software/conftool] - 10https://gerrit.wikimedia.org/r/992104 (https://phabricator.wikimedia.org/T355256) (owner: 10Clément Goubert)
[19:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: Fix various pylint warnings [software/conftool] - 10https://gerrit.wikimedia.org/r/992105 (owner: 10Clément Goubert)
[19:26:56] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066)
[19:28:51] <wikibugs>	 (03Abandoned) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992801 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson)
[19:29:59] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson)
[19:30:49] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson)
[19:31:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[19:33:55] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:34:04] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:34:53] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:35:01] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:36:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P55560 and previous config saved to /var/cache/conftool/dbconfig/20240124-193606-marostegui.json
[19:37:42] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Align consumer-devnull with deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/992806
[19:38:48] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:39:00] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:51:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P55561 and previous config saved to /var/cache/conftool/dbconfig/20240124-195113-marostegui.json
[19:52:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) There seems to be a problem with my developer account as well. I created my developer account through the [[ https://idm.wikimedia.org/signup/ | IDM signup page ]] last week, but I have...
[20:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[20:06:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55562 and previous config saved to /var/cache/conftool/dbconfig/20240124-200619-marostegui.json
[20:06:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[20:06:30] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[20:06:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[20:06:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[20:06:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[20:06:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55563 and previous config saved to /var/cache/conftool/dbconfig/20240124-200659-marostegui.json
[20:08:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55564 and previous config saved to /var/cache/conftool/dbconfig/20240124-200808-marostegui.json
[20:23:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P55565 and previous config saved to /var/cache/conftool/dbconfig/20240124-202315-marostegui.json
[20:26:46] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=scowiki --logwiki=metawiki 'TheBabushka' 'AshotGPT' # T355743
[20:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:51] <stashbot>	 T355743: Unblock stuck global rename of AshotGPT - https://phabricator.wikimedia.org/T355743
[20:29:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) An update on my attempts to figure out my developer/wikitech account creation issue:  - When I go to the[[ https://idp.wikimedia.org/login#divAttributes |  IDP login page ]], I get this...
[20:35:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) For more context, if I go to the[[ https://idm.wikimedia.org/wikimedia/login/ |  IDM login page ]] and click on the "Wikimedia Developer Single Sign On" button, I get this: {F41714657}
[20:37:37] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@2f514fc]: (no justification provided)
[20:38:10] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@2f514fc]: (no justification provided) (duration: 00m 33s)
[20:38:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P55566 and previous config saved to /var/cache/conftool/dbconfig/20240124-203821-marostegui.json
[20:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[20:53:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55567 and previous config saved to /var/cache/conftool/dbconfig/20240124-205327-marostegui.json
[20:53:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[20:53:33] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[20:53:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[20:53:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55568 and previous config saved to /var/cache/conftool/dbconfig/20240124-205350-marostegui.json
[20:56:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55569 and previous config saved to /var/cache/conftool/dbconfig/20240124-205600-marostegui.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2100). Please do the needful.
[21:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:02:57] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10hashar) + @jnuche from release engineering who knows even more about Jenkins than me :-)  `contint2002` hosts...
[21:05:10] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5a0681b]: Regular analytics weekly train [airflow-dags/analytics@5a0681bc]
[21:05:47] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@5a0681b]: Regular analytics weekly train [airflow-dags/analytics@5a0681bc] (duration: 00m 37s)
[21:11:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P55570 and previous config saved to /var/cache/conftool/dbconfig/20240124-211107-marostegui.json
[21:12:28] <wikibugs>	 (03PS1) 10Gmodena: eventstreams: redactions with underscores in title [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456)
[21:25:10] <wikibugs>	 (03CR) 10Htriedman: [C: 03+1] eventstreams: redactions with underscores in title [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) (owner: 10Gmodena)
[21:26:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P55571 and previous config saved to /var/cache/conftool/dbconfig/20240124-212613-marostegui.json
[21:28:16] <wikibugs>	 (03CR) 10Cathal Mooney: "Yeah this is really weird.  I'd a bit of a look and can't see why that part of the code is getting executed when the check fails." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi)
[21:38:51] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[21:41:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55572 and previous config saved to /var/cache/conftool/dbconfig/20240124-214120-marostegui.json
[21:41:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[21:41:25] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[21:41:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[21:41:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55573 and previous config saved to /var/cache/conftool/dbconfig/20240124-214141-marostegui.json
[21:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[21:43:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55574 and previous config saved to /var/cache/conftool/dbconfig/20240124-214351-marostegui.json
[21:45:02] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2022-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002
[21:45:08] <stashbot>	 T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719
[21:58:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P55575 and previous config saved to /var/cache/conftool/dbconfig/20240124-215857-marostegui.json
[22:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2200)
[22:10:53] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2103.codfw.wmnet with OS bullseye
[22:11:05] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2104.codfw.wmnet with OS bullseye
[22:11:27] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2106.codfw.wmnet with OS bullseye
[22:11:30] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2105.codfw.wmnet with OS bullseye
[22:14:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P55576 and previous config saved to /var/cache/conftool/dbconfig/20240124-221403-marostegui.json
[22:28:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage
[22:28:53] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: promote new hosts to master-eligible [puppet] - 10https://gerrit.wikimedia.org/r/992538 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[22:29:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55577 and previous config saved to /var/cache/conftool/dbconfig/20240124-222910-marostegui.json
[22:29:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[22:29:15] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[22:29:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[22:29:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55578 and previous config saved to /var/cache/conftool/dbconfig/20240124-222932-marostegui.json
[22:31:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55579 and previous config saved to /var/cache/conftool/dbconfig/20240124-223142-marostegui.json
[22:33:04] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage
[22:34:31] <wikibugs>	 (03PS1) 10Jforrester: Revert "Update <p> spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805)
[22:37:00] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 344 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1180, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 330, delayed_unassigned_shards: 0, number_of_pending_tas
[22:37:00] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3887, active_shards_percent_as_number: 77.42782152230971 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:08] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 429 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1102, relocating_shards: 0, initializing_shards: 56, unassigned_shards: 373, delayed_unassigned_shards: 0, number_of_pending_tas
[22:37:08] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 71.97909862834749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:08] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 373 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1225, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 364, delayed_unassigned_shards: 0, number_of_pending_ta
[22:37:08] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 13245, active_shards_percent_as_number: 76.65832290362954 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:14] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 339 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1185, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 325, delayed_unassigned_shards: 0, number_of_pending_tas
[22:37:14] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 19022, active_shards_percent_as_number: 77.75590551181102 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:16] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:16] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:24] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:26] <inflatador>	 ^^ we're working on this
[22:37:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 422 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1109, relocating_shards: 0, initializing_shards: 55, unassigned_shards: 367, delayed_unassigned_shards: 0, number_of_pending_tas
[22:37:40] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.43631613324625 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 422 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1109, relocating_shards: 0, initializing_shards: 55, unassigned_shards: 367, delayed_unassigned_shards: 0, number_of_pending_tas
[22:37:40] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.43631613324625 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:37:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 337 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1261, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 329, delayed_unassigned_shards: 0, number_of_pending_ta
[22:37:40] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 23547, active_shards_percent_as_number: 78.91113892365456 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:38:00] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1240, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 276, delayed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_fli
[22:38:00] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 1852, active_shards_percent_as_number: 81.36482939632546 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:38:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1249, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 267, delayed_unassigned_shards: 0, number_of_pending_tasks: 11, number_of_in_fli
[22:38:16] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 10555, active_shards_percent_as_number: 81.95538057742782 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:38:42] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1294, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 290, delayed_unassigned_shards: 0, number_of_pending_tasks: 19, number_of_i
[22:38:42] <icinga-wm>	 _fetch: 0, task_max_waiting_in_queue_millis: 10224, active_shards_percent_as_number: 80.97622027534418 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:38:50] <wikibugs>	 (03CR) 10Jforrester: "Dose this need deploying?" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan)
[22:39:08] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1320, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 274, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_
[22:39:08] <icinga-wm>	 etch: 0, task_max_waiting_in_queue_millis: 28697, active_shards_percent_as_number: 82.60325406758447 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:34] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: cloduelastic maintenance
[22:39:51] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: cloduelastic maintenance
[22:41:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:46:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55580 and previous config saved to /var/cache/conftool/dbconfig/20240124-224648-marostegui.json
[22:47:30] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1228, relocating_shards: 0, initializing_shards: 44, unassigned_shards: 259, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli
[22:47:30] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.20901371652515 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:47:32] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1228, relocating_shards: 0, initializing_shards: 44, unassigned_shards: 259, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli
[22:47:32] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.20901371652515 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:48:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1233, relocating_shards: 0, initializing_shards: 43, unassigned_shards: 255, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli
[22:48:06] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.5355976485957 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:50:39] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2106.codfw.wmnet with OS bullseye
[22:55:48] <James_F>	 jouncebot: nowandnext
[22:55:48] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2200)
[22:55:48] <jouncebot>	 In 8 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700)
[22:55:48] <jouncebot>	 In 8 hour(s) and 4 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700)
[22:56:44] <wikibugs>	 (03PS1) 10Ryan Kemper: cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826
[22:56:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805) (owner: 10Jforrester)
[22:56:56] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826 (owner: 10Ryan Kemper)
[22:57:07] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826 (owner: 10Ryan Kemper)
[23:01:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 697, active_shards: 1339, relocating_shards: 0, initializing_shards: 23, unassigned_shards: 169, delayed_unassigned_shards: 0, number_of_pending_tasks: 16, number_of_in_
[23:01:06] <icinga-wm>	 etch: 0, task_max_waiting_in_queue_millis: 52817, active_shards_percent_as_number: 87.45917700849118 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:10] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 733, active_shards: 1468, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 130, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_i
[23:01:10] <icinga-wm>	 _fetch: 0, task_max_waiting_in_queue_millis: 7119, active_shards_percent_as_number: 91.8648310387985 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:20] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 732, active_shards: 1404, relocating_shards: 0, initializing_shards: 34, unassigned_shards: 93, delayed_unassigned_shards: 0, number_of_pending_tasks: 13, number_of_in_f
[23:01:20] <icinga-wm>	 tch: 0, task_max_waiting_in_queue_millis: 40933, active_shards_percent_as_number: 91.70476812540824 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55581 and previous config saved to /var/cache/conftool/dbconfig/20240124-230155-marostegui.json
[23:04:34] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2103.codfw.wmnet with OS bullseye
[23:06:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.762% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:09:34] * James_F wonders if he can get away with a quick sleep whilst waiting for gerrit to merge the patch. ;-(
[23:14:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott)
[23:14:08] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:14:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:17:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55582 and previous config saved to /var/cache/conftool/dbconfig/20240124-231701-marostegui.json
[23:17:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[23:17:07] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[23:17:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[23:17:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T354336)', diff saved to https://phabricator.wikimedia.org/P55583 and previous config saved to /var/cache/conftool/dbconfig/20240124-231723-marostegui.json
[23:17:49] <wikibugs>	 (03PS2) 10Andrew Bogott: wikireplicas maintain-meta_p: don't store cursor in schema class [puppet] - 10https://gerrit.wikimedia.org/r/984626
[23:18:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update <p> spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805) (owner: 10Jforrester)
[23:19:29] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:992775|Revert "Update <p> spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]]
[23:19:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T354336)', diff saved to https://phabricator.wikimedia.org/P55584 and previous config saved to /var/cache/conftool/dbconfig/20240124-231933-marostegui.json
[23:19:35] <stashbot>	 T355805: Syntax highlighting in 2017 wikitext editor has extreme vertical cursor displacement - https://phabricator.wikimedia.org/T355805
[23:19:36] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[23:21:01] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:992775|Revert "Update <p> spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:23:20] <James_F>	 Kemayo: Everything look OK to you?
[23:25:39] <James_F>	 https://test.wikipedia.org/w/index.php?title=Foo&veaction=editsource&debug=1 is taking rather a while to load, sigh.
[23:26:05] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[23:32:06] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2104.codfw.wmnet with OS bullseye
[23:32:18] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:32:25] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2105.codfw.wmnet with OS bullseye
[23:32:59] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:992775|Revert "Update <p> spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]] (duration: 13m 29s)
[23:33:07] <stashbot>	 T355805: Syntax highlighting in 2017 wikitext editor has extreme vertical cursor displacement - https://phabricator.wikimedia.org/T355805
[23:33:07] <stashbot>	 T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433
[23:33:12] <James_F>	 Finally.
[23:34:29] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2103.codfw.wmnet with OS bullseye
[23:34:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55585 and previous config saved to /var/cache/conftool/dbconfig/20240124-233439-marostegui.json
[23:36:00] <Kemayo>	 James_F: Yeah, that took forever to load. It does seem to work fine -- the bug I know about is gone, anyway.
[23:39:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10Scott_French)
[23:41:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10Scott_French) 05Open→03In progress p:05Triage→03Medium
[23:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[23:43:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10RLazarus)
[23:44:45] <wikibugs>	 (03PS1) 10Scott French: admin: add new SSH pubkey for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/992829 (https://phabricator.wikimedia.org/T355834)
[23:49:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55586 and previous config saved to /var/cache/conftool/dbconfig/20240124-234946-marostegui.json
[23:51:15] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2103.codfw.wmnet with reason: host reimage
[23:51:57] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] admin: add new SSH pubkey for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/992829 (https://phabricator.wikimedia.org/T355834) (owner: 10Scott French)
[23:54:46] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2103.codfw.wmnet with reason: host reimage
[23:56:36] <wikibugs>	 (03PS1) 10Zabe: Start reading from af_user(_text)/afh_user(_text) in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992830 (https://phabricator.wikimedia.org/T355616)
[23:59:45] <zabe>	 jouncebot: nowandnext
[23:59:45] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 0 minute(s)
[23:59:45] <jouncebot>	 In 7 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700)
[23:59:45] <jouncebot>	 In 7 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700)