[00:00:35] RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T354336)', diff saved to https://phabricator.wikimedia.org/P55433 and previous config saved to /var/cache/conftool/dbconfig/20240124-000802-marostegui.json [00:08:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance [00:08:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1228.eqiad.wmnet with reason: Maintenance [00:08:19] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [00:08:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55434 and previous config saved to /var/cache/conftool/dbconfig/20240124-000824-marostegui.json [00:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55435 and previous config saved to /var/cache/conftool/dbconfig/20240124-001044-marostegui.json [00:25:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P55436 and previous config saved to /var/cache/conftool/dbconfig/20240124-002551-marostegui.json [00:36:53] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440 [00:39:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440 (owner: 10TrainBranchBot) [00:40:41] (03CR) 10Cwhite: [C: 03+2] httpd: ErrorLogFormat for ECS [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [00:40:51] (03PS4) 10Cwhite: httpd: ErrorLogFormat for ECS [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [00:40:59] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [00:40:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P55437 and previous config saved to /var/cache/conftool/dbconfig/20240124-004058-marostegui.json [00:41:11] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [00:54:37] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [00:54:49] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [00:56:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354336)', diff saved to https://phabricator.wikimedia.org/P55438 and previous config saved to /var/cache/conftool/dbconfig/20240124-005605-marostegui.json [00:56:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [00:56:11] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [00:56:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [00:56:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55439 and previous config saved to /var/cache/conftool/dbconfig/20240124-005627-marostegui.json [00:56:35] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [00:58:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55440 and previous config saved to /var/cache/conftool/dbconfig/20240124-005849-marostegui.json [01:01:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992440 (owner: 10TrainBranchBot) [01:10:35] PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 113.77, 103.15, 87.07 https://wikitech.wikimedia.org/wiki/Swift [01:13:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P55441 and previous config saved to /var/cache/conftool/dbconfig/20240124-011355-marostegui.json [01:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:15] RECOVERY - very high load average likely xfs on ms-be2075 is OK: OK - load average: 59.22, 72.02, 78.74 https://wikitech.wikimedia.org/wiki/Swift [01:29:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P55442 and previous config saved to /var/cache/conftool/dbconfig/20240124-012902-marostegui.json [01:35:10] 10SRE, 10serviceops: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Mstyles) [01:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354336)', diff saved to https://phabricator.wikimedia.org/P55443 and previous config saved to /var/cache/conftool/dbconfig/20240124-014408-marostegui.json [01:44:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [01:44:21] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [01:44:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [01:44:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55444 and previous config saved to /var/cache/conftool/dbconfig/20240124-014430-marostegui.json [01:46:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55445 and previous config saved to /var/cache/conftool/dbconfig/20240124-014651-marostegui.json [02:01:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:01:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P55447 and previous config saved to /var/cache/conftool/dbconfig/20240124-020157-marostegui.json [02:02:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:03:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51307 bytes in 7.849 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:03:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:08:50] (03CR) 10Ssingh: "Please feel free to take 928 that is reserved for authdns (but we haven't used it anywhere so far, so all good)." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [02:17:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P55448 and previous config saved to /var/cache/conftool/dbconfig/20240124-021704-marostegui.json [02:32:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354336)', diff saved to https://phabricator.wikimedia.org/P55449 and previous config saved to /var/cache/conftool/dbconfig/20240124-023210-marostegui.json [02:32:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:32:16] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [02:32:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:39:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:14:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:18] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: add awareness of 'unmanaged' role [puppet] - 10https://gerrit.wikimedia.org/r/992543 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [03:24:53] (03PS8) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [03:27:32] (03PS9) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [03:34:48] (03CR) 10Andrea Denisse: "I've clarified the situation with Sukhbir." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [03:35:23] (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [03:45:13] (03PS1) 10Andrea Denisse: authdns: Add entry for the 'authdns' GID [puppet] - 10https://gerrit.wikimedia.org/r/992550 [03:47:09] (03CR) 10Andrea Denisse: "Hi, this patch is related to the issue discussed in 990795." [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse) [05:45:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:45:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:47:00] (03PS1) 10Marostegui: Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992511 [05:48:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:48:37] (03CR) 10Marostegui: [C: 03+2] Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992511 (owner: 10Marostegui) [05:48:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:49:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:49:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55450 and previous config saved to /var/cache/conftool/dbconfig/20240124-054924-root.json [05:49:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:49:31] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [05:49:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55451 and previous config saved to /var/cache/conftool/dbconfig/20240124-054932-marostegui.json [05:49:38] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [05:51:05] (03PS1) 10Marostegui: mariadb: Disable notifications on A1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/992555 (https://phabricator.wikimedia.org/T355437) [05:51:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158 db2157 es2026 db2136 T355437', diff saved to https://phabricator.wikimedia.org/P55452 and previous config saved to /var/cache/conftool/dbconfig/20240124-055143-marostegui.json [05:51:49] T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 [05:51:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55453 and previous config saved to /var/cache/conftool/dbconfig/20240124-055157-marostegui.json [05:52:15] (03Abandoned) 10Ammarpad: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad) [05:52:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on A1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/992555 (https://phabricator.wikimedia.org/T355437) (owner: 10Marostegui) [05:56:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T354506', diff saved to https://phabricator.wikimedia.org/P55454 and previous config saved to /var/cache/conftool/dbconfig/20240124-055635-marostegui.json [05:56:40] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [05:57:08] (03PS1) 10Marostegui: db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992556 (https://phabricator.wikimedia.org/T354506) [05:58:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2129.codfw.wmnet with OS bookworm [05:58:24] (03CR) 10Marostegui: [C: 03+2] db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992556 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:04:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55455 and previous config saved to /var/cache/conftool/dbconfig/20240124-060429-root.json [06:04:38] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [06:07:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P55456 and previous config saved to /var/cache/conftool/dbconfig/20240124-060703-marostegui.json [06:15:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: host reimage [06:18:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: host reimage [06:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55457 and previous config saved to /var/cache/conftool/dbconfig/20240124-061934-root.json [06:19:40] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [06:22:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P55458 and previous config saved to /var/cache/conftool/dbconfig/20240124-062210-marostegui.json [06:24:40] (03PS1) 10Marostegui: Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992512 [06:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55459 and previous config saved to /var/cache/conftool/dbconfig/20240124-063440-root.json [06:34:45] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [06:37:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T354336)', diff saved to https://phabricator.wikimedia.org/P55460 and previous config saved to /var/cache/conftool/dbconfig/20240124-063717-marostegui.json [06:37:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:37:22] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:37:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:37:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55461 and previous config saved to /var/cache/conftool/dbconfig/20240124-063739-marostegui.json [06:38:48] (03CR) 10Marostegui: [C: 03+2] Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992512 (owner: 10Marostegui) [06:40:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55462 and previous config saved to /var/cache/conftool/dbconfig/20240124-064003-marostegui.json [06:40:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55463 and previous config saved to /var/cache/conftool/dbconfig/20240124-064020-root.json [06:40:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2129.codfw.wmnet with OS bookworm [06:47:12] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992442 (https://phabricator.wikimedia.org/T355739) [06:47:16] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/992443 (https://phabricator.wikimedia.org/T355739) [06:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55464 and previous config saved to /var/cache/conftool/dbconfig/20240124-064944-root.json [06:49:50] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [06:55:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P55465 and previous config saved to /var/cache/conftool/dbconfig/20240124-065510-marostegui.json [06:55:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55466 and previous config saved to /var/cache/conftool/dbconfig/20240124-065525-root.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0700) [07:04:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55467 and previous config saved to /var/cache/conftool/dbconfig/20240124-070449-root.json [07:04:55] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [07:10:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P55468 and previous config saved to /var/cache/conftool/dbconfig/20240124-071016-marostegui.json [07:10:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55469 and previous config saved to /var/cache/conftool/dbconfig/20240124-071030-root.json [07:19:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repool db2175 after a crash T355489', diff saved to https://phabricator.wikimedia.org/P55470 and previous config saved to /var/cache/conftool/dbconfig/20240124-071954-root.json [07:20:00] T355489: db2175 replication lag - https://phabricator.wikimedia.org/T355489 [07:25:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T354336)', diff saved to https://phabricator.wikimedia.org/P55471 and previous config saved to /var/cache/conftool/dbconfig/20240124-072523-marostegui.json [07:25:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [07:25:29] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55472 and previous config saved to /var/cache/conftool/dbconfig/20240124-072535-root.json [07:25:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [07:25:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55473 and previous config saved to /var/cache/conftool/dbconfig/20240124-072557-marostegui.json [07:28:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55474 and previous config saved to /var/cache/conftool/dbconfig/20240124-072821-marostegui.json [07:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55475 and previous config saved to /var/cache/conftool/dbconfig/20240124-074040-root.json [07:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P55476 and previous config saved to /var/cache/conftool/dbconfig/20240124-074327-marostegui.json [07:45:55] (03CR) 10Slyngshede: P:debmonitor::server rework debmonitor http monitoring. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:55:36] (03PS1) 10Mxmxchere: etcd 3.4: Fix ETCD_CLIENT_CERT_AUTH=false [puppet] - 10https://gerrit.wikimedia.org/r/992629 [07:55:40] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere) [07:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55477 and previous config saved to /var/cache/conftool/dbconfig/20240124-075545-root.json [07:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P55478 and previous config saved to /var/cache/conftool/dbconfig/20240124-075834-marostegui.json [08:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0800). [08:00:04] WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:26] \o [08:01:43] I can self serve [08:04:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [08:05:05] (03Merged) 10jenkins-bot: Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [08:05:56] !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] [08:06:01] T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798 [08:06:17] (03CR) 10Awight: [C: 03+1] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [08:07:43] !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:08:04] * WMDE-Fisch testing [08:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55479 and previous config saved to /var/cache/conftool/dbconfig/20240124-081050-root.json [08:12:38] good morning [08:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T354336)', diff saved to https://phabricator.wikimedia.org/P55480 and previous config saved to /var/cache/conftool/dbconfig/20240124-081340-marostegui.json [08:13:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:13:46] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:13:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:14:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:14:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:14:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55481 and previous config saved to /var/cache/conftool/dbconfig/20240124-081445-marostegui.json [08:17:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55482 and previous config saved to /var/cache/conftool/dbconfig/20240124-081708-marostegui.json [08:17:18] !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] [08:17:23] T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798 [08:18:49] !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:52] !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync [08:19:45] (03PS1) 10Hashar: Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680) [08:20:22] I will deploy that LiquidThreads patch as well [08:20:30] (03CR) 10Hashar: [C: 03+2] Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680) (owner: 10Hashar) [08:22:54] (03Merged) 10jenkins-bot: Use a class for 'LogActionsHandlers' [extensions/LiquidThreads] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992513 (https://phabricator.wikimedia.org/T355680) (owner: 10Hashar) [08:25:50] !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:992411|Allow Cite events for reference previews baseline stats (T353798)]] (duration: 08m 32s) [08:25:56] T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798 [08:27:45] I'm done here [08:28:00] excellent [08:28:06] I am doing the LiquidThreads patch [08:28:45] !log hashar@deploy2002 Started scap: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]] [08:28:50] T355680: InvalidArgumentException: Passing a raw callable is not allowed here. Use [ 'factory' => $callable ] instead. - https://phabricator.wikimedia.org/T355680 [08:30:14] !log hashar@deploy2002 hashar: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:30:17] !log hashar@deploy2002 hashar: Continuing with sync [08:30:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [08:32:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P55483 and previous config saved to /var/cache/conftool/dbconfig/20240124-083215-marostegui.json [08:34:37] [{reqId}] {exception_url} Error: Class 'GuzzleHttp\Exception\ConnectException' not found [08:34:39] * hashar whistles [08:34:47] [{reqId}] {exception_url} PHP Warning: socket_create(): Unable to create socket [24]: Too many open files [08:34:50] ahh computers... [08:36:46] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:992513|Use a class for 'LogActionsHandlers' (T355680)]] (duration: 08m 00s) [08:36:51] T355680: InvalidArgumentException: Passing a raw callable is not allowed here. Use [ 'factory' => $callable ] instead. - https://phabricator.wikimedia.org/T355680 [08:41:03] (03CR) 10Phuedx: [C: 03+1] Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) (owner: 10Clare Ming) [08:45:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1037.eqiad.wmnet [08:45:54] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert) [08:45:58] 10SRE, 10serviceops: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Scap deployments have been running fine following the proxy replacement. Re... [08:46:13] PROBLEM - Check systemd state on ganeti1037 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@eno12399np0.service,networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P55484 and previous config saved to /var/cache/conftool/dbconfig/20240124-084721-marostegui.json [08:54:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [08:56:37] (03CR) 10Slyngshede: [C: 03+2] Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [08:58:52] (03Merged) 10jenkins-bot: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [08:59:50] (03CR) 10Slyngshede: [C: 03+2] Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [08:59:51] RECOVERY - Check systemd state on ganeti1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:04] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T0900) [09:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T354336)', diff saved to https://phabricator.wikimedia.org/P55485 and previous config saved to /var/cache/conftool/dbconfig/20240124-090228-marostegui.json [09:02:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:02:34] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:02:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:02:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55486 and previous config saved to /var/cache/conftool/dbconfig/20240124-090250-marostegui.json [09:03:02] (03Merged) 10jenkins-bot: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [09:03:50] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1189/co" [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [09:04:12] I will run the train in a few [09:04:20] I am in the middle of completing a bug report [09:05:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55487 and previous config saved to /var/cache/conftool/dbconfig/20240124-090512-marostegui.json [09:08:17] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti1037.eqiad.wmnet [09:10:06] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add labels to thanos-rule blocks [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [09:10:52] (03CR) 10Muehlenhoff: "Ah, thanks for the pointer! I'll update this page to reflect that all allocations should only ever happen in data.yaml. Keeping two data s" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [09:11:22] lets roll forward [09:11:40] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433) [09:11:42] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [09:11:51] (03CR) 10Muehlenhoff: "In addition that page is also terribly wrong, since there's no mention about the difference between local system users and system-wide use" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [09:12:26] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992630 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [09:19:59] (03PS1) 10WMDE-Fisch: Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) [09:20:15] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.15 refs T354433 [09:20:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P55488 and previous config saved to /var/cache/conftool/dbconfig/20240124-092019-marostegui.json [09:20:21] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [09:20:49] (03CR) 10Awight: [C: 03+1] Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [09:23:33] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:24:17] (03CR) 10Muehlenhoff: Bird: move firewall and default neighbor to module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:27:10] !log hashar@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.15 refs T354433 (duration: 06m 55s) [09:27:16] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [09:28:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:28:20] T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 [09:28:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:28:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:29:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:29:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:29:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:29:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:30:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:30:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437 [09:31:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [09:32:30] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [09:32:30] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [09:35:26] (03CR) 10Clément Goubert: [C: 03+1] sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [09:35:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P55489 and previous config saved to /var/cache/conftool/dbconfig/20240124-093526-marostegui.json [09:35:32] (03CR) 10Clément Goubert: [C: 03+1] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [09:36:04] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [09:36:10] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [09:36:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [09:37:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse) [09:38:33] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:41:28] !log ayounsi@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [09:41:29] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [09:49:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [09:49:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: A1 codfw maintenance [09:49:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: A1 codfw maintenance [09:50:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T354336)', diff saved to https://phabricator.wikimedia.org/P55491 and previous config saved to /var/cache/conftool/dbconfig/20240124-095032-marostegui.json [09:50:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [09:50:39] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:50:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [09:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55492 and previous config saved to /var/cache/conftool/dbconfig/20240124-095054-marostegui.json [09:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55493 and previous config saved to /var/cache/conftool/dbconfig/20240124-095317-marostegui.json [09:53:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS bullseye [09:53:19] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) @papaul @jhancock.wm db2158 db2157 db2136 es2026 are now off and ready to be moved anytime [09:53:48] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) [09:53:49] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:25] (03PS14) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [09:55:04] (03CR) 10CI reject: [V: 04-1] external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:58:08] !log depooling cp3066 - T354424 [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:13] T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [09:59:29] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [09:59:29] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [09:59:30] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [10:00:01] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 282598 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2024-03-24 11:27:11 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:00:12] ^^ that was me, sorry about the noise [10:00:15] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 539089 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:00:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 539080 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:00:38] !log repool cp3066 - T354424 [10:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:42] (03PS15) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [10:08:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P55494 and previous config saved to /var/cache/conftool/dbconfig/20240124-100824-marostegui.json [10:09:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We should allow the current behavior for earlier versions of etcd, maybe, or fix our current configuration before this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere) [10:10:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [10:11:19] (03CR) 10Ayounsi: "thanks for the feedback !" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:11:56] (03PS4) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) [10:11:58] (03PS12) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) [10:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [10:16:32] (03PS1) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [10:16:37] (03CR) 10Muehlenhoff: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:17:32] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations: Investigate crypto KDC deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544 (10Gehel) [10:19:07] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10Gehel) [10:19:43] (03CR) 10Ayounsi: "Overall that makes sens to me, maybe rename the flag to a more explicit "--keep-mgmt-dns" ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:21:09] (03CR) 10Samtar: "**Note to reviewer:** This change *may* depend on the user preference added in Ibbb59eb84f1dd0b40f9576e048f2ac76044f9014, but given it cur" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [10:21:09] I am rolling back [10:21:20] (03PS1) 10Majavah: P:wmcs::kubeadm: worker: support containerd separate volume [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) [10:21:21] that Echo issue sounds like it is breaking something [10:22:01] which issue? [10:22:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1190/console" [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:22:13] (curiosity) [10:22:16] https://phabricator.wikimedia.org/T355751 [10:22:18] from Echo [10:22:27] which emits a notification with some `null` summary for the event [10:22:32] (03PS2) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 [10:22:45] which is passed to some Parser sanitizer function which now requires a String as input and thus bails out [10:22:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) @CCiufo-WMF Is this screenshot from Superset? If so, could you try accessing another service like https://icinga.wikimedia.org / https://turnilo.wikimedia.org and see if those work? It's al... [10:23:00] (ty, and oh dear :/) [10:23:04] that is for EchoRevertedPresentationModel [10:23:11] which I guess happens anytime some diff is reverted [10:23:12] maybe [10:23:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P55495 and previous config saved to /var/cache/conftool/dbconfig/20240124-102330-marostegui.json [10:24:12] (03CR) 10Majavah: "Good idea, done." [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:25:26] (03PS1) 10Hashar: Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433) [10:25:28] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433) (owner: 10Hashar) [10:25:43] rollbacks are cheap [10:26:09] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.42.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992634 (https://phabricator.wikimedia.org/T354433) (owner: 10Hashar) [10:26:36] (03CR) 10CI reject: [V: 04-1] sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:26:57] (03CR) 10Majavah: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:28:26] * hashar looks at the process to raise awareness of the blocker :D [10:29:26] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) [10:30:21] (03PS3) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 [10:30:23] (03PS1) 10Majavah: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 [10:31:53] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.42.0-wmf.15" - T354433 [10:31:57] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [10:34:34] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:34:51] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::kubeadm: worker: support containerd separate volume [puppet] - 10https://gerrit.wikimedia.org/r/992633 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:34:55] (03CR) 10CI reject: [V: 04-1] sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:34:59] (03CR) 10CI reject: [V: 04-1] sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah) [10:35:09] (03PS2) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [10:35:13] (03PS3) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [10:36:40] !log upgrading cumin1002 to pymsql 1.0.2-2~wmf11u1 T355531 [10:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:45] T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531 [10:37:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1014.eqiad.wmnet [10:38:06] jforrester@gerrit.wikimedia.org: Permission denied (publickey). [10:38:08] * hashar whistles [10:38:24] (03PS1) 10Muehlenhoff: Switch snapshot1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992637 (https://phabricator.wikimedia.org/T349619) [10:38:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T354336)', diff saved to https://phabricator.wikimedia.org/P55496 and previous config saved to /var/cache/conftool/dbconfig/20240124-103837-marostegui.json [10:38:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [10:38:43] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:38:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [10:38:57] !log deployment-server: removing `gerrit` remove from `/srv/mediawiki-staging` given it is tied to a specific username and the `origin` remote already has ssh protocol for push # ping James_F [10:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55497 and previous config saved to /var/cache/conftool/dbconfig/20240124-103900-marostegui.json [10:39:15] (03CR) 10Alexandros Kosiaris: jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:39:48] a second set of eyes for `IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config]` (https://gerrit.wikimedia.org/r/992632) would be appreciated — namely if it would be safe to deploy without the referenced user option yet being in prod [10:40:08] (03PS2) 10Majavah: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 [10:40:10] (03PS4) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 [10:40:27] (03PS5) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) [10:40:29] (03PS13) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) [10:40:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch snapshot1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992637 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55498 and previous config saved to /var/cache/conftool/dbconfig/20240124-104123-marostegui.json [10:42:04] (03CR) 10Majavah: [C: 04-1] IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [10:42:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992647 (https://phabricator.wikimedia.org/T355760) [10:43:42] (03PS4) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [10:43:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1017.eqiad.wmnet with OS bullseye [10:44:16] (03CR) 10Ayounsi: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:44:23] (03CR) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [10:44:27] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:44:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1014.eqiad.wmnet [10:45:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1014.eqiad.wmnet with OS bullseye [10:49:41] (03CR) 10Majavah: [C: 03+1] "assuming the preference name is correct, which I have not checked :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [10:49:53] (03CR) 10Ayounsi: sre.hosts.decommission: Add flag to disable removing mgmt DNS name (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [10:53:06] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10MoritzMuehlenhoff) > Assuming all our systems are no longer vulnerable I double-checked and I can confirm that we have his consistently fixed across the fleet: The upstr... [10:56:07] (03CR) 10Muehlenhoff: "Looks good, one last nit (which I missed earlier)" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:56:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P55499 and previous config saved to /var/cache/conftool/dbconfig/20240124-105630-marostegui.json [10:57:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1173 with weight 0 T355760', diff saved to https://phabricator.wikimedia.org/P55500 and previous config saved to /var/cache/conftool/dbconfig/20240124-105702-root.json [10:57:08] T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760 [10:57:41] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=rowikinews --fix # T350889 [10:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:46] T350889: Run maintenance script to fix BBC:* titles in all wikis following set up of Toba Batak Wikipedia - https://phabricator.wikimedia.org/T350889 [10:59:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1100) [11:00:50] (03PS5) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 [11:02:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1014.eqiad.wmnet with reason: host reimage [11:03:53] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397) [11:04:52] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397) (owner: 10Kosta Harlan) [11:05:48] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992642 (https://phabricator.wikimedia.org/T355397) (owner: 10Kosta Harlan) [11:07:09] (03CR) 10Alexandros Kosiaris: "We chatted on IRC with Filippo, the best path forward is probably to follow the wmf-stable/secrets chart path in the same vein as e.g. ml-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:08:32] (03CR) 10Ladsgroup: [C: 03+2] "I have many questions..." [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah) [11:10:25] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [11:10:52] (03CR) 10Majavah: [C: 03+2] sre.hosts.decommission: Add flag to disable removing mgmt DNS name (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [11:11:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P55501 and previous config saved to /var/cache/conftool/dbconfig/20240124-111136-marostegui.json [11:14:04] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10MoritzMuehlenhoff) >>! In T345724#9483239, @cmooney wrote: > I've been looking into these settings a little bit. > > The man for //ipfrag_high_thresh// states: > ` > Maxi... [11:14:45] (03PS3) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) [11:15:23] (03Merged) 10jenkins-bot: sre.mysql.clone: Silence SQL injection warning [cookbooks] - 10https://gerrit.wikimedia.org/r/992636 (owner: 10Majavah) [11:15:25] (03Merged) 10jenkins-bot: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 (owner: 10Majavah) [11:20:01] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992645 (https://phabricator.wikimedia.org/T355066) [11:20:07] (03PS1) 10Ladsgroup: GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514 [11:20:17] jouncebot: nowandnext [11:20:17] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1100) [11:20:17] In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400) [11:20:31] * hashar lunches [11:20:32] (03CR) 10Ladsgroup: [C: 03+2] GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514 (owner: 10Ladsgroup) [11:24:10] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:24:40] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:26:03] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:26:36] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:26:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55503 and previous config saved to /var/cache/conftool/dbconfig/20240124-112643-marostegui.json [11:26:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [11:26:48] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:26:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [11:27:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55504 and previous config saved to /var/cache/conftool/dbconfig/20240124-112705-marostegui.json [11:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55505 and previous config saved to /var/cache/conftool/dbconfig/20240124-112929-marostegui.json [11:29:50] (03CR) 10Hnowlan: [C: 03+2] admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [11:30:38] (03PS1) 10Majavah: hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666 [11:31:18] !log depooling cp3066 - T354424 [11:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:23] T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [11:32:15] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:32:41] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:32:42] (03Merged) 10jenkins-bot: admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [11:33:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1014.eqiad.wmnet with OS bullseye [11:33:05] !log repool cp3066 - T354424 [11:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1193/co" [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah) [11:34:08] (03CR) 10Majavah: hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah) [11:35:17] (03CR) 10Alexandros Kosiaris: "This should work. Before deployment via helmfile, we 'll need the corresponding private puppet change that populates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:35:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:35:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:38:36] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Add ->disableSandbox() to shell command [extensions/ConfirmEdit] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992514 (owner: 10Ladsgroup) [11:43:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host acmechief-test2001.codfw.wmnet [11:44:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P55506 and previous config saved to /var/cache/conftool/dbconfig/20240124-114435-marostegui.json [11:45:40] (03PS1) 10Muehlenhoff: Switch acmechief-test2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992669 (https://phabricator.wikimedia.org/T349619) [11:46:03] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:47:38] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:48:06] (03CR) 10Majavah: [C: 03+2] hieradata: openstack: codfw1dev: use cloud-private names for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/992666 (owner: 10Majavah) [11:49:48] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [11:49:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [11:51:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch acmechief-test2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992669 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:52:06] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:52:47] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:54:32] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:55:42] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [11:55:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief-test2001.codfw.wmnet [11:55:55] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:56:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [11:56:38] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:57:03] (03PS1) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) [11:57:18] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:57:22] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]] [11:57:45] (03CR) 10CI reject: [V: 04-1] Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [11:58:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host acmechief-test1001.eqiad.wmnet [11:58:51] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P55509 and previous config saved to /var/cache/conftool/dbconfig/20240124-115942-marostegui.json [11:59:59] (03CR) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [12:00:28] (03PS1) 10Superpes15: [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) [12:00:35] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:00:40] (03PS2) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) [12:00:48] (03CR) 10Hnowlan: [C: 03+2] kubernetes: Add usernames for mw-videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/992199 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [12:01:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355760 [12:01:50] (03PS1) 10Muehlenhoff: Switch acmechief-test1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992673 (https://phabricator.wikimedia.org/T349619) [12:02:08] T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760 [12:02:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355760 [12:03:35] (03CR) 10Dreamy Jazz: Update beta configs to reflect new temp account naming pattern (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [12:04:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch acmechief-test1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992673 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:07:18] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:992514|GenerateFancyCaptchas: Add ->disableSandbox() to shell command]] (duration: 09m 55s) [12:08:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992647 (https://phabricator.wikimedia.org/T355760) (owner: 10Gerrit maintenance bot) [12:09:51] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [12:11:13] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Docker [12:13:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief-test1001.eqiad.wmnet [12:13:59] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [12:14:47] >.> [12:14:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T354336)', diff saved to https://phabricator.wikimedia.org/P55510 and previous config saved to /var/cache/conftool/dbconfig/20240124-121448-marostegui.json [12:14:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [12:14:54] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:15:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [12:15:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:15:15] (03PS3) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) [12:15:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:15:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55511 and previous config saved to /var/cache/conftool/dbconfig/20240124-121526-marostegui.json [12:15:30] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [12:15:42] (03CR) 10STran: Update beta configs to reflect new temp account naming pattern (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [12:16:55] (03PS1) 10Majavah: openstack: keystone: ensure keystone-admin is restarted when keystone is [puppet] - 10https://gerrit.wikimedia.org/r/992676 [12:17:45] (03CR) 10Dreamy Jazz: [C: 03+1] Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [12:18:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1194/co" [puppet] - 10https://gerrit.wikimedia.org/r/992676 (owner: 10Majavah) [12:19:11] (03PS1) 10Majavah: wmcs-image-create: remove cloud-init-finished flag if present [puppet] - 10https://gerrit.wikimedia.org/r/992677 [12:19:44] !log Starting s6 eqiad failover from db1231 to db1173 - T355760 [12:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:49] T355760: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T355760 [12:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1173 to s6 primary T355760', diff saved to https://phabricator.wikimedia.org/P55512 and previous config saved to /var/cache/conftool/dbconfig/20240124-122030-marostegui.json [12:21:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231 T355760', diff saved to https://phabricator.wikimedia.org/P55513 and previous config saved to /var/cache/conftool/dbconfig/20240124-122148-root.json [12:23:30] (03PS9) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [12:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P55514 and previous config saved to /var/cache/conftool/dbconfig/20240124-122354-root.json [12:25:45] (03PS1) 10Superpes15: [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) [12:28:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1052.eqiad.wmnet [12:28:53] (03CR) 10Muehlenhoff: [C: 03+2] mc1052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991299 (owner: 10Effie Mouzeli) [12:33:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1052.eqiad.wmnet [12:34:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2052.codfw.wmnet [12:37:36] (03CR) 10Muehlenhoff: [C: 03+2] mc2052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991300 (owner: 10Effie Mouzeli) [12:39:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P55515 and previous config saved to /var/cache/conftool/dbconfig/20240124-123859-root.json [12:42:33] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [12:42:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2052.codfw.wmnet [12:42:49] (03PS6) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) [12:42:51] (03PS14) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) [12:43:00] (03CR) 10Ayounsi: "I also removed the 2 elements mentioning bird6 as they were for the bird to bird2 transition." [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:43:37] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [12:44:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:47:40] (03PS1) 10Hnowlan: kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791) [12:50:43] (03PS1) 10Muehlenhoff: Remove obsolete sysctls for setting lower boundary of IP frag [puppet] - 10https://gerrit.wikimedia.org/r/992680 (https://phabricator.wikimedia.org/T345724) [12:54:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55516 and previous config saved to /var/cache/conftool/dbconfig/20240124-125404-root.json [12:55:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:59:03] (03PS1) 10Cathal Mooney: Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) [12:59:48] (03Abandoned) 10Cathal Mooney: Remove obsolete sysctls for setting lower boundary of IP frag [puppet] - 10https://gerrit.wikimedia.org/r/992680 (https://phabricator.wikimedia.org/T345724) (owner: 10Muehlenhoff) [13:04:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney) [13:09:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55517 and previous config saved to /var/cache/conftool/dbconfig/20240124-130909-root.json [13:16:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55518 and previous config saved to /var/cache/conftool/dbconfig/20240124-131600-marostegui.json [13:16:06] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:16:37] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482934, @Wilfredor wrote: > I think the simplest way to correct this error is to lower the maximum upload limit to 1 GB for validation. That would be reducing f... [13:17:38] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482231, @MikhasikRV wrote: > @MatthewVernon I just used Upload Wizard to upload the file. I did not see neither attempt to delete the file after upload. After pr... [13:21:50] (03PS1) 10Muehlenhoff: Make ganeti1038 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992689 (https://phabricator.wikimedia.org/T349925) [13:22:31] (03CR) 10Samtar: [C: 03+1] Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent) [13:23:14] jouncebot: nowandnext [13:23:14] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [13:23:15] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400) [13:23:29] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9482093, @MatthewVernon wrote: > Right, those are all too far ago to still in the recent logs. Today's, however, I can find, and swift has done what was asked of... [13:24:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55519 and previous config saved to /var/cache/conftool/dbconfig/20240124-132414-root.json [13:28:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent) [13:29:26] (03Merged) 10jenkins-bot: Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent) [13:30:09] !log samtar@deploy2002 Started scap: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]] [13:30:14] T354790: Add Diff to RSS whitelist for Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T354790 [13:31:05] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [13:31:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P55520 and previous config saved to /var/cache/conftool/dbconfig/20240124-133107-marostegui.json [13:31:10] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [13:32:03] !log samtar@deploy2002 samtar and varnent: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:11] * TheresNoTime testing [13:32:30] (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 (owner: 10Effie Mouzeli) [13:32:31] !log samtar@deploy2002 samtar and varnent: Continuing with sync [13:33:29] (03Merged) 10jenkins-bot: cache.mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 (owner: 10Effie Mouzeli) [13:33:40] (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [13:33:49] (03CR) 10CI reject: [V: 04-1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [13:37:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [13:37:02] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [13:38:33] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:39:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55521 and previous config saved to /var/cache/conftool/dbconfig/20240124-133919-root.json [13:39:23] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:991100|Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. (T354790)]] (duration: 09m 14s) [13:39:43] T354790: Add Diff to RSS whitelist for Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T354790 [13:40:11] (03CR) 10Ayounsi: "nice !" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney) [13:40:20] (03PS4) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) [13:40:47] (03PS1) 10Filippo Giunchedi: deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) [13:41:00] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1038 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992689 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff) [13:41:35] (03CR) 10Filippo Giunchedi: "I have updated the secret name and pushed https://gerrit.wikimedia.org/r/c/labs/private/+/992699 based on what I could find both there and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [13:42:03] (03CR) 10Filippo Giunchedi: "Goes with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984143" [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [13:42:20] (03PS10) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [13:45:41] (03PS1) 10Muehlenhoff: Remove long-absented resource [puppet] - 10https://gerrit.wikimedia.org/r/992700 [13:46:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P55522 and previous config saved to /var/cache/conftool/dbconfig/20240124-134614-marostegui.json [13:47:06] 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) `ipmi_exporter` now has support to collect generic SEL entries and export metrics from those: https://github.com/... [13:49:40] (03PS1) 10Muehlenhoff: Fold linux44 into the regular wmf kmod::blacklist [puppet] - 10https://gerrit.wikimedia.org/r/992702 [13:50:03] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1053.eqiad.wmnet [13:50:25] (03CR) 10Muehlenhoff: [C: 03+2] mc1053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991301 (owner: 10Effie Mouzeli) [13:52:42] jouncebot: next [13:52:43] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400) [13:54:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55523 and previous config saved to /var/cache/conftool/dbconfig/20240124-135424-root.json [13:55:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1053.eqiad.wmnet [13:59:38] (03PS5) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [13:59:50] (03PS29) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1400). [14:00:04] WMDE-Fisch and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] \o [14:00:15] Hi :) [14:00:31] o/ [14:01:12] guess I’m deploying ^^ [14:01:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T354336)', diff saved to https://phabricator.wikimedia.org/P55524 and previous config saved to /var/cache/conftool/dbconfig/20240124-140120-marostegui.json [14:01:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:01:26] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:01:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55525 and previous config saved to /var/cache/conftool/dbconfig/20240124-140142-marostegui.json [14:01:46] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [14:01:56] Lucas_WMDE: Sure. I could do mine but might not have time for more. So might make sense if you just go ahead. [14:02:01] (03PS2) 10Lucas Werkmeister (WMDE): Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:02:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:02:15] alright, I’m deploying then [14:02:23] Thanks! [14:03:30] (03Merged) 10jenkins-bot: Add mediawiki.reference_previews to wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992631 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:03:51] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437) [14:03:53] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]] [14:03:56] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437) [14:04:00] T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 [14:04:01] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f37d946c-6c32-4271-92ba-bc66a002809d) set by klausman@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with... [14:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55526 and previous config saved to /var/cache/conftool/dbconfig/20240124-140406-marostegui.json [14:04:11] T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798 [14:04:14] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2053.codfw.wmnet [14:05:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15) [14:05:35] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:45] (03CR) 10Muehlenhoff: [C: 03+2] mc2053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991302 (owner: 10Effie Mouzeli) [14:06:02] WMDE-Fisch: can you test the change on mwdebug? [14:06:11] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:06:19] (03PS6) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [14:06:36] (03PS30) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:07:52] Lucas_WMDE: Hard to tell atm. I tried but I'm not sure if there's a delay. Please go on. There seems to be no problem at least. [14:08:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:08:19] alright [14:08:20] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Continuing with sync [14:08:21] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:44] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) @Marostegui thank you. [14:09:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2053.codfw.wmnet [14:10:35] (03PS1) 10Eevans: restbase: upgrade Cassandra to 'dev' (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719) [14:10:36] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10CCiufo-WMF) Hmm I'm getting the same warning when accessing https://icinga.wikimedia.org/ and https://turnilo.wikimedia.org/ I was also just trying to access https://superset.wikimedia.org/ previously,... [14:11:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:22] (03PS1) 10Lucas Werkmeister (WMDE): cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 [14:11:30] ^ decided to also add my own change ^^ [14:12:27] anyone want to +1 it? :) [14:12:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719) (owner: 10Eevans) [14:13:24] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) [14:13:38] (03PS1) 10Samtar: EditRecovery: Add user preference [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992522 (https://phabricator.wikimedia.org/T350653) [14:14:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:39] (03CR) 10WMDE-Fisch: [C: 03+1] "Makes sense :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE)) [14:14:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992631|Add mediawiki.reference_previews to wgEventLoggingStreamNames (T353798)]] (duration: 10m 52s) [14:14:50] T353798: Fix the data collection for ReferencePreviews - https://phabricator.wikimedia.org/T353798 [14:14:51] thanks :) [14:14:58] (03CR) 10Samtar: [C: 03+1] "ship it 🛳️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE)) [14:15:15] emoji in gerrit :screm: [14:15:30] (03PS2) 10Lucas Werkmeister (WMDE): [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15) [14:15:31] :D [14:15:31] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) ml-serve2005 is off and ready [14:15:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15) [14:15:54] :D Lol [14:16:51] (03Merged) 10jenkins-bot: [ganwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992671 (https://phabricator.wikimedia.org/T355126) (owner: 10Superpes15) [14:16:55] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) @klausman thank you [14:17:13] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]] [14:17:25] T355126: Change the autoconfirmed user standard for gan.wikipedia - https://phabricator.wikimedia.org/T355126 [14:17:33] (03CR) 10Eevans: [C: 03+2] restbase: upgrade Cassandra to 'dev' (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992705 (https://phabricator.wikimedia.org/T355719) (owner: 10Eevans) [14:18:51] (03PS7) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [14:19:02] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:19:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P55527 and previous config saved to /var/cache/conftool/dbconfig/20240124-141912-marostegui.json [14:19:33] (03PS1) 10Alexandros Kosiaris: eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709 [14:19:42] Superpes: anything to test about this change? [14:19:48] I guess it’s a bit difficult to test autoconfirmation settings [14:20:16] Yep agree, nothing to test, I think who is already autoconfirmed won't be removed :/ [14:20:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Continuing with sync [14:20:33] (03PS1) 10Andrea Denisse: grafana: Failover from grafana1002 to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992710 (https://phabricator.wikimedia.org/T352665) [14:23:30] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MikhasikRV) >>! In T355433#9484879, @Jeff_G wrote: > > I was able to download the file as F 1-74-0217.PDF. In case one of us gets it to upload, what filename would you like and how wou... [14:24:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [14:24:31] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [14:24:42] (03PS31) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:25:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [14:25:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [14:25:13] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [14:25:13] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:25:29] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:25:34] (03CR) 10CI reject: [V: 04-1] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:25:37] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:25:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:26:08] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:27:05] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992671|[ganwiki] Change autoconfirmed setting (T355126)]] (duration: 09m 51s) [14:27:10] T355126: Change the autoconfirmed user standard for gan.wikipedia - https://phabricator.wikimedia.org/T355126 [14:27:29] (03PS2) 10Lucas Werkmeister (WMDE): [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:27:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:28:41] (03Abandoned) 10Samtar: EditRecovery: Add user preference [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992522 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [14:28:48] (03Merged) 10jenkins-bot: [azwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992678 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:29:09] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]] [14:29:14] T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041 [14:29:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [14:29:46] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [14:30:44] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [14:31:10] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [14:31:11] Superpes: the azwiki change should be testable :) [14:31:28] Yep I'm testing! just a minute since there are a lot of aliases [14:31:29] !log analytics/refinery weekly deployment train - begin [14:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] ok! [14:32:36] Ok it's fine thanks Lucas_WMDE :) [14:32:39] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [14:32:41] ok! [14:33:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1054.eqiad.wmnet [14:33:36] !log aqu@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:33:50] (03CR) 10Muehlenhoff: [C: 03+2] mc1054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991303 (owner: 10Effie Mouzeli) [14:34:06] !log aqu@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P55529 and previous config saved to /var/cache/conftool/dbconfig/20240124-143419-marostegui.json [14:34:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [14:34:50] hm, there are some new MessageCache errors in logspam-watch… [14:34:52] * Lucas_WMDE looks [14:35:01] !log aqu@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:35:10] !log aqu@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:35:24] ok but according to logstash they already went away again [14:35:50] !log aqu@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:36:00] !log aqu@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:36:10] !log aqu@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:37:02] !log aqu@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:37:18] (03PS1) 10Andrea Denisse: grafana: Ensure user traffic goes to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) [14:37:45] let’s see if the current deployment triggers it again, I guess [14:37:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1054.eqiad.wmnet [14:38:20] (it was apparently limited to frwikisource and shwiktionary, whatever it was) [14:38:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2054.codfw.wmnet [14:38:42] (“LogicException: Process cache for 'fr' should be set by now.”, and same for sh instead of fr) [14:38:54] (03CR) 10Muehlenhoff: [C: 03+2] mc2054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991304 (owner: 10Effie Mouzeli) [14:39:09] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992678|[azwiki] Add new namespace aliases (T355041)]] (duration: 10m 00s) [14:39:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:22] no recurrence in logstash so far [14:39:26] T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041 [14:39:27] I’ll do the birthday logo cleanup then [14:39:44] (03PS2) 10Lucas Werkmeister (WMDE): cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 [14:39:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE)) [14:40:33] Many thanks for your assistance (and for the various cleanups! Sometimes we forget to complete things lmao) Lucas_WMDE :) [14:40:38] !log aqu@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:40:43] np :) [14:40:45] (03Merged) 10jenkins-bot: cswiki: remove unused birthday logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992706 (owner: 10Lucas Werkmeister (WMDE)) [14:41:00] yeah I looked for other *birthday* files yesterday and found this one :D [14:41:11] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]] [14:41:37] !log aqu@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:41:39] !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c]: Regular analytics weekly train [analytics/refinery@d1ee04cc] [14:43:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:43:32] hashar: Whoops, sorry, thanks for cleaning up mw-staging; can you see when that was added? Don't think I've done any git stuff there for years… [14:44:12] checked that https://en.wikipedia.org/static/images/project-logos/cswiki-birthday.png goes away on mwdebug [14:44:14] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:44:43] (03PS1) 10Jforrester: Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) [14:46:12] James_F: I’m done deploying in a few minutes if you want to backport that immediately [14:46:46] Lucas_WMDE: I was going to leave it to the train conductor. [14:46:58] alright [14:47:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2054.codfw.wmnet [14:48:59] alright, https://en.wikipedia.org/static/images/project-logos/cswiki-birthday-1.5x.png is gone now [14:49:21] https://en.wikipedia.org/static/images/project-logos/cswiki-birthday.png is still in the front cache and will stay there for up to a year [14:49:25] I think that’s fine 🤷 [14:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T354336)', diff saved to https://phabricator.wikimedia.org/P55530 and previous config saved to /var/cache/conftool/dbconfig/20240124-144925-marostegui.json [14:49:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:49:39] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:49:40] (03CR) 10Hnowlan: [C: 03+2] kubernetes: move more jobrunner hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/992679 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [14:49:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:49:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55531 and previous config saved to /var/cache/conftool/dbconfig/20240124-144947-marostegui.json [14:50:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709 (owner: 10Alexandros Kosiaris) [14:50:48] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992706|cswiki: remove unused birthday logo files]] (duration: 09m 36s) [14:50:50] !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c]: Regular analytics weekly train [analytics/refinery@d1ee04cc] (duration: 09m 11s) [14:51:54] !log UTC afternoon backport+config window done [14:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55532 and previous config saved to /var/cache/conftool/dbconfig/20240124-145211-marostegui.json [14:52:25] !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c] (thin): Regular analytics weekly train THIN [analytics/refinery@d1ee04cc] [14:52:32] !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c] (thin): Regular analytics weekly train THIN [analytics/refinery@d1ee04cc] (duration: 00m 06s) [14:52:39] !log aqu@deploy2002 Started deploy [analytics/refinery@d1ee04c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d1ee04cc] [14:53:04] (03Merged) 10jenkins-bot: eventrouter: Bump requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/992709 (owner: 10Alexandros Kosiaris) [14:55:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:55:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:55:49] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:55:54] !log bump eventrouter limits/requests memory/cpu [14:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:18] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:56:19] !log aqu@deploy2002 Finished deploy [analytics/refinery@d1ee04c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d1ee04cc] (duration: 03m 40s) [14:56:37] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:56:46] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:56:54] (03CR) 10Hnowlan: [C: 03+2] modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [14:56:58] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1198/co" [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [14:56:58] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:57:04] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:57:17] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:57:22] (03PS1) 10Majavah: Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 [14:57:42] !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06]: Regular analytics weekly train [analytics/refinery@13f7a06c] [14:57:50] (03Merged) 10jenkins-bot: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [14:57:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:58:25] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:30] (03CR) 10CI reject: [V: 04-1] Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [14:59:00] (03PS2) 10Majavah: Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 [14:59:09] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2427.codfw.wmnet with OS bullseye [14:59:12] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2430.codfw.wmnet with OS bullseye [14:59:18] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2446.codfw.wmnet with OS bullseye [14:59:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1500) [15:00:07] (03CR) 10CI reject: [V: 04-1] Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [15:00:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1200/co" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [15:00:32] (03CR) 10Majavah: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [15:00:59] PROBLEM - Check systemd state on kubernetes2036 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1202/co" [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [15:03:15] (03PS1) 10Volans: setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 [15:03:17] (03CR) 10CI reject: [V: 04-1] Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester) [15:04:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2055.codfw.wmnet [15:04:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) Hehe. Nice. [15:05:08] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) 05In progress→03Resolved [15:05:16] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:05:33] (03CR) 10Muehlenhoff: [C: 03+2] mc2055: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991305 (owner: 10Effie Mouzeli) [15:06:09] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:06:31] (03CR) 10Andrea Denisse: [C: 03+2] authdns: Add entry for the 'authdns' GID [puppet] - 10https://gerrit.wikimedia.org/r/992550 (owner: 10Andrea Denisse) [15:07:13] (03CR) 10Hashar: "> ArgumentCountError: Too few arguments to function MediaWiki\User\CentralId\CentralIdLookupFactory::__construct(), 3 passed in /workspac" [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester) [15:07:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P55533 and previous config saved to /var/cache/conftool/dbconfig/20240124-150718-marostegui.json [15:07:21] (03CR) 10Hashar: [C: 03+2] Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 (owner: 10Kosta Harlan) [15:07:55] !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06]: Regular analytics weekly train [analytics/refinery@13f7a06c] (duration: 10m 12s) [15:08:27] !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06] (thin): Regular analytics weekly train THIN [analytics/refinery@13f7a06c] [15:08:33] !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06] (thin): Regular analytics weekly train THIN [analytics/refinery@13f7a06c] (duration: 00m 05s) [15:08:35] !log aqu@deploy2002 Started deploy [analytics/refinery@13f7a06] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f7a06c] [15:09:02] (03CR) 10Slyngshede: [C: 03+1] "Yes please. Look good." [puppet] - 10https://gerrit.wikimedia.org/r/992700 (owner: 10Muehlenhoff) [15:09:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2055.codfw.wmnet [15:10:43] (03PS1) 10Arnaudb: mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) [15:10:57] (03CR) 10Hashar: [C: 03+2] "+2 ing after I have +2ed the CentralAuth tests fix Iac91046516a1c05da8a12de5cf03dde089050662" [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester) [15:11:02] !log uploading pymsql 1.0.2-2~wmf11u1 to apt.wikimedia.org T355531 [15:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:13] T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531 [15:12:03] !log aqu@deploy2002 Finished deploy [analytics/refinery@13f7a06] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f7a06c] (duration: 03m 28s) [15:12:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) 05Open→03In progress [15:12:37] (03Merged) 10jenkins-bot: Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 (owner: 10Kosta Harlan) [15:13:03] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) @SLopes-WMF @thcipriani Will need your approval for this. [15:13:52] (03CR) 10Marostegui: "Remember to push the dbctl configuration once this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:13:57] (03CR) 10Marostegui: [C: 03+1] mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:16:35] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage [15:16:38] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage [15:17:04] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage [15:19:28] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage [15:20:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:20:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:20:16] (03CR) 10Arnaudb: [C: 03+2] mariadb: preparing cloning db2169 to db2194 [puppet] - 10https://gerrit.wikimedia.org/r/992651 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:21:46] (03PS1) 10Alexandros Kosiaris: helm-state-metrics: Declare the healthcheck port [deployment-charts] - 10https://gerrit.wikimedia.org/r/992731 (https://phabricator.wikimedia.org/T355167) [15:21:48] !log Refinery weekly deployment train - end (scap, then deployed onto hdfs) (test cluster deploy still broken T354703) [15:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:59] T354703: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 [15:22:05] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage [15:22:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P55534 and previous config saved to /var/cache/conftool/dbconfig/20240124-152224-marostegui.json [15:25:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage [15:25:06] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@da2e61c]: Regular analytics weekly train [airflow-dags/analytics@da2e61c7] [15:25:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:25:48] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@da2e61c]: Regular analytics weekly train [airflow-dags/analytics@da2e61c7] (duration: 00m 42s) [15:26:12] (03PS1) 10Muehlenhoff: mc: Switch to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619) [15:26:24] 10Puppet, 10Infrastructure-Foundations, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10dcaro) [15:26:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:27:00] 10Puppet, 10Toolforge, 10Documentation: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10dcaro) 05Open→03Declined No more grid work is going to be done, we are retiring it :) [15:29:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:29:12] (03PS1) 10Slyngshede: Add uwsgi plugin dependency [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [15:30:05] (03Merged) 10jenkins-bot: Fix EchoRevertedPresentationModel using null as string [extensions/Echo] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992523 (https://phabricator.wikimedia.org/T355751) (owner: 10Jforrester) [15:30:10] (03PS2) 10Slyngshede: Debian packaging, dependencies and permissions [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [15:31:11] (03CR) 10Slyngshede: [C: 03+1] "LGTM, that tripped me up in a couple of tests, so I won't miss it." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans) [15:32:10] (03PS1) 10Alexandros Kosiaris: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) [15:32:19] (03CR) 10Volans: [C: 03+2] setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans) [15:32:21] !log imported jenkins 2.426.3 for buster/bullseye T355503 [15:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:23] (03Merged) 10jenkins-bot: setup.py: remove dependency on pytest-runner [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992726 (owner: 10Volans) [15:36:03] (03CR) 10Volans: "question inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [15:36:54] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:37:08] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:37:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host phab2002.codfw.wmnet [15:37:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T354336)', diff saved to https://phabricator.wikimedia.org/P55536 and previous config saved to /var/cache/conftool/dbconfig/20240124-153730-marostegui.json [15:37:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [15:37:36] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:37:38] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:37:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [15:37:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55537 and previous config saved to /var/cache/conftool/dbconfig/20240124-153752-marostegui.json [15:37:53] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:37:59] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:38:15] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:38:31] !log sudo cumin 'A:dns-rec' "disable-puppet 'merging CR 980929'" [15:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2427.codfw.wmnet with OS bullseye [15:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55538 and previous config saved to /var/cache/conftool/dbconfig/20240124-154013-marostegui.json [15:40:28] (03PS3) 10Slyngshede: Debian packaging, dependencies and permissions [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [15:41:02] (03CR) 10Slyngshede: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [15:41:19] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: forward_zones for wikimedia.org, too [puppet] - 10https://gerrit.wikimedia.org/r/980929 (https://phabricator.wikimedia.org/T347054) (owner: 10BBlack) [15:42:15] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2430.codfw.wmnet with OS bullseye [15:43:47] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:44:38] (03PS1) 10Muehlenhoff: Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619) [15:45:03] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2446.codfw.wmnet with OS bullseye [15:46:27] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [15:46:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:46:38] (03PS2) 10Muehlenhoff: Switch phab2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992743 (https://phabricator.wikimedia.org/T349619) [15:47:21] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:47:34] !log hashar@deploy2002 Synchronized php-1.42.0-wmf.15/extensions/CentralAuth/tests/phpunit/CentralAuthIdLookupTest.php: Fix CentralIdLookup tests (duration: 11m 18s) [15:48:00] (03CR) 10Tacsipacsi: IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [15:48:39] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Swift [15:48:40] more than 11 minutes :-\ [15:48:43] poor scap [15:48:51] !log sudo cumin -b1 -s120 'A:dns-rec' "enable-puppet 'merging CR 980929' && run-puppet-agent" [15:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:05] (03PS1) 10Alexandros Kosiaris: cxserver: Remove all kademlia support from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) [15:50:07] !log disable puppet on cp3066 - T354424 [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:14] T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [15:51:10] (03CR) 10Muehlenhoff: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [15:55:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P55539 and previous config saved to /var/cache/conftool/dbconfig/20240124-155519-marostegui.json [15:55:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:25] (03CR) 10Ssingh: [C: 03+1] "Looks good, thanks for cleaning this up! Happy to take care of merging this and also we can reimage a durum host to see how the initial pu" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [15:57:36] !log hashar@deploy2002 Synchronized php-1.42.0-wmf.15/extensions/Echo/includes/Formatters/EchoRevertedPresentationModel.php: Fix EchoRevertedPresentationModel using null as string - T355751 (duration: 09m 06s) [15:57:46] T355751: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::escapeHtmlAllowEntities() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/extensions/Echo/includes/DiscussionParser.php on line 1299 - https://phabricator.wikimedia.org/T355751 [15:58:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host phab2002.codfw.wmnet [16:02:11] (03PS1) 10Bking: cloudelastic: lay the groundwork for private IP migration [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) [16:02:50] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:03:32] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:03:38] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:04:23] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:04:31] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:10:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P55540 and previous config saved to /var/cache/conftool/dbconfig/20240124-161026-marostegui.json [16:11:30] (03CR) 10JHathaway: [C: 03+1] "looks good based on the phab discussion" [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney) [16:15:23] (03CR) 10Ssingh: Puppet: Routed Ganeti support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:18:23] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Bawolff) Just trying to think up solutions - if thumbor gives a 429, could varnish inste... [16:19:46] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm) [16:20:06] (03PS32) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [16:20:08] (03CR) 10BBlack: [C: 03+1] Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney) [16:21:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/992702 (owner: 10Muehlenhoff) [16:23:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:25:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T354336)', diff saved to https://phabricator.wikimedia.org/P55541 and previous config saved to /var/cache/conftool/dbconfig/20240124-162532-marostegui.json [16:25:39] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:25:46] (03CR) 10Ayounsi: Puppet: Routed Ganeti support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:28:30] (03CR) 10Ssingh: [C: 03+1] Puppet: Routed Ganeti support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:30:58] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-eqiad: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [16:31:03] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [16:31:52] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) [16:35:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:35:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:38:01] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:39:19] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase103[1-3].eqiad.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [16:39:24] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [16:41:14] (03PS1) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) [16:42:29] (03CR) 10CI reject: [V: 04-1] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo) [16:42:53] (03PS2) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) [16:43:58] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) [16:44:06] (03CR) 10CI reject: [V: 04-1] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo) [16:44:25] (03PS3) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) [16:44:48] (03PS4) 10Jcrespo: dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) [16:44:55] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo) [16:49:54] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Create temporary fileset for dbprov for dbbackups archival [puppet] - 10https://gerrit.wikimedia.org/r/992755 (https://phabricator.wikimedia.org/T349360) (owner: 10Jcrespo) [16:51:07] (03CR) 10BCornwall: [C: 03+2] fifo-log-demux: Update project homepage [puppet] - 10https://gerrit.wikimedia.org/r/973887 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [16:51:50] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758 [16:54:34] !log disable puppet on all the hosts running bird to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/991699 [16:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:52] (03CR) 10Ayounsi: Bird: move firewall and default neighbor to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:54:57] jouncebot: nowandnext [16:54:57] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [16:54:57] In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1800) [16:55:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:55:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:55:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55542 and previous config saved to /var/cache/conftool/dbconfig/20240124-165522-marostegui.json [16:55:24] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758 (owner: 10Volans) [16:55:26] !log enable puppet on durum1001 to test CR 991699 [16:55:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:37] !log enable puppet on cp3066 - T354424 [16:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:43] T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [16:56:55] (03CR) 10Ayounsi: [C: 03+2] Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:57:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55543 and previous config saved to /var/cache/conftool/dbconfig/20240124-165732-marostegui.json [16:57:59] train blocker got lifted so I guess I can run the train again now? poke thcipriani [16:58:23] given I have to go in 40 minutes [16:58:32] (03PS33) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [16:58:41] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:58:54] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) All db* and es* up and running [16:59:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [16:59:36] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [17:02:01] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Some 'raw' data on the last 30 days increase of errors per-host/drive: ` cloudcephosd1021-sdb 88 cloudceph... [17:03:05] (03PS1) 10Jcrespo: Revert "dbbackups: Create temporary fileset for dbprov for dbbackups archival" [puppet] - 10https://gerrit.wikimedia.org/r/992766 [17:04:42] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [17:05:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [17:05:32] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [17:05:38] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@16476a9] (releasing): (no justification provided) [17:06:46] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@16476a9] (releasing): (no justification provided) (duration: 01m 07s) [17:07:11] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.4 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/992758 (owner: 10Volans) [17:07:13] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) ml-serve2005 is back up and working fine [17:07:19] (03PS8) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [17:07:23] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433) [17:07:27] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [17:08:07] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992759 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [17:09:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase103[1-3].eqiad.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [17:10:02] !log sudo cumin -b1 -s60 "R:Class = Bird" "enable-puppet 'CR991699' && run-puppet-agent" [17:10:08] (03PS34) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [17:10:08] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [17:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:24] sukhe: pro-tip C:bird is equivalent to R:Class = Bird ;) [17:11:46] volans: thanks :) [17:11:56] mostly used to A: and P: and hence [17:12:14] (03CR) 10Cathal Mooney: [C: 03+2] Remove sysctl settings to override defualt IP frag buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/992682 (https://phabricator.wikimedia.org/T345724) (owner: 10Cathal Mooney) [17:12:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P55544 and previous config saved to /var/cache/conftool/dbconfig/20240124-171238-marostegui.json [17:13:16] ehehe, https://wikitech.wikimedia.org/wiki/Cumin#PuppetDB_host_selection is your friend :D [17:14:16] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [17:14:24] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [17:14:52] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Create temporary fileset for dbprov for dbbackups archival" [puppet] - 10https://gerrit.wikimedia.org/r/992766 (owner: 10Jcrespo) [17:16:48] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.15 refs T354433 [17:16:54] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [17:17:09] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2015-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [17:17:22] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [17:17:50] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:19:07] of course [17:19:11] LiquidThreads is broken again [17:20:31] (03PS3) 10Jcrespo: mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) [17:21:18] damn, I didn’t realize we had it in production at all [17:21:21] (I only know it from TWN) [17:22:21] what's wrong with it this time? [17:22:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 241, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:23:58] !log hashar@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.15 refs T354433 (duration: 07m 10s) [17:24:11] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [17:24:13] (03PS1) 10Jforrester: Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) [17:25:19] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [17:26:21] 10SRE: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) [17:27:07] T355808 [17:27:08] T355808: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T355808 [17:27:12] that is for liquidthreads [17:27:17] I haven't marked it as a blocker though [17:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P55545 and previous config saved to /var/cache/conftool/dbconfig/20240124-172745-marostegui.json [17:28:54] ah, probably some corner case as we haven't seen it at twn (yet) [17:29:16] (03CR) 10Jforrester: [C: 04-1] Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) (owner: 10Jforrester) [17:29:51] (03CR) 10Cathal Mooney: "Good stuff thanks! One comment on edge-case in line, otherwise LGTM. I'll check on netbox-next and see if I can figure anything out abou" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [17:29:55] Nikerabbit: yeah I guess so :) [17:29:58] and somehow [17:30:06] we have an error log from 1.42.0-wmf.13 ... [17:30:20] ah [17:30:27] that is from mwmaint2002 [17:31:06] some `maintenance/migrateLinksTable.php(` which yields PHP Warning: EtcdConfig failed to fetch data: (curl error: 6) Couldn't resolve host name [17:31:34] I guess it is a very long on going migration [17:33:01] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10cmooney) 05Open→03Resolved >>! In T345724#9484317, @MoritzMuehlenhoff wrote: > Given that we specifically only added this for Fragmentsmack (and not for a specific sca... [17:35:29] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2015-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [17:35:34] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [17:36:13] (03PS1) 10Klausman: ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) [17:36:36] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) I'm running a script now to gather nicer reports with smartctl included, will send it once it's finished. [17:37:54] so MediaWiki looks okish, I am off! [17:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:42:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55546 and previous config saved to /var/cache/conftool/dbconfig/20240124-174251-marostegui.json [17:42:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:43:08] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:43:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:43:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:43:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:43:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55547 and previous config saved to /var/cache/conftool/dbconfig/20240124-174332-marostegui.json [17:44:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55548 and previous config saved to /var/cache/conftool/dbconfig/20240124-174442-marostegui.json [17:44:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Here you go, that has one directory per host, with one file per drive with the total increase of errors in... [17:46:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:46:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:47:04] (03CR) 10Klausman: [C: 03+2] ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [17:50:02] (03Merged) 10jenkins-bot: ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [17:50:41] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [17:50:48] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [17:58:10] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) [17:59:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P55549 and previous config saved to /var/cache/conftool/dbconfig/20240124-175948-marostegui.json [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T1800) [18:02:09] (03PS1) 10Volans: Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 [18:03:32] volans: quick question since I see you are around: for a long-running cumin command affecting multiple hosts, what's the best way to somehow see which host is currently being affected? [18:03:46] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) @cmooney will you issue a downtime before the maintenance for each host? [18:04:16] sukhe: already launched or to be launched? [18:04:51] in this case, already launched but I do want to know for "to be launched" :) [18:06:22] sukhe: in that case launch it with -d, --debug and then tail /var/log/cumin/cumin.log [18:07:51] * volans double checking as I'm going by memory [18:08:11] hmm ok, that works. would you consider this as a feature request? sometimes it's helpful to know where the progress is [18:08:25] I mean I can do some mental math given how many hosts is affecting and what's the current # but yeah [18:08:37] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@fed6de3]: (no justification provided) [18:08:52] but how would that show in the UI? as part of the progress bar? [18:09:10] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@fed6de3]: (no justification provided) (duration: 00m 32s) [18:09:16] something like that yeah. I am even happy with some more verbose output [18:09:38] (the progress bar re-writing the screen has already created too many issues in the past :D but I guess we could inject the hostname in there if asked) [18:09:44] (or by default) [18:09:54] I can file a task on why/how this came up too for some context [18:10:00] the verbose output would mess with the aggregation of output though [18:10:03] sure [18:10:08] that would be great, thanks [18:10:19] (or whenever `-b 1` with multiple hosts) [18:10:39] there is a cumin tag in phab [18:11:12] rzl: yep that's the other angle I was thinking about, if no batch or large batches are used, there are multiple hosts in parallel and they will scroll very rpidly so it would be a mess anyway [18:11:15] thanks! we can discuss it there, I just realized I spammed this channel [18:11:33] better you than icinga-wm and jinxer-wm :D [18:13:36] (03CR) 10Volans: "Tested on build2002, lintian could be improved a bit more ideally:" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [18:14:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P55550 and previous config saved to /var/cache/conftool/dbconfig/20240124-181455-marostegui.json [18:16:28] sukhe: for completeness, if you're running the same via a cookbook, the debug logs are always available in the -extended.log files [18:16:59] volans: ok thanks. but just cumin directly in this case [18:17:04] writing that task now, you can read it tomorrow [18:17:16] <3 [18:23:58] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2017-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [18:24:03] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [18:25:40] 10SRE, 10Cumin, 10Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811 (10ssingh) [18:25:49] sukhe: thanks for the task, pro-tip run-puppet-agent accepts a -e --enable MSG argument ;) [18:26:11] haha, you have told me this before but that breaks my mental model :P [18:30:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55551 and previous config saved to /var/cache/conftool/dbconfig/20240124-183001-marostegui.json [18:30:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:30:08] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:30:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:30:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:30:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:30:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:30:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:31:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55552 and previous config saved to /var/cache/conftool/dbconfig/20240124-183059-marostegui.json [18:33:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55553 and previous config saved to /var/cache/conftool/dbconfig/20240124-183308-marostegui.json [18:48:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P55554 and previous config saved to /var/cache/conftool/dbconfig/20240124-184815-marostegui.json [18:48:35] (03PS2) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [18:48:53] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [18:50:51] (03PS3) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [18:51:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [18:51:59] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [18:56:58] (03PS4) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:03:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P55555 and previous config saved to /var/cache/conftool/dbconfig/20240124-190322-marostegui.json [19:09:27] (03PS5) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:10:33] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [19:12:04] (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992801 (https://phabricator.wikimedia.org/T355066) [19:13:14] (03CR) 10CDanis: Add SameSite=Strict attribute to NetworkProbeLimit cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) (owner: 10Ayounsi) [19:13:21] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase[2017-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [19:13:36] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [19:16:11] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2022-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [19:17:17] (03PS6) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:18:26] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [19:18:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55557 and previous config saved to /var/cache/conftool/dbconfig/20240124-191828-marostegui.json [19:18:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [19:18:43] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:18:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [19:18:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55558 and previous config saved to /var/cache/conftool/dbconfig/20240124-191850-marostegui.json [19:20:35] (03PS7) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:21:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55559 and previous config saved to /var/cache/conftool/dbconfig/20240124-192100-marostegui.json [19:21:45] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [19:22:10] (03CR) 10CDanis: [C: 03+2] Fix various pylint warnings [software/conftool] - 10https://gerrit.wikimedia.org/r/992105 (owner: 10Clément Goubert) [19:22:20] (03CR) 10CDanis: [C: 03+2] Raise yaml_log_error logging level to error [software/conftool] - 10https://gerrit.wikimedia.org/r/992104 (https://phabricator.wikimedia.org/T355256) (owner: 10Clément Goubert) [19:23:04] (03PS8) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:23:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:24:06] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:24:20] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) Today's work is complete. The only node left to relocation is gitlab2002. Service ops will get back with us with a day for sometimes next week. All old ports in netbox and on a... [19:24:35] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [19:25:27] (03Merged) 10jenkins-bot: Raise yaml_log_error logging level to error [software/conftool] - 10https://gerrit.wikimedia.org/r/992104 (https://phabricator.wikimedia.org/T355256) (owner: 10Clément Goubert) [19:25:31] (03Merged) 10jenkins-bot: Fix various pylint warnings [software/conftool] - 10https://gerrit.wikimedia.org/r/992105 (owner: 10Clément Goubert) [19:26:56] (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066) [19:28:51] (03Abandoned) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992801 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson) [19:29:59] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson) [19:30:49] (03Merged) 10jenkins-bot: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992803 (https://phabricator.wikimedia.org/T355066) (owner: 10Ebernhardson) [19:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:33:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:34:04] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:34:53] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:35:01] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:36:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P55560 and previous config saved to /var/cache/conftool/dbconfig/20240124-193606-marostegui.json [19:37:42] (03PS1) 10Ebernhardson: cirrus updater: Align consumer-devnull with deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/992806 [19:38:48] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:39:00] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:51:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P55561 and previous config saved to /var/cache/conftool/dbconfig/20240124-195113-marostegui.json [19:52:22] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) There seems to be a problem with my developer account as well. I created my developer account through the [[ https://idm.wikimedia.org/signup/ | IDM signup page ]] last week, but I have... [20:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T354336)', diff saved to https://phabricator.wikimedia.org/P55562 and previous config saved to /var/cache/conftool/dbconfig/20240124-200619-marostegui.json [20:06:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance [20:06:30] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:06:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance [20:06:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:06:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:06:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55563 and previous config saved to /var/cache/conftool/dbconfig/20240124-200659-marostegui.json [20:08:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55564 and previous config saved to /var/cache/conftool/dbconfig/20240124-200808-marostegui.json [20:23:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P55565 and previous config saved to /var/cache/conftool/dbconfig/20240124-202315-marostegui.json [20:26:46] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=scowiki --logwiki=metawiki 'TheBabushka' 'AshotGPT' # T355743 [20:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:51] T355743: Unblock stuck global rename of AshotGPT - https://phabricator.wikimedia.org/T355743 [20:29:08] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) An update on my attempts to figure out my developer/wikitech account creation issue: - When I go to the[[ https://idp.wikimedia.org/login#divAttributes | IDP login page ]], I get this... [20:35:08] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) For more context, if I go to the[[ https://idm.wikimedia.org/wikimedia/login/ | IDM login page ]] and click on the "Wikimedia Developer Single Sign On" button, I get this: {F41714657} [20:37:37] !log fab@deploy2002 Started deploy [airflow-dags/research@2f514fc]: (no justification provided) [20:38:10] !log fab@deploy2002 Finished deploy [airflow-dags/research@2f514fc]: (no justification provided) (duration: 00m 33s) [20:38:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P55566 and previous config saved to /var/cache/conftool/dbconfig/20240124-203821-marostegui.json [20:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T354336)', diff saved to https://phabricator.wikimedia.org/P55567 and previous config saved to /var/cache/conftool/dbconfig/20240124-205327-marostegui.json [20:53:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance [20:53:33] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:53:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance [20:53:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55568 and previous config saved to /var/cache/conftool/dbconfig/20240124-205350-marostegui.json [20:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55569 and previous config saved to /var/cache/conftool/dbconfig/20240124-205600-marostegui.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2100). Please do the needful. [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:02:57] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10hashar) + @jnuche from release engineering who knows even more about Jenkins than me :-) `contint2002` hosts... [21:05:10] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5a0681b]: Regular analytics weekly train [airflow-dags/analytics@5a0681bc] [21:05:47] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@5a0681b]: Regular analytics weekly train [airflow-dags/analytics@5a0681bc] (duration: 00m 37s) [21:11:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P55570 and previous config saved to /var/cache/conftool/dbconfig/20240124-211107-marostegui.json [21:12:28] (03PS1) 10Gmodena: eventstreams: redactions with underscores in title [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) [21:25:10] (03CR) 10Htriedman: [C: 03+1] eventstreams: redactions with underscores in title [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) (owner: 10Gmodena) [21:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P55571 and previous config saved to /var/cache/conftool/dbconfig/20240124-212613-marostegui.json [21:28:16] (03CR) 10Cathal Mooney: "Yeah this is really weird. I'd a bit of a look and can't see why that part of the code is getting executed when the check fails." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [21:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T354336)', diff saved to https://phabricator.wikimedia.org/P55572 and previous config saved to /var/cache/conftool/dbconfig/20240124-214120-marostegui.json [21:41:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance [21:41:25] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [21:41:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance [21:41:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55573 and previous config saved to /var/cache/conftool/dbconfig/20240124-214141-marostegui.json [21:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:43:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55574 and previous config saved to /var/cache/conftool/dbconfig/20240124-214351-marostegui.json [21:45:02] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2022-2035].codfw.wmnet: Updated Cassandra to 4.1.1-wmf1 — T355719 - eevans@cumin1002 [21:45:08] T355719: Patch Cassandra for CASSANDRA-18733 (streaming receive deadlock) - https://phabricator.wikimedia.org/T355719 [21:58:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P55575 and previous config saved to /var/cache/conftool/dbconfig/20240124-215857-marostegui.json [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2200) [22:10:53] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2103.codfw.wmnet with OS bullseye [22:11:05] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2104.codfw.wmnet with OS bullseye [22:11:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2106.codfw.wmnet with OS bullseye [22:11:30] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2105.codfw.wmnet with OS bullseye [22:14:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P55576 and previous config saved to /var/cache/conftool/dbconfig/20240124-221403-marostegui.json [22:28:52] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage [22:28:53] (03CR) 10Bking: [C: 03+2] cloudelastic: promote new hosts to master-eligible [puppet] - 10https://gerrit.wikimedia.org/r/992538 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking) [22:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T354336)', diff saved to https://phabricator.wikimedia.org/P55577 and previous config saved to /var/cache/conftool/dbconfig/20240124-222910-marostegui.json [22:29:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance [22:29:15] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [22:29:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance [22:29:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55578 and previous config saved to /var/cache/conftool/dbconfig/20240124-222932-marostegui.json [22:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55579 and previous config saved to /var/cache/conftool/dbconfig/20240124-223142-marostegui.json [22:33:04] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage [22:34:31] (03PS1) 10Jforrester: Revert "Update

spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805) [22:37:00] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 344 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1180, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 330, delayed_unassigned_shards: 0, number_of_pending_tas [22:37:00] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3887, active_shards_percent_as_number: 77.42782152230971 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:08] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 429 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1102, relocating_shards: 0, initializing_shards: 56, unassigned_shards: 373, delayed_unassigned_shards: 0, number_of_pending_tas [22:37:08] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 71.97909862834749 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:08] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 373 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1225, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 364, delayed_unassigned_shards: 0, number_of_pending_ta [22:37:08] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 13245, active_shards_percent_as_number: 76.65832290362954 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:14] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 339 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1185, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 325, delayed_unassigned_shards: 0, number_of_pending_tas [22:37:14] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 19022, active_shards_percent_as_number: 77.75590551181102 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:16] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:16] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:24] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:26] ^^ we're working on this [22:37:40] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 422 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1109, relocating_shards: 0, initializing_shards: 55, unassigned_shards: 367, delayed_unassigned_shards: 0, number_of_pending_tas [22:37:40] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.43631613324625 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:40] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 422 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1109, relocating_shards: 0, initializing_shards: 55, unassigned_shards: 367, delayed_unassigned_shards: 0, number_of_pending_tas [22:37:40] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.43631613324625 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:40] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 337 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1261, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 329, delayed_unassigned_shards: 0, number_of_pending_ta [22:37:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 23547, active_shards_percent_as_number: 78.91113892365456 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:00] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1240, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 276, delayed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_fli [22:38:00] h: 0, task_max_waiting_in_queue_millis: 1852, active_shards_percent_as_number: 81.36482939632546 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:16] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 700, active_shards: 1249, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 267, delayed_unassigned_shards: 0, number_of_pending_tasks: 11, number_of_in_fli [22:38:16] h: 0, task_max_waiting_in_queue_millis: 10555, active_shards_percent_as_number: 81.95538057742782 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:42] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1294, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 290, delayed_unassigned_shards: 0, number_of_pending_tasks: 19, number_of_i [22:38:42] _fetch: 0, task_max_waiting_in_queue_millis: 10224, active_shards_percent_as_number: 80.97622027534418 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:50] (03CR) 10Jforrester: "Dose this need deploying?" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [22:39:08] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 733, active_shards: 1320, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 274, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_ [22:39:08] etch: 0, task_max_waiting_in_queue_millis: 28697, active_shards_percent_as_number: 82.60325406758447 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:34] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: cloduelastic maintenance [22:39:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: cloduelastic maintenance [22:41:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:46:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55580 and previous config saved to /var/cache/conftool/dbconfig/20240124-224648-marostegui.json [22:47:30] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1228, relocating_shards: 0, initializing_shards: 44, unassigned_shards: 259, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [22:47:30] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.20901371652515 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:47:32] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1228, relocating_shards: 0, initializing_shards: 44, unassigned_shards: 259, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [22:47:32] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.20901371652515 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:48:06] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 7, number_of_data_nodes: 7, active_primary_shards: 697, active_shards: 1233, relocating_shards: 0, initializing_shards: 43, unassigned_shards: 255, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [22:48:06] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.5355976485957 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:50:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2106.codfw.wmnet with OS bullseye [22:55:48] jouncebot: nowandnext [22:55:48] For the next 0 hour(s) and 4 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240124T2200) [22:55:48] In 8 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700) [22:55:48] In 8 hour(s) and 4 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700) [22:56:44] (03PS1) 10Ryan Kemper: cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826 [22:56:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805) (owner: 10Jforrester) [22:56:56] (03CR) 10Bking: [C: 03+2] cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826 (owner: 10Ryan Kemper) [22:57:07] (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: add old masters back [puppet] - 10https://gerrit.wikimedia.org/r/992826 (owner: 10Ryan Kemper) [23:01:06] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 697, active_shards: 1339, relocating_shards: 0, initializing_shards: 23, unassigned_shards: 169, delayed_unassigned_shards: 0, number_of_pending_tasks: 16, number_of_in_ [23:01:06] etch: 0, task_max_waiting_in_queue_millis: 52817, active_shards_percent_as_number: 87.45917700849118 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:10] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 733, active_shards: 1468, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 130, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_i [23:01:10] _fetch: 0, task_max_waiting_in_queue_millis: 7119, active_shards_percent_as_number: 91.8648310387985 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:20] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 732, active_shards: 1404, relocating_shards: 0, initializing_shards: 34, unassigned_shards: 93, delayed_unassigned_shards: 0, number_of_pending_tasks: 13, number_of_in_f [23:01:20] tch: 0, task_max_waiting_in_queue_millis: 40933, active_shards_percent_as_number: 91.70476812540824 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55581 and previous config saved to /var/cache/conftool/dbconfig/20240124-230155-marostegui.json [23:04:34] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2103.codfw.wmnet with OS bullseye [23:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 6.762% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:09:34] * James_F wonders if he can get away with a quick sleep whilst waiting for gerrit to merge the patch. ;-( [23:14:07] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott) [23:14:08] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:14:16] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T354336)', diff saved to https://phabricator.wikimedia.org/P55582 and previous config saved to /var/cache/conftool/dbconfig/20240124-231701-marostegui.json [23:17:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [23:17:07] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [23:17:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [23:17:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T354336)', diff saved to https://phabricator.wikimedia.org/P55583 and previous config saved to /var/cache/conftool/dbconfig/20240124-231723-marostegui.json [23:17:49] (03PS2) 10Andrew Bogott: wikireplicas maintain-meta_p: don't store cursor in schema class [puppet] - 10https://gerrit.wikimedia.org/r/984626 [23:18:59] (03Merged) 10jenkins-bot: Revert "Update

spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" [skins/Vector] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992775 (https://phabricator.wikimedia.org/T355805) (owner: 10Jforrester) [23:19:29] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:992775|Revert "Update

spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]] [23:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T354336)', diff saved to https://phabricator.wikimedia.org/P55584 and previous config saved to /var/cache/conftool/dbconfig/20240124-231933-marostegui.json [23:19:35] T355805: Syntax highlighting in 2017 wikitext editor has extreme vertical cursor displacement - https://phabricator.wikimedia.org/T355805 [23:19:36] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [23:21:01] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:992775|Revert "Update

spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:23:20] Kemayo: Everything look OK to you? [23:25:39] https://test.wikipedia.org/w/index.php?title=Foo&veaction=editsource&debug=1 is taking rather a while to load, sigh. [23:26:05] !log jforrester@deploy2002 jforrester: Continuing with sync [23:32:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2104.codfw.wmnet with OS bullseye [23:32:18] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:32:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2105.codfw.wmnet with OS bullseye [23:32:59] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:992775|Revert "Update

spacing to improve consistency of ul/ol spacing, also update heading spacing to be more consistent, relying on mw defaults more" (T355805 T354433)]] (duration: 13m 29s) [23:33:07] T355805: Syntax highlighting in 2017 wikitext editor has extreme vertical cursor displacement - https://phabricator.wikimedia.org/T355805 [23:33:07] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [23:33:12] Finally. [23:34:29] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2103.codfw.wmnet with OS bullseye [23:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55585 and previous config saved to /var/cache/conftool/dbconfig/20240124-233439-marostegui.json [23:36:00] James_F: Yeah, that took forever to load. It does seem to work fine -- the bug I know about is gone, anyway. [23:39:27] 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10Scott_French) [23:41:10] 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10Scott_French) 05Open→03In progress p:05Triage→03Medium [23:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [23:43:51] 10SRE, 10SRE-Access-Requests: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10RLazarus) [23:44:45] (03PS1) 10Scott French: admin: add new SSH pubkey for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/992829 (https://phabricator.wikimedia.org/T355834) [23:49:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55586 and previous config saved to /var/cache/conftool/dbconfig/20240124-234946-marostegui.json [23:51:15] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2103.codfw.wmnet with reason: host reimage [23:51:57] (03CR) 10RLazarus: [C: 03+2] admin: add new SSH pubkey for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/992829 (https://phabricator.wikimedia.org/T355834) (owner: 10Scott French) [23:54:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2103.codfw.wmnet with reason: host reimage [23:56:36] (03PS1) 10Zabe: Start reading from af_user(_text)/afh_user(_text) in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992830 (https://phabricator.wikimedia.org/T355616) [23:59:45] jouncebot: nowandnext [23:59:45] No deployments scheduled for the next 7 hour(s) and 0 minute(s) [23:59:45] In 7 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700) [23:59:45] In 7 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700)