[00:01:57] (03PS2) 10Ryan Kemper: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) [00:14:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:16:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10Krinkle) [00:16:06] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:58] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:17:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10Krinkle) [00:23:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Cmjohnson For the once in row A,B and C it looks like the OS is already installed on them so you can use the --no-pxe and --new flags to... [00:25:38] 10SRE, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 4 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Krinkle) [00:31:03] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10Sustainability (Incident Followup): Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10Krinkle) [00:31:32] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:33:06] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:16] !log slowly restarting (with batch 1 sleep 5) trafficserver on text caches to fully deploy g 817086 T313578 [00:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:24] T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578 [00:48:54] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:52:31] ctrl-C halfway through since the dashboard I was watching was showing a scary number of errors [00:55:18] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,session-c1506.scope,session-c1507.scope,session-c1508.scope,session-c1510.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:55] seems like it took more like 5 minutes to recover, not 5 seconds [00:57:42] will do the remaining hosts with sleep 300 [01:17:28] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:23:27] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10Sustainability (Incident Followup): Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10lmata) [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:46] (03PS1) 10Krinkle: Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) [02:08:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:17:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:40] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:23:40] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 136, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:25:36] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:54:12] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [03:20:08] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:26] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:51:14] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:12:59] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10calbon) Thanks @Papaul [04:16:24] (03CR) 10Tim Starling: [C: 03+2] "I restarted all the ATS services to deploy this, which ended up being slow and painful since it took 5 minutes for each server to recover " [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [04:25:32] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:59:26] (03PS1) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) [04:59:53] (03CR) 10CI reject: [V: 04-1] drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [05:00:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 16 hosts with reason: codfw s8 sanitarium master switch [05:00:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: codfw s8 sanitarium master switch [05:01:41] (03PS2) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) [05:06:31] (03PS1) 10Marostegui: mariadb: db2072 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/818292 (https://phabricator.wikimedia.org/T311493) [05:07:51] (03CR) 10Marostegui: [C: 03+2] mariadb: db2072 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/818292 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:12:15] (03PS1) 10Marostegui: db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818293 (https://phabricator.wikimedia.org/T314109) [05:15:27] (03CR) 10Marostegui: [C: 03+2] db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818293 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui) [05:30:00] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:33:56] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:26:56] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, [06:26:56] e2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:36:42] (03PS1) 10Marostegui: db2090: No longer candidatem master [puppet] - 10https://gerrit.wikimedia.org/r/818296 (https://phabricator.wikimedia.org/T314109) [06:38:34] (03CR) 10Marostegui: [C: 03+2] db2090: No longer candidatem master [puppet] - 10https://gerrit.wikimedia.org/r/818296 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui) [06:39:47] (03CR) 10Slyngshede: Add per node vCPU allocations (031 comment) [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede) [06:42:56] (03PS3) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 [06:43:41] (03CR) 10Slyngshede: [C: 03+2] Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede) [06:43:45] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede) [06:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32104 and previous config saved to /var/cache/conftool/dbconfig/20220729-065004-root.json [06:51:37] (03PS1) 10Marostegui: site.pp: Remove insetup from db217[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/818298 (https://phabricator.wikimedia.org/T311493) [06:51:44] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, [06:51:44] e2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220729T0700) [07:01:04] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db217[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/818298 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32105 and previous config saved to /var/cache/conftool/dbconfig/20220729-070509-root.json [07:16:39] (03PS1) 10Kevin Bazira: ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) [07:20:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32106 and previous config saved to /var/cache/conftool/dbconfig/20220729-072013-root.json [07:25:18] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:27:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:30:34] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:35:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32107 and previous config saved to /var/cache/conftool/dbconfig/20220729-073518-root.json [07:40:53] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) @soworu you haven't submitted a SSH key. This is ok for analytics-privatedata-users access per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Le... [07:41:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [07:48:06] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:48:33] (03PS1) 10Vgutierrez: admin: Add mraish to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) [07:49:59] (03CR) 10CI reject: [V: 04-1] admin: Add mraish to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [07:50:05] thanks CI [07:50:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32108 and previous config saved to /var/cache/conftool/dbconfig/20220729-075023-root.json [07:59:26] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:00:00] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [08:02:28] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10fgiunchedi) >>! In T314027#8112820, @Papaul wrote: > @fgiunchedi f1-f4 PDU's are not setup yet Makes sense, thanks for the context Papaul! [08:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32109 and previous config saved to /var/cache/conftool/dbconfig/20220729-080528-root.json [08:12:31] !log depool ats-be on cp4026 for debugging purposes [08:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:58] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10Volans) 05Open→03Resolved a:03Volans @fnegri thanks for opening the task. I can confirm you're in LDAP `wmf` and I've added you to the #wmf-nda Phabricator group. Resolving, feel free to re-open i... [08:20:43] (03PS1) 10Filippo Giunchedi: sre: port zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [08:21:18] (03PS2) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [08:21:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Volans) [08:25:14] (03CR) 10Filippo Giunchedi: "I was looking into porting the "zookeeper server is down" alert. I was looking for higher-level zk metrics to indicate that an election ca" [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:28:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Volans) @MRaishWMF your user `mikeraish` is already part of the `analytics-privatedata-users` with no SSH key (access to that group can be configured in... [08:30:34] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Volans) [08:34:20] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Volans) @sgrabarczuk The access to `analytics-privatedata-users` can be configured in different ways depending on what you need to access. Could you plea... [08:43:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Aline_Bruenger_WMDE) [08:43:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10WMDE-leszek) I confirm @Aline_Bruenger_WMDE 's identity and approve the request on WMDE's side. Thanks [08:44:43] 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [08:44:52] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:50:18] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:57] (03PS2) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) [08:56:11] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.2 [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119) [08:57:41] (03CR) 10MarcoAurelio: Amend license request contact form per Legal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio) [09:07:34] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:11:01] (03CR) 10MarcoAurelio: Amend license request contact form per Legal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio) [09:11:54] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [09:12:09] 10SRE-swift-storage: flip/flop mounting filesystems between systemd and swift-drive-audit - https://phabricator.wikimedia.org/T265450 (10MatthewVernon) swift-drive-audit needs to run `systemctl daemon-reload` after making changes to `/etc/fstab`. Thanks, systemd. [09:13:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Slavina Stefanova - https://phabricator.wikimedia.org/T314122 (10Slst2020) [09:13:30] 10SRE, 10SRE-swift-storage: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10MatthewVernon) ...this behaviour has reverted, since we've gone back to using upstream swift-drive-audit, which is a cron.d entry. [09:14:16] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) 05Resolved→03Open [09:15:25] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) [09:17:46] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) I am reopening this task because it was requested in {T314061} [09:18:21] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) a:05pwangai→03None [09:18:50] (03Restored) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai) [09:19:07] 10SRE-swift-storage: swift-drive-audit configuration broken on >= buster - https://phabricator.wikimedia.org/T314123 (10MatthewVernon) [09:25:06] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:10] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:08] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:49:52] (03PS1) 10Vgutierrez: trafficserver: Fix logging template [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651) [09:51:08] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36502/console" [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [09:55:38] (03PS2) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) [09:58:45] (03PS1) 10Marostegui: db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818431 (https://phabricator.wikimedia.org/T311493) [09:58:47] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Volans) p:05High→03Medium a:05RKemper→03Volans For future reference, there is a process to follow for requesting [[ https://wikitech.wikimed... [10:01:24] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) 10.6.8 hosts pooled again: * db1111 in s8 * db1132 s1 * db1127 in s7 All of them back to the version that has no urin... [10:01:27] (03CR) 10Marostegui: [C: 03+2] db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818431 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:08:15] (03PS1) 10Jbond: C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) [10:10:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Aline_Bruenger_WMDE given that the access to `analytics-privatedata-users` can be setup in different ways based on what you need, could you please clar... [10:11:18] (03PS1) 10Marostegui: instances.yaml: Add db2173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818437 (https://phabricator.wikimedia.org/T311493) [10:11:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) p:05Triage→03Medium [10:12:39] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818437 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:13:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36504/console" [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:13:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Slavina Stefanova - https://phabricator.wikimedia.org/T314122 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans @Slst2020 thanks for opening the task. I can confirm you're in LDAP `wmf `and I've added you to the #WMF-NDA Phabricator gro... [10:15:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2173 into s1 T311493', diff saved to https://phabricator.wikimedia.org/P32110 and previous config saved to /var/cache/conftool/dbconfig/20220729-101507-marostegui.json [10:15:13] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [10:15:31] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) [10:24:08] (03PS2) 10Vgutierrez: C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:25:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36505/console" [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:26:00] (03CR) 10Volans: [C: 03+2] "Verified user, added to wmf LDAP group. LGTM, thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai) [10:28:44] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10Volans) 05Open→03Resolved a:03Volans @pwangai I've added you to the `wmf` LDAP group and #wmf-nda Phabricator project. As an additiona... [10:29:41] (03Abandoned) 10Vgutierrez: trafficserver: Fix logging template [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [10:30:06] (03CR) 10AikoChou: [C: 03+1] ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [10:33:54] !log disable puppet on cp nodes to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/818436 [10:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:31] (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) [10:35:21] (03CR) 10CI reject: [V: 04-1] statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [10:36:13] (03PS2) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) [10:36:19] * Lucas_WMDE is not a fan of middle-of-the-line alignment [10:38:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:42:22] Lucas_WMDE: I know, it's quite annoying and ruins git history [10:43:37] (03PS3) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) [10:43:49] (03PS1) 10Jbond: check_user: handle situation where user has no organisation [puppet] - 10https://gerrit.wikimedia.org/r/818441 (https://phabricator.wikimedia.org/T314129) [10:45:40] (03PS4) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) [10:45:42] (03CR) 10Michael Große: [C: 03+1] statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [10:53:02] (03PS1) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) [10:54:43] (03PS2) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) [10:56:05] (03CR) 10Ladsgroup: drop_cx_translation_translators_T314087.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [10:56:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36508/console" [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:56:58] (03PS3) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) [10:57:52] (03PS3) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) [10:58:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36510/console" [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [10:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:59:42] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond) [11:03:23] (03CR) 10Jbond: [C: 03+2] check_user: handle situation where user has no organisation [puppet] - 10https://gerrit.wikimedia.org/r/818441 (https://phabricator.wikimedia.org/T314129) (owner: 10Jbond) [11:03:56] !log repool ats-be@cp4026 - T309651 [11:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:01] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [11:04:27] (03CR) 10Ladsgroup: [C: 03+1] "We need to find a way to make it work on x1, I will think about it. Thanks <3" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [11:04:44] !log reenable puppet on cp nodes [11:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:07] (03CR) 10Marostegui: [C: 03+2] drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [11:06:32] (03Merged) 10jenkins-bot: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [11:06:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Aline_Bruenger_WMDE) @Volans For now, the purpose is only pulling numbers which can be done without ssh, but I'll probably be asked to build recurring reports [... [11:10:38] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:17:48] (03CR) 10Marostegui: [C: 03+2] drop_cx_translation_translators_T314087.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui) [11:21:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [11:23:27] (03CR) 10Jbond: "just to confirm, no i dont consider this an access request and" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [11:31:31] (03PS1) 10Ssingh: hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) [11:32:15] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:32:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36511/console" [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:33:14] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:34:03] (03CR) 10Vgutierrez: [C: 03+2] hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:36:34] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:37:33] !log update ATS to version 9.1.2 in cp4032 - T309651 [11:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:39] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [11:39:21] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10akosiaris) [11:40:37] (03PS1) 10Marostegui: instances.yaml: Remove db2088 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818446 (https://phabricator.wikimedia.org/T313797) [11:41:32] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2088 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818446 (https://phabricator.wikimedia.org/T313797) (owner: 10Marostegui) [11:42:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2088 from dbctl T313797', diff saved to https://phabricator.wikimedia.org/P32111 and previous config saved to /var/cache/conftool/dbconfig/20220729-114203-marostegui.json [11:42:08] T313797: decommission db2088 - https://phabricator.wikimedia.org/T313797 [11:56:46] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:14] 10Puppet, 10Infrastructure-Foundations: decomission puppetmaster[12]00[12] and replace them with puppetmaster[12]00[45] - https://phabricator.wikimedia.org/T314136 (10jbond) p:05Triage→03Medium [12:08:59] (03PS1) 10Jbond: puppetmaster1004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/818449 (https://phabricator.wikimedia.org/T314136) [12:09:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster1004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/818449 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [12:11:28] !log dbmaint s3@eqiad T314087 [12:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] T314087: Add primary key and drop unique index on cx_translators on wmf wikis - https://phabricator.wikimedia.org/T314087 [12:12:02] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:18:31] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10Ottomata) GRIZZLYYYYY? [12:26:45] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) Yar, no sorry, I have had zero time to work on this. @JArguello-WMF we should find a sprint to put this into. [12:28:48] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) a:03Papaul Hi, This drive is now unmounted, so can be swapped at your earliest convenience, please :) Thanks! [12:31:09] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) @soworu what data are you trying to access? I'm not aware of any usage data from wordpress extensions. Do you plan on [[ https://wikitech.wikimedia.org/wiki/Event... [12:33:56] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Ottomata) Indeed, this is pretty privileged access. I think that ultimately for this purpose, Xabriel won't need this access, but Xabriel has been... [12:35:11] PROBLEM - Check systemd state on puppetmaster1004 is CRITICAL: CRITICAL - degraded: The following units failed: remove_old_puppet_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:26] (03CR) 10Jbond: [C: 03+2] beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [12:41:00] (03PS1) 10Jbond: P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) [12:42:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36512/console" [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:43:55] (03CR) 10CI reject: [V: 04-1] P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:51:54] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Volans) To add a bit more context I'm asking because your request: > * The specific LDAP group that you want to be added to (optional): > analytics-privatedata-users If for an... [12:59:05] PROBLEM - puppetmaster backend https on puppetmaster1004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:01:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Aline_Bruenger_WMDE ok, let's proceed for the simple access for now and you can always request to integrate it later on. @KFrancis could you please co... [13:01:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) [13:01:36] jbond: ^^^ (puppetmaster backend) I guess is related to your changes [13:03:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Ottomata or @odimitrijevic: your approval is needed as `analytics-privatedata-users` group owner. [13:06:26] !log dbmaint s3@eqiad T314141 [13:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] T314141: Add primary key and drop unique index on translate_tmt wmf wikis - https://phabricator.wikimedia.org/T314141 [13:07:02] !log dbmaint s4@eqiad T314141T314140 [13:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] !log dbmaint s4@eqiad T314140 [13:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:10] T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140 [13:07:10] 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10MatthewVernon) [13:07:42] !log dbmaint s8@eqiad T314140 [13:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:53] (03PS3) 10Volans: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) (owner: 10Ryan Kemper) [13:09:48] (03CR) 10Volans: [C: 03+2] "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) (owner: 10Ryan Kemper) [13:10:56] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Volans) Given the above approval of both @WDoranWMF as Xabriel's manager and @Ottomata as group approver I went ahead and merged the above patch to... [13:11:17] (03CR) 10Klausman: [C: 03+2] ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [13:11:32] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [13:11:36] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [13:12:33] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2047.codfw.wmnet with OS bullseye [13:12:40] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2047.codfw.wmnet with OS bullseye [13:12:55] volans: ahh yes can be ignored ill update to disable alerts [13:13:19] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:22] (03Merged) 10jenkins-bot: ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [13:15:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) [13:15:38] (03PS2) 10Jbond: P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) [13:18:30] (03CR) 10CI reject: [V: 04-1] P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [13:20:08] (03PS1) 10Jbond: puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455 [13:20:49] (03PS2) 10Jbond: puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455 [13:21:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455 (owner: 10Jbond) [13:22:43] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) Thank you all for taking care of this. [13:24:53] (03PS1) 10Ssingh: hiera: enable ATS9 on cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) [13:25:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36513/console" [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:26:47] (03CR) 10Ssingh: [V: 03+1] "[not to be merged till Tuesday]" [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:26:50] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10EChetty) @RhinosF1 It's currently serving as a critical blocker for @xcollazo on being able to work on failures related to the Data Engineerings ins... [13:27:11] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) Confirmed sudo access: ` xcollazo@an-launcher1002:~$ hostname -f an-launcher1002.eqiad.wmnet xcollazo@an-launcher1002:~$ whoami xcollazo... [13:27:46] (03PS1) 10Ssingh: hiera: enable ATS9 on cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) [13:27:58] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) 05Open→03Resolved [13:28:47] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36514/console" [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:29:07] (03CR) 10Ssingh: [V: 03+1] "[not to be merged till Tuesday]" [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:31:25] (03PS1) 10BBlack: cp4027: enable manual ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/818460 (https://phabricator.wikimedia.org/T308799) [13:33:24] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2047.codfw.wmnet with reason: host reimage [13:33:50] (03CR) 10BBlack: [C: 03+2] cp4027: enable manual ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/818460 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [13:36:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2047.codfw.wmnet with reason: host reimage [13:48:22] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:02] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:55:04] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:58:20] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) For `D7`, please ping @jbond once done so he can confirm the ms-be* nodes have come back up OK. [13:58:30] (03PS1) 10Jaime Nuche: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 [13:59:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2047.codfw.wmnet with OS bullseye [13:59:11] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2047.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [13:59:38] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) In rack `D2`, ms-fe2012 needs depooling before the power goes, and if you could ping me once the rack is done so I can check all the ms-be* nodes come back up again, that'd... [14:00:52] (03PS2) 10Jaime Nuche: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) [14:03:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [14:03:51] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) [14:03:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye [14:04:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with OS bullseye [14:04:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with... [14:05:54] (03Abandoned) 10Sbisson: Images for Wikipedia Preview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666680 (https://phabricator.wikimedia.org/T273674) (owner: 10Sbisson) [14:06:30] (03PS4) 10Jaime Nuche: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [14:06:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bullseye [14:06:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with OS bullseye [14:07:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [14:09:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bullseye [14:09:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with OS bullseye [14:09:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1188.eqiad.wmnet with OS bullseye [14:09:34] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with OS bullseye [14:09:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS bullseye [14:09:40] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with OS bullseye [14:10:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [14:10:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [14:10:20] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [14:10:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [14:11:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) I think the problem might be with the gateway_ip, it is set to none. Or perhaps it is the allocation pool starts at 17 rather than 18, and if we... [14:11:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [14:12:00] (03CR) 10Jaime Nuche: "These new config files were configured in Puppet to belong to root and specific groups, e.g. local/phd.json belonged to group "phd" (https" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [14:13:30] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10BBlack) Sorry for all the delays on my end as well, but we're getting somewhere. On the earlier points about attachment via varnish vs ats-be: For now, it's just simpl... [14:13:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [14:14:17] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:26] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2043.codfw.wmnet with OS bullseye [14:15:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [14:15:32] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2043.codfw.wmnet with OS bullseye [14:21:37] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:03] (03PS3) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) [14:23:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36516/console" [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [14:23:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS bullseye [14:23:35] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with OS bullseye completed: -... [14:26:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1186.eqiad.wmnet with OS bullseye [14:26:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with OS bullseye completed: -... [14:27:26] (03PS1) 10FNegri: Add node16 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/818474 (https://phabricator.wikimedia.org/T310821) [14:28:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:28:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1188.eqiad.wmnet with OS bullseye [14:28:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with OS bullseye completed: -... [14:30:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS bullseye [14:30:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with OS bullseye completed: -... [14:34:20] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2043.codfw.wmnet with reason: host reimage [14:35:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson) [14:37:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2043.codfw.wmnet with reason: host reimage [14:39:13] !log dbmaint s3@eqiad T314140 [14:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:17] T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140 [14:41:04] 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) A sample of alerts I found while looking for IRC floods from `icinga-wm` (reporting a sample of alert, not repeating the flood here) ` 2022-04-20T... [14:52:38] (03PS2) 10Volans: admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) [14:52:40] (03PS2) 10Volans: admin: add raymond-ndibe user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817843 (https://phabricator.wikimedia.org/T313876) [14:55:09] (03PS4) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) [14:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:58:49] (03CR) 10Volans: [C: 03+2] "I've just inverted the 2 patches as this one can already go." [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) (owner: 10Volans) [14:59:23] !log dbmaint s7@eqiad T314140 [14:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:28] T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140 [15:00:08] PROBLEM - Check systemd state on search-loader1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2043.codfw.wmnet with OS bullseye [15:00:42] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2043.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [15:01:40] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:03] PROBLEM - MariaDB Replica SQL: s7 #page on db1174 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP INDEX tmi_key: check that it exists on query. Default database: metawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:02:14] ^ that's probably me [15:02:15] hello [15:02:21] fixing [15:02:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans) @Slst2020 All set. The change has been merged, it will take up to ~30 minutes to propagate. After that please verify your access and close this task if it's... [15:02:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P32112 and previous config saved to /var/cache/conftool/dbconfig/20220729-150256-root.json [15:03:00] thanks [15:03:12] let me ack on victorops [15:03:17] jynus: ACKed [15:03:21] ah, ok [15:03:24] thanks [15:03:25] thanks :) [15:03:26] * volans still here if needed [15:03:29] volans: go :P [15:03:40] it's still work time :) [15:03:42] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2030.codfw.wmnet with OS bullseye [15:03:43] * sukhe turns volans to volans|off :) [15:03:48] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2030.codfw.wmnet with OS bullseye [15:03:50] ahahah [15:03:57] (03PS10) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:04:46] should be fixed [15:05:37] RECOVERY - MariaDB Replica SQL: s7 #page on db1174 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:06:00] marostegui: all good! thanks! [15:06:11] thanks! [15:07:08] (03PS3) 10Ebernhardson: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 [15:08:02] (03PS1) 10Marostegui: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818479 (https://phabricator.wikimedia.org/T314154) [15:09:49] (03CR) 10Marostegui: [C: 03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818479 (https://phabricator.wikimedia.org/T314154) (owner: 10Marostegui) [15:12:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH) [15:12:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH) [15:14:12] here [15:14:26] I was late it seems [15:17:08] Amir1: all good, marostegu.i took care o fit [15:17:08] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2030.codfw.wmnet with reason: host reimage [15:19:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2030.codfw.wmnet with reason: host reimage [15:20:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) [15:23:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:23:50] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:26] (03CR) 10JHathaway: [C: 03+1] "I think the rationale makes sense to me, just a couple of questions about the code." [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [15:37:38] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2030.codfw.wmnet with OS bullseye [15:37:45] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2030.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [15:40:17] (03PS1) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) [15:40:39] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2058.codfw.wmnet with OS bullseye [15:40:45] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye [15:44:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse) [15:46:06] (03PS2) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) [15:47:25] (03CR) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse) [15:47:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36517/console" [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:48:42] (03CR) 10Majavah: [V: 03+1 C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:49:48] (03CR) 10Majavah: [C: 04-1] hieradata: switch traffic to cloudrabbit1001-3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [15:50:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice, thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse) [15:50:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) See also {T304478} As noted, I won't be available to coordinate, but someone else is welcome to do the depooling step in my absence (I... [15:55:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2058.codfw.wmnet with reason: host reimage [15:58:24] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2058.codfw.wmnet with reason: host reimage [16:13:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2058.codfw.wmnet with OS bullseye [16:21:10] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [16:21:13] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye executed with errors: - elastic... [16:28:03] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:27] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2042.codfw.wmnet with OS bullseye [16:30:34] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2042.codfw.wmnet with OS bullseye [16:30:49] (03CR) 10Andrew Bogott: [C: 03+2] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [16:33:23] (03PS1) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503 [16:34:29] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:35:05] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:36:15] (03PS2) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503 [16:37:23] PROBLEM - DNS on db1186.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.3.0 but got 10.65.2.255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:53] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:41:40] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503 (owner: 10Andrew Bogott) [16:42:55] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:37] 10SRE, 10ops-codfw, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) 05Open→03Resolved There is a decom task for this server so we can resolve this. [16:48:52] (03PS7) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [16:50:31] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2042.codfw.wmnet with reason: host reimage [16:53:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2042.codfw.wmnet with reason: host reimage [16:54:13] (03CR) 10Dzahn: [C: 03+2] "https://netbox.wikimedia.org/search/?q=+208.80.153.8&obj_type=" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [16:55:47] PROBLEM - DNS on db1187.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.3.1 but got 10.65.3.0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:57:48] (03CR) 10Dzahn: [C: 03+2] "[authdns1001:~] $ host gitlab-replica-new.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:03:55] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:10:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2042.codfw.wmnet with OS bullseye [17:11:15] PROBLEM - DNS on db1188.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.2.255 but got 10.65.3.1 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:15] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2042.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [17:14:33] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdb3 1236 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [17:20:47] 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10mpopov) [17:22:03] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:23:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36518/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:24:51] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:25:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36519/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:32:21] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:32:54] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) 05Open→03In progress [17:33:28] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) a:05Vgutierrez→03sgrabarczuk [17:35:06] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Dzahn) 05Open→03In progress a:03Slst2020 [17:35:16] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:35:19] (03PS4) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [17:36:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Dzahn) 05Open→03In progress a:05Vgutierrez→03MRaishWMF [17:36:51] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Dzahn) 05Open→03In progress a:05Vgutierrez→03soworu [17:38:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Dzahn) 05Open→03In progress a:03Raymond_Ndibe [17:38:07] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) 05Open→03In progress a:05Volans→03Jclark-ctr [17:38:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) 05Open→03In progress [17:38:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) a:03odimitrijevic [17:40:05] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) 05Open→03In progress a:03ERayfield [17:41:43] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@85585b0]: (no justification provided) [17:41:46] (03PS4) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:41:52] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@85585b0]: (no justification provided) (duration: 00m 09s) [17:41:58] PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:24] 10SRE, 10Platform Engineering, 10Wikimedia-Mailing-lists, 10User-AKlapper: Close / shut down public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Dzahn) 05Open→03Resolved a:03Dzahn Seems like this is done. If the mail already auto-responds with that... [17:42:47] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36522/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:47:08] (03PS5) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:47:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2057.codfw.wmnet with OS bullseye [17:47:28] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2057.codfw.wmnet with OS bullseye [17:47:58] (03CR) 10CI reject: [V: 04-1] trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:48:05] 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn) [17:48:09] 10SRE, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10Dzahn) [17:48:28] (03PS6) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:50:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36525/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:51:58] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10Dzahn) 05Open→03Resolved a:03Dzahn This was caused by an incident but the incident is over. There will be a report in the future. [17:55:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [17:55:36] (03PS7) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) [17:56:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36526/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:58:17] 10SRE, 10Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn) [17:59:00] 10SRE, 10Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn) p:05Triage→03Medium [17:59:10] (03CR) 10Ssingh: [V: 03+1] "I was trying to make the output for the !ATS9 hosts to not change at all, fighting ERB in the process and that finally seems to have worke" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:59:32] 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn) p:05Triage→03High [17:59:59] 10SRE, 10Phabricator, 10serviceops-radar, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn) [18:00:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) The subnet is updated but seeing the same kinds of results: ` openstack subnet show a9439c35-f465-475c-85a0-8e0f0f41ac4d +----------------------... [18:00:16] 10SRE, 10Phabricator, 10serviceops-radar, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn) [18:02:44] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2057.codfw.wmnet with reason: host reimage [18:06:21] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2057.codfw.wmnet with reason: host reimage [18:06:45] 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10Dzahn) p:05Triage→03Medium [18:10:35] 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10Dzahn) For "Apache HTTP on mw" I guess ideally it would be replaced by 2 things: - a paging alert based on "too many mw servers have failed apaches" with som... [18:11:49] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10Dzahn) [18:14:56] (03CR) 10Dzahn: [C: 03+2] phabricator: add phabricator-roots on new phabricator hardware [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:25:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10wiki_willy) a:03Cmjohnson [18:28:34] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2057.codfw.wmnet with OS bullseye [18:28:40] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2057.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [18:32:25] PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:30] (03PS1) 10AOkoth: gitlab: add gitlab role to gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) [18:39:03] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36527/" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [18:51:09] RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:11] (03CR) 10Dzahn: [C: 03+1] "this looks good to me. it matches looking at gitlab1003 which is currently gitlab-replica.wikimedia.org. nothing seems off in the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [19:03:55] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:06:03] (03CR) 10Dzahn: [C: 03+1] "is the plan to enable 'enable_restore' later and have it restore on both replicas or is it enough on one of them? just thinking out loud a" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [19:06:09] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:33] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [19:15:17] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:17:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:22:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:23:07] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:34:03] RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:50:55] (03CR) 10Dzahn: "got assigned 208.80.153.104. amending" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:51:50] (03PS1) 10Ryan Kemper: 6.8.23-wmf2 search-extra for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078) [19:56:35] (03CR) 10Ebernhardson: [C: 03+1] "I'm not entirely sure on the proper debian packaging way to ship the same update for stretch and bullseye, but this seems like it should d" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:01:13] (03PS5) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [20:03:57] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:59] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36528/" [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [20:07:06] (03PS1) 10Jcrespo: Add json output when adding the ?format=json GET parameter [software/pampinus] - 10https://gerrit.wikimedia.org/r/818508 [20:12:02] (03PS3) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.104 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) [20:12:59] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2029.codfw.wmnet with OS bullseye [20:13:05] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye [20:13:59] RECOVERY - Check systemd state on puppetmaster1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:56] (03CR) 10Dzahn: [C: 03+2] "https://netbox.wikimedia.org/search/?q=+208.80.153.104&obj_type=" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:18:16] !log authdns-update - adding gerrit-replica-new.wikimedia.org [20:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:55] (03CR) 10Dzahn: [C: 03+2] "[authdns1001:~] $ host gerrit-replica-new.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:20:00] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) new in DNS: [authdns1001:~] $ host gerrit-replica-new.wikimedia.org gerrit-replica-new.wikimedia.org has address 208.80.153.104 gerrit-replica-new.wikimedia.or... [20:25:49] (03PS4) 10Dzahn: gerrit: add hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) [20:26:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2029.codfw.wmnet with reason: host reimage [20:26:51] (03PS5) 10Dzahn: gerrit: add hiera settings and IP for new replica gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) [20:27:49] (03CR) 10Dzahn: [C: 03+2] "We can do this now after we got assigned 208.80.153.104 ( gerrit-replica-new.wikimedia.org.)" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:29:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2029.codfw.wmnet with reason: host reimage [20:30:11] (03CR) 10Dzahn: [C: 03+2] "noop, but it makes sure when the role gets applied it already knows the right IP and won't try to set the wrong one or fail" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:30:39] (03PS2) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) [20:34:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:36:27] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:38:17] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:45] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [20:44:39] (03PS3) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) [20:45:10] (03CR) 10Dzahn: gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:45:42] (03CR) 10Dzahn: gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:46:07] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:50] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2029.codfw.wmnet with OS bullseye [20:46:55] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [20:46:59] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye executed with errors: - elastic... [20:54:21] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) T313972#8116463 [21:06:13] !log phab1004 - mkdir /srv/repos ; mkdir /srv/dumps [21:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:18:02] PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:28] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:29:06] (03PS1) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) [21:31:24] (03PS2) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) [21:34:01] (03PS3) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) [21:36:54] (03CR) 10CI reject: [V: 04-1] phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:37:50] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:39:36] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:03] (03CR) 10Dzahn: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 15.2 [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119) (owner: 10Jelto) [21:42:29] (03PS4) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) [21:43:55] (03PS5) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) [21:49:16] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36530/" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:50:38] (03CR) 10Dzahn: [V: 03+1] "doing this in 2 steps will also make the diff smaller when we actually apply the full role" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:52:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:56:09] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on phab1001, phab2001 - adds all the "base" things to phab1004, phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:57:00] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2041.codfw.wmnet with OS bullseye [21:57:06] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2041.codfw.wmnet with OS bullseye [21:59:34] (03PS1) 10Cwhite: nova_fullstack_test: rename error.stack to stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/818516 [22:01:12] !log phab1001 - rsync -avp --bwlimit=1000 /srv/repos/ rsync://phab1004.eqiad.wmnet/phabricator-srv-repos (running slowly inside a screen session as root) (T313360, T280597) [22:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:18] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [22:01:18] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [22:02:36] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:09:16] !log findBadBlobs.php nlwiktionary --revisions 22 --mark 'Invalid gzip, T265989' [22:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:21] T265989: nl.wiktionary.org faces "PHP Warning: gzinflate(): data error" (sometimes with fatal RevisionAccessException) - https://phabricator.wikimedia.org/T265989 [22:17:55] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2041.codfw.wmnet with reason: host reimage [22:19:21] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36531/phab1001.eqiad.wmnet/index.html does this look expected with all those files not " [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [22:20:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2041.codfw.wmnet with reason: host reimage [22:21:08] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:25:13] RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2041.codfw.wmnet with OS bullseye [22:37:16] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2041.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR... [22:39:06] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:45] !log krinkle@mwmaint1002$ mwscript findBadBlobs.php nlwiktionary; mark 2371 blobs from May 2004 as "Invalid gzip, T265989" [22:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:51] T265989: nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException) - https://phabricator.wikimedia.org/T265989 [23:03:53] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:27:57] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:43] (03CR) 10Cwhite: "Thanks for putting this together!" [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [23:57:17] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse)