[00:01:57] <wikibugs>	 (03PS2) 10Ryan Kemper: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176)
[00:14:40] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:16:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10Krinkle)
[00:16:06] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:16:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:17:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10Krinkle)
[00:23:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Cmjohnson For the once in row A,B and C it looks like the OS is already installed on them so you can use the --no-pxe and --new flags to...
[00:25:38] <wikibugs>	 10SRE, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 4 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Krinkle)
[00:31:03] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10Sustainability (Incident Followup): Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10Krinkle)
[00:31:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:33:06] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:38:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:16] <TimStarling>	 !log slowly restarting (with batch 1 sleep 5) trafficserver on text caches to fully deploy g 817086 T313578
[00:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:24] <stashbot>	 T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578
[00:48:54] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:52:31] <TimStarling>	 ctrl-C halfway through since the dashboard I was watching was showing a scary number of errors
[00:55:18] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,session-c1506.scope,session-c1507.scope,session-c1508.scope,session-c1510.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:56:55] <TimStarling>	 seems like it took more like 5 minutes to recover, not 5 seconds
[00:57:42] <TimStarling>	 will do the remaining hosts with sleep 300
[01:17:28] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:23:27] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10Sustainability (Incident Followup): Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10lmata)
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:46] <wikibugs>	 (03PS1) 10Krinkle: Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097)
[02:08:40] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:40] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:23:40] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 136, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:25:36] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:54:12] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[03:20:08] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:26] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:51:14] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:12:59] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10calbon) Thanks @Papaul
[04:16:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "I restarted all the ATS services to deploy this, which ended up being slow and painful since it took 5 minutes for each server to recover " [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[04:25:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:59:26] <wikibugs>	 (03PS1) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087)
[04:59:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[05:00:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 16 hosts with reason: codfw s8 sanitarium master switch
[05:00:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: codfw s8 sanitarium master switch
[05:01:41] <wikibugs>	 (03PS2) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087)
[05:06:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: db2072 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/818292 (https://phabricator.wikimedia.org/T311493)
[05:07:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: db2072 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/818292 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:12:15] <wikibugs>	 (03PS1) 10Marostegui: db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818293 (https://phabricator.wikimedia.org/T314109)
[05:15:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818293 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui)
[05:30:00] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:33:56] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:26:56] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, 
[06:26:56] <icinga-wm>	 e2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:36:42] <wikibugs>	 (03PS1) 10Marostegui: db2090: No longer candidatem master [puppet] - 10https://gerrit.wikimedia.org/r/818296 (https://phabricator.wikimedia.org/T314109)
[06:38:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2090: No longer candidatem master [puppet] - 10https://gerrit.wikimedia.org/r/818296 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui)
[06:39:47] <wikibugs>	 (03CR) 10Slyngshede: Add per node vCPU allocations (031 comment) [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede)
[06:42:56] <wikibugs>	 (03PS3) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818
[06:43:41] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede)
[06:43:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede)
[06:50:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32104 and previous config saved to /var/cache/conftool/dbconfig/20220729-065004-root.json
[06:51:37] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db217[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/818298 (https://phabricator.wikimedia.org/T311493)
[06:51:44] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, 
[06:51:44] <icinga-wm>	 e2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220729T0700)
[07:01:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db217[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/818298 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:05:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32105 and previous config saved to /var/cache/conftool/dbconfig/20220729-070509-root.json
[07:16:39] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307)
[07:20:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32106 and previous config saved to /var/cache/conftool/dbconfig/20220729-072013-root.json
[07:25:18] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:27:54] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:30:34] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:35:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32107 and previous config saved to /var/cache/conftool/dbconfig/20220729-073518-root.json
[07:40:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) @soworu you haven't submitted a SSH key. This is ok for analytics-privatedata-users access per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Le...
[07:41:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[07:48:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:48:33] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Add mraish to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429)
[07:49:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add mraish to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[07:50:05] <vgutierrez>	 thanks CI
[07:50:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32108 and previous config saved to /var/cache/conftool/dbconfig/20220729-075023-root.json
[07:59:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:00:00] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[08:02:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10fgiunchedi) >>! In T314027#8112820, @Papaul wrote: > @fgiunchedi f1-f4 PDU's are not setup yet   Makes sense, thanks for the context Papaul!
[08:05:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32109 and previous config saved to /var/cache/conftool/dbconfig/20220729-080528-root.json
[08:12:31] <vgutierrez>	 !log depool ats-be on cp4026 for debugging purposes
[08:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10Volans) 05Open→03Resolved a:03Volans @fnegri thanks for opening the task. I can confirm you're in LDAP `wmf` and I've added you to the #wmf-nda Phabricator group. Resolving, feel free to re-open i...
[08:20:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: port zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847)
[08:21:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847)
[08:21:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Volans)
[08:25:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I was looking into porting the "zookeeper server is down" alert. I was looking for higher-level zk metrics to indicate that an election ca" [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:28:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Volans) @MRaishWMF your user `mikeraish` is already part of the `analytics-privatedata-users` with no SSH key (access to that group can be configured in...
[08:30:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Volans)
[08:34:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Volans) @sgrabarczuk The access to `analytics-privatedata-users` can be configured in different ways depending on what you need to access. Could you plea...
[08:43:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Aline_Bruenger_WMDE)
[08:43:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10WMDE-leszek) I confirm @Aline_Bruenger_WMDE 's identity and approve the request on WMDE's side. Thanks
[08:44:43] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[08:44:52] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:50:18] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:55:57] <wikibugs>	 (03PS2) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359)
[08:56:11] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.2 [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119)
[08:57:41] <wikibugs>	 (03CR) 10MarcoAurelio: Amend license request contact form per Legal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio)
[09:07:34] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:11:01] <wikibugs>	 (03CR) 10MarcoAurelio: Amend license request contact form per Legal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio)
[09:11:54] <icinga-wm>	 RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[09:12:09] <wikibugs>	 10SRE-swift-storage: flip/flop mounting filesystems between systemd and swift-drive-audit - https://phabricator.wikimedia.org/T265450 (10MatthewVernon) swift-drive-audit needs to run `systemctl daemon-reload` after making changes to `/etc/fstab`. Thanks, systemd.
[09:13:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Slavina Stefanova - https://phabricator.wikimedia.org/T314122 (10Slst2020)
[09:13:30] <wikibugs>	 10SRE, 10SRE-swift-storage: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10MatthewVernon) ...this behaviour has reverted, since we've gone back to using upstream swift-drive-audit, which is a cron.d entry.
[09:14:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) 05Resolved→03Open
[09:15:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai)
[09:17:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) I am reopening this task because it was requested in {T314061}
[09:18:21] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) a:05pwangai→03None
[09:18:50] <wikibugs>	 (03Restored) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai)
[09:19:07] <wikibugs>	 10SRE-swift-storage: swift-drive-audit configuration broken on >= buster - https://phabricator.wikimedia.org/T314123 (10MatthewVernon)
[09:25:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:10] <icinga-wm>	 PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:08] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:49:52] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Fix logging template [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651)
[09:51:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36502/console" [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[09:55:38] <wikibugs>	 (03PS2) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794)
[09:58:45] <wikibugs>	 (03PS1) 10Marostegui: db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818431 (https://phabricator.wikimedia.org/T311493)
[09:58:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Volans) p:05High→03Medium a:05RKemper→03Volans For future reference, there is a process to follow for requesting [[ https://wikitech.wikimed...
[10:01:24] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) 10.6.8 hosts pooled again:  * db1111 in s8 * db1132 s1 * db1127 in s7  All of them back to the version that has no urin...
[10:01:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818431 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:08:15] <wikibugs>	 (03PS1) 10Jbond: C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651)
[10:10:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Aline_Bruenger_WMDE given that the access to `analytics-privatedata-users` can be setup in different ways based on what you need, could you please clar...
[10:11:18] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818437 (https://phabricator.wikimedia.org/T311493)
[10:11:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) p:05Triage→03Medium
[10:12:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818437 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:13:06] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36504/console" [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:13:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Slavina Stefanova - https://phabricator.wikimedia.org/T314122 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans @Slst2020 thanks for opening the task. I can confirm you're in LDAP `wmf `and I've added you to the #WMF-NDA Phabricator gro...
[10:15:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2173 into s1 T311493', diff saved to https://phabricator.wikimedia.org/P32110 and previous config saved to /var/cache/conftool/dbconfig/20220729-101507-marostegui.json
[10:15:13] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[10:15:31] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui)
[10:24:08] <wikibugs>	 (03PS2) 10Vgutierrez: C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:25:12] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36505/console" [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:26:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Verified user, added to wmf LDAP group. LGTM, thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai)
[10:28:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: Request for wmf group access for user: pwangai - https://phabricator.wikimedia.org/T313794 (10Volans) 05Open→03Resolved a:03Volans @pwangai I've added you to the `wmf` LDAP group and #wmf-nda Phabricator project. As an additiona...
[10:29:41] <wikibugs>	 (03Abandoned) 10Vgutierrez: trafficserver: Fix logging template [puppet] - 10https://gerrit.wikimedia.org/r/818429 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[10:30:06] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[10:33:54] <vgutierrez>	 !log disable puppet on cp nodes to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/818436
[10:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:31] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130)
[10:35:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE))
[10:36:13] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130)
[10:36:19] * Lucas_WMDE is not a fan of middle-of-the-line alignment
[10:38:20] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] C:trafficserver: build yaml structure in puppet instead of erb [puppet] - 10https://gerrit.wikimedia.org/r/818436 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:42:22] <Amir1>	 Lucas_WMDE: I know, it's quite annoying and ruins git history
[10:43:37] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130)
[10:43:49] <wikibugs>	 (03PS1) 10Jbond: check_user: handle situation where user has no organisation [puppet] - 10https://gerrit.wikimedia.org/r/818441 (https://phabricator.wikimedia.org/T314129)
[10:45:40] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130)
[10:45:42] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE))
[10:53:02] <wikibugs>	 (03PS1) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651)
[10:54:43] <wikibugs>	 (03PS2) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651)
[10:56:05] <wikibugs>	 (03CR) 10Ladsgroup: drop_cx_translation_translators_T314087.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[10:56:17] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36508/console" [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:56:58] <wikibugs>	 (03PS3) 10Jbond: C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651)
[10:57:52] <wikibugs>	 (03PS3) 10Marostegui: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087)
[10:58:06] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36510/console" [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[10:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:59:42] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] C:trafficserver: filter out logs without ensure == present [puppet] - 10https://gerrit.wikimedia.org/r/818442 (https://phabricator.wikimedia.org/T309651) (owner: 10Jbond)
[11:03:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] check_user: handle situation where user has no organisation [puppet] - 10https://gerrit.wikimedia.org/r/818441 (https://phabricator.wikimedia.org/T314129) (owner: 10Jbond)
[11:03:56] <vgutierrez>	 !log repool ats-be@cp4026 - T309651
[11:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:01] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[11:04:27] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "We need to find a way to make it work on x1, I will think about it. Thanks <3" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[11:04:44] <vgutierrez>	 !log reenable puppet on cp nodes
[11:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[11:06:32] <wikibugs>	 (03Merged) 10jenkins-bot: drop_cx_translation_translators_T314087.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[11:06:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Aline_Bruenger_WMDE) @Volans For now, the purpose is only pulling numbers which can be done without ssh, but I'll probably be asked to build recurring reports [...
[11:10:38] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:17:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] drop_cx_translation_translators_T314087.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/818291 (https://phabricator.wikimedia.org/T314087) (owner: 10Marostegui)
[11:21:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[11:23:27] <wikibugs>	 (03CR) 10Jbond: "just to confirm, no i dont consider this an access request and" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[11:31:31] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651)
[11:32:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:32:27] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36511/console" [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:33:14] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:34:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: enable ATS9 on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/818445 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:36:34] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:37:33] <vgutierrez>	 !log update ATS to version 9.1.2 in cp4032 - T309651
[11:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:39] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[11:39:21] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10akosiaris)
[11:40:37] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2088 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818446 (https://phabricator.wikimedia.org/T313797)
[11:41:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2088 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818446 (https://phabricator.wikimedia.org/T313797) (owner: 10Marostegui)
[11:42:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2088 from dbctl T313797', diff saved to https://phabricator.wikimedia.org/P32111 and previous config saved to /var/cache/conftool/dbconfig/20220729-114203-marostegui.json
[11:42:08] <stashbot>	 T313797: decommission db2088 - https://phabricator.wikimedia.org/T313797
[11:56:46] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:14] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: decomission puppetmaster[12]00[12] and replace them with puppetmaster[12]00[45] - https://phabricator.wikimedia.org/T314136 (10jbond) p:05Triage→03Medium
[12:08:59] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/818449 (https://phabricator.wikimedia.org/T314136)
[12:09:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster1004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/818449 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[12:11:28] <marostegui>	 !log dbmaint s3@eqiad T314087
[12:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:34] <stashbot>	 T314087: Add primary key and drop unique index on cx_translators on wmf wikis - https://phabricator.wikimedia.org/T314087
[12:12:02] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:18:31] <wikibugs>	 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10Ottomata) GRIZZLYYYYY?
[12:26:45] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) Yar, no sorry, I have had zero time to work on this.  @JArguello-WMF we should find a sprint to put this into.
[12:28:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) a:03Papaul Hi, This drive is now unmounted, so can be swapped at your earliest convenience, please :) Thanks!
[12:31:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) @soworu what data are you trying to access?  I'm not aware of any usage data from wordpress extensions.  Do you plan on [[ https://wikitech.wikimedia.org/wiki/Event...
[12:33:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Ottomata) Indeed, this is pretty privileged access.  I think that ultimately for this purpose, Xabriel won't need this access, but Xabriel has been...
[12:35:11] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1004 is CRITICAL: CRITICAL - degraded: The following units failed: remove_old_puppet_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond)
[12:41:00] <wikibugs>	 (03PS1) 10Jbond: P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067)
[12:42:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36512/console" [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[12:43:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[12:51:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Volans) To add a bit more context I'm asking because your request:  > * The specific LDAP group that you want to be added to (optional): > analytics-privatedata-users  If for an...
[12:59:05] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster1004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:01:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Aline_Bruenger_WMDE ok, let's proceed for the simple access for now and you can always request to integrate it later on.  @KFrancis could you please co...
[13:01:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans)
[13:01:36] <volans>	 jbond: ^^^ (puppetmaster backend) I guess is related to your changes
[13:03:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans) @Ottomata or @odimitrijevic: your approval is needed as `analytics-privatedata-users` group owner.
[13:06:26] <marostegui>	 !log dbmaint s3@eqiad T314141
[13:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:30] <stashbot>	 T314141: Add primary key and drop unique index on translate_tmt wmf wikis - https://phabricator.wikimedia.org/T314141
[13:07:02] <marostegui>	 !log dbmaint s4@eqiad T314141T314140
[13:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:04] <marostegui>	 !log dbmaint s4@eqiad T314140
[13:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:10] <stashbot>	 T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140
[13:07:10] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10MatthewVernon)
[13:07:42] <marostegui>	 !log dbmaint s8@eqiad T314140
[13:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:53] <wikibugs>	 (03PS3) 10Volans: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) (owner: 10Ryan Kemper)
[13:09:48] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) (owner: 10Ryan Kemper)
[13:10:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Volans) Given the above approval of both @WDoranWMF as Xabriel's manager and @Ottomata as group approver I went ahead and merged the above patch to...
[13:11:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[13:11:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[13:11:36] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[13:12:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2047.codfw.wmnet with OS bullseye
[13:12:40] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2047.codfw.wmnet with OS bullseye
[13:12:55] <jbond>	 volans: ahh yes can be ignored ill update to disable alerts
[13:13:19] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:15:22] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Add vi & wikidata wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818394 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[13:15:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Volans)
[13:15:38] <wikibugs>	 (03PS2) 10Jbond: P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067)
[13:18:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:base: add stages [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[13:20:08] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455
[13:20:49] <wikibugs>	 (03PS2) 10Jbond: puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455
[13:21:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster1004: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818455 (owner: 10Jbond)
[13:22:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) Thank you all for taking care of this.
[13:24:53] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651)
[13:25:46] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36513/console" [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:26:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "[not to be merged till Tuesday]" [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:26:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10EChetty) @RhinosF1 It's currently serving as a critical blocker for @xcollazo on being able to work on failures related to the Data Engineerings ins...
[13:27:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) Confirmed sudo access:   ` xcollazo@an-launcher1002:~$ hostname -f an-launcher1002.eqiad.wmnet xcollazo@an-launcher1002:~$ whoami xcollazo...
[13:27:46] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651)
[13:27:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Patch-For-Review: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) 05Open→03Resolved
[13:28:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36514/console" [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:29:07] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "[not to be merged till Tuesday]" [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:31:25] <wikibugs>	 (03PS1) 10BBlack: cp4027: enable manual ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/818460 (https://phabricator.wikimedia.org/T308799)
[13:33:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2047.codfw.wmnet with reason: host reimage
[13:33:50] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cp4027: enable manual ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/818460 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[13:36:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2047.codfw.wmnet with reason: host reimage
[13:48:22] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:02] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:55:04] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:58:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) For `D7`, please ping @jbond once done so he can confirm the ms-be* nodes have come back up OK.
[13:58:30] <wikibugs>	 (03PS1) 10Jaime Nuche: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465
[13:59:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2047.codfw.wmnet with OS bullseye
[13:59:11] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2047.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[13:59:38] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) In rack `D2`, ms-fe2012 needs depooling before the power goes, and if you could ping me once the rack is done so I can check all the ms-be* nodes come back up again, that'd...
[14:00:52] <wikibugs>	 (03PS2) 10Jaime Nuche: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950)
[14:03:49] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
[14:03:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon)
[14:03:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye
[14:04:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with OS bullseye
[14:04:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with...
[14:05:54] <wikibugs>	 (03Abandoned) 10Sbisson: Images for Wikipedia Preview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666680 (https://phabricator.wikimedia.org/T273674) (owner: 10Sbisson)
[14:06:30] <wikibugs>	 (03PS4) 10Jaime Nuche: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[14:06:45] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bullseye
[14:06:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with OS bullseye
[14:07:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage
[14:09:13] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bullseye
[14:09:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with OS bullseye
[14:09:26] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1188.eqiad.wmnet with OS bullseye
[14:09:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with OS bullseye
[14:09:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS bullseye
[14:09:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with OS bullseye
[14:10:09] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage
[14:10:18] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage
[14:10:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage
[14:10:30] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[14:11:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) I think the problem might be with the gateway_ip, it is set to none. Or perhaps it is the allocation pool starts at 17 rather than 18, and if we...
[14:11:08] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage
[14:12:00] <wikibugs>	 (03CR) 10Jaime Nuche: "These new config files were configured in Puppet to belong to root and specific groups, e.g. local/phd.json belonged to group "phd" (https" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[14:13:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10BBlack)  Sorry for all the delays on my end as well, but we're getting somewhere.  On the earlier points about attachment via varnish vs ats-be: For now, it's just simpl...
[14:13:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage
[14:14:17] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2043.codfw.wmnet with OS bullseye
[14:15:26] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[14:15:32] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2043.codfw.wmnet with OS bullseye
[14:21:37] <icinga-wm>	 RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:03] <wikibugs>	 (03PS3) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067)
[14:23:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36516/console" [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[14:23:26] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS bullseye
[14:23:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with OS bullseye completed: -...
[14:26:27] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1186.eqiad.wmnet with OS bullseye
[14:26:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with OS bullseye completed: -...
[14:27:26] <wikibugs>	 (03PS1) 10FNegri: Add node16 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/818474 (https://phabricator.wikimedia.org/T310821)
[14:28:05] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:28:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1188.eqiad.wmnet with OS bullseye
[14:28:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with OS bullseye completed: -...
[14:30:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS bullseye
[14:30:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with OS bullseye completed: -...
[14:34:20] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2043.codfw.wmnet with reason: host reimage
[14:35:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson)
[14:37:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2043.codfw.wmnet with reason: host reimage
[14:39:13] <marostegui>	 !log dbmaint s3@eqiad T314140
[14:39:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:17] <stashbot>	 T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140
[14:41:04] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) A sample of alerts I found while looking for IRC floods from `icinga-wm` (reporting a sample of alert, not repeating the flood here)  ` 2022-04-20T...
[14:52:38] <wikibugs>	 (03PS2) 10Volans: admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934)
[14:52:40] <wikibugs>	 (03PS2) 10Volans: admin: add raymond-ndibe user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817843 (https://phabricator.wikimedia.org/T313876)
[14:55:09] <wikibugs>	 (03PS4) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067)
[14:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:58:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "I've just inverted the 2 patches as this one can already go." [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) (owner: 10Volans)
[14:59:23] <marostegui>	 !log dbmaint s7@eqiad T314140
[14:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:28] <stashbot>	 T314140: Add primary key and drop unique index on translate_messageindex on wmf wikis - https://phabricator.wikimedia.org/T314140
[15:00:08] <icinga-wm>	 PROBLEM - Check systemd state on search-loader1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:35] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2043.codfw.wmnet with OS bullseye
[15:00:42] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2043.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[15:01:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:03] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 #page on db1174 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP INDEX tmi_key: check that it exists on query. Default database: metawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:02:14] <marostegui>	 ^ that's probably me
[15:02:15] <sukhe>	 hello
[15:02:21] <marostegui>	 fixing
[15:02:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans) @Slst2020 All set. The change has been merged, it will take up to ~30 minutes to propagate. After that please verify your access and close this task if it's...
[15:02:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P32112 and previous config saved to /var/cache/conftool/dbconfig/20220729-150256-root.json
[15:03:00] <sukhe>	 thanks
[15:03:12] <jynus>	 let me ack on victorops
[15:03:17] <sukhe>	 jynus: ACKed
[15:03:21] <jynus>	 ah, ok
[15:03:24] <marostegui>	 thanks
[15:03:25] <sukhe>	 thanks :)
[15:03:26] * volans still here if needed
[15:03:29] <sukhe>	 volans: go :P
[15:03:40] <volans>	 it's still work time :)
[15:03:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2030.codfw.wmnet with OS bullseye
[15:03:43] * sukhe turns volans to volans|off :)
[15:03:48] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2030.codfw.wmnet with OS bullseye
[15:03:50] <volans>	 ahahah
[15:03:57] <wikibugs>	 (03PS10) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:04:46] <marostegui>	 should be fixed
[15:05:37] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s7 #page on db1174 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:06:00] <sukhe>	 marostegui: all good! thanks!
[15:06:11] <marostegui>	 thanks!
[15:07:08] <wikibugs>	 (03PS3) 10Ebernhardson: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004
[15:08:02] <wikibugs>	 (03PS1) 10Marostegui: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818479 (https://phabricator.wikimedia.org/T314154)
[15:09:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818479 (https://phabricator.wikimedia.org/T314154) (owner: 10Marostegui)
[15:12:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH)
[15:12:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH)
[15:14:12] <Amir1>	 here
[15:14:26] <Amir1>	 I was late it seems
[15:17:08] <sukhe>	 Amir1: all good, marostegu.i took care o fit
[15:17:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2030.codfw.wmnet with reason: host reimage
[15:19:48] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2030.codfw.wmnet with reason: host reimage
[15:20:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH)
[15:23:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[15:23:50] <icinga-wm>	 RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:26] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "I think the rationale makes sense to me, just a couple of questions about the code." [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[15:37:38] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2030.codfw.wmnet with OS bullseye
[15:37:45] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2030.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[15:40:17] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162)
[15:40:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2058.codfw.wmnet with OS bullseye
[15:40:45] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye
[15:44:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse)
[15:46:06] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162)
[15:47:25] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Add regex that matches the netmon instances to get certs from Acme Chief (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse)
[15:47:56] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36517/console" [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:48:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:49:48] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] hieradata: switch traffic to cloudrabbit1001-3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[15:50:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice, thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse)
[15:50:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) See also {T304478}  As noted, I won't be available to coordinate, but someone else is welcome to do the depooling step in my absence (I...
[15:55:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2058.codfw.wmnet with reason: host reimage
[15:58:24] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2058.codfw.wmnet with reason: host reimage
[16:13:19] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:21:05] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2058.codfw.wmnet with OS bullseye
[16:21:10] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[16:21:13] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2058.codfw.wmnet with OS bullseye executed with errors: - elastic...
[16:28:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2042.codfw.wmnet with OS bullseye
[16:30:34] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2042.codfw.wmnet with OS bullseye
[16:30:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[16:33:23] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503
[16:34:29] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:35:05] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:36:15] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503
[16:37:23] <icinga-wm>	 PROBLEM - DNS on db1186.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.3.0 but got 10.65.2.255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:40:53] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[16:41:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: add cname records for rabbitmq in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/818503 (owner: 10Andrew Bogott)
[16:42:55] <icinga-wm>	 RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:37] <wikibugs>	 10SRE, 10ops-codfw, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) 05Open→03Resolved There is a decom task for this server so we can resolve this.
[16:48:52] <wikibugs>	 (03PS7) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[16:50:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2042.codfw.wmnet with reason: host reimage
[16:53:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2042.codfw.wmnet with reason: host reimage
[16:54:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://netbox.wikimedia.org/search/?q=+208.80.153.8&obj_type=" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[16:55:47] <icinga-wm>	 PROBLEM - DNS on db1187.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.3.1 but got 10.65.3.0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:57:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[authdns1001:~] $ host gitlab-replica-new.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[17:03:55] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:10:20] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2042.codfw.wmnet with OS bullseye
[17:11:15] <icinga-wm>	 PROBLEM - DNS on db1188.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.2.255 but got 10.65.3.1 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:11:15] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2042.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[17:14:33] <icinga-wm>	 PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdb3 1236 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops
[17:20:47] <wikibugs>	 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10mpopov)
[17:22:03] <wikibugs>	 (03PS1) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:23:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36518/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:24:51] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:25:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36519/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:32:21] <wikibugs>	 (03PS3) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:32:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) 05Open→03In progress
[17:33:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) a:05Vgutierrez→03sgrabarczuk
[17:35:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Dzahn) 05Open→03In progress a:03Slst2020
[17:35:16] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:35:19] <wikibugs>	 (03PS4) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[17:36:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Dzahn) 05Open→03In progress a:05Vgutierrez→03MRaishWMF
[17:36:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Dzahn) 05Open→03In progress a:05Vgutierrez→03soworu
[17:38:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Dzahn) 05Open→03In progress a:03Raymond_Ndibe
[17:38:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) 05Open→03In progress a:05Volans→03Jclark-ctr
[17:38:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) 05Open→03In progress
[17:38:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) a:03odimitrijevic
[17:40:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) 05Open→03In progress a:03ERayfield
[17:41:43] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@85585b0]: (no justification provided)
[17:41:46] <wikibugs>	 (03PS4) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:41:52] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@85585b0]: (no justification provided) (duration: 00m 09s)
[17:41:58] <icinga-wm>	 PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:42:24] <wikibugs>	 10SRE, 10Platform Engineering, 10Wikimedia-Mailing-lists, 10User-AKlapper: Close / shut down public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Dzahn) 05Open→03Resolved a:03Dzahn Seems like this is done. If the mail already auto-responds with that...
[17:42:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36522/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:47:08] <wikibugs>	 (03PS5) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:47:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2057.codfw.wmnet with OS bullseye
[17:47:28] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2057.codfw.wmnet with OS bullseye
[17:47:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:48:05] <wikibugs>	 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn)
[17:48:09] <wikibugs>	 10SRE, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10Dzahn)
[17:48:28] <wikibugs>	 (03PS6) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:50:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36525/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:51:58] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10Dzahn) 05Open→03Resolved a:03Dzahn This was caused by an incident but the incident is over. There will be a report in the future.
[17:55:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10Dzahn) 05Open→03In progress p:05Triage→03Medium
[17:55:36] <wikibugs>	 (03PS7) 10Ssingh: trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651)
[17:56:25] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36526/console" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:58:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn)
[17:59:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10Dzahn) p:05Triage→03Medium
[17:59:10] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "I was trying to make the output for the !ATS9 hosts to not change at all, fighting ERB in the process and that finally seems to have worke" [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:59:32] <wikibugs>	 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn) p:05Triage→03High
[17:59:59] <wikibugs>	 10SRE, 10Phabricator, 10serviceops-radar, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn)
[18:00:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) The subnet is updated but seeing the same kinds of results:  ` openstack subnet show a9439c35-f465-475c-85a0-8e0f0f41ac4d +----------------------...
[18:00:16] <wikibugs>	 10SRE, 10Phabricator, 10serviceops-radar, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Dzahn)
[18:02:44] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2057.codfw.wmnet with reason: host reimage
[18:06:21] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2057.codfw.wmnet with reason: host reimage
[18:06:45] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10Dzahn) p:05Triage→03Medium
[18:10:35] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10Dzahn) For "Apache HTTP on mw" I guess  ideally it would be replaced by 2 things:  - a paging alert based on "too many mw servers have failed apaches" with som...
[18:11:49] <wikibugs>	 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10Dzahn)
[18:14:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: add phabricator-roots on new phabricator hardware [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[18:25:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10wiki_willy) a:03Cmjohnson
[18:28:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2057.codfw.wmnet with OS bullseye
[18:28:40] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2057.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[18:32:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:30] <wikibugs>	 (03PS1) 10AOkoth: gitlab: add gitlab role to gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713)
[18:39:03] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36527/" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[18:51:09] <icinga-wm>	 RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "this looks good to me. it matches looking at gitlab1003 which is currently gitlab-replica.wikimedia.org. nothing seems off in the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[19:03:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:06:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "is the plan to enable 'enable_restore' later and have it restore on both replicas or is it enough on one of them? just thinking out loud a" [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[19:06:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:33] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn)
[19:15:17] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:17:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:22:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:23:07] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:34:03] <icinga-wm>	 RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:49:19] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:50:55] <wikibugs>	 (03CR) 10Dzahn: "got assigned 208.80.153.104. amending" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[19:51:50] <wikibugs>	 (03PS1) 10Ryan Kemper: 6.8.23-wmf2 search-extra for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078)
[19:56:35] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "I'm not entirely sure on the proper debian packaging way to ship the same update for stretch and bullseye, but this seems like it should d" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper)
[20:01:13] <wikibugs>	 (03PS5) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[20:03:57] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:04:59] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36528/" [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[20:07:06] <wikibugs>	 (03PS1) 10Jcrespo: Add json output when adding the ?format=json GET parameter [software/pampinus] - 10https://gerrit.wikimedia.org/r/818508
[20:12:02] <wikibugs>	 (03PS3) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.104 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250)
[20:12:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2029.codfw.wmnet with OS bullseye
[20:13:05] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye
[20:13:59] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:15:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://netbox.wikimedia.org/search/?q=+208.80.153.104&obj_type=" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:18:16] <mutante>	 !log authdns-update - adding gerrit-replica-new.wikimedia.org
[20:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[authdns1001:~] $ host gerrit-replica-new.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:20:00] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) new in DNS:  [authdns1001:~] $ host gerrit-replica-new.wikimedia.org gerrit-replica-new.wikimedia.org has address 208.80.153.104 gerrit-replica-new.wikimedia.or...
[20:25:49] <wikibugs>	 (03PS4) 10Dzahn: gerrit: add hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250)
[20:26:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2029.codfw.wmnet with reason: host reimage
[20:26:51] <wikibugs>	 (03PS5) 10Dzahn: gerrit: add hiera settings and IP for new replica gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250)
[20:27:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "We can do this now after we got assigned 208.80.153.104 ( gerrit-replica-new.wikimedia.org.)" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:29:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2029.codfw.wmnet with reason: host reimage
[20:30:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop, but it makes sure when the role gets applied it already knows the right IP and won't try to set the wrong one or fail" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:30:39] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250)
[20:34:41] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:36:27] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:38:17] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:43:45] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[20:44:39] <wikibugs>	 (03PS3) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250)
[20:45:10] <wikibugs>	 (03CR) 10Dzahn: gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:45:42] <wikibugs>	 (03CR) 10Dzahn: gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:46:07] <icinga-wm>	 RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:50] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2029.codfw.wmnet with OS bullseye
[20:46:55] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[20:46:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2029.codfw.wmnet with OS bullseye executed with errors: - elastic...
[20:54:21] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) T313972#8116463
[21:06:13] <mutante>	 !log phab1004 - mkdir /srv/repos ; mkdir /srv/dumps  
[21:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:18:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:28] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:29:06] <wikibugs>	 (03PS1) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360)
[21:31:24] <wikibugs>	 (03PS2) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360)
[21:34:01] <wikibugs>	 (03PS3) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360)
[21:36:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:37:50] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:39:36] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:40:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 15.2 [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119) (owner: 10Jelto)
[21:42:29] <wikibugs>	 (03PS4) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360)
[21:43:55] <wikibugs>	 (03PS5) 10Dzahn: phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360)
[21:49:16] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36530/" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:50:38] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "doing this in 2 steps will also make the diff smaller when we actually apply the full role" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:52:49] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: use migration role for pre-syncing data [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:56:09] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on phab1001, phab2001 - adds all the "base" things to phab1004, phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/818513 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:57:00] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2041.codfw.wmnet with OS bullseye
[21:57:06] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2041.codfw.wmnet with OS bullseye
[21:59:34] <wikibugs>	 (03PS1) 10Cwhite: nova_fullstack_test: rename error.stack to stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/818516
[22:01:12] <mutante>	 !log phab1001 - rsync -avp --bwlimit=1000 /srv/repos/ rsync://phab1004.eqiad.wmnet/phabricator-srv-repos (running slowly inside a screen session as root)  (T313360, T280597)
[22:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:18] <stashbot>	 T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360
[22:01:18] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[22:02:36] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:09:16] <Krinkle>	 !log findBadBlobs.php nlwiktionary --revisions 22 --mark 'Invalid gzip, T265989'
[22:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:21] <stashbot>	 T265989: nl.wiktionary.org faces "PHP Warning: gzinflate(): data error" (sometimes with fatal RevisionAccessException) - https://phabricator.wikimedia.org/T265989
[22:17:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2041.codfw.wmnet with reason: host reimage
[22:19:21] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36531/phab1001.eqiad.wmnet/index.html does this look expected with all those files not " [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[22:20:35] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2041.codfw.wmnet with reason: host reimage
[22:21:08] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:25:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:37:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2041.codfw.wmnet with OS bullseye
[22:37:16] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2041.codfw.wmnet with OS bullseye completed: - elastic2047 (**WAR...
[22:39:06] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:45] <Krinkle>	 !log krinkle@mwmaint1002$ mwscript findBadBlobs.php nlwiktionary; mark 2371 blobs from May 2004 as "Invalid gzip, T265989"
[22:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:51] <stashbot>	 T265989: nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException) - https://phabricator.wikimedia.org/T265989
[23:03:53] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:27:57] <icinga-wm>	 PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:56:43] <wikibugs>	 (03CR) 10Cwhite: "Thanks for putting this together!" [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[23:57:17] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse)