[00:03:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:04:03] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047610 (owner: 10TrainBranchBot)
[00:16:37] <icinga-wm_>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2024-06-11 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:19:35] <icinga-wm_>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2024-06-11 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:22:35] <icinga-wm_>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2024-06-11 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:25:37] <icinga-wm_>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2024-06-11 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:25:01] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9908152 (10Papaul) Your replacement part associated with RMA R200519348 Item # 100 has been successfully shipped. Details of which are provided below. Tracking URL: https://wwwapps.ups.com/We...
[01:37:09] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 80%, RTA = 30.32 ms
[01:38:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367856)', diff saved to https://phabricator.wikimedia.org/P65212 and previous config saved to /var/cache/conftool/dbconfig/20240620-013827-marostegui.json
[01:38:33] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[01:38:55] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[16:30:40] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bookworm
[16:33:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: post T365987 repool', diff saved to https://phabricator.wikimedia.org/P65256 and previous config saved to /var/cache/conftool/dbconfig/20240620-163348-arnaudb.json
[16:33:57] <stashbot>	 T365987: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987
[16:34:22] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:36:57] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810#9910823 (10herron) >>! In T253810#9485013, @fgiunchedi wrote: > `ipmi_exporter` now has support to collect generic SEL entries and export metrics from those...
[16:37:21] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Untaint exim domain variable [puppet] - 10https://gerrit.wikimedia.org/r/1048027 (https://phabricator.wikimedia.org/T368063) (owner: 10EoghanGaffney)
[16:37:36] <wikibugs>	 (03PS73) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[16:43:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368065#9910861 (10VRiley-WMF) a:03VRiley-WMF
[16:43:56] <wikibugs>	 (03PS74) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[16:44:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth)
[16:48:26] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[16:48:47] <jinxer-wm>	 FIRING: ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1013-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:48:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: post T365987 repool', diff saved to https://phabricator.wikimedia.org/P65257 and previous config saved to /var/cache/conftool/dbconfig/20240620-164857-arnaudb.json
[16:50:48] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:07] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[16:53:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9910877 (10RKemper) Host has been downtimed. Accidentally associated to wrong ticket: https://phabricator.wikimedia.org/T367825#9908323
[17:00:05] <jouncebot>	 bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1700)
[17:07:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9910948 (10bking) Thanks @Jhancock.wm ! If you or anyone else reading this is interested in alerting based on SEL errors, I created T367790 , feel free to add yo...
[17:08:55] <wikibugs>	 (03PS1) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:09:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:14:07] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[17:15:23] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[17:15:26] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bookworm
[17:17:24] <wikibugs>	 (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1048041 (https://phabricator.wikimedia.org/T364457)
[17:19:25] <wikibugs>	 (03PS2) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:19:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:20:53] <wikibugs>	 (03PS3) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:22:30] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3002/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[17:24:07] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3003/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[17:24:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:25:02] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3004/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[17:25:54] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+1] haproxy:cache: raise max uri len to 2048B [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[17:26:22] <wikibugs>	 (03PS4) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:29:15] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 24.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:29:31] <wikibugs>	 (03PS5) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:30:20] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bookworm
[17:30:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1048041 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott)
[17:30:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65258 and previous config saved to /var/cache/conftool/dbconfig/20240620-173050-marostegui.json
[17:30:56] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:33:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:37:47] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 60%, RTA = 0.41 ms
[17:38:26] <wikibugs>	 10ops-eqiad, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093 (10Andrew) 03NEW
[17:40:37] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[17:41:12] <wikibugs>	 (03PS6) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:41:16] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[17:41:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[17:41:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T352010)', diff saved to https://phabricator.wikimedia.org/P65259 and previous config saved to /var/cache/conftool/dbconfig/20240620-174125-ladsgroup.json
[17:41:31] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:43:36] <wikibugs>	 (03PS3) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190)
[17:43:40] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup)
[17:44:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2088.codfw.wmnet
[17:44:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy:cache: raise max uri len to 2048B [puppet] - 10https://gerrit.wikimedia.org/r/1048024 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[17:44:53] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 90%, RTA = 33.83 ms
[17:44:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:45:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65260 and previous config saved to /var/cache/conftool/dbconfig/20240620-174557-marostegui.json
[17:48:55] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
[17:51:44] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
[17:55:48] <jinxer-wm>	 FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:25] <wikibugs>	 (03PS7) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[17:57:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[17:57:59] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9911180 (10wiki_willy) During my call with the Dell Account team today, I asked them to push on this a bit more.  Th...
[18:00:05] <jouncebot>	 jnuche and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1800).
[18:01:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P65261 and previous config saved to /var/cache/conftool/dbconfig/20240620-180104-marostegui.json
[18:03:37] <wikibugs>	 (03PS8) 10DCausse: [WIP] wdqs: allow to configure internal federated enpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[18:04:03] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[18:06:52] <inflatador>	 !log bking@an-airflow1007 install `ripgrep` deb pkg
[18:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:44] <wikibugs>	 (03CR) 10Scott French: [C:03+2] service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[18:13:08] <wikibugs>	 (03PS1) 10Dzahn: aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960)
[18:14:15] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:15:10] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:16:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T364069)', diff saved to https://phabricator.wikimedia.org/P65262 and previous config saved to /var/cache/conftool/dbconfig/20240620-181613-marostegui.json
[18:16:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[18:16:19] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[18:16:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[18:16:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65263 and previous config saved to /var/cache/conftool/dbconfig/20240620-181635-marostegui.json
[18:19:10] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 31.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:19:42] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[18:20:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911239 (10Ladsgroup) Somehow [[https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaMaintenance/+/refs/heads/master/addWiki.php#249|s...
[18:21:04] <Amir1>	 sigh
[18:21:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bookworm
[18:21:09] <Amir1>	 I know what's going on
[18:22:15] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 32.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911243 (10Ladsgroup) Before run: ` root@ms-fe1009:~# swift list | grep u4c root@ms-fe1009:~#   `  After run: ` root@ms-fe1009:~# swift list |...
[18:24:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911245 (10Ladsgroup) Done for itwiki arbcom too: ` root@ms-fe1009:~# for i in $(swift list | grep wikipedia-arbcom-it) ;    do echo "$i:"...
[18:25:04] <wikibugs>	 (03CR) 10Scott French: [C:03+2] envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[18:25:51] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9911246 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup
[18:34:18] <Amir1>	 jouncebot: nowandnext
[18:34:19] <jouncebot>	 For the next 1 hour(s) and 25 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T1800)
[18:34:19] <jouncebot>	 In 1 hour(s) and 25 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T2000)
[18:34:42] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911291 (10Scott_French)
[18:35:07] <wikibugs>	 (03PS2) 10Dzahn: aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960)
[18:39:15] <wikibugs>	 (03PS1) 10Zabe: Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649)
[18:43:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "I see! Alright, +1 then" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[18:45:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "This has happened meanwhile." [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[18:46:28] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911340 (10Scott_French)
[18:47:23] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9911350 (10Scott_French) 05In progress→03Resolved I believe that should be everything now. I'll follow-up in T368096 for ite...
[18:51:15] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:51:32] <wikibugs>	 (03PS1) 10Jforrester: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610)
[18:52:21] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: new graph split instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1048060 (https://phabricator.wikimedia.org/T364077)
[18:53:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Add --time option [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy)
[18:58:03] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[19:00:48] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:01:19] <wikibugs>	 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098 (10Dzahn) 03NEW
[19:01:41] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[19:03:09] <wikibugs>	 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911439 (10Dzahn)
[19:03:47] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:04:38] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2088.codfw.wmnet
[19:05:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9911440 (10VRiley-WMF) Swapped out B1 with another compatible DIMM and the unit should be coming back online.
[19:05:09] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368099 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:05:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368099 (10ops-monitoring-bot) 03NEW
[19:07:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9911467 (10VRiley-WMF) 05Open→03Resolved
[19:07:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "tests passed:" [puppet] - 10https://gerrit.wikimedia.org/r/1041746 (owner: 10Ahmon Dancy)
[19:10:15] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9911479 (10VRiley-WMF) I swapped the mainboard with a compatible server. Upon booting, it didn't seem to see any memory again. Troubleshot this with @Papaul to no avail. Was instructed to put the...
[19:10:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T367004#9911476 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr No faults since jun 9th
[19:12:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:11] <wikibugs>	 (03PS9) 10Ryan Kemper: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:16:19] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#9911494 (10Ladsgroup) @Marostegui To get the list of direct replicas, something like this would...
[19:17:38] <wikibugs>	 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911495 (10Ladsgroup)
[19:18:44] <wikibugs>	 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9911497 (10Ladsgroup) It has limit of 50,000 and it hits them with 10 of them at the same time.
[19:18:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105 for T348977 - bking@cumin2002
[19:18:45] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1105 for T348977 - bking@cumin2002
[19:18:50] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[19:18:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105* for T348977 - bking@cumin2002
[19:18:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1105* for T348977 - bking@cumin2002
[19:21:17] <wikibugs>	 (03PS1) 10Ssingh: hiera dnsbox and P:bird: remove references to ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360)
[19:22:48] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3006/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[19:23:00] <wikibugs>	 (03PS1) 10Ssingh: policies/cr-labs: remove obsolete ntp.anycast.wmnet [homer/public] - 10https://gerrit.wikimedia.org/r/1048066 (https://phabricator.wikimedia.org/T366360)
[19:23:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:25:26] <wikibugs>	 (03PS1) 10Ssingh: conftool-data: remove ntp service [puppet] - 10https://gerrit.wikimedia.org/r/1048067 (https://phabricator.wikimedia.org/T366360)
[19:26:39] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "To be merged week of Jun 24." [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[19:32:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:40:34] <wikibugs>	 (03PS3) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128)
[19:42:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:07] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:45:07] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:47:26] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:49:45] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774)
[19:50:37] <wikibugs>	 (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774)
[19:50:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[19:51:41] <wikibugs>	 (03PS2) 10Scott French: kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932)
[19:52:04] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[19:52:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[19:53:08] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048071 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[19:54:36] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[19:55:59] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[19:56:00] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[19:57:31] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[19:57:32] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[19:57:48] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[19:57:50] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[19:57:51] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[19:57:54] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[19:57:55] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[19:58:24] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[19:58:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Indeed, good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240620T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:04:14] <wikibugs>	 (03CR) 10Scott French: "Thank you both!" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[20:10:16] <wikibugs>	 (03PS1) 10Bking: team-data-platform: Add all team-search-platform alerts to team-data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107)
[20:20:18] <wikibugs>	 (03CR) 10Scott French: [C:03+2] aqs-http-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[20:22:02] <wikibugs>	 (03Merged) 10jenkins-bot: aqs-http-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[20:24:50] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[20:25:03] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[20:25:23] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[20:25:35] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[20:26:04] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[20:26:15] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[20:26:50] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[20:27:01] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[20:27:23] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply
[20:27:33] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[20:27:55] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply
[20:28:06] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[20:30:50] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079
[20:33:58] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[20:34:24] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[20:36:28] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[20:36:46] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[20:38:58] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[20:39:16] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[20:40:42] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply
[20:40:59] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply
[20:42:42] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply
[20:43:01] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply
[20:44:35] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
[20:44:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[20:44:53] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
[20:53:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977
[20:53:07] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[20:53:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977
[20:53:51] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms
[20:59:35] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310)
[21:00:19] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[21:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: use rate-limited HTTP client [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048083 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[21:03:12] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[21:03:26] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[21:03:45] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079 (owner: 10Ebernhardson)
[21:04:16] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048079 (owner: 10Ebernhardson)
[21:04:44] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[21:04:50] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[21:05:05] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[21:05:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368065#9911919 (10VRiley-WMF) Swapped out power supply. It looks like it's now reporting properly
[21:06:14] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[21:06:30] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[21:07:46] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:07:47] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[21:08:03] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[21:08:32] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:08:52] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:09:15] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:09:42] <brett>	 !log Include ncmonitor 1.0.0 in wikimedia-bookworm apt repo
[21:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:53] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[21:10:08] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[21:11:47] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[21:12:04] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[21:12:21] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[21:12:21] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:14:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[21:34:57] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Move puppetdb.tuning.conf template [puppet] - 10https://gerrit.wikimedia.org/r/1047498 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[21:35:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Move update-netboot-image.sh to the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[21:37:03] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove obsolete motd [puppet] - 10https://gerrit.wikimedia.org/r/1047949 (owner: 10Muehlenhoff)
[21:43:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367856)', diff saved to https://phabricator.wikimedia.org/P65265 and previous config saved to /var/cache/conftool/dbconfig/20240620-214326-marostegui.json
[21:43:32] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[21:45:56] <wikibugs>	 (03PS3) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373)
[21:50:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:54:39] <wikibugs>	 (03PS1) 10David Martin: Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435)
[21:54:39] <wikibugs>	 (03CR) 10David Martin: "The table creation patch merged today (June 20); will deploy next week." [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin)
[21:58:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P65266 and previous config saved to /var/cache/conftool/dbconfig/20240620-215833-marostegui.json
[22:03:11] <icinga-wm_>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[22:03:33] <wikibugs>	 (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:04:49] <wikibugs>	 (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:10:11] <icinga-wm_>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[22:12:00] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[22:13:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P65267 and previous config saved to /var/cache/conftool/dbconfig/20240620-221340-marostegui.json
[22:14:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1047470/3007/gitlab2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto)
[22:15:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:18:17] <wikibugs>	 (03CR) 10Dzahn: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[22:19:58] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) (owner: 10Dzahn)
[22:20:07] <wikibugs>	 (03CR) 10Dzahn: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[22:21:46] <zabe>	 jouncebot: nowandnext
[22:21:47] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 38 minute(s)
[22:21:47] <jouncebot>	 In 7 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T0600)
[22:22:26] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe)
[22:23:05] <wikibugs>	 (03Merged) 10jenkins-bot: Enable local uploads on newly created wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048055 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe)
[22:24:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking)
[22:26:38] <wikibugs>	 (03PS3) 10Dzahn: admin: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[22:27:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] aphlict: remove duplicate sytemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1048052 (https://phabricator.wikimedia.org/T367960) (owner: 10Dzahn)
[22:28:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367856)', diff saved to https://phabricator.wikimedia.org/P65268 and previous config saved to /var/cache/conftool/dbconfig/20240620-222847-marostegui.json
[22:28:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[22:28:52] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[22:29:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[22:29:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367856)', diff saved to https://phabricator.wikimedia.org/P65269 and previous config saved to /var/cache/conftool/dbconfig/20240620-222909-marostegui.json
[22:30:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3008/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[22:33:44] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1 C:03+2] mailman3: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[22:33:50] <logmsgbot>	 !log zabe@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T361041 T363825 T366649 (duration: 09m 55s)
[22:33:59] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[22:33:59] <stashbot>	 T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825
[22:33:59] <stashbot>	 T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649
[22:39:57] <mutante>	 !log aphlict1002/aphlict2001 - systemctl stop aphlict_lograte.timer (and .service); systemctl disable aphlict_logrotate.timer (and .service); systemctl daemon-reload; systemctl reset-failed T367960
[22:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:02] <stashbot>	 T367960: SystemdUnitFailed - aphlict1002 - logrotate - https://phabricator.wikimedia.org/T367960
[22:42:19] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:44:45] <wikibugs>	 (03CR) 10Dzahn: "gerrit thinks it's still my turn (attention set) even though this is already merged. so +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1047116 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[22:45:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "related: updated secret in private repo as requested" [puppet] - 10https://gerrit.wikimedia.org/r/1047116 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[22:48:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65270 and previous config saved to /var/cache/conftool/dbconfig/20240620-224803-marostegui.json
[22:48:09] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[23:03:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P65271 and previous config saved to /var/cache/conftool/dbconfig/20240620-230310-marostegui.json
[23:13:35] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038)
[23:14:39] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks good enough to try on the beta cluster. I have a few questions about some details." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[23:14:46] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[23:16:00] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038) (owner: 10Zabe)
[23:16:41] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for btmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048116 (https://phabricator.wikimedia.org/T368038) (owner: 10Zabe)
[23:18:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P65272 and previous config saved to /var/cache/conftool/dbconfig/20240620-231817-marostegui.json
[23:20:38] <zabe>	 !log create Wikipedia Mandailing # T368038
[23:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:43] <stashbot>	 T368038: Create Wikipedia Mandailing - https://phabricator.wikimedia.org/T368038
[23:20:58] <logmsgbot>	 !log zabe@deploy1002 Started scap: Creating btmwiki (T368038)
[23:23:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:33:19] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Creating btmwiki (T368038) (duration: 12m 20s)
[23:33:24] <stashbot>	 T368038: Create Wikipedia Mandailing - https://phabricator.wikimedia.org/T368038
[23:33:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364069)', diff saved to https://phabricator.wikimedia.org/P65273 and previous config saved to /var/cache/conftool/dbconfig/20240620-233324-marostegui.json
[23:33:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[23:33:30] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[23:33:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[23:33:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T364069)', diff saved to https://phabricator.wikimedia.org/P65274 and previous config saved to /var/cache/conftool/dbconfig/20240620-233346-marostegui.json
[23:34:59] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=btmwiki --cluster=all 2>&1 | tee /tmp/btmwiki.UpdateSearchIndexConfig.log # T368038
[23:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:41] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121
[23:35:42] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121 (owner: 10Zabe)
[23:36:29] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048121 (owner: 10Zabe)
[23:36:57] <logmsgbot>	 !log zabe@deploy1002 Started scap: Update interwiki cache
[23:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122
[23:38:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122 (owner: 10TrainBranchBot)
[23:45:19] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:47:09] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Update interwiki cache (duration: 10m 12s)
[23:47:29] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:48:21] <icinga-wm_>	 PROBLEM - Webrequests Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:21] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3072 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:22] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3068 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:24] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:25] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:27] <icinga-wm_>	 PROBLEM - statsv Varnishkafka log producer on cp3070 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:48:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:50:27] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp3072 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[23:50:48] <sukhe>	 !incidents
[23:50:49] <sirenbot>	 4765 (UNACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[23:50:49] <sirenbot>	 4764 (RESOLVED)  db1206 (paged)/MariaDB Replica Lag: s1 (paged)
[23:50:52] <sukhe>	 !ack 4765
[23:50:52] <sirenbot>	 4765 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[23:53:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:53:52] <AntiComposite>	 many user reports of 503s
[23:53:56] <AntiComposite>	 but fine here
[23:54:12] <AntiComposite>	 ok many is 2 they just said it more than once
[23:55:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:55:38] <sukhe>	 AntiComposite: should be recovering