[00:02:19] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:23] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[00:09:07] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[00:24:43] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 149.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[00:24:54] <wikibugs>	 (03Abandoned) 10Tim Starling: Increase wgMaxUserDBWriteDuration to 10 on votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713024 (https://phabricator.wikimedia.org/T288831) (owner: 10Tim Starling)
[00:27:09] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:31] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[00:47:29] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:55] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[01:01:37] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:21] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[01:26:17] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:29:57] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1145.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:52:15] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[02:02:03] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:53] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[02:26:17] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:51] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:06:50] <icinga-wm>	 PROBLEM - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:07:35] <legoktm>	 O.o
[03:08:30] <rzl>	 👋
[03:08:38] <icinga-wm>	 RECOVERY - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 4.656 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:08:52] <rzl>	 eqiad, so hopefully not too serious? let's see
[03:09:15] <legoktm>	 https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=30s
[03:10:13] <rzl>	 traffic spike about 2:50, latency and 5xx spike about 3:02?
[03:10:26] <legoktm>	 yeah, seems like
[03:11:09] <rzl>	 interesting that it's all hosts in eqiad, but uneven in codfw
[03:11:16] <legoktm>	 oh, thumbor is pooled in eqiad, not codfw
[03:11:52] <rzl>	 really? qps looks like it's in the same ballpark
[03:12:08] <legoktm>	 https://config-master.wikimedia.org/discovery/discovery-basic.yaml
[03:12:38] <rzl>	 huh
[03:13:10] <legoktm>	 we discussed moving swift over so new codfw hardware could be added: https://phabricator.wikimedia.org/T288458#7300647
[03:13:20] <legoktm>	 guess it makes sense that thumbor followed as well
[03:13:46] <rzl>	 nod
[03:17:10] <rzl>	 digging for logs a bit
[03:17:26] <legoktm>	 someone ("Python urllib2") is scrapping but getting hit by 429s
[03:17:37] <legoktm>	 scraping*
[03:17:59] <rzl>	 nod
[03:26:15] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:19] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 139.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[04:02:05] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:03:43] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:09:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:15:15] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:19:16] <wikibugs>	 (03CR) 10Krinkle: Update configuration related to disabling Score functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm)
[04:20:59] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:24:33] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 123.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[04:25:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:55] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:35:57] <wikibugs>	 10SRE, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki)
[04:35:59] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[04:36:12] <wikibugs>	 10SRE, 10ChangeProp, 10serviceops, 10SCB, and 2 others: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10jijiki) 05Open→03Declined Bluntly closing this, no updates/findings for a long time
[05:01:21] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:29] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:15:55] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/715347 (https://phabricator.wikimedia.org/T288803)
[05:16:29] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 63.05 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[05:16:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/715347 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui)
[05:18:50] <wikibugs>	 (03PS1) 10Marostegui: db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715348 (https://phabricator.wikimedia.org/T288803)
[05:22:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715348 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui)
[05:23:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 for reimage T288803', diff saved to https://phabricator.wikimedia.org/P17105 and previous config saved to /var/cache/conftool/dbconfig/20210830-052336-marostegui.json
[05:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:42] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[05:26:15] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2110.codfw.wmnet with reason: REIMAGE
[05:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2110.codfw.wmnet with reason: REIMAGE
[05:44:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:29] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:55] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:24:19] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:04] <wikibugs>	 10SRE, 10Research, 10observability, 10Patch-For-Review: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10Legoktm) 05Open→03Resolved This is no longer an issue because SCB is long gone, and there are no flapping alerts for this service that I've seen rec...
[06:33:19] <wikibugs>	 10SRE, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445 (10Legoktm) 05Open→03Resolved This is no longer an issue because SCB is long gone, and there are no flapping alerts for this service that I'v...
[06:38:28] <godog>	 !log more weight to ms-be20[62-65] - T288458
[06:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:34] <stashbot>	 T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458
[06:46:04] <wikibugs>	 (03PS1) 10Marostegui: pc[12]007-010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715441 (https://phabricator.wikimedia.org/T289112)
[06:46:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc[12]007-010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715441 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui)
[06:53:06] <elukey>	 !log drop an-airflow1001's old airflow logs to fix root partition almost filled up
[06:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10fgiunchedi) The fact that this is a new wiki suggests to me the maintenance scripts to give Thumbor access to the containers haven't been run...
[06:57:11] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) Essentially a puppet setting yes, `rsync::server::wrap_with_stunnel` for the server bits and then e.g. `rsync::quickdatacopy` has the option to turn on ssl on the...
[06:58:57] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n, 10Patch-For-Review: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10fgiunchedi) >>! In T184871#7315757, @TheDJ wrote: > @fgiunchedi you know if that patch makes...
[07:01:03] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10jcrespo) p:05Triage→03High a:03jcrespo Hi, @Bethany, I can process your request with no issue, but might I request to update your email (and verify it) on your account on Wikitech at https://...
[07:04:57] <wikibugs>	 (03PS3) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305)
[07:05:28] <godog>	 I can't op myself here for whatever reason, can someone change the topic to set me on clinic duty?
[07:05:33] <wikibugs>	 (03PS1) 10Mforns: Fix --unitl for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442
[07:06:25] <wikibugs>	 (03PS4) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305)
[07:10:07] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add bgwiki (Bethany) to the list of privileged ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892)
[07:10:13] <wikibugs>	 (03CR) 10Jelto: helmfile.d admin rename view rbac resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[07:10:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:12] <marostegui>	 godog: done :)
[07:12:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "-1 waiting for LDAP and HR records for mail to be identical (see ticket)." [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[07:12:40] <godog>	 marostegui: thank you <3 <3
[07:13:57] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:14:34] <jynus>	 I was actually still doing stuff myself, as normally meeting was in the afternoon
[07:14:39] <jynus>	 happy to handover now
[07:16:13] <godog>	 jynus: yeah I don't know tbh when the handover is supposed to happen but anytime works for me
[07:16:39] <jynus>	 let me cleanup the maint-announce for you and it will be all yours :-)
[07:16:46] <jynus>	 for things during the weekend
[07:18:11] <godog>	 heheh ok, LMK jynus 
[07:22:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: profile: adapt alertmanager-webhook-logger to ECS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[07:28:26] <wikibugs>	 (03CR) 10RhinosF1: "the email is a -ctr email. does that not mean we need expiry date & contact" [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[07:29:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:37:22] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:24] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:48:06] <elukey>	 godog: o/ as FYI yesterday I have downtimed the cr2-esams / cr2 eqiad alerts due to the Lumen maintenance (that will lasts days IIUC sigh) so it may start alarming again in a couple of hours
[07:49:01] <godog>	 elukey: ack, thank you! will keep it mind
[07:50:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived - https://phabricator.wikimedia.org/T289928 (10fgiunchedi) p:05Triage→03Medium
[07:50:23] <wikibugs>	 10SRE, 10Traffic: cp2027 powercycled - https://phabricator.wikimedia.org/T289908 (10fgiunchedi) p:05Triage→03Medium
[07:50:46] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) p:05Triage→03Medium
[07:50:53] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) p:05Triage→03Medium
[07:51:08] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10fgiunchedi) p:05Triage→03Medium
[07:55:15] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[07:55:15] <wikibugs>	 (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446
[07:56:32] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) 05Open→03Resolved
[07:58:36] <wikibugs>	 (03PS6) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275)
[08:01:48] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:09] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10fgiunchedi) p:05Triage→03Medium
[08:03:37] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) Unless there are objections let's go with (b), do you need command line access or web interface is fine @bren...
[08:03:50] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#7261361, @Ottomata wrote: > I think that will do it.  helm template looks good locally. >  > @JMeybohm is it ok that I mov...
[08:04:45] <wikibugs>	 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10fgiunchedi) p:05Triage→03Medium
[08:05:11] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:05:18] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Remove the non-TLS k8s service will be handled via T255871
[08:05:24] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:05:34] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:05:40] <wikibugs>	 10SRE, 10Traffic: cp2027 powercycled - https://phabricator.wikimedia.org/T289908 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez all seems to be OK with cp2027, I just repooled it. Thanks @elukey!
[08:05:46] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871
[08:06:04] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871
[08:06:15] <wikibugs>	 (03PS1) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017)
[08:06:17] <wikibugs>	 (03PS1) 10JMeybohm: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581)
[08:06:19] <wikibugs>	 (03PS1) 10JMeybohm: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868)
[08:06:21] <wikibugs>	 (03PS1) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869)
[08:06:24] <wikibugs>	 (03PS1) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875)
[08:06:26] <wikibugs>	 (03PS1) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878)
[08:06:28] <wikibugs>	 (03PS1) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879)
[08:07:48] <wikibugs>	 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10fgiunchedi) p:05Triage→03Medium
[08:07:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm)
[08:07:55] <wikibugs>	 10SRE, 10MediaWiki-Uploading, 10Traffic, 10serviceops, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10fgiunchedi) p:05Triage→03Medium
[08:08:01] <wikibugs>	 10SRE, 10docker-pkg, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10fgiunchedi) p:05Triage→03Medium
[08:08:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm)
[08:08:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm)
[08:08:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm)
[08:08:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10fgiunchedi) p:05Triage→03Medium
[08:08:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm)
[08:08:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm)
[08:08:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10fgiunchedi) p:05Triage→03Medium
[08:09:08] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) p:05Triage→03Medium
[08:11:24] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446 (owner: 10PipelineBot)
[08:14:23] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446 (owner: 10PipelineBot)
[08:19:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] admin: Add bgwiki (Bethany) to the list of privileged ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[08:25:45] <wikibugs>	 (03PS2) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017)
[08:25:47] <wikibugs>	 (03PS2) 10JMeybohm: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581)
[08:25:49] <wikibugs>	 (03PS2) 10JMeybohm: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868)
[08:25:51] <wikibugs>	 (03PS2) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869)
[08:25:53] <wikibugs>	 (03PS2) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875)
[08:25:55] <wikibugs>	 (03PS2) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878)
[08:25:57] <wikibugs>	 (03PS2) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879)
[08:25:59] <wikibugs>	 (03PS1) 10JMeybohm: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454
[08:26:28] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:17] <wikibugs>	 (03CR) 10RhinosF1: admin: Add bgwiki (Bethany) to the list of privileged ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[08:27:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) >>! In T281780#7315315, @AntiCompositeNumber wrote: > The thumbnail not existing in Swift is certainly...
[08:29:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[08:37:46] <wikibugs>	 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10ema) @MMandere: is there anything left to do here? If not, let's close the task!
[08:41:44] <wikibugs>	 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema)
[08:41:51] <wikibugs>	 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema) p:05Triage→03Low
[08:42:02] <wikibugs>	 (03PS5) 10Volans: lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[08:42:57] <wikibugs>	 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema)
[08:51:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:55] <wikibugs>	 10SRE-swift-storage: Puppetize container creation for applications that don't create containers - https://phabricator.wikimedia.org/T289976 (10fgiunchedi)
[08:56:45] <wikibugs>	 (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo)
[08:57:20] <godog>	 !log +100G to prometheus/global in codfw
[08:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:37] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet
[08:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:50] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet
[08:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:27] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[09:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:00] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[09:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:20] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[09:01:22] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master
[09:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:50] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:10:42] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1150.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:11:11] <wikibugs>	 (03PS1) 10David Caro: ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700)
[09:11:53] <wikibugs>	 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10MMandere) @ema: All is set here. I will go ahead and close the task.
[09:12:04] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro)
[09:12:45] <wikibugs>	 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10MMandere) 05Open→03Resolved
[09:14:37] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) a:05JMeybohm→03Jelto
[09:22:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:23:51] <wikibugs>	 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JMeybohm)
[09:24:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:25:48] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137)
[09:31:11] <wikibugs>	 (03PS4) 10Ladsgroup: Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944)
[09:31:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup)
[09:32:30] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup)
[09:34:07] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:703476|Set $wgIncludejQueryMigrate to false in group0 (T280944)]] (duration: 00m 57s)
[09:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:11] <stashbot>	 T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944
[09:34:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The debian-glue job failed because" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[09:37:59] <wikibugs>	 (03PS4) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216
[09:38:07] <wikibugs>	 (03CR) 10David Caro: "All the changes in PCC are expected (adding the before relationship)." [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro)
[09:38:32] <wikibugs>	 (03CR) 10Jcrespo: "Followup?" [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo)
[09:45:39] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 (https://phabricator.wikimedia.org/T285374)
[09:45:41] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[09:46:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[09:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:29] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30905/console" [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[09:48:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:49:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:50:24] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[09:51:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[09:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:51] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[09:58:23] <wikibugs>	 (03CR) 10Vgutierrez: "tested in labs setting profile::cache::varnish::frontend::uds_owner to envoy:" [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[09:59:41] <wikibugs>	 (03PS1) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578)
[10:00:17] <wikibugs>	 (03PS2) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578)
[10:01:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30906/console" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[10:02:22] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:05] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10jbond) 05Open→03Resolved a:03jbond this seems resolved, boldly closing, please re-open if missed something
[10:08:40] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:13:44] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse)
[10:14:18] <wikibugs>	 (03CR) 10Ema: varnish: Allow configuring UDS owner/group/perms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[10:14:38] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I took the liberty to fix some typos in the commit message. The code looks sane to me, although if you want to be extra careful and want t" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[10:15:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM once the parent change has been tested." [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[10:16:13] <wikibugs>	 (03Merged) 10jenkins-bot: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse)
[10:19:27] <wikibugs>	 (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo)
[10:21:28] <wikibugs>	 (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[10:21:37] <wikibugs>	 (03CR) 10Volans: "LGTM but I'd like to get some buy-in from the service owners that use those facts in their code to know if they might need to use them fro" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[10:21:46] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[10:21:52] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[10:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:07] <wikibugs>	 (03CR) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[10:25:26] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:24] <wikibugs>	 (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[10:28:56] <wikibugs>	 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10fgiunchedi)
[10:30:04] <wikibugs>	 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10fgiunchedi) @Dzahn perhaps do you know what to do ? or know who might know? thank you!
[10:30:04] <jouncebot>	 jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1030).
[10:44:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:44:44] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:45:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[10:48:10] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[10:53:37] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:24] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:55:30] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:56:58] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583)
[11:00:00] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1100).
[11:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:02:08] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:15:58] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583)
[11:17:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:17:52] <wikibugs>	 (03PS3) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578)
[11:21:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/715477 (owner: 10L10n-bot)
[11:26:50] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:19] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:16] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the updated commit message with the analysis. I guess that in case of use cases in which people might need those facts we" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[11:41:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:43:54] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:47:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:48:01] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:48:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:23] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:41] <wikibugs>	 (03PS2) 10Mforns: Fix --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442
[11:52:49] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:52:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:32] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 43 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:00:48] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:01:18] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Bennylin) @Aklapper: Got it @fgiunchedi: Maybe because not many new wiki chose to enable upload, so this step was not added to the SOP for cre...
[12:02:40] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488
[12:06:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10RhinosF1) >>! In T289860#7317728, @fgiunchedi wrote: > The fact that this is a new wiki suggests to me the maintenance scripts to give Thumbor...
[12:06:54] <RhinosF1>	 Amir1: that's new wiki related ^
[12:08:23] <Amir1>	 I fix it
[12:08:58] <Amir1>	 tbh I'm not sure if it really warrants a UBN
[12:09:00] <Amir1>	 but meh
[12:09:00] <RhinosF1>	 Amir1: was wondering if you knew whether add wiki was being silly that day too
[12:09:14] <RhinosF1>	 No I'm not entirely sure either it's UBN
[12:10:24] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:12:20] <Amir1>	 !log ladsgroup@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=jvwikisource --backend=local-multiwrite (T289860)
[12:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:27] <stashbot>	 T289860: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860
[12:12:53] <RhinosF1>	 godog: found your person to blame and give the sticker for fixing it ^
[12:13:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10RhinosF1) \o/ works!
[12:13:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10ayounsi)
[12:13:34] <godog>	 RhinosF1: haha! I'll have to buy $beverage for Amir1 
[12:13:37] <godog>	 Amir1: thank you <3
[12:13:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10User-Ladsgroup: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Somehow that was missed or didn't run properly during creation of the wik...
[12:13:52] <godog>	 but yeah definitely not UBN
[12:14:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10ayounsi) 05Resolved→03Open Some of them seem duplicated, see the red ones on https://netbox.wikimedia.org/extras/reports/cables.Cables/
[12:14:46] <RhinosF1>	 Amir1: ty!
[12:14:49] <Amir1>	 godog: ^^ I tried marostegui but he's grumpy today :D
[12:15:04] <marostegui>	 hahaha
[12:15:04] <Amir1>	 I'll find my beer somehow :P
[12:15:44] <RhinosF1>	 Amir1: if I ever get to an in person event (one day I will), I'll get you one
[12:15:45] <Amir1>	 actually, I should run pdf cleaner (what I ran for commons image table) on all wikisource wikis
[12:15:57] <godog>	 ahah yeah no a problem, the local Speti will help
[12:16:14] <godog>	 Späti
[12:16:16] <Amir1>	 RhinosF1: no Lager, I hate Lager :D
[12:16:32] <RhinosF1>	 Amir1: :)
[12:16:56] <Amir1>	 godog: the local Spati next to my home speak Persian. It's sooooo convenient 
[12:17:16] <godog>	 haha! that's amazing
[12:18:50] <wikibugs>	 (03PS2) 10Jbond: P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488
[12:22:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10User-Ladsgroup: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Bennylin) It works now. Tyvm!
[12:26:26] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488 (owner: 10Jbond)
[12:30:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Overall LGTM. We might need to revisit it the day we have Linux based switches." [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[12:33:58] <wikibugs>	 (03PS1) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921)
[12:38:26] <wikibugs>	 (03PS1) 10Jbond: puppetdb - cloud: add readonly user to config [puppet] - 10https://gerrit.wikimedia.org/r/715493
[12:39:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) > (So row B in 10G, C8, and/or D5.) New cloud hosts only go in cloud racks, so no row B.
[12:40:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb - cloud: add readonly user to config [puppet] - 10https://gerrit.wikimedia.org/r/715493 (owner: 10Jbond)
[12:44:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:44:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ayounsi) Note that DNS PTRs are missing at least for: ` 185.71.138.139 185.71.138.141 185.71.138.140 `
[12:44:45] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "I think there is a missing entry for wcqs.svc.codfw.wmnet. See inline comment." [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson)
[12:45:25] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) >>! In T289536#7318613, @ayounsi wrote: > Note that DNS PTRs are missing at least for: > ` > 185.71.138.139 > 185.71.138.141 > 185.71.138.140 > `  Oh right, good catch, thank...
[12:48:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Noted, thanks! Yeah fine to close for now, and re-open if any issues.
[12:50:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:51:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:52:13] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi)
[12:52:19] <wikibugs>	 (03CR) 10Gehel: blazegraph: Setup new wcqs instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson)
[12:59:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) It occurred to me that as part of uid/gid preprovision we should detect if swift was previously on the host (i.e. there are labeled filesyste...
[12:59:54] <wikibugs>	 (03PS5) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216
[13:01:14] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:54] <icinga-wm>	 PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[13:08:13] <wikibugs>	 (03PS5) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[13:08:44] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Allow configuring UDS owner/group/perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[13:09:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:10:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[13:11:18] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374)
[13:14:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[13:15:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:17:10] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374)
[13:18:21] <wikibugs>	 (03CR) 10Jcrespo: "Will do." [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo)
[13:20:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) 05Resolved→03Open @SimoneThisDot by any chance, do you have a @wikimedia.org email that was provided to you? An alert has been fired about this access on production, and we woul...
[13:21:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez)
[13:21:58] <wikibugs>	 (03PS1) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305)
[13:26:22] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:24] <wikibugs>	 (03PS1) 10Ssingh: wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536)
[13:29:54] <wikibugs>	 (03CR) 10Ssingh: "I am not sure how to handle the *.check case, so I came up with "yes" and "no". (Do we/should we have PTR records for this case? :)" [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[13:32:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[13:40:30] <wikibugs>	 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Hello!  > I believe that the top-level http and meta properties are in a sense "owned" by the intake...
[13:43:45] <urbanecm>	 jouncebot: now
[13:43:45] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 16 minute(s)
[13:44:17] <urbanecm>	 jouncebot: next
[13:44:17] <jouncebot>	 In 3 hour(s) and 15 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1700)
[13:44:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm)
[13:44:31] <wikibugs>	 (03PS2) 10Urbanecm: Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224
[13:44:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm)
[13:45:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm)
[13:45:54] <wikibugs>	 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[13:47:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:48:02] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6fbcc93f429ff3fbca98aeecdee4f33f022ca7c3: Add missing edit*protected rights to $wgAvailableRights (duration: 00m 56s)
[13:48:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:14] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[13:49:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:49:32] <wikibugs>	 (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo)
[13:50:29] <wikibugs>	 (03PS2) 10Urbanecm: knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333)
[13:50:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) (owner: 10Urbanecm)
[13:51:55] <wikibugs>	 (03Merged) 10jenkins-bot: knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) (owner: 10Urbanecm)
[13:52:25] <wikibugs>	 (03PS2) 10Urbanecm: Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307)
[13:52:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm)
[13:52:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro)
[13:53:13] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f1a178e1d4d7c98a1988da68982f97848f390c68: knwiki: Disable wmgNewUserMessageOnAutoCreate (T289333) (duration: 00m 57s)
[13:53:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:18] <stashbot>	 T289333: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwiki - https://phabricator.wikimedia.org/T289333
[13:53:50] <wikibugs>	 (03Merged) 10jenkins-bot: Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm)
[13:54:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:55:08] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b17015395cc592e021a4ca8ce6f81b699bb77381:  Growth mentor dashboard: Enable beta features only on beta wikis (T280307) (duration: 00m 55s)
[13:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:13] <stashbot>	 T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307
[13:55:16] * urbanecm done
[13:56:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:59:23] <wikibugs>	 (03PS2) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305)
[13:59:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:00:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:40] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 32 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:06:02] <wikibugs>	 (03PS1) 10Ladsgroup: alertmanager: Add Wikidata team to alert manager [puppet] - 10https://gerrit.wikimedia.org/r/715505 (https://phabricator.wikimedia.org/T287741)
[14:18:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[14:18:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Add Wikidata team to alert manager [puppet] - 10https://gerrit.wikimedia.org/r/715505 (https://phabricator.wikimedia.org/T287741) (owner: 10Ladsgroup)
[14:21:09] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet
[14:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet
[14:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:26] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master
[14:21:27] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master
[14:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:38] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:28:23] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10fgiunchedi)
[14:31:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:37:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:39:00] <icinga-wm>	 RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms
[14:39:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:39:19] <wikibugs>	 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[14:41:00] <icinga-wm>	 RECOVERY - Host ripe-atlas-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms
[14:44:27] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[14:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:13] <wikibugs>	 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[14:45:16] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) > Unless there are objections let's go with (b), do you need command line access or web interface is fine @brenn...
[14:47:28] <wikibugs>	 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10Dzahn) a:03Dzahn
[14:47:57] <wikibugs>	 (03PS2) 10Ssingh: wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536)
[14:48:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:49:16] <wikibugs>	 (03PS1) 10Urbanecm: Add mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369)
[14:52:04] <wikibugs>	 (03PS2) 10Urbanecm: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369)
[14:52:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:53:26] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: Adjust memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531
[14:53:28] <wikibugs>	 (03PS1) 10DCausse: flink-session-cluster: Remove service.name from the ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/715532
[14:58:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:01:32] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:32] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) That's for this writeup  Let me start by saying that of the 3 solutions, the basic idea of the 3rd one should be the one that we aim for in the long run (but not now). Deploym...
[15:03:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH)
[15:03:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) >>! In T289882#7318587, @ayounsi wrote: >> (So row B in 10G, C8, and/or D5.) > New cloud hosts only go in cloud racks, so no...
[15:03:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:05:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) a:03nskaggs @nskaggs: Who in WMCS is going to be point on these servers?   I ask so we can assign them this racking task, s...
[15:05:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:07:44] <wikibugs>	 (03PS1) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009)
[15:08:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:10:09] <wikibugs>	 (03CR) 10Ema: varnish: Containerize varnish test environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[15:11:20] <icinga-wm>	 PROBLEM - Long running screen/tmux on snapshot1009 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 32809, 1728364s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[15:13:16] <apergos>	 oh?
[15:13:22] <apergos>	 lemme see about that
[15:13:53] <apergos>	 fixed, apologies!
[15:15:08] <dancy>	 tsk tsk!
[15:15:27] <apergos>	 yeah totally my bad, I usually close those out when done and I dropped the ball on that one
[15:16:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:17:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:22:24] <wikibugs>	 (03CR) 10Ema: Add Varnish SLO dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema)
[15:24:26] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421)
[15:25:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "I think we're in good shape to deploy this!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema)
[15:26:44] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:44] <wikibugs>	 (03PS9) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421)
[15:28:46] <wikibugs>	 (03PS9) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421)
[15:28:49] <wikibugs>	 (03PS9) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421)
[15:28:50] <wikibugs>	 (03PS8) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421)
[15:28:52] <wikibugs>	 (03PS8) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421)
[15:28:54] <wikibugs>	 (03PS8) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421)
[15:28:56] <wikibugs>	 (03PS3) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421)
[15:28:58] <wikibugs>	 (03PS2) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421)
[15:29:00] <wikibugs>	 (03PS2) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421)
[15:29:02] <wikibugs>	 (03PS3) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421)
[15:31:15] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10akosiaris)
[15:32:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:33:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:37:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:38:36] <wikibugs>	 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) Update: Grafana-based performance alerts have been migrated to AlertManager, and as such show up at https://alerts.wikimed...
[15:38:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:42:48] <wikibugs>	 (03PS1) 10Ladsgroup: icinga: Drop grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/715543 (https://phabricator.wikimedia.org/T287741)
[15:44:50] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[15:50:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: Drop grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/715543 (https://phabricator.wikimedia.org/T287741) (owner: 10Ladsgroup)
[15:53:33] <Jdlrobson>	 Hey @paladox I was wondering if there's anything I can do to help move https://gerrit-review.googlesource.com/c/gerrit/+/313490 along?
[15:56:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:11] <wikibugs>	 (03PS1) 10Ssingh: durum: fix notifying the uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715546
[15:57:27] <paladox>	 Oh, i forgot about that. I just need to add some docs i think... though i'm not exactly sure if i did that correctly.
[15:58:28] <wikibugs>	 (03PS1) 10Jbond: C:puppetdb::app: move blacklist file to correct config [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578)
[15:58:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:59:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30907/console" [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[15:59:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetdb::app: move blacklist file to correct config [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[16:00:14] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30908/console" [puppet] - 10https://gerrit.wikimedia.org/r/715546 (owner: 10Ssingh)
[16:01:40] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro)
[16:02:27] <Jdlrobson>	 @paladox yeh it looks like you did it correctly. They just want it expanded with documentation.
[16:02:28] <wikibugs>	 (03PS2) 10Andrew Bogott: ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro)
[16:02:39] <Jdlrobson>	 (and tests)
[16:04:44] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[16:06:48] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02183 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:12:14] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30909/console" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[16:13:38] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "pcc output: https://puppet-compiler.wmflabs.org/compiler1003/30909/" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[16:15:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[16:15:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:16:21] <sukhe>	 !log running authdns-update for Gerrit 715499
[16:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:28] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:18:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: fix notifying the uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715546 (owner: 10Ssingh)
[16:20:01] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet
[16:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:08] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1008.eqiad.wmnet
[16:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:46] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Resyncing from master
[16:20:47] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Resyncing from master
[16:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:49] <wikibugs>	 (03PS1) 10Zabe: osm: remove absented osm_sync_lag cron [puppet] - 10https://gerrit.wikimedia.org/r/715552 (https://phabricator.wikimedia.org/T273673)
[16:23:36] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:26:48] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:54] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:32:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:32:35] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] Set permission of creating short url to everyone everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup)
[16:33:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 (owner: 10MSantos)
[16:34:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE
[16:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:36:39] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE
[16:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:43:13] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call `ml-services` dir under `helmfile.d` (see...
[16:43:15] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:43:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:47:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:49:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:59:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10Bethany) >>! In T289892#7317738, @jcrespo wrote: > Hi, @Bethany, I can process your request with no issue, but might I request to update your email (and verify it) on your acc...
[17:00:05] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1700).
[17:00:10] <ryankemper>	 !log T289483 Pooled `wdqs1013`
[17:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:15] <stashbot>	 T289483: asw2-c-eqiad:ge-5/0/39 - wdqs1013 - Inbound interface errors - https://phabricator.wikimedia.org/T289483
[17:00:15] <wikibugs>	 (03PS1) 10Ssingh: test_dns: add tests for durum check service [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/715561 (https://phabricator.wikimedia.org/T289536)
[17:02:18] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.84`. Pre-deploy tests passing on canary `wdqs1003`
[17:02:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:34] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@a17833c]: 0.3.84
[17:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:27] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: add tests for durum check service [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/715561 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh)
[17:04:29] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.84` on canary `wdqs1003`; proceeding to rest of fleet
[17:04:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:05:30] <wikibugs>	 (03PS5) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493)
[17:05:59] <wikibugs>	 (03CR) 10MMandere: varnish: Containerize varnish test environment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[17:09:19] <wikibugs>	 (03Abandoned) 10Hnowlan: maps: reenable tilerator on codfw new cluster [puppet] - 10https://gerrit.wikimedia.org/r/705684 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan)
[17:10:50] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@a17833c]: 0.3.84 (duration: 08m 16s)
[17:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:25] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:23] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[17:12:26] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across both test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[17:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:33] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[17:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:15] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:55] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:37:04] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420)
[17:37:32] <wikibugs>	 (03PS1) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569
[17:38:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (owner: 10Ryan Kemper)
[17:38:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[17:40:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:41:17] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 77.57 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:41:23] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420)
[17:43:09] <wikibugs>	 (03PS1) 10Ryan Kemper: wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001)
[17:44:39] <wikibugs>	 (03PS4) 10Kosta Harlan: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797)
[17:44:56] <ryankemper>	 !log [WDQS Deploy] Test query passing on `query.wikidata.org` and icinga looks good. This deploy is done.
[17:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:42] <wikibugs>	 (03PS2) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001)
[17:48:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)
[17:49:14] <wikibugs>	 (03PS3) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001)
[17:49:26] <paladox>	 Jdlrobson: done: https://gerrit-review.googlesource.com/c/gerrit/+/313490,edit
[17:49:35] <paladox>	 i changed it to use self()
[17:51:48] <wikibugs>	 (03PS1) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664)
[17:52:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:53:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:53:29] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10Performance-Team (Radar), 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10Krinkle)
[17:54:13] <wikibugs>	 (03CR) 10Urbanecm: "you need to edit wmf-config/config/itwiki.yaml (and then run composer buildDBLists to build dblists/*)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:56:32] <wikibugs>	 (03PS2) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664)
[17:56:36] <wikibugs>	 (03CR) 10Jdlrobson: "Is this enough to make Italian Wikipedia a group 1 wiki? I'm not too familiar with the process here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:56:42] <RhinosF1>	 urbanecm: won't that also need a change to wikiversions.yaml if not deploy on a monday/Thursday
[17:56:42] <wikibugs>	 (03PS3) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664)
[17:57:30] <wikibugs>	 (03CR) 10RhinosF1: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:57:58] <urbanecm>	 RhinosF1: yes, but today it's Monday 🙂
[17:58:24] <RhinosF1>	 urbanecm: yeah if it goes now it'll be good
[17:59:37] <urbanecm>	 the deployment would also differ (if wikiversions.json is changed, you also need to run scap sync-wikiversions to rebuild it on the app hosts)
[17:59:54] <RhinosF1>	 urbanecm: yeah
[18:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1800).
[18:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:01:05] <tgr>	 There are some patches in the queue now.
[18:01:14] <urbanecm>	 i like last minute additions :)
[18:01:21] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This is ready for deploy, waiting for a review from Filippo and you or me can merge and deploy the ldap change." [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[18:02:25] <urbanecm>	 tgr: wondering if https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/715529 can/should be squashed here too?
[18:02:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10jcrespo) a:05jcrespo→03fgiunchedi
[18:03:02] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan)
[18:03:11] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[18:04:26] <tgr>	 urbanecm: sure, can you add it to the wiki?
[18:04:39] <urbanecm>	 Will do tgr. 
[18:05:17] <tgr>	 how much does it depend on the GrowthExperiments patch? do we need to backport that?
[18:06:38] <urbanecm>	 tgr: I don't think so, I intend to test it at beta anyway. 
[18:07:41] <tgr>	 ack. Should I do the deployment?
[18:08:57] <urbanecm>	 tgr: sure. 
[18:09:29] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan)
[18:10:15] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan)
[18:10:43] <tgr>	 we can also deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/714549 while we are at it
[18:10:54] <tgr>	 I guess that's more "merge" than "deploy"
[18:12:17] <urbanecm>	 Yup, just a git pull once it merges
[18:14:09] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714548|GrowthExperiments: Switch image recommendations flag off (T288797)]] (duration: 00m 57s)
[18:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:13] <stashbot>	 T288797: Add Image: Create image-recommendation task type - https://phabricator.wikimedia.org/T288797
[18:14:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 5794 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:14:29] <wikibugs>	 (03PS3) 10Gergő Tisza: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan)
[18:14:40] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan)
[18:15:28] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan)
[18:16:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:16:20] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483)
[18:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:22] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[18:16:57] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[18:17:19] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[18:17:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:18:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:39] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan)
[18:21:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:22:08] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715568|GrowthExperiments: Enable link recommendation for dewiki and nlwiki (T288420 T285254)]] (duration: 00m 56s)
[18:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:13] <stashbot>	 T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254
[18:22:13] <stashbot>	 T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420
[18:22:41] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:24:04] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483)
[18:24:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:59] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:29:56] <urbanecm>	 tgr: you probably need to rebase (and re-+2) the config patch, too
[18:31:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:33:25] <wikibugs>	 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Asaf) 05Open→03Resolved Never mind, opened a new task with updated request.
[18:34:05] <tgr>	 sorry, had to go afk for a sec.
[18:34:10] <wikibugs>	 (03PS3) 10Gergő Tisza: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:34:17] <urbanecm>	 np
[18:34:35] <tgr>	 yeah for some reason Gerrit only allows rebasing after the initial +2
[18:34:50] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:35:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:38:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:57] <tgr>	 urbanecm: it's on mwdebug2001
[18:39:23] <urbanecm>	 tgr: I can't check it there, as the instrumentation code is only at beta
[18:41:03] <tgr>	 I guess we can backport the core patch at any time if we run into trouble
[18:41:21] <urbanecm>	 yeah, but this AFAIK only registers the stream
[18:41:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:51] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:41:57] <tgr>	 yeah but production will log with a different stream name for a few days
[18:42:07] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) With the elastic SSPL changes that happened this year (T272111 T272238  etc.) is...
[18:42:19] <tgr>	 but then schema errors are not a big deal and the feature is not getting any traffic
[18:42:37] <urbanecm>	 tgr: why would it? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/715108 was merged today, few hours before you merged the fix of the schema name
[18:43:12] <logmsgbot>	 !log tgr@deploy1002 scap failed: average error rate on 3/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details)
[18:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:19] <tgr>	 oh right the logging code is in the same branch
[18:43:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:43:29] <tgr>	 !log morning deploys done
[18:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:41] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:43:45] <tgr>	 ...actually not done.
[18:43:45] <urbanecm>	 tgr: you are aware of "scap failed: average error rate on 3/6 canaries increased by 10x " above, right? :)
[18:45:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:45:24] <dancy>	 error counters ticking up rapidly.  
[18:46:12] <tgr>	 urbanecm: I guess that array syntax for the stream name is invalid?
[18:46:17] <urbanecm>	 looks so
[18:46:24] <urbanecm>	 tgr: please revert, uploading a followup
[18:46:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 177 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:47:09] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 270 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:47:41] <wikibugs>	 (03PS1) 10Urbanecm: Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369)
[18:48:22] <logmsgbot>	 !log tgr@deploy1002 Scap failed!: 5/6 canaries failed their endpoint checks(https://en.wikipedia.org)
[18:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:01] * urbanecm running sync with --force
[18:49:17] <urbanecm>	 not
[18:49:22] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:715529|Add mediawiki.mentor_dashboard.visit schema (T289369)]] (duration: 00m 26s)
[18:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:26] <stashbot>	 T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369
[18:49:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:49:55] <tgr>	 scap could be clearer on whether a failure means that the code was deployed or not
[18:50:45] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:50:46] <urbanecm>	 i think it means "deployed to canaries", as scap doesn't know anything about git
[18:50:49] <tgr>	 "18:49:22 14 hosts had failures restarting php-fpm"
[18:51:20] <urbanecm>	 tgr: do you want to try to sync the fix, or merge revert in gerrit and try later?
[18:51:41] <tgr>	 let's sync it, it's a trivial fix
[18:51:51] <urbanecm>	 ok
[18:52:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:53:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:53:56] <urbanecm>	 tgr: the fix is at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/715579/, adding to calendar too
[18:53:58] <tgr>	 it doesn't know about git, sure. But when it says "18:48:22 sync-file failed: <RuntimeError> Scap failed!: 5/6 canaries failed their endpoint checks", does that mean the code has been deployed everywhere, and then failed a check? deployed to canaries, failed the check and still serving traffic from the canary hosts? failed the check so it has been undeployed?
[18:55:04] <tgr>	 I guess we are just rsyncing and not using a separate directory for the new code so it has to be the middle one
[18:55:35] <urbanecm>	 yeah
[18:56:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:56:22] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[18:58:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442 (owner: 10Mforns)
[18:59:38] <tgr>	 I did check some random wiki page with X-WM-DBG, still don't understand why that worked.
[19:00:01] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:00:07] <urbanecm>	 tgr: because events are either sent from client side or using deferedupdates
[19:00:17] <urbanecm>	 there was sth in logstash for mwdebug hosts
[19:00:49] <tgr>	 oh, deferred, that makes sense.
[19:01:05] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:06] <tgr>	 yeah, I didn't check the mwdebug log, shame on me.
[19:03:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:42] <tgr>	 well, it is empty now, so I guess we are good
[19:04:51] <urbanecm>	 let's hope!
[19:05:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:06:22] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715579|Fix schema definition for mediawiki.mentor_dashboard.visit (T289369)]] (duration: 00m 56s)
[19:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:26] <stashbot>	 T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369
[19:08:09] <tgr>	 !log morning deploys done for real
[19:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:35] <urbanecm>	 thanks tgr 
[19:08:54] <tgr>	 dancy: sorry, took me a while to realize the code is in production. Should I file an incident report?
[19:09:57] <dancy>	 I don't think that will be necessary. We're all back to normal now?
[19:11:06] <tgr>	 yeah.
[19:12:15] <icinga-wm>	 RECOVERY - Long running screen/tmux on snapshot1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[19:12:34] <tgr>	 1K web request errors, 10K API errors. I think the web ones have minimal impact because of the deferred, just that event logging failed. The API does not seem to use deferred so probably those requests errored out?
[19:20:10] <urbanecm>	 tgr: so, i still don't see any events coming in beta (any at all), but i guess that is because of T289029
[19:20:11] <stashbot>	 T289029: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T289029
[19:24:24] <wikibugs>	 (03PS6) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493)
[19:25:21] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:48] <wikibugs>	 (03PS1) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215)
[19:46:38] <wikibugs>	 (03CR) 10Ottomata: Add mediawiki.mentor_dashboard.visit schema (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[19:46:58] <ottomata>	 tgr urbanecm  ^
[19:47:09] <urbanecm>	 ottomata: that's why it doesn't work! thanks, you just saved me hours of digging :)
[19:47:26] <urbanecm>	 at least i got beta's ingest endpoint up again :D
[19:48:34] <wikibugs>	 (03PS1) 10Urbanecm: Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369)
[19:48:38] <urbanecm>	 ottomata: so like that? ^^
[19:49:08] <ottomata>	 yup look sgood
[19:49:15] <urbanecm>	 jouncebot: now
[19:49:15] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 10 minute(s)
[19:49:16] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[19:49:18] <urbanecm>	 jouncebot: next
[19:49:18] <jouncebot>	 In 0 hour(s) and 10 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2000)
[19:49:23] <urbanecm>	 i guess i can get that out then
[19:49:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "third time's the charm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[19:50:41] <wikibugs>	 (03Merged) 10jenkins-bot: Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[19:52:17] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9a92e2ae7526717a0a42b825a34b4595e75a544b: Fix mediawiki.mentor_dashboard.visits definition (duration: 00m 56s)
[19:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:25] <urbanecm>	 and let's see if that did the trick
[19:57:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2000). Please do the needful.
[20:00:33] <wikibugs>	 (03PS5) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673)
[20:01:55] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:05] <wikibugs>	 (03CR) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:02:09] <tgr>	 dancy: urbanecm: follow-ups: T290036 T290037 T290038 T290039
[20:02:09] <stashbot>	 T290039: Structure tests for stream settings in operations/mediawiki-config - https://phabricator.wikimedia.org/T290039
[20:02:09] <stashbot>	 T290037: Scap should be clearer about the need for a revert after a failed canary check - https://phabricator.wikimedia.org/T290037
[20:02:09] <stashbot>	 T290038: scap sync-file --force warns "sudo: no tty present and no askpass program specified" - https://phabricator.wikimedia.org/T290038
[20:02:10] <stashbot>	 T290036: Scap revert commands should use --force - https://phabricator.wikimedia.org/T290036
[20:02:16] <urbanecm>	 thanks!
[20:02:45] <dancy>	 Thanks tgr. We will triage those and try to make improvements.
[20:05:34] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate - Disable http service if tls.enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) (owner: 10Ottomata)
[20:06:25] <wikibugs>	 (03CR) 10Ssingh: envoyproxy: Allow setting http2 protocol options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[20:06:41] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[20:07:30] <urbanecm>	 ottomata: so, that fix got deployed to beta, but I don't see any events in https://stream-beta.wmflabs.org/v2/ui/#/?streams=eventlogging_HomepageVisit,mediawiki.mentor_dashboard.visit :/. Am I doing something wrong?
[20:08:37] <urbanecm>	 EventLogging.log says nothing, EventBus.log says "DEBUG: Using destination_event_service eventgate-analytics-external for stream mediawiki.mentor_dashboard.visit.", `kafkacat -C -b deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud -t eqiad.mediawiki.mentor_dashboard.visit` is silent, stream-beta is silent :/
[20:08:44] <wikibugs>	 10SRE: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Peachey88)
[20:09:04] <ottomata>	 urbanecm:  looking
[20:09:08] <urbanecm>	 appreciated!
[20:09:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[20:12:27] <wikibugs>	 (03PS1) 10Zabe: labstore: remove absented /etc/exports.d/public_root.exports [puppet] - 10https://gerrit.wikimedia.org/r/715591
[20:14:22] <ottomata>	 urbanecm:  everything looks good from the config side
[20:14:29] <ottomata>	 i can post an event to eventgate in beta and it shows up in stream-beta
[20:14:41] <urbanecm>	 oh, so it just took same time to fully propagate
[20:14:45] <urbanecm>	 *some more
[20:14:52] <urbanecm>	 thanks for your help ottomata 
[20:14:55] <wikibugs>	 (03PS2) 10Zabe: labstore: remove absented /etc/exports.d/public_root.exports file [puppet] - 10https://gerrit.wikimedia.org/r/715591
[20:14:55] <ottomata>	 oh ya!
[20:14:56] <ottomata>	 ok cool
[20:14:59] <ottomata>	 great glad it works
[20:15:09] <ottomata>	 urbanecm:  mforns  was telling me you had some trouble getting the dev env to work?
[20:17:42] <urbanecm>	 yes! I was able to get the "new" schemas working (by running the devserver and setting `$wgEventLoggingServiceUri = 'http://localhost:8192/v1/events';`), but schemas like `HomepageVisit` (in the legacy folder) complain about `wgEventLoggingBaseUri` not being set
[20:22:03] <urbanecm>	 (also, i had to edit `node_modules/eventgate-wikimedia/eventgate-wikimedia.js` in EventLogging's devserver to load `uriHasProtocol` from `@wikimedia/url-get`; https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/master/eventgate-wikimedia.js#8 has it fixed, but I'm not sure how much https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/devserver/package.json#L8 can be bumped)
[20:22:44] <James_F>	 Hmm, I note that Beta Cluster wikis still have `Musical scores are temporarily disabled`; I guess we didn't set up shellbox there yet?
[20:23:55] <urbanecm>	 James_F: I don't think deployment-prep has kubernetes to begin with
[20:24:55] <urbanecm>	 T276650 is still opened
[20:24:56] <stashbot>	 T276650: Re-consider setting up a Kubernetes cluster on the Beta cluster - https://phabricator.wikimedia.org/T276650
[20:26:25] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:26:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,webperf_navtiming} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:27:08] <James_F>	 Indeed, but new services are meant to be mocked in Beta when added to prod.
[20:27:40] <urbanecm>	 in theory, but since T215217 is open, there's no one responsible for that :)
[20:27:41] <stashbot>	 T215217: deployment-prep: Code stewardship request - https://phabricator.wikimedia.org/T215217
[20:28:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:29:55] <urbanecm>	 (ottomata: see my re few lines above -- happy to move this discussion somewhere else, too)
[20:30:57] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 04-1] gitlab cas:  update uid field to use uid not CN (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond)
[20:31:28] <James_F>	 It's also going to be a massive pain for Wikifunctions that Beta doesn't have a k8s equivalent.
[20:31:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[20:34:49] <ottomata>	 hmm interrsting.
[20:34:52] <ottomata>	 urbanecm:  sorry in other convos too
[20:35:01] <ottomata>	 urbanecm:  the eventgate-wikimedia dep can be bumped to latest
[20:35:19] <urbanecm>	 np, just wanted to make sure you didn't miss it :)
[20:35:43] <wikibugs>	 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf)
[20:42:40] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) @JMeybohm, I merged that and am trying to apply for eventgate-logging-external staging.  Diff looks good:  ` 20:23:31 [@deploy1002:/srv/...
[20:42:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[20:42:52] <urbanecm>	 ottomata: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/715595, fyio
[20:42:53] <urbanecm>	 *fyi
[20:44:23] <ottomata>	 urbanecm:  i'm unsure if we should use master, or pin to an explicit version
[20:44:44] <urbanecm>	 me too -- happy to change for current latest hash :)
[21:00:04] <jouncebot>	 Reedy and sbassett: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2100).
[21:01:31] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:43] <wikibugs>	 (03PS1) 10Zabe: swift: migrate swift-drive-audit cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806)
[21:10:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:16:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:20:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:21:54] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen)
[21:25:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:45] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[21:34:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived - https://phabricator.wikimedia.org/T289928 (10Effeietsanders) 05Open→03Resolved a:03Effeietsanders Thanks @Legoktm for digging into this! It is surprising that I'm not on wlm-announce as a member, because once i was, and i...
[21:34:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:37:48] <wikibugs>	 (03PS5) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[21:37:50] <wikibugs>	 (03PS2) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881)
[21:37:52] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604
[21:38:29] <wikibugs>	 (03PS2) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881)
[21:38:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:41:22] <wikibugs>	 (03PS3) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881)
[21:47:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:52:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:57:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:57:45] <wikibugs>	 (03PS1) 10Urbanecm: Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369)
[21:58:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:01:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:02:31] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:03:09] <wikibugs>	 (03PS4) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664)
[22:03:17] <wikibugs>	 (03PS7) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493)
[22:03:34] <wikibugs>	 (03PS2) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215)
[22:04:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:09:02] <wikibugs>	 (03CR) 10Urbanecm: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[22:13:11] <wikibugs>	 (03CR) 10RobH: [C: 03+1] "This change updates the quotereview tool to parse the equotes for me properly now and still works for the dell team prepared quote format." [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans)
[22:13:59] <wikibugs>	 (03PS2) 10RobH: quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans)
[22:15:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[22:25:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:39:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:42:35] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:43:00] <urbanecm>	 jouncebot: next
[22:43:00] <jouncebot>	 In 0 hour(s) and 16 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2300)
[22:43:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:43:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "will deploy during the evening window; CI failure was an unrelated one" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[22:51:57] <wikibugs>	 (03PS1) 10Bstorm: cloud osmdb: set num_threads in the sync job [puppet] - 10https://gerrit.wikimedia.org/r/715623 (https://phabricator.wikimedia.org/T285668)
[22:54:56] <wikibugs>	 (03PS1) 10Bstorm: cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668)
[22:59:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] prometheus_local_crontabs: use a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/714173 (https://phabricator.wikimedia.org/T273673) (owner: 10Majavah)
[23:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:12] * urbanecm still waiting on CI
[23:02:09] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:04:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:05:44] <wikibugs>	 (03Merged) 10jenkins-bot: Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm)
[23:05:48] <urbanecm>	 \o
[23:07:48] <wikibugs>	 (03PS6) 10Andrew Bogott: rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:08:05] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/includes/Specials/SpecialHomepage.php: 9e2264a0c9a48548da4795b2a5b9d7275d254ac7: Instrument Special:MentorDashboard (T289369) (duration: 00m 57s)
[23:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:11] <stashbot>	 T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369
[23:08:13] * urbanecm doe
[23:08:15] <urbanecm>	 *done
[23:09:40] <urbanecm>	 would be done...if i synced the right file
[23:10:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:11:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/includes/Specials/SpecialMentorDashboard.php: 9e2264a0c9a48548da4795b2a5b9d7275d254ac7: Instrument Special:MentorDashboard (T289369) (duration: 00m 55s)
[23:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:24] <urbanecm>	 now it works :)
[23:11:32] <urbanecm>	 !log Evening B&C done
[23:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:02] <wikibugs>	 (03PS7) 10Andrew Bogott: rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:13:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:13:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:14:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "This only runs on cloudcontrols, all Buster and soon to be Bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:16:11] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10wiki_willy) a:03Cmjohnson
[23:18:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337 (10wiki_willy) a:03Cmjohnson
[23:21:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] check_keystone_roles.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670925 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:22:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:22:54] <wikibugs>	 (03CR) 10Andrew Bogott: "Does this need attention still or have y'all long since worked around it?" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott)
[23:24:00] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova vendordata.txt: delete systemd-coredump user [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond)
[23:24:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "This is pre-puppet so should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond)
[23:25:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:27:21] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:39:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:43:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:48:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:50:37] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:53:04] <wikibugs>	 (03CR) 10Bstorm: "I'd like to try merging this if you can add that dependency @Majavah. I can add it if you like as well." [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah)
[23:55:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:55:27] <wikibugs>	 (03CR) 10Bstorm: "I don't think we want ssh client in the docker images. Is there a specific use you had in mind?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan)
[23:57:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:57:25] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "I hope we don't have to set up a new OS before the grid is decommissioned." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah)
[23:59:40] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "Thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/715591 (owner: 10Zabe)