[00:02:19] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:23] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:09:07] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:24:43] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 149.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [00:24:54] (03Abandoned) 10Tim Starling: Increase wgMaxUserDBWriteDuration to 10 on votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713024 (https://phabricator.wikimedia.org/T288831) (owner: 10Tim Starling) [00:27:09] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:31] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [00:47:29] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:55] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [01:01:37] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:21] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:26:17] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:57] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1145.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:52:15] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [02:02:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:53] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [02:26:17] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:50] PROBLEM - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:07:35] O.o [03:08:30] 👋 [03:08:38] RECOVERY - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 4.656 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:08:52] eqiad, so hopefully not too serious? let's see [03:09:15] https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=30s [03:10:13] traffic spike about 2:50, latency and 5xx spike about 3:02? [03:10:26] yeah, seems like [03:11:09] interesting that it's all hosts in eqiad, but uneven in codfw [03:11:16] oh, thumbor is pooled in eqiad, not codfw [03:11:52] really? qps looks like it's in the same ballpark [03:12:08] https://config-master.wikimedia.org/discovery/discovery-basic.yaml [03:12:38] huh [03:13:10] we discussed moving swift over so new codfw hardware could be added: https://phabricator.wikimedia.org/T288458#7300647 [03:13:20] guess it makes sense that thumbor followed as well [03:13:46] nod [03:17:10] digging for logs a bit [03:17:26] someone ("Python urllib2") is scrapping but getting hit by 429s [03:17:37] scraping* [03:17:59] nod [03:26:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:19] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 139.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:02:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:43] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:09:51] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:15:15] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:16] (03CR) 10Krinkle: Update configuration related to disabling Score functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [04:20:59] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:33] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 123.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [04:25:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:55] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:57] 10SRE, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [04:35:59] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [04:36:12] 10SRE, 10ChangeProp, 10serviceops, 10SCB, and 2 others: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10jijiki) 05Open→03Declined Bluntly closing this, no updates/findings for a long time [05:01:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:29] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:15:55] (03PS1) 10Marostegui: install_server: Reimage db2110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/715347 (https://phabricator.wikimedia.org/T288803) [05:16:29] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 63.05 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [05:16:57] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/715347 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [05:18:50] (03PS1) 10Marostegui: db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715348 (https://phabricator.wikimedia.org/T288803) [05:22:38] (03CR) 10Marostegui: [C: 03+2] db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715348 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [05:23:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 for reimage T288803', diff saved to https://phabricator.wikimedia.org/P17105 and previous config saved to /var/cache/conftool/dbconfig/20210830-052336-marostegui.json [05:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:42] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:26:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2110.codfw.wmnet with reason: REIMAGE [05:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2110.codfw.wmnet with reason: REIMAGE [05:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:29] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:19] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:04] 10SRE, 10Research, 10observability, 10Patch-For-Review: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10Legoktm) 05Open→03Resolved This is no longer an issue because SCB is long gone, and there are no flapping alerts for this service that I've seen rec... [06:33:19] 10SRE, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445 (10Legoktm) 05Open→03Resolved This is no longer an issue because SCB is long gone, and there are no flapping alerts for this service that I'v... [06:38:28] !log more weight to ms-be20[62-65] - T288458 [06:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:34] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [06:46:04] (03PS1) 10Marostegui: pc[12]007-010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715441 (https://phabricator.wikimedia.org/T289112) [06:46:54] (03CR) 10Marostegui: [C: 03+2] pc[12]007-010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715441 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui) [06:53:06] !log drop an-airflow1001's old airflow logs to fix root partition almost filled up [06:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:37] 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10fgiunchedi) The fact that this is a new wiki suggests to me the maintenance scripts to give Thumbor access to the containers haven't been run... [06:57:11] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) Essentially a puppet setting yes, `rsync::server::wrap_with_stunnel` for the server bits and then e.g. `rsync::quickdatacopy` has the option to turn on ssl on the... [06:58:57] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n, 10Patch-For-Review: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10fgiunchedi) >>! In T184871#7315757, @TheDJ wrote: > @fgiunchedi you know if that patch makes... [07:01:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:33] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10jcrespo) p:05Triage→03High a:03jcrespo Hi, @Bethany, I can process your request with no issue, but might I request to update your email (and verify it) on your account on Wikitech at https://... [07:04:57] (03PS3) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) [07:05:28] I can't op myself here for whatever reason, can someone change the topic to set me on clinic duty? [07:05:33] (03PS1) 10Mforns: Fix --unitl for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442 [07:06:25] (03PS4) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) [07:10:07] (03PS1) 10Jcrespo: admin: Add bgwiki (Bethany) to the list of privileged ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) [07:10:13] (03CR) 10Jelto: helmfile.d admin rename view rbac resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [07:10:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:12] godog: done :) [07:12:13] (03CR) 10Jcrespo: [C: 04-1] "-1 waiting for LDAP and HR records for mail to be identical (see ticket)." [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [07:12:40] marostegui: thank you <3 <3 [07:13:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:14:34] I was actually still doing stuff myself, as normally meeting was in the afternoon [07:14:39] happy to handover now [07:16:13] jynus: yeah I don't know tbh when the handover is supposed to happen but anytime works for me [07:16:39] let me cleanup the maint-announce for you and it will be all yours :-) [07:16:46] for things during the weekend [07:18:11] heheh ok, LMK jynus [07:22:16] (03CR) 10Filippo Giunchedi: profile: adapt alertmanager-webhook-logger to ECS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [07:28:26] (03CR) 10RhinosF1: "the email is a -ctr email. does that not mean we need expiry date & contact" [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [07:29:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:22] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:48:06] godog: o/ as FYI yesterday I have downtimed the cr2-esams / cr2 eqiad alerts due to the Lumen maintenance (that will lasts days IIUC sigh) so it may start alarming again in a couple of hours [07:49:01] elukey: ack, thank you! will keep it mind [07:50:06] 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived - https://phabricator.wikimedia.org/T289928 (10fgiunchedi) p:05Triage→03Medium [07:50:23] 10SRE, 10Traffic: cp2027 powercycled - https://phabricator.wikimedia.org/T289908 (10fgiunchedi) p:05Triage→03Medium [07:50:46] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) p:05Triage→03Medium [07:50:53] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) p:05Triage→03Medium [07:51:08] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10fgiunchedi) p:05Triage→03Medium [07:55:15] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [07:55:15] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446 [07:56:32] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) 05Open→03Resolved [07:58:36] (03PS6) 10DCausse: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) [08:01:48] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:09] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10fgiunchedi) p:05Triage→03Medium [08:03:37] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) Unless there are objections let's go with (b), do you need command line access or web interface is fine @bren... [08:03:50] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#7261361, @Ottomata wrote: > I think that will do it. helm template looks good locally. > > @JMeybohm is it ok that I mov... [08:04:45] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10fgiunchedi) p:05Triage→03Medium [08:05:11] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:18] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Remove the non-TLS k8s service will be handled via T255871 [08:05:24] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:34] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:40] 10SRE, 10Traffic: cp2027 powercycled - https://phabricator.wikimedia.org/T289908 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez all seems to be OK with cp2027, I just repooled it. Thanks @elukey! [08:05:46] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871 [08:06:04] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871 [08:06:15] (03PS1) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) [08:06:17] (03PS1) 10JMeybohm: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) [08:06:19] (03PS1) 10JMeybohm: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) [08:06:21] (03PS1) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) [08:06:24] (03PS1) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) [08:06:26] (03PS1) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) [08:06:28] (03PS1) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) [08:07:48] 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10fgiunchedi) p:05Triage→03Medium [08:07:48] (03CR) 10jerkins-bot: [V: 04-1] citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [08:07:55] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10serviceops, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10fgiunchedi) p:05Triage→03Medium [08:08:01] 10SRE, 10docker-pkg, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10fgiunchedi) p:05Triage→03Medium [08:08:03] (03CR) 10jerkins-bot: [V: 04-1] termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [08:08:07] (03CR) 10jerkins-bot: [V: 04-1] zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [08:08:15] (03CR) 10jerkins-bot: [V: 04-1] mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [08:08:27] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10fgiunchedi) p:05Triage→03Medium [08:08:29] (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [08:08:33] (03CR) 10jerkins-bot: [V: 04-1] cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [08:08:41] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10fgiunchedi) p:05Triage→03Medium [08:09:08] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) p:05Triage→03Medium [08:11:24] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446 (owner: 10PipelineBot) [08:14:23] (03Merged) 10jenkins-bot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715446 (owner: 10PipelineBot) [08:19:22] (03CR) 10Jcrespo: [C: 04-1] admin: Add bgwiki (Bethany) to the list of privileged ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [08:25:45] (03PS2) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) [08:25:47] (03PS2) 10JMeybohm: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) [08:25:49] (03PS2) 10JMeybohm: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) [08:25:51] (03PS2) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) [08:25:53] (03PS2) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) [08:25:55] (03PS2) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) [08:25:57] (03PS2) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) [08:25:59] (03PS1) 10JMeybohm: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 [08:26:28] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:17] (03CR) 10RhinosF1: admin: Add bgwiki (Bethany) to the list of privileged ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [08:27:48] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) >>! In T281780#7315315, @AntiCompositeNumber wrote: > The thumbnail not existing in Swift is certainly... [08:29:12] (03CR) 10JMeybohm: [C: 03+1] helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:37:46] 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10ema) @MMandere: is there anything left to do here? If not, let's close the task! [08:41:44] 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema) [08:41:51] 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema) p:05Triage→03Low [08:42:02] (03PS5) 10Volans: lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [08:42:57] 10SRE, 10SRE Observability, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema) [08:51:48] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:55] 10SRE-swift-storage: Puppetize container creation for applications that don't create containers - https://phabricator.wikimedia.org/T289976 (10fgiunchedi) [08:56:45] (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [08:57:20] !log +100G to prometheus/global in codfw [08:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [08:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet [08:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:27] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [09:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:00] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [09:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [09:01:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1006.eqiad.wmnet with reason: Resyncing from master [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:42] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1150.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:11:11] (03PS1) 10David Caro: ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) [09:11:53] 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10MMandere) @ema: All is set here. I will go ahead and close the task. [09:12:04] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro) [09:12:45] 10SRE, 10Traffic: Prometheus Varnish exporter unit should depend on Varnish - https://phabricator.wikimedia.org/T283660 (10MMandere) 05Open→03Resolved [09:14:37] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) a:05JMeybohm→03Jelto [09:22:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:51] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JMeybohm) [09:24:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:25:48] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:36] (03PS1) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) [09:31:11] (03PS4) 10Ladsgroup: Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) [09:31:46] (03CR) 10Ladsgroup: [C: 03+2] Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [09:32:30] (03Merged) 10jenkins-bot: Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [09:34:07] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:703476|Set $wgIncludejQueryMigrate to false in group0 (T280944)]] (duration: 00m 57s) [09:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:11] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [09:34:31] (03CR) 10Filippo Giunchedi: "The debian-glue job failed because" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [09:37:59] (03PS4) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 [09:38:07] (03CR) 10David Caro: "All the changes in PCC are expected (adding the before relationship)." [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro) [09:38:32] (03CR) 10Jcrespo: "Followup?" [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [09:45:39] (03PS3) 10Vgutierrez: varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 (https://phabricator.wikimedia.org/T285374) [09:45:41] (03PS1) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [09:46:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:29] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30905/console" [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [09:48:39] (03CR) 10Ladsgroup: [C: 04-1] dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:49:35] (03CR) 10Ladsgroup: [C: 03+1] osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:50:24] (03PS2) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [09:51:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] (03PS3) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [09:58:23] (03CR) 10Vgutierrez: "tested in labs setting profile::cache::varnish::frontend::uds_owner to envoy:" [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [09:59:41] (03PS1) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) [10:00:17] (03PS2) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) [10:01:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30906/console" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:02:22] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:05] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10jbond) 05Open→03Resolved a:03jbond this seems resolved, boldly closing, please re-open if missed something [10:08:40] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:13:44] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse) [10:14:18] (03CR) 10Ema: varnish: Allow configuring UDS owner/group/perms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [10:14:38] (03CR) 10Volans: [C: 03+1] "I took the liberty to fix some typos in the commit message. The code looks sane to me, although if you want to be extra careful and want t" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:15:33] (03CR) 10Volans: [C: 03+1] "LGTM once the parent change has been tested." [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:16:13] (03Merged) 10jenkins-bot: flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse) [10:19:27] (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [10:21:28] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:21:37] (03CR) 10Volans: "LGTM but I'd like to get some buy-in from the service owners that use those facts in their code to know if they might need to use them fro" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:21:46] (03PS4) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [10:21:52] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [10:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] (03CR) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [10:25:26] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:24] (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [10:28:56] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10fgiunchedi) [10:30:04] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10fgiunchedi) @Dzahn perhaps do you know what to do ? or know who might know? thank you! [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1030). [10:44:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:44:44] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:45:27] (03CR) 10Jelto: [C: 03+2] helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:48:10] (03Merged) 10jenkins-bot: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:53:37] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:24] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:55:30] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:56:58] (03PS1) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) [11:00:00] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:02:08] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:02] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:15:58] (03PS2) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) [11:17:36] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:17:52] (03PS3) 10Jbond: puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) [11:21:44] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/715477 (owner: 10L10n-bot) [11:26:50] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:19] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:16] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:34] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the updated commit message with the analysis. I guess that in case of use cases in which people might need those facts we" [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [11:41:58] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:43:54] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:47:18] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:48] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:48:01] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:23] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:41] (03PS2) 10Mforns: Fix --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442 [11:52:49] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:32] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 43 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:00:48] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:01:18] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:52] 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Bennylin) @Aklapper: Got it @fgiunchedi: Maybe because not many new wiki chose to enable upload, so this step was not added to the SOP for cre... [12:02:40] (03PS1) 10Jbond: P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488 [12:06:22] 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10RhinosF1) >>! In T289860#7317728, @fgiunchedi wrote: > The fact that this is a new wiki suggests to me the maintenance scripts to give Thumbor... [12:06:54] Amir1: that's new wiki related ^ [12:08:23] I fix it [12:08:58] tbh I'm not sure if it really warrants a UBN [12:09:00] but meh [12:09:00] Amir1: was wondering if you knew whether add wiki was being silly that day too [12:09:14] No I'm not entirely sure either it's UBN [12:10:24] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:12:20] !log ladsgroup@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=jvwikisource --backend=local-multiwrite (T289860) [12:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] T289860: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 [12:12:53] godog: found your person to blame and give the sticker for fixing it ^ [12:13:29] 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10RhinosF1) \o/ works! [12:13:32] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10ayounsi) [12:13:34] RhinosF1: haha! I'll have to buy $beverage for Amir1 [12:13:37] Amir1: thank you <3 [12:13:47] 10SRE, 10SRE-swift-storage, 10Thumbor, 10User-Ladsgroup: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Somehow that was missed or didn't run properly during creation of the wik... [12:13:52] but yeah definitely not UBN [12:14:03] 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10ayounsi) 05Resolved→03Open Some of them seem duplicated, see the red ones on https://netbox.wikimedia.org/extras/reports/cables.Cables/ [12:14:46] Amir1: ty! [12:14:49] godog: ^^ I tried marostegui but he's grumpy today :D [12:15:04] hahaha [12:15:04] I'll find my beer somehow :P [12:15:44] Amir1: if I ever get to an in person event (one day I will), I'll get you one [12:15:45] actually, I should run pdf cleaner (what I ran for commons image table) on all wikisource wikis [12:15:57] ahah yeah no a problem, the local Speti will help [12:16:14] Späti [12:16:16] RhinosF1: no Lager, I hate Lager :D [12:16:32] Amir1: :) [12:16:56] godog: the local Spati next to my home speak Persian. It's sooooo convenient [12:17:16] haha! that's amazing [12:18:50] (03PS2) 10Jbond: P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488 [12:22:10] 10SRE, 10SRE-swift-storage, 10Thumbor, 10User-Ladsgroup: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10Bennylin) It works now. Tyvm! [12:26:26] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:02] (03CR) 10Jbond: [C: 03+2] P:puppetdb::database: ensure users are all created before db's [puppet] - 10https://gerrit.wikimedia.org/r/715488 (owner: 10Jbond) [12:30:46] (03CR) 10Ayounsi: [C: 03+1] "Overall LGTM. We might need to revisit it the day we have Linux based switches." [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [12:33:58] (03PS1) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) [12:38:26] (03PS1) 10Jbond: puppetdb - cloud: add readonly user to config [puppet] - 10https://gerrit.wikimedia.org/r/715493 [12:39:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) > (So row B in 10G, C8, and/or D5.) New cloud hosts only go in cloud racks, so no row B. [12:40:12] (03CR) 10Jbond: [C: 03+2] puppetdb - cloud: add readonly user to config [puppet] - 10https://gerrit.wikimedia.org/r/715493 (owner: 10Jbond) [12:44:36] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:41] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ayounsi) Note that DNS PTRs are missing at least for: ` 185.71.138.139 185.71.138.141 185.71.138.140 ` [12:44:45] (03CR) 10Gehel: [C: 04-1] "I think there is a missing entry for wcqs.svc.codfw.wmnet. See inline comment." [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [12:45:25] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) >>! In T289536#7318613, @ayounsi wrote: > Note that DNS PTRs are missing at least for: > ` > 185.71.138.139 > 185.71.138.141 > 185.71.138.140 > ` Oh right, good catch, thank... [12:48:49] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Noted, thanks! Yeah fine to close for now, and re-open if any issues. [12:50:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:51:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:13] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) [12:52:19] (03CR) 10Gehel: blazegraph: Setup new wcqs instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [12:59:38] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) It occurred to me that as part of uid/gid preprovision we should detect if swift was previously on the host (i.e. there are labeled filesyste... [12:59:54] (03PS5) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 [13:01:14] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:54] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [13:08:13] (03PS5) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [13:08:44] (03CR) 10Ema: [C: 03+1] varnish: Allow configuring UDS owner/group/perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [13:09:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:07] (03CR) 10Vgutierrez: [C: 03+2] varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [13:11:18] (03PS6) 10Vgutierrez: varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) [13:14:16] (03CR) 10Vgutierrez: [C: 03+2] varnish: Allow configuring UDS owner/group/perms [puppet] - 10https://gerrit.wikimedia.org/r/715460 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [13:15:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:10] (03PS6) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) [13:18:21] (03CR) 10Jcrespo: "Will do." [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [13:20:28] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) 05Resolved→03Open @SimoneThisDot by any chance, do you have a @wikimedia.org email that was provided to you? An alert has been fired about this access on production, and we woul... [13:21:30] (03CR) 10Vgutierrez: [C: 03+2] varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [13:21:58] (03PS1) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) [13:26:22] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:24] (03PS1) 10Ssingh: wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) [13:29:54] (03CR) 10Ssingh: "I am not sure how to handle the *.check case, so I came up with "yes" and "no". (Do we/should we have PTR records for this case? :)" [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [13:32:55] (03CR) 10Ayounsi: [C: 03+1] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [13:40:30] 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Hello! > I believe that the top-level http and meta properties are in a sense "owned" by the intake... [13:43:45] jouncebot: now [13:43:45] No deployments scheduled for the next 3 hour(s) and 16 minute(s) [13:44:17] jouncebot: next [13:44:17] In 3 hour(s) and 15 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1700) [13:44:26] (03CR) 10Urbanecm: [C: 03+2] Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm) [13:44:31] (03PS2) 10Urbanecm: Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 [13:44:36] (03CR) 10Urbanecm: [C: 03+2] Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm) [13:45:23] (03Merged) 10jenkins-bot: Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 (owner: 10Urbanecm) [13:45:54] 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [13:47:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:48:02] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6fbcc93f429ff3fbca98aeecdee4f33f022ca7c3: Add missing edit*protected rights to $wgAvailableRights (duration: 00m 56s) [13:48:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:14] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:49:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:49:32] (03CR) 10Jbond: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [13:50:29] (03PS2) 10Urbanecm: knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) [13:50:33] (03CR) 10Urbanecm: [C: 03+2] knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) (owner: 10Urbanecm) [13:51:55] (03Merged) 10jenkins-bot: knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) (owner: 10Urbanecm) [13:52:25] (03PS2) 10Urbanecm: Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307) [13:52:28] (03CR) 10Urbanecm: [C: 03+2] Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [13:52:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:51] (03CR) 10Andrew Bogott: [C: 03+1] ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro) [13:53:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f1a178e1d4d7c98a1988da68982f97848f390c68: knwiki: Disable wmgNewUserMessageOnAutoCreate (T289333) (duration: 00m 57s) [13:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:18] T289333: Disable wmgNewUserMessageOnAutoCreate from Extension:NewUserMessage on knwiki - https://phabricator.wikimedia.org/T289333 [13:53:50] (03Merged) 10jenkins-bot: Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [13:54:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:55:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b17015395cc592e021a4ca8ce6f81b699bb77381: Growth mentor dashboard: Enable beta features only on beta wikis (T280307) (duration: 00m 55s) [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:13] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [13:55:16] * urbanecm done [13:56:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:59:23] (03PS2) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) [13:59:50] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:40] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:28] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 32 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:06:02] (03PS1) 10Ladsgroup: alertmanager: Add Wikidata team to alert manager [puppet] - 10https://gerrit.wikimedia.org/r/715505 (https://phabricator.wikimedia.org/T287741) [14:18:00] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Add Wikidata team to alert manager [puppet] - 10https://gerrit.wikimedia.org/r/715505 (https://phabricator.wikimedia.org/T287741) (owner: 10Ladsgroup) [14:21:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet [14:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet [14:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master [14:21:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master [14:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:38] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:23] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10fgiunchedi) [14:31:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:37:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:39:00] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [14:39:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:39:19] 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [14:41:00] RECOVERY - Host ripe-atlas-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [14:44:27] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:13] 10SRE, 10Traffic: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [14:45:16] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) > Unless there are objections let's go with (b), do you need command line access or web interface is fine @brenn... [14:47:28] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10Dzahn) a:03Dzahn [14:47:57] (03PS2) 10Ssingh: wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) [14:48:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:49:16] (03PS1) 10Urbanecm: Add mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) [14:52:04] (03PS2) 10Urbanecm: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) [14:52:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:26] (03PS1) 10DCausse: rdf-streaming-updater: Adjust memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 [14:53:28] (03PS1) 10DCausse: flink-session-cluster: Remove service.name from the ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/715532 [14:58:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:01:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:32] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) That's for this writeup Let me start by saying that of the 3 solutions, the basic idea of the 3rd one should be the one that we aim for in the long run (but not now). Deploym... [15:03:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) [15:03:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) >>! In T289882#7318587, @ayounsi wrote: >> (So row B in 10G, C8, and/or D5.) > New cloud hosts only go in cloud racks, so no... [15:03:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) a:03nskaggs @nskaggs: Who in WMCS is going to be point on these servers? I ask so we can assign them this racking task, s... [15:05:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:07:44] (03PS1) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009) [15:08:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:09] (03CR) 10Ema: varnish: Containerize varnish test environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [15:11:20] PROBLEM - Long running screen/tmux on snapshot1009 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 32809, 1728364s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:13:16] oh? [15:13:22] lemme see about that [15:13:53] fixed, apologies! [15:15:08] tsk tsk! [15:15:27] yeah totally my bad, I usually close those out when done and I dropped the ball on that one [15:16:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:17:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:24] (03CR) 10Ema: Add Varnish SLO dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [15:24:26] (03PS1) 10Vgutierrez: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) [15:25:55] (03CR) 10Herron: [C: 03+1] "I think we're in good shape to deploy this!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [15:26:44] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:44] (03PS9) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [15:28:46] (03PS9) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [15:28:49] (03PS9) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [15:28:50] (03PS8) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [15:28:52] (03PS8) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [15:28:54] (03PS8) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [15:28:56] (03PS3) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [15:28:58] (03PS2) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) [15:29:00] (03PS2) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) [15:29:02] (03PS3) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [15:31:15] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10akosiaris) [15:32:58] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:33:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:37:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:36] 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) Update: Grafana-based performance alerts have been migrated to AlertManager, and as such show up at https://alerts.wikimed... [15:38:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:48] (03PS1) 10Ladsgroup: icinga: Drop grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/715543 (https://phabricator.wikimedia.org/T287741) [15:44:50] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:50:42] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: Drop grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/715543 (https://phabricator.wikimedia.org/T287741) (owner: 10Ladsgroup) [15:53:33] Hey @paladox I was wondering if there's anything I can do to help move https://gerrit-review.googlesource.com/c/gerrit/+/313490 along? [15:56:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:11] (03PS1) 10Ssingh: durum: fix notifying the uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715546 [15:57:27] Oh, i forgot about that. I just need to add some docs i think... though i'm not exactly sure if i did that correctly. [15:58:28] (03PS1) 10Jbond: C:puppetdb::app: move blacklist file to correct config [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578) [15:58:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30907/console" [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:59:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetdb::app: move blacklist file to correct config [puppet] - 10https://gerrit.wikimedia.org/r/715547 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:00:14] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30908/console" [puppet] - 10https://gerrit.wikimedia.org/r/715546 (owner: 10Ssingh) [16:01:40] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:22] (03CR) 10Andrew Bogott: [C: 03+2] ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro) [16:02:27] @paladox yeh it looks like you did it correctly. They just want it expanded with documentation. [16:02:28] (03PS2) 10Andrew Bogott: ceph: fix keyring race condition [puppet] - 10https://gerrit.wikimedia.org/r/715455 (https://phabricator.wikimedia.org/T289700) (owner: 10David Caro) [16:02:39] (and tests) [16:04:44] (03CR) 10BBlack: [C: 03+1] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:06:48] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02183 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:12:14] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30909/console" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:13:38] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "pcc output: https://puppet-compiler.wmflabs.org/compiler1003/30909/" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:15:03] (03CR) 10Ssingh: [C: 03+2] wikimedia-dns: update PTR records for durum [dns] - 10https://gerrit.wikimedia.org/r/715499 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:15:50] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:16:21] !log running authdns-update for Gerrit 715499 [16:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:28] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:18:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: fix notifying the uWSGI service [puppet] - 10https://gerrit.wikimedia.org/r/715546 (owner: 10Ssingh) [16:20:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet [16:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1008.eqiad.wmnet [16:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Resyncing from master [16:20:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Resyncing from master [16:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:49] (03PS1) 10Zabe: osm: remove absented osm_sync_lag cron [puppet] - 10https://gerrit.wikimedia.org/r/715552 (https://phabricator.wikimedia.org/T273673) [16:23:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:26:48] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:54] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cassandra-metrics-collector.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:32:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:35] (03CR) 10Legoktm: [C: 04-1] Set permission of creating short url to everyone everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [16:33:49] (03CR) 10Hnowlan: [C: 03+2] maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 (owner: 10MSantos) [16:34:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:43:13] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call `ml-services` dir under `helmfile.d` (see... [16:43:15] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:59:48] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10Bethany) >>! In T289892#7317738, @jcrespo wrote: > Hi, @Bethany, I can process your request with no issue, but might I request to update your email (and verify it) on your acc... [17:00:05] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1700). [17:00:10] !log T289483 Pooled `wdqs1013` [17:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:15] T289483: asw2-c-eqiad:ge-5/0/39 - wdqs1013 - Inbound interface errors - https://phabricator.wikimedia.org/T289483 [17:00:15] (03PS1) 10Ssingh: test_dns: add tests for durum check service [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/715561 (https://phabricator.wikimedia.org/T289536) [17:02:18] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.84`. Pre-deploy tests passing on canary `wdqs1003` [17:02:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:34] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@a17833c]: 0.3.84 [17:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:27] (03CR) 10Ssingh: [C: 03+2] test_dns: add tests for durum check service [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/715561 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [17:04:29] !log [WDQS Deploy] Tests passing following deploy of `0.3.84` on canary `wdqs1003`; proceeding to rest of fleet [17:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:30] (03PS5) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [17:05:59] (03CR) 10MMandere: varnish: Containerize varnish test environment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [17:09:19] (03Abandoned) 10Hnowlan: maps: reenable tilerator on codfw new cluster [puppet] - 10https://gerrit.wikimedia.org/r/705684 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [17:10:50] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@a17833c]: 0.3.84 (duration: 08m 16s) [17:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:23] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [17:12:26] !log [WDQS Deploy] Restarted `wdqs-categories` across both test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [17:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:33] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:37:04] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) [17:37:32] (03PS1) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 [17:38:03] (03CR) 10jerkins-bot: [V: 04-1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (owner: 10Ryan Kemper) [17:38:34] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [17:40:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:17] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 77.57 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:41:23] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) [17:43:09] (03PS1) 10Ryan Kemper: wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001) [17:44:39] (03PS4) 10Kosta Harlan: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) [17:44:56] !log [WDQS Deploy] Test query passing on `query.wikidata.org` and icinga looks good. This deploy is done. [17:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:42] (03PS2) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) [17:48:13] (03CR) 10jerkins-bot: [V: 04-1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [17:49:14] (03PS3) 10Ryan Kemper: wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) [17:49:26] Jdlrobson: done: https://gerrit-review.googlesource.com/c/gerrit/+/313490,edit [17:49:35] i changed it to use self() [17:51:48] (03PS1) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) [17:52:55] (03CR) 10jerkins-bot: [V: 04-1] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:53:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:29] 10SRE, 10Privacy Engineering, 10Traffic, 10Performance-Team (Radar), 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10Krinkle) [17:54:13] (03CR) 10Urbanecm: "you need to edit wmf-config/config/itwiki.yaml (and then run composer buildDBLists to build dblists/*)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:56:32] (03PS2) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) [17:56:36] (03CR) 10Jdlrobson: "Is this enough to make Italian Wikipedia a group 1 wiki? I'm not too familiar with the process here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:56:42] urbanecm: won't that also need a change to wikiversions.yaml if not deploy on a monday/Thursday [17:56:42] (03PS3) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) [17:57:30] (03CR) 10RhinosF1: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:57:58] RhinosF1: yes, but today it's Monday 🙂 [17:58:24] urbanecm: yeah if it goes now it'll be good [17:59:37] the deployment would also differ (if wikiversions.json is changed, you also need to run scap sync-wikiversions to rebuild it on the app hosts) [17:59:54] urbanecm: yeah [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:05] There are some patches in the queue now. [18:01:14] i like last minute additions :) [18:01:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:36] (03CR) 10Jcrespo: [C: 03+1] "This is ready for deploy, waiting for a review from Filippo and you or me can merge and deploy the ldap change." [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [18:02:25] tgr: wondering if https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/715529 can/should be squashed here too? [18:02:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10jcrespo) a:05jcrespo→03fgiunchedi [18:03:02] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan) [18:03:11] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [18:04:26] urbanecm: sure, can you add it to the wiki? [18:04:39] Will do tgr. [18:05:17] how much does it depend on the GrowthExperiments patch? do we need to backport that? [18:06:38] tgr: I don't think so, I intend to test it at beta anyway. [18:07:41] ack. Should I do the deployment? [18:08:57] tgr: sure. [18:09:29] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan) [18:10:15] (03Merged) 10jenkins-bot: GrowthExperiments: Switch image recommendations flag off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714548 (https://phabricator.wikimedia.org/T288797) (owner: 10Kosta Harlan) [18:10:43] we can also deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/714549 while we are at it [18:10:54] I guess that's more "merge" than "deploy" [18:12:17] Yup, just a git pull once it merges [18:14:09] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714548|GrowthExperiments: Switch image recommendations flag off (T288797)]] (duration: 00m 57s) [18:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:13] T288797: Add Image: Create image-recommendation task type - https://phabricator.wikimedia.org/T288797 [18:14:17] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 5794 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:14:29] (03PS3) 10Gergő Tisza: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan) [18:14:40] (03CR) 10Gergő Tisza: [C: 03+2] [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan) [18:15:28] (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Switch image recommendations flag on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714549 (owner: 10Kosta Harlan) [18:16:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:16:20] (03PS1) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) [18:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:22] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [18:16:57] (03PS3) 10Gergő Tisza: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [18:17:19] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [18:17:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:18:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:39] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation for dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715568 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [18:21:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:22:08] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715568|GrowthExperiments: Enable link recommendation for dewiki and nlwiki (T288420 T285254)]] (duration: 00m 56s) [18:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:13] T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254 [18:22:13] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [18:22:41] (03CR) 10Gergő Tisza: [C: 03+2] Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:24:04] (03PS2) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) [18:24:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:59] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:29:56] tgr: you probably need to rebase (and re-+2) the config patch, too [18:31:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:33:25] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Asaf) 05Open→03Resolved Never mind, opened a new task with updated request. [18:34:05] sorry, had to go afk for a sec. [18:34:10] (03PS3) 10Gergő Tisza: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:34:17] np [18:34:35] yeah for some reason Gerrit only allows rebasing after the initial +2 [18:34:50] (03CR) 10Gergő Tisza: [C: 03+2] Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:35:47] (03Merged) 10jenkins-bot: Add mediawiki.mentor_dashboard.visit schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:38:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:57] urbanecm: it's on mwdebug2001 [18:39:23] tgr: I can't check it there, as the instrumentation code is only at beta [18:41:03] I guess we can backport the core patch at any time if we run into trouble [18:41:21] yeah, but this AFAIK only registers the stream [18:41:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:51] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:57] yeah but production will log with a different stream name for a few days [18:42:07] 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) With the elastic SSPL changes that happened this year (T272111 T272238 etc.) is... [18:42:19] but then schema errors are not a big deal and the feature is not getting any traffic [18:42:37] tgr: why would it? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/715108 was merged today, few hours before you merged the fix of the schema name [18:43:12] !log tgr@deploy1002 scap failed: average error rate on 3/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [18:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:19] oh right the logging code is in the same branch [18:43:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:43:29] !log morning deploys done [18:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:45] ...actually not done. [18:43:45] tgr: you are aware of "scap failed: average error rate on 3/6 canaries increased by 10x " above, right? :) [18:45:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:45:24] error counters ticking up rapidly. [18:46:12] urbanecm: I guess that array syntax for the stream name is invalid? [18:46:17] looks so [18:46:24] tgr: please revert, uploading a followup [18:46:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 177 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:09] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 270 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:41] (03PS1) 10Urbanecm: Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369) [18:48:22] !log tgr@deploy1002 Scap failed!: 5/6 canaries failed their endpoint checks(https://en.wikipedia.org) [18:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:01] * urbanecm running sync with --force [18:49:17] not [18:49:22] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:715529|Add mediawiki.mentor_dashboard.visit schema (T289369)]] (duration: 00m 26s) [18:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:26] T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369 [18:49:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:49:55] scap could be clearer on whether a failure means that the code was deployed or not [18:50:45] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:50:46] i think it means "deployed to canaries", as scap doesn't know anything about git [18:50:49] "18:49:22 14 hosts had failures restarting php-fpm" [18:51:20] tgr: do you want to try to sync the fix, or merge revert in gerrit and try later? [18:51:41] let's sync it, it's a trivial fix [18:51:51] ok [18:52:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:53:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:53:56] tgr: the fix is at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/715579/, adding to calendar too [18:53:58] it doesn't know about git, sure. But when it says "18:48:22 sync-file failed: Scap failed!: 5/6 canaries failed their endpoint checks", does that mean the code has been deployed everywhere, and then failed a check? deployed to canaries, failed the check and still serving traffic from the canary hosts? failed the check so it has been undeployed? [18:55:04] I guess we are just rsyncing and not using a separate directory for the new code so it has to be the middle one [18:55:35] yeah [18:56:21] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:56:22] (03CR) 10Gergő Tisza: [C: 03+2] Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:57:17] (03Merged) 10jenkins-bot: Fix schema definition for mediawiki.mentor_dashboard.visit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715579 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [18:58:19] (03CR) 10Ottomata: [C: 03+2] Fix --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/715442 (owner: 10Mforns) [18:59:38] I did check some random wiki page with X-WM-DBG, still don't understand why that worked. [19:00:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:07] tgr: because events are either sent from client side or using deferedupdates [19:00:17] there was sth in logstash for mwdebug hosts [19:00:49] oh, deferred, that makes sense. [19:01:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:06] yeah, I didn't check the mwdebug log, shame on me. [19:03:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:42] well, it is empty now, so I guess we are good [19:04:51] let's hope! [19:05:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:06:22] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715579|Fix schema definition for mediawiki.mentor_dashboard.visit (T289369)]] (duration: 00m 56s) [19:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:26] T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369 [19:08:09] !log morning deploys done for real [19:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:35] thanks tgr [19:08:54] dancy: sorry, took me a while to realize the code is in production. Should I file an incident report? [19:09:57] I don't think that will be necessary. We're all back to normal now? [19:11:06] yeah. [19:12:15] RECOVERY - Long running screen/tmux on snapshot1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:12:34] 1K web request errors, 10K API errors. I think the web ones have minimal impact because of the deferred, just that event logging failed. The API does not seem to use deferred so probably those requests errored out? [19:20:10] tgr: so, i still don't see any events coming in beta (any at all), but i guess that is because of T289029 [19:20:11] T289029: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T289029 [19:24:24] (03PS6) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [19:25:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:48] (03PS1) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) [19:46:38] (03CR) 10Ottomata: Add mediawiki.mentor_dashboard.visit schema (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715529 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [19:46:58] tgr urbanecm ^ [19:47:09] ottomata: that's why it doesn't work! thanks, you just saved me hours of digging :) [19:47:26] at least i got beta's ingest endpoint up again :D [19:48:34] (03PS1) 10Urbanecm: Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) [19:48:38] ottomata: so like that? ^^ [19:49:08] yup look sgood [19:49:15] jouncebot: now [19:49:15] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [19:49:16] (03CR) 10Ottomata: [C: 03+1] Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [19:49:18] jouncebot: next [19:49:18] In 0 hour(s) and 10 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2000) [19:49:23] i guess i can get that out then [19:49:43] (03CR) 10Urbanecm: [C: 03+2] "third time's the charm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [19:50:41] (03Merged) 10jenkins-bot: Fix mediawiki.mentor_dashboard.visit's definition #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715588 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [19:52:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9a92e2ae7526717a0a42b825a34b4595e75a544b: Fix mediawiki.mentor_dashboard.visits definition (duration: 00m 56s) [19:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:25] and let's see if that did the trick [19:57:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2000). Please do the needful. [20:00:33] (03PS5) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [20:01:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:05] (03CR) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:02:09] dancy: urbanecm: follow-ups: T290036 T290037 T290038 T290039 [20:02:09] T290039: Structure tests for stream settings in operations/mediawiki-config - https://phabricator.wikimedia.org/T290039 [20:02:09] T290037: Scap should be clearer about the need for a revert after a failed canary check - https://phabricator.wikimedia.org/T290037 [20:02:09] T290038: scap sync-file --force warns "sudo: no tty present and no askpass program specified" - https://phabricator.wikimedia.org/T290038 [20:02:10] T290036: Scap revert commands should use --force - https://phabricator.wikimedia.org/T290036 [20:02:16] thanks! [20:02:45] Thanks tgr. We will triage those and try to make improvements. [20:05:34] (03CR) 10Ottomata: [C: 03+2] eventgate - Disable http service if tls.enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) (owner: 10Ottomata) [20:06:25] (03CR) 10Ssingh: envoyproxy: Allow setting http2 protocol options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [20:06:41] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [20:07:30] ottomata: so, that fix got deployed to beta, but I don't see any events in https://stream-beta.wmflabs.org/v2/ui/#/?streams=eventlogging_HomepageVisit,mediawiki.mentor_dashboard.visit :/. Am I doing something wrong? [20:08:37] EventLogging.log says nothing, EventBus.log says "DEBUG: Using destination_event_service eventgate-analytics-external for stream mediawiki.mentor_dashboard.visit.", `kafkacat -C -b deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud -t eqiad.mediawiki.mentor_dashboard.visit` is silent, stream-beta is silent :/ [20:08:44] 10SRE: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Peachey88) [20:09:04] urbanecm: looking [20:09:08] appreciated! [20:09:20] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [20:12:27] (03PS1) 10Zabe: labstore: remove absented /etc/exports.d/public_root.exports [puppet] - 10https://gerrit.wikimedia.org/r/715591 [20:14:22] urbanecm: everything looks good from the config side [20:14:29] i can post an event to eventgate in beta and it shows up in stream-beta [20:14:41] oh, so it just took same time to fully propagate [20:14:45] *some more [20:14:52] thanks for your help ottomata [20:14:55] (03PS2) 10Zabe: labstore: remove absented /etc/exports.d/public_root.exports file [puppet] - 10https://gerrit.wikimedia.org/r/715591 [20:14:55] oh ya! [20:14:56] ok cool [20:14:59] great glad it works [20:15:09] urbanecm: mforns was telling me you had some trouble getting the dev env to work? [20:17:42] yes! I was able to get the "new" schemas working (by running the devserver and setting `$wgEventLoggingServiceUri = 'http://localhost:8192/v1/events';`), but schemas like `HomepageVisit` (in the legacy folder) complain about `wgEventLoggingBaseUri` not being set [20:22:03] (also, i had to edit `node_modules/eventgate-wikimedia/eventgate-wikimedia.js` in EventLogging's devserver to load `uriHasProtocol` from `@wikimedia/url-get`; https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/master/eventgate-wikimedia.js#8 has it fixed, but I'm not sure how much https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/devserver/package.json#L8 can be bumped) [20:22:44] Hmm, I note that Beta Cluster wikis still have `Musical scores are temporarily disabled`; I guess we didn't set up shellbox there yet? [20:23:55] James_F: I don't think deployment-prep has kubernetes to begin with [20:24:55] T276650 is still opened [20:24:56] T276650: Re-consider setting up a Kubernetes cluster on the Beta cluster - https://phabricator.wikimedia.org/T276650 [20:26:25] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,webperf_navtiming} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:27:08] Indeed, but new services are meant to be mocked in Beta when added to prod. [20:27:40] in theory, but since T215217 is open, there's no one responsible for that :) [20:27:41] T215217: deployment-prep: Code stewardship request - https://phabricator.wikimedia.org/T215217 [20:28:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:29:55] (ottomata: see my re few lines above -- happy to move this discussion somewhere else, too) [20:30:57] (03CR) 10Brennen Bearnes: [C: 04-1] gitlab cas: update uid field to use uid not CN (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [20:31:28] It's also going to be a massive pain for Wikifunctions that Beta doesn't have a k8s equivalent. [20:31:40] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [20:34:49] hmm interrsting. [20:34:52] urbanecm: sorry in other convos too [20:35:01] urbanecm: the eventgate-wikimedia dep can be bumped to latest [20:35:19] np, just wanted to make sure you didn't miss it :) [20:35:43] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [20:42:40] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) @JMeybohm, I merged that and am trying to apply for eventgate-logging-external staging. Diff looks good: ` 20:23:31 [@deploy1002:/srv/... [20:42:42] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [20:42:52] ottomata: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/715595, fyio [20:42:53] *fyi [20:44:23] urbanecm: i'm unsure if we should use master, or pin to an explicit version [20:44:44] me too -- happy to change for current latest hash :) [21:00:04] Reedy and sbassett: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2100). [21:01:31] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:43] (03PS1) 10Zabe: swift: migrate swift-drive-audit cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) [21:10:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:16:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:20:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:21:54] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) [21:25:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:45] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [21:34:32] 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived - https://phabricator.wikimedia.org/T289928 (10Effeietsanders) 05Open→03Resolved a:03Effeietsanders Thanks @Legoktm for digging into this! It is surprising that I'm not on wlm-announce as a member, because once i was, and i... [21:34:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:37:48] (03PS5) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [21:37:50] (03PS2) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [21:37:52] (03PS1) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 [21:38:29] (03PS2) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) [21:38:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:41:22] (03PS3) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) [21:47:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:52:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:57:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:57:45] (03PS1) 10Urbanecm: Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) [21:58:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:01:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:02:31] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:09] (03PS4) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) [22:03:17] (03PS7) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [22:03:34] (03PS2) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) [22:04:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:09:02] (03CR) 10Urbanecm: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [22:13:11] (03CR) 10RobH: [C: 03+1] "This change updates the quotereview tool to parse the equotes for me properly now and still works for the dell team prepared quote format." [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [22:13:59] (03PS2) 10RobH: quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [22:15:55] (03CR) 10jerkins-bot: [V: 04-1] Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [22:25:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:39:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:42:35] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:00] jouncebot: next [22:43:00] In 0 hour(s) and 16 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2300) [22:43:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:43:53] (03CR) 10Urbanecm: [C: 03+2] "will deploy during the evening window; CI failure was an unrelated one" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [22:51:57] (03PS1) 10Bstorm: cloud osmdb: set num_threads in the sync job [puppet] - 10https://gerrit.wikimedia.org/r/715623 (https://phabricator.wikimedia.org/T285668) [22:54:56] (03PS1) 10Bstorm: cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668) [22:59:20] (03CR) 10Andrew Bogott: [C: 03+2] prometheus_local_crontabs: use a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/714173 (https://phabricator.wikimedia.org/T273673) (owner: 10Majavah) [23:00:05] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210830T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:12] * urbanecm still waiting on CI [23:02:09] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:04:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:05:44] (03Merged) 10jenkins-bot: Instrument Special:MentorDashboard [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715610 (https://phabricator.wikimedia.org/T289369) (owner: 10Urbanecm) [23:05:48] \o [23:07:48] (03PS6) 10Andrew Bogott: rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:08:05] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/includes/Specials/SpecialHomepage.php: 9e2264a0c9a48548da4795b2a5b9d7275d254ac7: Instrument Special:MentorDashboard (T289369) (duration: 00m 57s) [23:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:11] T289369: Instrument mentor dashboard for views - https://phabricator.wikimedia.org/T289369 [23:08:13] * urbanecm doe [23:08:15] *done [23:09:40] would be done...if i synced the right file [23:10:15] (03CR) 10jerkins-bot: [V: 04-1] rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:11:04] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/includes/Specials/SpecialMentorDashboard.php: 9e2264a0c9a48548da4795b2a5b9d7275d254ac7: Instrument Special:MentorDashboard (T289369) (duration: 00m 55s) [23:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:24] now it works :) [23:11:32] !log Evening B&C done [23:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:02] (03PS7) 10Andrew Bogott: rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:13:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:34] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmqadmin.py: Update to latest available upstream version [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:14:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:06] (03CR) 10Andrew Bogott: [C: 03+2] "This only runs on cloudcontrols, all Buster and soon to be Bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:16:11] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10wiki_willy) a:03Cmjohnson [23:18:20] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337 (10wiki_willy) a:03Cmjohnson [23:21:10] (03CR) 10Andrew Bogott: [C: 03+2] check_keystone_roles.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670925 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:22:13] (03CR) 10Andrew Bogott: [C: 03+1] confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:22:54] (03CR) 10Andrew Bogott: "Does this need attention still or have y'all long since worked around it?" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [23:24:00] (03PS2) 10Andrew Bogott: Nova vendordata.txt: delete systemd-coredump user [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [23:24:48] (03CR) 10Andrew Bogott: [C: 03+2] "This is pre-puppet so should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [23:25:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:27:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:39:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:43:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:50:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:04] (03CR) 10Bstorm: "I'd like to try merging this if you can add that dependency @Majavah. I can add it if you like as well." [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [23:55:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:27] (03CR) 10Bstorm: "I don't think we want ssh client in the docker images. Is there a specific use you had in mind?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan) [23:57:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:57:25] (03CR) 10Bstorm: [C: 03+1] "I hope we don't have to set up a new OS before the grid is decommissioned." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [23:59:40] (03CR) 10Bstorm: [C: 03+2] "Thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/715591 (owner: 10Zabe)