[00:15:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925121 [00:39:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925121 (owner: 10TrainBranchBot) [00:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:46:46] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/925121 (owner: 10TrainBranchBot) [01:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:01] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/925119/41512/" [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [04:11:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (cloudcontrol2004-dev), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:00:07] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230602T0600) [06:38:26] (03CR) 10Jelto: [C: 03+2] gitlab: use production idp for gitlab hosts [puppet] - 10https://gerrit.wikimedia.org/r/924525 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [06:53:39] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Legoktm) [06:53:47] (03CR) 10Slyngshede: [C: 03+2] C:IDM Remove absent systemd services. [puppet] - 10https://gerrit.wikimedia.org/r/923559 (owner: 10Slyngshede) [06:57:49] (03PS4) 10Slyngshede: mgmt module [software/bitu] - 10https://gerrit.wikimedia.org/r/918245 [06:58:43] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.reimage: Remove specialised cookbook. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/922065 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [06:59:34] (03Abandoned) 10Slyngshede: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 (owner: 10Jbond) [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230602T0700) [07:00:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [07:00:53] (03CR) 10Muehlenhoff: [C: 03+2] Cloud VPS: Remove support for stretch in various roles [puppet] - 10https://gerrit.wikimedia.org/r/925831 (owner: 10Muehlenhoff) [07:06:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:11:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:16:30] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) 05Open→03Declined The old-style syntax is used all over the place and it would be a significant effort to change. Since it wil be... [07:20:04] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Xover) >>! In T325607#8897400, @SCherukuwada wrote: > As far as I can tell, a lot of the pages that aren't appearing in the index are simply not li... [07:21:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [07:22:07] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad or A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:27:36] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [07:29:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:30:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:30:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad or A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:31:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50135 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.543 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3006.wikimedia.org [07:39:52] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3006.wikimedia.org [07:47:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6002.wikimedia.org [07:53:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6002.wikimedia.org [08:01:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] "ferm depends on iptables anyway (see https://packages.debian.org/bookworm/ferm) so we could just require ferm instead. That being said, th" [puppet] - 10https://gerrit.wikimedia.org/r/925877 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:13:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] debian: remove cadvisor from the kubelet's systemd unit [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [08:26:36] (03PS1) 10Muehlenhoff: Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/926421 (https://phabricator.wikimedia.org/T329945) [08:28:16] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10akosiaris) Adding one more data point that just might be in some cases related: in T292663 we noticed (after going first down the wrong rabbithole) that in some cases we have connections `mw (en... [08:31:47] (03PS1) 10Stevemunene: Add new stat1009 to the stat servers rsync hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) [08:32:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] debian: remove cadvisor from the kubelet's systemd unit (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [08:33:09] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I restarted the swift frontends this morning, and the 5xx rate has returned to normal. [08:34:57] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41513/console" [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [08:37:40] (03PS1) 10Slyngshede: Netbox: Rename OIDC secret variable. [labs/private] - 10https://gerrit.wikimedia.org/r/926423 [08:38:06] (03PS1) 10Kosta Harlan: ipoid: Set sources to point to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/926424 (https://phabricator.wikimedia.org/T337714) [08:38:27] (03PS35) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [08:39:03] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Netbox: Rename OIDC secret variable. [labs/private] - 10https://gerrit.wikimedia.org/r/926423 (owner: 10Slyngshede) [08:40:57] (03CR) 10CI reject: [V: 04-1] P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [08:42:22] (03PS36) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [08:42:26] !log installing traceroute bugfix updates from Bullseye point release [08:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:05] (03CR) 10Kosta Harlan: ipoid: add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [08:44:23] (03PS2) 10Kosta Harlan: ipoid: Update for GitLab migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/926424 (https://phabricator.wikimedia.org/T337714) [08:44:52] (03CR) 10CI reject: [V: 04-1] P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [08:44:59] (03CR) 10Kosta Harlan: ipoid: add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [08:46:26] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [08:48:03] (03PS1) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [08:48:38] (03PS37) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [08:49:02] (03CR) 10CI reject: [V: 04-1] P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [08:50:06] (03CR) 10CI reject: [V: 04-1] Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [08:50:28] (03PS38) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [08:51:29] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 0 hosts: [08:51:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 0 hosts: [08:51:46] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cp5013.eqsin.wmnet [08:51:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp5013.eqsin.wmnet [08:51:54] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cp5014.eqsin.wmnet [08:51:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp5014.eqsin.wmnet [08:51:59] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cp5015.eqsin.wmnet [08:52:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp5015.eqsin.wmnet [08:52:05] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cp5016.eqsin.wmnet [08:52:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp5016.eqsin.wmnet [08:52:54] (03PS2) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [08:54:22] (03CR) 10CI reject: [V: 04-1] Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [08:54:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [08:56:04] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41514/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [08:56:39] (03CR) 10MVernon: [C: 03+2] swift: remove support for pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/925807 (https://phabricator.wikimedia.org/T279637) (owner: 10MVernon) [09:09:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41515/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [09:10:12] (03PS3) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [09:13:18] (03CR) 10Slyngshede: debmonitor: Install Debian Django packages on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [09:18:21] !log update kubernetes-node to 1.23.14-2 on all P:kubernetes::node hosts (88 in total) T337836. Reload systemd for unit changes to take effect [09:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:25] T337836: Cadvisor may be breaking Kubernetes worker nodes - https://phabricator.wikimedia.org/T337836 [09:19:11] (03PS1) 10Arturo Borrero Gonzalez: interface::route: add persist option [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) [09:20:35] (03CR) 10Muehlenhoff: debmonitor: Install Debian Django packages on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [09:26:06] (03PS39) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [09:29:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41516/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [09:30:43] (03PS3) 10Muehlenhoff: debmonitor: Install Debian Django packages on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) [09:35:52] !log installing texlive-security updates on buster [09:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:24] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [09:40:04] (03CR) 10Slyngshede: [V: 03+1] P:netbox reconfigure to used OIDC (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [09:51:36] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Install Debian Django packages on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:03:49] (03PS1) 10Ladsgroup: wikireplicas: Repool clouddb10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/926453 (https://phabricator.wikimedia.org/T337734) [10:04:01] (03CR) 10CI reject: [V: 04-1] wikireplicas: Repool clouddb10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/926453 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [10:04:35] (03PS2) 10Ladsgroup: wikireplicas: Repool clouddb10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/926453 (https://phabricator.wikimedia.org/T337734) [10:05:29] (03CR) 10Ladsgroup: [C: 03+2] wikireplicas: Repool clouddb10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/926453 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [10:20:42] (03CR) 10Ladsgroup: "I'll deploy this early next week." [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [10:30:35] (03PS1) 10Muehlenhoff: debmonitor: Handle the deployment directory if not using scap [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) [10:31:20] (03PS2) 10Muehlenhoff: debmonitor: Handle the deployment directory if not using scap [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) [10:32:33] (03CR) 10Clément Goubert: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/926455 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [10:33:53] (03PS2) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [10:35:19] (03PS3) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [10:36:18] (03CR) 10Jbond: [C: 04-1] planet: restrict firewall source range for port 443 to envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [10:36:27] (03PS4) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [10:37:33] (03PS5) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [10:40:59] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: let all designate traffic happen using cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/926456 (https://phabricator.wikimedia.org/T336808) [10:43:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:44:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:11] (03PS3) 10Muehlenhoff: debmonitor: Handle the deployment directory if not using scap [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) [10:45:06] (03CR) 10Jbond: "lgtm but See comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:46:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:46:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:49:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:23] (03CR) 10Jbond: [C: 04-1] "lgtm but i think we should add the glob type when we add the glob backend (unless im missing something)" [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:55:19] (03CR) 10Muehlenhoff: [C: 03+2] Bitu: Add an alert for the front page [puppet] - 10https://gerrit.wikimedia.org/r/925650 (https://phabricator.wikimedia.org/T320603) (owner: 10Muehlenhoff) [10:57:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: let all designate traffic happen using cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/926456 (https://phabricator.wikimedia.org/T336808) [11:00:21] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: sre.ganeti.makevm cook book only allows specifying RAM size in full gigabytes - https://phabricator.wikimedia.org/T230712 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff sre.ganeti.makevm now supports fractions of gigabytes. [11:02:08] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [11:04:14] (03CR) 10Jelto: [C: 03+2] microsites: remove bienvenida.wikimedia.org, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [11:08:05] (03Abandoned) 10Clément Goubert: trafficserver::lua_script: Allow templated config [puppet] - 10https://gerrit.wikimedia.org/r/924494 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [11:08:41] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Handle the deployment directory if not using scap [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [11:09:07] (ProbeDown) firing: Service miscweb1003:443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:33] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/926456/41518/" [puppet] - 10https://gerrit.wikimedia.org/r/926456 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [11:14:07] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:14:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) 05Open→03Stalled Blocked by {T331609} [11:16:20] RECOVERY - Check systemd state on puppetboard1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:19] (03PS3) 10Jbond: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:17:22] (03PS1) 10Jbond: puppetserver::git: make ensureable [puppet] - 10https://gerrit.wikimedia.org/r/926464 [11:18:31] (03PS1) 10Muehlenhoff: debmonitor: Use wmflib::dir::mkdir_p() [puppet] - 10https://gerrit.wikimedia.org/r/926465 [11:18:51] (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:07] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:36] ^ miscweb probe down should be fixed soon, looking at it [11:24:03] (03CR) 10Jbond: [C: 04-1] "LGTM, -1 is for the docs strings some other nits/advice inline" [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:24:10] (03CR) 10Clément Goubert: [C: 03+1] opentelemetry-collector: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/925015 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [11:26:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925922 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:27:53] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Use wmflib::dir::mkdir_p() [puppet] - 10https://gerrit.wikimedia.org/r/926465 (owner: 10Muehlenhoff) [11:29:07] (ProbeDown) resolved: (2) Service miscweb1003:443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41520/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [11:32:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41519/console" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [11:37:27] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM but would be good to have moritz take a look before merge" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [11:42:53] (03CR) 10Jbond: [C: 04-1] "-1 on the namespacing. the container_build fact seems like it may be a bit fragile but happy to merge and see" [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:45:44] (03CR) 10Jbond: "general idea looks good to me, see comment inline and also need to fix spec test" [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:48:51] (ProbeDown) resolved: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove KDC role from krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/922068 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:52:11] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to all internal networks [puppet] - 10https://gerrit.wikimedia.org/r/926474 (https://phabricator.wikimedia.org/T336808) [12:03:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/926474/41521/" [puppet] - 10https://gerrit.wikimedia.org/r/926474 (https://phabricator.wikimedia.org/T336808) (owner: 10Arturo Borrero Gonzalez) [12:03:33] !log installing at-spi2-core bugfix updates from Bullseye point release [12:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:02] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [12:13:51] (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:16] (03PS1) 10Ottomata: flink-operator - Add ingress for health check probe port [deployment-charts] - 10https://gerrit.wikimedia.org/r/926479 [12:20:44] (03PS2) 10Ottomata: mw-page-content-change-enrich - code comment correction [deployment-charts] - 10https://gerrit.wikimedia.org/r/925821 [12:20:46] (03PS2) 10Ottomata: flink-operator - Add ingress for health check probe port [deployment-charts] - 10https://gerrit.wikimedia.org/r/926479 [12:20:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mw-page-content-change-enrich - code comment correction [deployment-charts] - 10https://gerrit.wikimedia.org/r/925821 (owner: 10Ottomata) [12:29:47] (03PS4) 10EoghanGaffney: releases: Add new hosts to failover servers list [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) [12:32:43] (03PS1) 10Slyngshede: P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 [12:34:28] (03PS1) 10Kosta Harlan: checkuser: Disable client hints feature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) [12:37:38] (03PS2) 10Slyngshede: P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 [12:39:20] (03CR) 10Muehlenhoff: "Two comments inline, I'll have a closer look in the next days." [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [12:39:28] (03CR) 10JMeybohm: [C: 03+1] flink-operator - Add ingress for health check probe port [deployment-charts] - 10https://gerrit.wikimedia.org/r/926479 (owner: 10Ottomata) [12:39:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:41:20] (03CR) 10Matthias Mullie: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:42:04] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) @Soda Yeah Navboxes would indeed have helped. Tell me if this makes sense re: categories. If a page isn't in any category or is in a... [12:42:33] (03CR) 10Ottomata: [C: 03+2] flink-operator - Add ingress for health check probe port [deployment-charts] - 10https://gerrit.wikimedia.org/r/926479 (owner: 10Ottomata) [12:42:39] (03PS3) 10Slyngshede: P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 [12:42:55] PROBLEM - puppet last run on puppetmaster2004 is CRITICAL: CRITICAL: Puppet has been disabled for 605056 seconds, message: test submit_only - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:43:01] (03CR) 10CI reject: [V: 04-1] P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 (owner: 10Slyngshede) [12:43:06] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Ladsgroup) So I looked at categorylinks tables everywhere. There are the top ten biggest ones: ` root@clouddb1021:/srv# ls -Ssh sqldata.s*/*/categorylinks.ibd | head 188G sqldata.s4/commonswiki/categorylink... [12:44:50] (03Merged) 10jenkins-bot: flink-operator - Add ingress for health check probe port [deployment-charts] - 10https://gerrit.wikimedia.org/r/926479 (owner: 10Ottomata) [12:49:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refresh toolforge-k8s-prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/926484 [12:50:02] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refresh toolforge-k8s-prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/926484 (https://phabricator.wikimedia.org/T338025) [12:52:32] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/926484 (https://phabricator.wikimedia.org/T338025) (owner: 10Arturo Borrero Gonzalez) [12:53:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refresh toolforge-k8s-prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/926484 (https://phabricator.wikimedia.org/T338025) (owner: 10Arturo Borrero Gonzalez) [12:53:29] (03CR) 10David Caro: [C: 03+1] toolforge: refresh toolforge-k8s-prometheus certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926484 (https://phabricator.wikimedia.org/T338025) (owner: 10Arturo Borrero Gonzalez) [13:04:22] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) @Xover That's very useful to know. Whatever is linked temporarily on Main: might not necessarily be picked up. Do you have a sense o... [13:17:35] (03PS2) 10Ayounsi: Initial gNMI support for network automation cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/924896 [13:17:37] (03PS1) 10Ayounsi: WIP: Cookbook to manage network users over gNMI [cookbooks] - 10https://gerrit.wikimedia.org/r/926491 (https://phabricator.wikimedia.org/T338028) [13:21:00] (03CR) 10CI reject: [V: 04-1] Initial gNMI support for network automation cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/924896 (owner: 10Ayounsi) [13:22:38] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:22:44] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:22:46] (03PS1) 10Ayounsi: Add /.vscode/ to .gitignore [cookbooks] - 10https://gerrit.wikimedia.org/r/926493 [13:24:26] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:24:35] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:24:46] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:24:58] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:25:39] !log deploying flink-operator change to dse-k8s and wikikube to add ingress for health check port - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/926479 [13:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:44] !log otto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:25:57] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:26:02] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:26:12] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:34:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Papaul) @Southparkfan are still have issue here or we can close this task? [13:36:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10ssingh) Hi @Papaul: I plan to come back to this once we do the LVS next week and see if we have the same issue, otherwise I will clos... [13:37:12] (03PS1) 10Muehlenhoff: Also apply url downloader role to remaining Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/926496 (https://phabricator.wikimedia.org/T329945) [13:37:35] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10ssingh) [13:39:31] (03PS1) 10Matthias Mullie: [SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) [13:40:13] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Papaul) @Jhancock.wm can you please check what is the IDRAC firmware version and what is the latest one on Dell website. Thanks. [13:41:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [13:44:09] (03CR) 10Muehlenhoff: [C: 03+2] Also apply url downloader role to remaining Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/926496 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [13:47:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:15] (03PS2) 10Samtar: logspam-watch: Add a fox emoji [puppet] - 10https://gerrit.wikimedia.org/r/921050 [13:53:30] (Processor usage over 85%) firing: Alert for device cr2-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [13:54:59] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:13] (03CR) 10JHathaway: [C: 03+2] ferm: Ensure iptables is installed before configuring alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925877 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:59:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:13] (03CR) 10JHathaway: [C: 03+2] expand_path, regex_data: use yaml safe_load when available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925862 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:01:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:15] (03CR) 10JHathaway: puppetserver: hiera type defs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:43] (03PS6) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [14:09:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10ssingh) [14:09:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [14:09:57] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice! Using a dict seems indeed more reasonable for these cases. LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 (owner: 10Elukey) [14:10:32] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:52] (03CR) 10JHathaway: [C: 03+2] "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/925922 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:12:54] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) Here are some unindexed articles (confirmed from search console and from Google). I came upon them by simply hitting "Zufällige Seite... [14:13:57] (03PS7) 10Clément Goubert: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) [14:15:17] (03CR) 10Herron: [C: 03+1] opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [14:15:50] (03PS4) 10Slyngshede: P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 [14:16:03] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10TheDJ) Shouldn't we have an automatic alert on this many 5xx's ? [14:16:13] (03CR) 10CI reject: [V: 04-1] P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 (owner: 10Slyngshede) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:34] (03PS1) 10Muehlenhoff: gdnsd: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/926509 [14:17:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41525/console" [puppet] - 10https://gerrit.wikimedia.org/r/926482 (owner: 10Slyngshede) [14:19:07] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) We do monitor and graph the 5xx rate, but it was only ever a tiny fraction of the request... [14:20:08] (03CR) 10JHathaway: bookworm: Change to deb822 format for sources.list (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [14:20:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:58] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Albertoleoncio) @SCherukuwada I have an interesting one: [[:s:pt:O Movimento Modernista]] Looking for ` "O Movimento Modernista" site:wikisource... [14:26:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [14:27:38] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:01] (03PS1) 10Majavah: mariadb: toolsdb: move default-character-set under mysql [puppet] - 10https://gerrit.wikimedia.org/r/926518 [14:29:44] (03Abandoned) 10Slyngshede: P:IDM Ensure that uwsgi app is running. [puppet] - 10https://gerrit.wikimedia.org/r/926482 (owner: 10Slyngshede) [14:36:37] (03PS1) 10BBlack: ntp: avoid DC-specific data in peering code [puppet] - 10https://gerrit.wikimedia.org/r/926519 [14:38:02] (03PS1) 10Muehlenhoff: idm: Mask the generic uwsgi service [puppet] - 10https://gerrit.wikimedia.org/r/926520 [14:39:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926520 (owner: 10Muehlenhoff) [14:39:41] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10Jhancock.wm) 05Open→03Resolved server remained reachable by ssh for 2 days. resolving. [14:40:27] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:57] (03CR) 10CDanis: [C: 03+1] puppet-merge: implement Lock out, tag out (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [14:53:11] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Jhancock.wm) idrac firmware on the host is version 3.32 Dell's latest version is 6.10 I will start the upgrade process on this. good call and thank you @Papaul [14:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:57:51] (03CR) 10Muehlenhoff: bookworm: Change to deb822 format for sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [14:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:59:19] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:21] (03CR) 10Klausman: [C: 03+1] kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 (owner: 10Elukey) [15:00:54] (03CR) 10Muehlenhoff: bookworm: Change to deb822 format for sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [15:01:11] (03PS3) 10JHathaway: don't export resources when wmflib::have_puppetdb() is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) [15:01:17] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:46] (03CR) 10Andrew Bogott: [C: 03+1] sudo: remove sudoldap support [puppet] - 10https://gerrit.wikimedia.org/r/924982 (owner: 10Majavah) [15:03:37] (03CR) 10CI reject: [V: 04-1] don't export resources when wmflib::have_puppetdb() is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:03:57] (03CR) 10Andrew Bogott: [C: 03+2] ldap: Drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/924981 (owner: 10Majavah) [15:04:39] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::instance: drop stretch-backports config [puppet] - 10https://gerrit.wikimedia.org/r/924983 (owner: 10Majavah) [15:05:17] (03CR) 10Andrew Bogott: [C: 03+2] sudo: remove sudoldap support [puppet] - 10https://gerrit.wikimedia.org/r/924982 (owner: 10Majavah) [15:07:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:07:27] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:08:30] (Processor usage over 85%) resolved: Device cr2-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [15:10:29] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:13:23] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:16:43] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:08] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) >>! In T336587#8897657, @Andrew wrote: > I haven't dug much, but... [15:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:29:24] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Bodhisattwa) @SCherukuwada , [[ https://bn.wikisource.org/wiki/%E0%A6%AE%E0%A7%8B%E0%A6%97%E0%A6%B2-%E0%A6%AC%E0%A6%BF%E0%A6%A6%... [15:35:02] !log restart ntp.service on dns1002 [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:53] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:04] (03PS2) 10BBlack: ntp: avoid DC-specific data in peering code [puppet] - 10https://gerrit.wikimedia.org/r/926519 [15:42:06] (03PS1) 10BBlack: ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 [15:42:32] (03CR) 10CI reject: [V: 04-1] ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 (owner: 10BBlack) [15:42:39] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Jhancock.wm) idrac version has been updated [15:42:53] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:43:26] (03PS2) 10BBlack: ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 [15:43:51] (03CR) 10CI reject: [V: 04-1] ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 (owner: 10BBlack) [15:48:37] (03PS3) 10BBlack: ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 [16:03:19] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) [16:03:54] !log dns*: removed faulty authdns[12]001 lines from /etc/hosts via cumin+sed [16:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:41] (03PS1) 10Cathal Mooney: Add KCVelaga to analytics-product-users users group [puppet] - 10https://gerrit.wikimedia.org/r/926543 (https://phabricator.wikimedia.org/T337766) [16:06:03] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:08:20] (03PS4) 10JHathaway: don't export resources when wmflib::have_puppetdb() is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) [16:08:40] (03PS1) 10Ahmon Dancy: Fix profile::gitlab::active_host and profile::gitlab::passive_hosts for devtools [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) [16:14:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:17:10] (03PS3) 10BBlack: ntp: avoid DC-specific data in peering code [puppet] - 10https://gerrit.wikimedia.org/r/926519 [16:17:12] (03PS4) 10BBlack: ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 [16:20:47] (03CR) 10JHathaway: java: ensure wmf-certificates is installed, when required (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:27:20] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [16:27:55] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [16:28:22] (03PS3) 10JHathaway: add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) [16:28:35] (03CR) 10JHathaway: add container facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:28:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10cmooney) @odimitrijevic @Ottomata can one of you advise if you are happy for zabe to be added to analytics-privatedata-users? Non-staff member already has LDAP user and ssh ke... [16:38:31] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10cmooney) @Dreamy_Jazz can you update the task description with the following so we can process this request following the approval? [] The **username** of your existing account on wikitech.wikimedia.... [16:39:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:42:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Ottomata) Hm, usually for non-staff access to analytics-privatedata-users, we require a WMF sponsor (usually a manager?) and a MOU [[ https://wikitech.wikimedia.org/wiki/Analyt... [16:44:43] (03CR) 10Brennen Bearnes: "I think this seems mostly right, but a few changes here I'm not sure about:" [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) (owner: 10Ahmon Dancy) [16:47:05] (03CR) 10Ahmon Dancy: Fix profile::gitlab::active_host and profile::gitlab::passive_hosts for devtools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) (owner: 10Ahmon Dancy) [16:47:09] (03CR) 10Ssingh: [C: 03+1] ntp: avoid DC-specific data in peering code [puppet] - 10https://gerrit.wikimedia.org/r/926519 (owner: 10BBlack) [16:50:36] (03PS1) 10Krinkle: Profiler: Replace copy of ExcimerClient.php with git submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926574 (https://phabricator.wikimedia.org/T337873) [16:52:16] (03CR) 10Ssingh: [C: 03+1] ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 (owner: 10BBlack) [16:53:48] (03CR) 10Brennen Bearnes: Fix profile::gitlab::active_host and profile::gitlab::passive_hosts for devtools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) (owner: 10Ahmon Dancy) [16:57:21] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:57:37] (03CR) 10JHathaway: puppetserver: add additional config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:57:39] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [16:57:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:00:25] (03Merged) 10jenkins-bot: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [17:18:21] (03PS1) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) [17:18:23] (03PS1) 10Eevans: sessionstore: upgrade sessionstore2002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) [17:18:25] (03PS1) 10Eevans: sessionstore: upgrade sessionstore2003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) [17:22:37] (03PS2) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) [17:22:39] (03PS2) 10Eevans: sessionstore: upgrade sessionstore2002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) [17:22:41] (03PS2) 10Eevans: sessionstore: upgrade sessionstore2003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) [17:23:31] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:26:58] (03PS3) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) [17:27:02] (03PS3) 10Eevans: sessionstore: upgrade sessionstore2002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) [17:27:06] (03PS3) 10Eevans: sessionstore: upgrade sessionstore2003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) [17:27:08] !log dns*: disabling puppet to control rollout of NTP config fixups [17:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:13] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:28:16] (03CR) 10BBlack: [C: 03+2] ntp: avoid DC-specific data in peering code [puppet] - 10https://gerrit.wikimedia.org/r/926519 (owner: 10BBlack) [17:29:18] (03CR) 10BBlack: [C: 03+2] ntp: revamp and refresh our "tos" options a bit [puppet] - 10https://gerrit.wikimedia.org/r/926539 (owner: 10BBlack) [17:34:15] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:34:40] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:45:23] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:45:24] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:47:30] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [17:48:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [17:48:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:11] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10Dreamy_Jazz) > [] The **username** of your existing account on wikitech.wikimedia.org: Dreamy Jazz > [] Do you currently have **shell access** (Yes/No)? No > [] **Purpose** (Specify which service you... [18:00:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:03:53] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:04:27] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:06:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans i tested the cookbook today on ssw1-a1 the switch did recevied dhcp but was not able to fetch the config file from the tftp serv... [18:17:43] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10TheresNoTime) >>! In T337829#8893753, @nskaggs wrote: > I presume resolving {T337848} would unblock the situation you described, yes? And if I'm reading correctly, your desire... [18:17:47] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10MusikAnimal) It would be great to have someone with this access on our team (#community-tech). I.e., recently I waited over a month for {T334041} to be processed. [18:19:45] RECOVERY - BFD status on cr2-codfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:43] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:49] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:31:31] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:31:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:36:17] (03PS1) 10Ssingh: ntp: do not restart the ntp service on conf change [puppet] - 10https://gerrit.wikimedia.org/r/926598 [18:37:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41529/console" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [18:38:13] (03PS1) 10Milimetric: labstore: Add glamwikidashboard project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/926599 (https://phabricator.wikimedia.org/T338063) [18:39:03] (03CR) 10Milimetric: "Sorry I'm not sure what the gid is, I can't login to the project yet." [puppet] - 10https://gerrit.wikimedia.org/r/926599 (https://phabricator.wikimedia.org/T338063) (owner: 10Milimetric) [18:40:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [18:42:59] !log dns*: puppets are all re-enabled, ntp restarts are done, etc [18:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:03] (03PS2) 10AikoChou: changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) [18:45:10] (03PS3) 10AikoChou: changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) [18:50:17] (03PS1) 10Ottomata: mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) [18:52:58] (03PS1) 10Dzahn: admin/miscweb: remove admin group sitemaps-admins [puppet] - 10https://gerrit.wikimedia.org/r/926602 (https://phabricator.wikimedia.org/T338064) [18:54:29] (03CR) 10Dzahn: [C: 03+2] admin/miscweb: remove admin group sitemaps-admins [puppet] - 10https://gerrit.wikimedia.org/r/926602 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:56:20] (03CR) 10JHathaway: don't export resources when wmflib::have_puppetdb() is false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:57:34] (03PS1) 10Dzahn: httpbb/site: remove sitemaps tests and comment [puppet] - 10https://gerrit.wikimedia.org/r/926603 (https://phabricator.wikimedia.org/T338064) [18:58:49] (03CR) 10Dzahn: [C: 03+2] httpbb/site: remove sitemaps tests and comment [puppet] - 10https://gerrit.wikimedia.org/r/926603 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [19:05:24] (03PS1) 10Dzahn: trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) [19:07:09] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10taavi) From experience: `wmcs-roots` is basically useless for wiki replica work. If {T337848} ends up expanding access for that group to wiki replica cookbooks and the replica... [19:08:55] (03PS1) 10Dzahn: miscweb: remove sitemaps profile from role [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) [19:10:06] (03PS2) 10Dzahn: miscweb: remove sitemaps profile from role [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) [19:15:37] (03CR) 10Herron: [V: 03+2 C: 03+2] import upstream 0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/923685 (owner: 10Herron) [19:19:30] (03PS1) 10Dzahn: varnish: remove/adjust rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [19:22:39] (03PS1) 10Dzahn: remove sitemaps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/926613 (https://phabricator.wikimedia.org/T338064) [19:37:18] (03CR) 10Herron: "This change is ready for review." (033 comments) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:37:33] (03PS1) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) [19:37:41] (03CR) 10Herron: pyrra: initial packaging for v0.6.2 (031 comment) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:38:16] (03CR) 10CI reject: [V: 04-1] Add initial stream configs for Android article events using Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [19:39:49] (03PS2) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) [19:54:18] (03PS4) 10JHathaway: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) [19:56:33] (03CR) 10JHathaway: "thanks for reviewing John" [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:01:41] 10ops-pmtpa: decom racktables - "db12" "db19", "db28" - https://phabricator.wikimedia.org/T81623 (10Aklapper) [20:16:01] !log rsync in ariel screen session, bwlimit 100000, running on dumpsdata1003, pulling from dumpsdata1002, copying over 'other dumps' [20:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] (03CR) 10Dzahn: httpd: set legacy_compat to absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [20:26:32] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T338069 (10Pereibri) [20:29:57] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Pereibri) [20:33:23] (03PS3) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) [20:37:36] (03PS4) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) [20:39:10] (03CR) 10BCornwall: [C: 03+1] "I prefer a more explicit approach, particularly if the defaults were to change in the future. However, I imagine this would be fine." [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [20:39:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:42:08] (03PS1) 10Kimberly Sarabia: Remove VectorLimitedWidthIndicator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926626 (https://phabricator.wikimedia.org/T336197) [20:42:36] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10bd808) > maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Affiliates. > — https://wikitech.wikimedia.org/wiki/Maps/External_usage > Wikimedia Maps may not be used by... [20:52:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) I am breaking out the ones that sre-collab can fix into a subtask. After that there are a handful for observability (thanos, prometheus, graphit... [20:53:16] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Pereibri) This was being added by me in the past using http://maps.wikimedia.org/osm-intl/%7Bz%7D/%7Bx%7D/%7By%7D.png We don’t have a affiliate org that I know of, is this something that can be purchased? Thi... [20:54:47] (03CR) 10Dzahn: [C: 03+2] gerrit/bacula: adjust Gerrit file paths to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/924608 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:56:22] (03Abandoned) 10Superpes15: [ruwiki] Add 'sboverride' right to the bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908973 (https://phabricator.wikimedia.org/T334344) (owner: 10Superpes15) [21:01:39] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10bd808) 05Open→03Declined >>! In T338069#8899884, @Pereibri wrote: > The free api worked great until now, is there an alternative if we can’t use this any longer? https://developers.google.com/maps is serv... [21:02:32] (03PS2) 10Superpes15: [itwiktionary] Add a tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926036 (https://phabricator.wikimedia.org/T337688) [21:09:24] 10SRE, 10Traffic-Icebox: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10BCornwall) @Vgutierrez Three years later, have you experienced these? [21:14:03] (03PS1) 10Superpes15: [fiwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926628 (https://phabricator.wikimedia.org/T337412) [21:14:59] (03PS14) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [21:35:13] (03PS1) 10Superpes15: [ruwiki] Add an editautoreviewprotected level protecion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926632 (https://phabricator.wikimedia.org/T337430) [21:46:13] (03CR) 10Cwhite: [C: 03+1] prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [22:30:45] PROBLEM - SSH on bast3006 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:32:17] RECOVERY - SSH on bast3006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:01:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:40:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:47:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:59:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded