[00:33:32] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:30] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fdecfee1280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:30] org/wiki/Search%23Administration [00:35:04] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:58] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 669, active_shards: 1512, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:35:58] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:42:04] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:58:03] (03PS1) 10Stang: jawikisource: Update project logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876364 (https://phabricator.wikimedia.org/T326488) [01:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:37:46] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:18:42] (03PS4) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714) [04:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:19:02] (03PS3) 10Samwilson: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) [06:26:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [06:28:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:14:09] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) Unfortunately that didn't solve it for all switches: asw2-c-eqiad is all good, but A and B are still showing errors. asw2-a-eqiad: fpc1:port: 1/1 - CRC alignment er... [07:19:56] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10Wangombe) Thanks! [07:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T0800). [08:00:05] samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:25] Hullo. I'm here. [08:01:02] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:29] Amir1 urbanecm are either of you deploying today? [08:05:08] 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) ` cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 --network public --cluster eqiad --group D idm-test1001 ` [08:05:41] (03PS2) 10KartikMistry: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) [08:06:07] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host idm-test1001.wikimedia.org [08:06:10] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:06:43] kart_: any chance you can deploy for samwilson if you’re around? [08:08:19] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm-test1001.wikimedia.org - slyngshede@cumin1001" [08:08:41] TheresNoTime: ^ [08:10:44] (03PS1) 10Muehlenhoff: Remove LDAP access for ddw [puppet] - 10https://gerrit.wikimedia.org/r/876797 [08:12:02] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm-test1001.wikimedia.org - slyngshede@cumin1001" [08:12:02] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:12:02] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache idm-test1001.wikimedia.org on all recursors [08:12:05] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idm-test1001.wikimedia.org on all recursors [08:16:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for ddw [puppet] - 10https://gerrit.wikimedia.org/r/876797 (owner: 10Muehlenhoff) [08:16:38] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [08:17:51] RhinosF1: Sorry, in a meeting now. Missed notification. [08:21:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32035 [08:21:43] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idm-test1001.wikimedia.org [08:23:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32035 [08:24:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 48237 [08:25:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 48237 [08:26:13] (03PS1) 10Slyngshede: idm-test: Add Ganeti VM for IDM test/demo deployment. [puppet] - 10https://gerrit.wikimedia.org/r/876881 (https://phabricator.wikimedia.org/T326406) [08:26:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 327700 [08:26:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 327700 [08:27:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:38] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:58] (03PS2) 10Slyngshede: idm-test: Add Ganeti VM for IDM test/demo deployment. [puppet] - 10https://gerrit.wikimedia.org/r/876881 (https://phabricator.wikimedia.org/T326406) [08:28:43] RhinosF1 kart_ If no one's around to deploy now, it can be held over to the afternoon window, when I think maybe TheresNoTime will be around. [08:29:33] (03CR) 10Slyngshede: [C: 03+2] C:ldap::client::utils remove ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/870524 (owner: 10Slyngshede) [08:35:12] (03PS1) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877091 (https://phabricator.wikimedia.org/T316532) [08:36:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/876881 (https://phabricator.wikimedia.org/T326406) (owner: 10Slyngshede) [08:37:05] (03CR) 10Slyngshede: [C: 03+2] idm-test: Add Ganeti VM for IDM test/demo deployment. [puppet] - 10https://gerrit.wikimedia.org/r/876881 (https://phabricator.wikimedia.org/T326406) (owner: 10Slyngshede) [08:43:43] (03PS1) 10Muehlenhoff: Remove access for jhernandez [puppet] - 10https://gerrit.wikimedia.org/r/877092 [08:48:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jhernandez [puppet] - 10https://gerrit.wikimedia.org/r/877092 (owner: 10Muehlenhoff) [08:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:54:47] (03PS1) 10Hashar: admin: bash config for Antoine [puppet] - 10https://gerrit.wikimedia.org/r/877093 [08:55:51] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877091 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [08:56:43] !log depool ulsfo for network maintenance - T316532 [08:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:47] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [08:58:54] !log installing glibc security updates [08:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10jcrespo) Thank you, while I understand why this was left open- sometimes a partial fix may only make things worse- in this specific case, I think this was a requirement t... [09:03:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8505438, @Dzahn wrote: > 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100... [09:04:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [09:11:11] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [09:11:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [09:11:36] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) Thanks @Jelto I will look at preparing patches for option B (//relocate Gerrit installation from... [09:12:37] (03PS2) 10Muehlenhoff: Add Cumin aliases for analytics postgres hosts [puppet] - 10https://gerrit.wikimedia.org/r/875354 [09:13:01] (03CR) 10Muehlenhoff: Add Cumin aliases for analytics postgres hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875354 (owner: 10Muehlenhoff) [09:18:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "Yes LGTM! Thank you for the heads up (assuming we're okay capacity wise!)" [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:33:02] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:33:08] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: move swift accounts_keys into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:35:33] !log restarting blazegraph on wdqs1006 (BlazegraphFreeAllocatorsDecreasingRapidly) [09:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:34] (03PS1) 10Muehlenhoff: standard_packages: Remove multiarch-support on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/877099 [09:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:25] (03CR) 10Filippo Giunchedi: "LGTM overall (see inline)" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:44:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:54] 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JMeybohm) >>! In T324994#8463619, @Clement_Goubert wrote: > We have the resources to keep it at 30 replic... [09:46:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:46:32] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove multiarch-support on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/877099 (owner: 10Muehlenhoff) [09:47:28] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove multiarch-support on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/877099 (owner: 10Muehlenhoff) [09:48:06] 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) No worries, I took a look at the resources and it seemed fine to leave it like that. We... [09:49:55] (03PS1) 10MVernon: hiera: remove ms-be2050 from servers_per_port 0 setting [puppet] - 10https://gerrit.wikimedia.org/r/877101 (https://phabricator.wikimedia.org/T308677) [09:52:09] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [09:54:12] (03PS1) 10Jelto: admin: add zabe to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) [09:55:33] (03CR) 10Jelto: "needs approval from group owner (Tyler)" [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto) [09:56:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto) [09:57:27] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/877101 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [09:57:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto) @thcipriani we need your approval here to add zabe to `deployment` group. Can you have a look? [09:58:18] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/875354 (owner: 10Muehlenhoff) [09:58:37] (03CR) 10CI reject: [V: 04-1] admin: add zabe to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto) [10:00:28] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Jelto) a:05BCornwall→03Eileenmcnaughton [10:03:10] (03CR) 10Volans: [C: 03+2] grammars: remove usage of leaveWhitespace [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 (owner: 10Volans) [10:03:17] (03CR) 10Volans: [C: 03+2] setup.py: support Python 3.10 and Pyparsing 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans) [10:04:04] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Jelto) a:05BCornwall→03KHurd-WMF @KHurd-WMF does your access works as expected (like SSH into stat machine)? Feel free... [10:05:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans) [10:06:12] (03CR) 10Btullis: [C: 03+1] Also include the Turnilo staging host in the analytics-tools alias [puppet] - 10https://gerrit.wikimedia.org/r/875970 (owner: 10Muehlenhoff) [10:08:21] (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [10:12:36] (03CR) 10JMeybohm: [C: 04-1] "You need to bump the version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [10:12:53] (03Merged) 10jenkins-bot: grammars: remove usage of leaveWhitespace [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 (owner: 10Volans) [10:12:55] (03Merged) 10jenkins-bot: setup.py: support Python 3.10 and Pyparsing 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans) [10:13:13] (03PS6) 10JMeybohm: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [10:18:41] (03CR) 10Muehlenhoff: [C: 03+2] Also include the Turnilo staging host in the analytics-tools alias [puppet] - 10https://gerrit.wikimedia.org/r/875970 (owner: 10Muehlenhoff) [10:20:49] (03PS3) 10Muehlenhoff: Add Cumin aliases for analytics postgres hosts [puppet] - 10https://gerrit.wikimedia.org/r/875354 [10:25:21] (03CR) 10JMeybohm: [C: 03+1] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [10:28:38] (03CR) 10Muehlenhoff: admin: add data types to validate UIDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [10:30:41] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877101 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:30:47] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [10:30:49] 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) 05In progress→03Resolved [10:33:53] (03CR) 10Muehlenhoff: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:34:37] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for analytics postgres hosts [puppet] - 10https://gerrit.wikimedia.org/r/875354 (owner: 10Muehlenhoff) [10:35:59] (03PS2) 10Jelto: admin: add zabe to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) [10:36:15] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: remove ms-be2050 from servers_per_port 0 setting [puppet] - 10https://gerrit.wikimedia.org/r/877101 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:41:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [10:41:39] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [10:42:21] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/877108 (owner: 10Clément Goubert) [10:42:52] (03PS1) 10Ayounsi: Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/876425 [10:43:40] jouncebot: next [10:43:40] In 0 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1100) [10:44:49] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [10:45:39] !log installing avahi security updates [10:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:19] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [10:46:21] !log switching maps to eqiad [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] (03PS3) 10Clément Goubert: hiera: Fix mw_releases values for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/877108 (https://phabricator.wikimedia.org/T326542) [10:46:49] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [10:47:28] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [10:47:56] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39002/console" [puppet] - 10https://gerrit.wikimedia.org/r/877108 (https://phabricator.wikimedia.org/T326542) (owner: 10Clément Goubert) [10:48:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [10:49:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [10:49:39] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [10:50:48] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:51:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [10:52:22] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:52:41] (03CR) 10MVernon: [C: 03+2] hiera: remove ms-be2050 from servers_per_port 0 setting [puppet] - 10https://gerrit.wikimedia.org/r/877101 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:54:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [10:54:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [10:54:26] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [10:54:57] !log Starting codfw appserver rolling reboot [10:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:59] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [10:55:03] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:55:29] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1100) [11:01:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [11:06:10] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/876425 (owner: 10Ayounsi) [11:06:39] !log repool ulsfo - T316532 [11:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:42] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [11:08:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:08:43] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:08:55] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [11:09:12] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [11:11:36] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:13:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [11:14:46] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10LSobanski) @Dzahn it looks like this task could be closed, or is there anything else that needs to happen? [11:16:12] 10SRE, 10Infrastructure-Foundations: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10LSobanski) Looks like this still needs to happen as majority of hosts are on 2.4-1: https://debmonitor.wikimedia.org/packages/ferm [11:16:58] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10jcrespo) [11:18:19] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) I didn't proceed with the upgrade as there was errors. I opened a JTAC case 2023-0109-616616 with: > Hi, > I'm trying to... [11:19:19] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [11:22:38] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10jcrespo) Wait, did my patch broke the check for enwiki? @Dzahn ? {F36067467} [11:23:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [11:26:03] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10jcrespo) No, I think it is a new breakage (albeit similar to the wikibooks one), due to this other edit: https://en.wikipedia.org/w/index.php?title=MediaWiki%3AWikimedia-... [11:28:43] !log depool cp5025 due to purging issues [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:45] !log restart purged on cp5025 [11:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:44] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [11:40:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15954 [11:41:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15954 [11:42:30] (03PS4) 10Clément Goubert: hiera: Fix mw_releases values for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/877108 (https://phabricator.wikimedia.org/T326542) [11:46:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This will make the mw configuration equivalent to the configuration we have on kubernetes." [puppet] - 10https://gerrit.wikimedia.org/r/875894 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [11:49:14] 10SRE, 10Observability-Alerting, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) [11:49:18] 10SRE, 10Observability-Alerting, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) 05Resolved→03Open [11:49:41] (03CR) 10Hnowlan: [C: 03+2] Use blubber via Docker tooling; no longer requires local binary [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 (owner: 10Brion VIBBER) [11:50:26] 10SRE, 10Observability-Alerting, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) Increasing the scope to make sure the alert doesn't reoccur. I have sent a message on the change request, h... [11:51:47] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) JTAC replied with (I cherry picked the useful info): > As JTAC we would suggest that you perform a step upgrade in your c... [11:52:12] (03PS1) 10Muehlenhoff: postgresql::user: No longer compare password hashes on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) [11:53:57] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) As the #WMF-Legal project tag was added to this task, some general in... [11:57:00] (03Merged) 10jenkins-bot: Use blubber via Docker tooling; no longer requires local binary [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 (owner: 10Brion VIBBER) [11:57:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [12:00:14] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/875894/39001/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/875894 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [12:02:51] (03PS1) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [12:03:12] (03CR) 10CI reject: [V: 04-1] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [12:18:55] !log repool cp5025 [12:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:50] !log hnowlan@deploy1002 Started deploy [restbase/deploy@bcb0a69]: New wikis T321284 T321290 T321296 T326140 [12:34:57] T321290: Add guwwikiquote to RESTBase - https://phabricator.wikimedia.org/T321290 [12:34:58] T326140: Add gorwiktionary to RESTBase - https://phabricator.wikimedia.org/T326140 [12:34:58] T321296: Add aswikiquote to RESTBase - https://phabricator.wikimedia.org/T321296 [12:34:58] T321284: Add shnwikibooks to RESTBase - https://phabricator.wikimedia.org/T321284 [12:40:16] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10LSobanski) [12:42:00] (03PS2) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [12:42:35] (03CR) 10CI reject: [V: 04-1] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [12:42:55] 10SRE, 10Wikimedia-GitHub, 10serviceops-collab: stop syncing and delete labs/private repo from github - https://phabricator.wikimedia.org/T315925 (10LSobanski) [12:44:00] 10SRE, 10Infrastructure-Foundations: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10LSobanski) [12:44:16] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:45:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:46:21] (03Abandoned) 10Clément Goubert: hiera: Fix mw_releases values for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/877108 (https://phabricator.wikimedia.org/T326542) (owner: 10Clément Goubert) [12:46:55] (DDoSDetected) firing: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [12:51:30] (03PS3) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [12:51:55] (DDoSDetected) resolved: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [12:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:53:46] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@bcb0a69]: New wikis T321284 T321290 T321296 T326140 (duration: 18m 56s) [12:53:54] T321290: Add guwwikiquote to RESTBase - https://phabricator.wikimedia.org/T321290 [12:53:54] T326140: Add gorwiktionary to RESTBase - https://phabricator.wikimedia.org/T326140 [12:53:54] T321296: Add aswikiquote to RESTBase - https://phabricator.wikimedia.org/T321296 [12:53:55] T321284: Add shnwikibooks to RESTBase - https://phabricator.wikimedia.org/T321284 [12:58:14] (03PS4) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [12:58:35] (03CR) 10CI reject: [V: 04-1] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [13:00:24] 10SRE, 10Wikimedia-Portals, 10serviceops, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10LSobanski) [13:00:42] (03CR) 10Muehlenhoff: "The PCC error can be ignored, these (abandoned) Stretch instances have been broken before" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [13:01:03] (03PS5) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [13:01:23] (03CR) 10CI reject: [V: 04-1] role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [13:02:11] (03CR) 10Muehlenhoff: role:IDM assign IDM role to test VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [13:08:44] (03PS6) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [13:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:35:06] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [13:35:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:35:31] (03PS24) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [13:36:34] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:36:37] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1003.eqiad.wmnet [13:41:53] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1003.eqiad.wmnet [13:43:37] (03PS25) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [13:44:44] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:50:19] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39003/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [13:52:13] (03PS3) 10Jelto: P:spicerack: add python-gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) [13:52:40] (03PS36) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) [13:55:15] !log installing systemd bugfix updates from Bullseye point release [13:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:15] (03PS1) 10KartikMistry: CX: Fix usage of categories translation unit as array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) [13:56:43] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [13:56:45] (03CR) 10Jelto: [C: 03+2] P:spicerack: add python-gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [13:58:51] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1400). Please do the needful. [14:00:05] cirno and samwilson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:32] o/ [14:00:53] \o [14:02:26] o/ [14:03:07] I can deploy [14:03:51] sounds good [14:04:19] (03PS3) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [14:05:11] (03PS1) 10KartikMistry: CX: Allow composer/installers plugin [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877139 [14:05:14] skimming the mw.o discussion about flow [14:06:53] (03CR) 10Hashar: "Jelto and I talked about early in December. This will use a 'gerrit-deploy' user top push the material to the user and reserve the 'gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [14:07:56] (03PS2) 10Lucas Werkmeister (WMDE): mediawikiwiki: Disable Flow on new pages by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871286 (https://phabricator.wikimedia.org/T325907) (owner: 10Stang) [14:08:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871286 (https://phabricator.wikimedia.org/T325907) (owner: 10Stang) [14:08:57] (03Merged) 10jenkins-bot: mediawikiwiki: Disable Flow on new pages by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871286 (https://phabricator.wikimedia.org/T325907) (owner: 10Stang) [14:09:16] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:871286|mediawikiwiki: Disable Flow on new pages by default (T325907)]] [14:09:19] T325907: Disable Flow on new pages by default on MediaWiki.org - https://phabricator.wikimedia.org/T325907 [14:09:34] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service,planet_sync_tile_generation-gis.service,regen-zoom-level-tilerator-regen.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:50] (03CR) 10CI reject: [V: 04-1] CX: Fix usage of categories translation unit as array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [14:09:54] (03CR) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [14:11:58] RECOVERY - Host mw1486 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:13:36] Lucas_WMDE: Added patch, [wmf.17] {{gerrit|877139}} to the deployment window, hopefully we can deploy it. [14:14:26] mh, I was already skeptical the existing patches were doable [14:14:38] as scap seems to be taking longer and longer (due to k8s? not sure) [14:14:47] we’ll see, I guess [14:15:09] wait, does the composer/installers fix really need to be backported? [14:15:29] I would’ve thought master and REL1_{35,38,39} would be enough [14:15:57] Lucas_WMDE: It seems backport of wmf.17 failing otherwise.. [14:16:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Preformed Flea Power Drain As requested by Dell [14:16:24] CX backports or any backports? [14:16:28] Lucas_WMDE: See: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/877138 [14:16:33] Lucas_WMDE: CX Backport. [14:17:05] I see… I didn’t see any other planned CX backports on the calendar [14:17:30] Yeah, because it is failing, I wanted to do composer fix first :) [14:18:07] And, since widnow is full, haven't added 2 more patches.. [14:18:12] (03CR) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [14:18:13] ok [14:18:41] I already rescheduled the WikiEditor one from this morning's window. Any chance of moving that up the queue? It's just that we said it'd go out today, and I'm not sure any of my team can make it to the next window. [14:18:51] (this one: https://gerrit.wikimedia.org/r/c/868816/ ) [14:18:53] (03PS3) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) [14:19:03] samwilson: I’ll try [14:19:25] I probably should’ve postponed the mediawikiwiki change for last, actually [14:19:28] that seems like the least important one [14:19:30] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:871286|mediawikiwiki: Disable Flow on new pages by default (T325907)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:19:33] but way too late for that [14:19:35] T325907: Disable Flow on new pages by default on MediaWiki.org - https://phabricator.wikimedia.org/T325907 [14:19:39] cirno: ^ there we go, can you test? [14:19:44] looking [14:19:57] (03PS17) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [14:20:32] Lucas_WMDE: thanks [14:20:39] I went to Special:NewPages for some fresh talk pages and it looks good to me so far [14:20:39] Lucas_WMDE, I randomly chose a page which has no talkpage, and I noticed there's no flow related stuff anymore, so LGTM [14:20:48] (i.e. on mwdebug the blank talk page shows the new new DiscussionTools interface) [14:20:52] cirno: ack, thanks [14:20:57] syncing [14:21:32] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [14:24:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] arwiki: Create extendedmover group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:25:59] (03PS2) 10Stang: arwiki: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) [14:26:26] (03CR) 10Stang: arwiki: Create extendedmover group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:26:26] cirno: sorry, I just meant “I’m mentioning it here in a review comment”, but okay ^^ [14:26:43] samwilson: trying to understand your change [14:26:55] what’s the current status of feature? beta feature enabled by default? [14:27:35] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:871286|mediawikiwiki: Disable Flow on new pages by default (T325907)]] (duration: 18m 19s) [14:27:39] T325907: Disable Flow on new pages by default on MediaWiki.org - https://phabricator.wikimedia.org/T325907 [14:28:01] Lucas_WMDE: no, beta feature that's being graduated to not a beta feature [14:28:14] okay, but how does removing it from the beta features whitelist do that? [14:28:23] and there's also a default-on gadget on plwiki that's being turned off at the same time [14:28:48] 10SRE: skylake CPU numa clustering settting discussion - https://phabricator.wikimedia.org/T207312 (10LSobanski) 05Open→03Resolved a:03LSobanski We are 3 CPU generations later and not much discussion has happened here so I'll resolve this task. [14:28:55] does the extension check $wgBetaFeaturesWhitelist to decide whether it offers the feature as a beta feature or all the time? [14:29:55] (03CR) 10Hnowlan: [C: 03+2] Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [14:30:08] yep, indirectly through BetaFeatures::isFeatureEnabled() [14:30:54] https://gerrit.wikimedia.org/g/mediawiki/extensions/WikiEditor/+/cf73418e9cd7408c39525b24e79f447752c1065b/includes/Hooks.php#246 [14:31:26] !log upgrade thanos to 0.30.1 on prometheus2005 - T303154 [14:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:29] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [14:33:00] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10hashar) [14:33:05] 10Puppet, 10SRE, 10Infrastructure-Foundations: Knock down puppet 4 deprecation warnings - https://phabricator.wikimedia.org/T193664 (10hashar) 05Open→03Resolved This has most probably been entirely fixed. [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/333012 | Gerrit 333012 - test: puppet-syntax... [14:33:11] I’m still confused [14:33:18] that will change $betaFeatureEnabled to false, rihgt? [14:33:27] but $betaFeaturesInstalled will still be true, the extension is loaded [14:33:37] so the module won’t be loaded anymore, right? [14:33:59] (03Merged) 10jenkins-bot: Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [14:34:09] we can still try it out on mwdebug, I guess… [14:34:16] Lucas_WMDE: eek, you're right! [14:34:22] we totally overlooked that, sorry. [14:35:06] 10SRE: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632 (10LSobanski) 05Open→03Declined This request goes against T254646, resolving. If you still feel strongly about renaming production to something else (main?), a new task is the way t... [14:35:10] ok, then I’ll do the arwiki change next, since I already reviewed that [14:35:15] and I guess WikiEditor will need a backport too? [14:35:21] some new “enable all the time” flag? [14:35:39] (03PS3) 10Lucas Werkmeister (WMDE): arwiki: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:35:43] (03PS4) 10Lucas Werkmeister (WMDE): arwiki: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:35:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:35:59] yep, I'll get that sorted. The config change will have to happen later. [14:36:01] let’s see if the k8s stuff takes so long again [14:36:05] samwilson: good luck :/ [14:36:13] :-) [14:36:17] jouncebot: next [14:36:18] In 1 hour(s) and 53 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1630) [14:36:38] (I’d be available to deploy stuff in the break between windows too, if it helps) [14:36:40] (03Merged) 10jenkins-bot: arwiki: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) (owner: 10Stang) [14:36:56] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:876310|arwiki: Create extendedmover group (T326434)]] [14:36:59] T326434: Create Page Mover user group at arwiki - https://phabricator.wikimedia.org/T326434 [14:38:36] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:876310|arwiki: Create extendedmover group (T326434)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:39:32] looks good to me, cirno do you want to confirm? [14:39:34] this group appears correctly in https://ar.wikipedia.org/w/index.php?title=Special:Listgrouprights [14:39:43] ack (I checked https://ar.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&formatversion=2) [14:39:49] syncing [14:39:59] and nice that scap didn’t take as long this time [14:40:40] (03PS3) 10Muehlenhoff: redis: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868706 (https://phabricator.wikimedia.org/T308013) [14:41:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876364 (https://phabricator.wikimedia.org/T326488) (owner: 10Stang) [14:42:59] 10SRE, 10Traffic: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10LSobanski) [14:45:12] 10SRE, 10Traffic: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10BBlack) [14:45:16] 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BBlack) [14:45:39] 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Expand Gerrit Manager permissions - https://phabricator.wikimedia.org/T234474 (10LSobanski) [14:45:52] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:876310|arwiki: Create extendedmover group (T326434)]] (duration: 08m 56s) [14:45:55] T326434: Create Page Mover user group at arwiki - https://phabricator.wikimedia.org/T326434 [14:46:05] (03PS2) 10Lucas Werkmeister (WMDE): jawikisource: Update project logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876364 (https://phabricator.wikimedia.org/T326488) (owner: 10Stang) [14:46:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876364 (https://phabricator.wikimedia.org/T326488) (owner: 10Stang) [14:46:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ssingh) >>! In T326425#8508075, @Clement_Goubert wrote: >>>! In T326425#8505438, @Dzahn wrote: >> 18:16 <+icinga-wm> PROBLEM - Host mw1486 i... [14:46:58] (03Merged) 10jenkins-bot: jawikisource: Update project logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876364 (https://phabricator.wikimedia.org/T326488) (owner: 10Stang) [14:47:11] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:876364|jawikisource: Update project logo and wordmark (T326488)]] [14:47:14] T326488: Requesting permanent logo and wordmark change for ja.wikisource.org - https://phabricator.wikimedia.org/T326488 [14:47:15] (03CR) 10Muehlenhoff: [C: 03+2] redis: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868706 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:47:34] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet [14:47:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:48:36] (03PS2) 10Muehlenhoff: Add role_contacts for role::mariadb::misc::analytics::backup [puppet] - 10https://gerrit.wikimedia.org/r/868670 [14:48:51] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:876364|jawikisource: Update project logo and wordmark (T326488)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:48:58] cirno: ^ [14:49:06] looking [14:49:20] looks good on my end, I’d say [14:50:23] checked on vector and vector-2022 and LGTM [14:50:30] great, thanks [14:51:35] TIL “ext” is a language code and “extwiki” exists, nice [14:51:55] 10SRE, 10WMF-General-or-Unknown: Increase $wgHTTPImportTimeout to a higher value on WMF wikis - https://phabricator.wikimedia.org/T155209 (10LSobanski) 05Open→03Resolved a:03LSobanski It doesn't seem like there's anything actionable left in this task, resolving. [14:51:57] 10SRE, 10MediaWiki-Core-Snapshots, 10WMF-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000 (10LSobanski) [14:52:00] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Cleared CEL Dell requested set the system profile to performance [14:52:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet [14:54:35] (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [14:54:45] (03PS26) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [14:55:02] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2003.codfw.wmnet [14:56:36] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:876364|jawikisource: Update project logo and wordmark (T326488)]] (duration: 09m 24s) [14:56:39] T326488: Requesting permanent logo and wordmark change for ja.wikisource.org - https://phabricator.wikimedia.org/T326488 [14:58:23] (03PS1) 10Aklapper: phab: Improve error message for too large file uploads [puppet] - 10https://gerrit.wikimedia.org/r/877188 (https://phabricator.wikimedia.org/T155130) [14:59:20] !log lucaswerkmeister-wmde@mwmaint1002:~$ echo 'https://en.wikipedia.org/static/images/project-logos/jawikisource.png' | mwscript purgeList.php # T326488 [14:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:28] (03PS2) 10Lucas Werkmeister (WMDE): extwiki: Install SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876311 (https://phabricator.wikimedia.org/T326450) (owner: 10Stang) [14:59:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876311 (https://phabricator.wikimedia.org/T326450) (owner: 10Stang) [14:59:43] backport window will run over a bit [14:59:50] 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:00:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2003.codfw.wmnet [15:00:25] (03Merged) 10jenkins-bot: extwiki: Install SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876311 (https://phabricator.wikimedia.org/T326450) (owner: 10Stang) [15:00:42] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:876311|extwiki: Install SandboxLink extension (T326450)]] [15:00:46] T326450: Activate SandboxLink extensions for ext.wikipedia - https://phabricator.wikimedia.org/T326450 [15:02:18] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:876311|extwiki: Install SandboxLink extension (T326450)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:02:24] cirno: ^ [15:02:28] looking [15:03:12] it works [15:03:14] seems to work on my end, assuming «zona de pruebas» = sandbox [15:03:15] yay [15:03:21] syncing [15:04:35] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [15:04:53] (03PS4) 10Samwilson: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) [15:07:28] (03PS1) 10Jelto: sre.gitlab.upgrade: remove call to super init methode [cookbooks] - 10https://gerrit.wikimedia.org/r/877190 (https://phabricator.wikimedia.org/T323569) [15:07:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:07:54] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/877190 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:08:04] (03CR) 10JMeybohm: [C: 03+1] Update flink-kubernetes-operator chart with upstream changes for 1.3.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:09:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [15:09:19] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:876311|extwiki: Install SandboxLink extension (T326450)]] (duration: 08m 37s) [15:09:22] T326450: Activate SandboxLink extensions for ext.wikipedia - https://phabricator.wikimedia.org/T326450 [15:09:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877139 (owner: 10KartikMistry) [15:09:57] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: remove call to super init methode [cookbooks] - 10https://gerrit.wikimedia.org/r/877190 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:10:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:11:16] !log disable puppet on all 'P:mediawiki::mcrouter_wancache' hosts to merge 875894 [15:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:53] (03PS4) 10Hnowlan: maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) [15:11:55] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: remove call to super init methode [cookbooks] - 10https://gerrit.wikimedia.org/r/877190 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:12:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:14:50] (03CR) 10Effie Mouzeli: [C: 03+2] P:mediawiki::mcrouter_wancache: add gutter pools for /*/mw-wan keys [puppet] - 10https://gerrit.wikimedia.org/r/875894 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [15:17:17] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Xaosflux) c.f. https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Text_... [15:17:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on maps2009.codfw.wmnet,maps1009.eqiad.wmnet with reason: Removing redis service [15:17:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on maps2009.codfw.wmnet,maps1009.eqiad.wmnet with reason: Removing redis service [15:17:59] (03CR) 10Hnowlan: [C: 03+2] maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:18:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39004/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [15:22:18] (03PS27) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [15:22:51] (03PS1) 10Jelto: sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) [15:24:09] (03Merged) 10jenkins-bot: CX: Allow composer/installers plugin [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877139 (owner: 10KartikMistry) [15:24:24] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:877139|CX: Allow composer/installers plugin]] [15:25:54] Backport to WMF branches with scap backport is easier but taking long time as we can't do +2 in advance :/ [15:26:06] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and kartik: Backport for [[gerrit:877139|CX: Allow composer/installers plugin]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:27:02] (03PS2) 10Btullis: Detect the correct disks for the O/S on the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) [15:28:00] !log Starting codfw jobrunner rolling reboot [15:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:35] yeah [15:28:44] Oh, no. [15:28:48] You're still deploying [15:28:50] Not rebooting [15:29:02] ah, sorry [15:29:07] !log Not starting codfw jobrunner rolling reboot, deploy in progress [15:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:11] Lucas_WMDE: No no don't [15:29:18] claime: is that why scap just got a chec kfailure on mw1449? [15:29:22] No [15:29:26] :) [15:29:33] I logged before starting and haven't actually started [15:29:38] Lucas_WMDE: What's the failure? [15:29:45] (03CR) 10JMeybohm: [C: 03+1] flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:29:58] Check 'Check endpoints for mw1449.eqiad.wmnet' failed: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='mw1449.eqiad.wmnet', port=443): Read timed out. (read timeout=5)")': /wiki/Special%3AVersion [15:29:58] /wiki/{title} (Special Version) timed out before a response was received [15:30:11] (I hope that wasn’t long enough to trigger truncation) [15:30:29] It was but it's no problem [15:30:40] that was during the canary phase [15:30:51] sync-apaches was okay for all 340 of them [15:30:59] so… hopefully fine? [15:31:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10thcipriani) >>! In T326327#8508324, @Jelto wrote: > @thcipriani we need your approval here to add zabe to `deployment` group. Can you have a look? Approved! Would be happ... [15:31:26] Lucas_WMDE: Yeah, it's up so.. I don't know what happened to it [15:31:46] (03CR) 10Btullis: Detect the correct disks for the O/S on the cephosd servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [15:33:43] (03PS2) 10Lucas Werkmeister (WMDE): CX: Fix usage of categories translation unit as array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [15:33:56] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:34:27] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:877139|CX: Allow composer/installers plugin]] (duration: 10m 03s) [15:34:37] ok, I think I’m done [15:34:44] Lucas_WMDE: It passes all httpbb checks, I'd say hiccup? [15:34:46] kart_: I hope https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/877138 can wait for the next backport window? [15:34:49] claime: yeah, sounds ok to me [15:35:38] !log UTC afternoon backport+config window done [15:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] If you need more time to finish deploying it's all good to me, don't worry [15:35:46] Ok then :D [15:35:59] I think I’ve dragged out the window for long enough already ^^ [15:36:05] :D [15:36:30] (I think the “max 6 patches” for backport windows is pretty optimistic at the moment, and maybe 4 would be more realistic) [15:36:44] (OTOH it doesn’t mean “6 patches guaranteed” so you could say it’s fine ^^) [15:37:25] (03CR) 10Btullis: [C: 03+1] flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:37:33] jouncebot: nowandnext [15:37:33] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [15:37:33] In 0 hour(s) and 52 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1630) [15:37:43] !log Starting codfw jobrunner rolling reboot [15:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:46] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:38:04] (03CR) 10Btullis: [C: 03+2] Correct the description for insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/867592 (owner: 10Btullis) [15:38:38] samwilson: just out of curiosity, are you still planning to deploy the WikiEditor change today? [15:38:46] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:38:50] (assuming the new config would get reviewed in time) [15:38:58] (03CR) 10Ottomata: [C: 03+2] flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:39:05] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/868670 (owner: 10Muehlenhoff) [15:39:09] Lucas_WMDE: yeah. [15:39:39] (03CR) 10Ottomata: [C: 03+2] Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:39:50] Lucas_WMDE: Thanks a lot. [15:39:57] np, good luck with the backport! [15:40:02] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) Let's wait and see what is legal's opinion first, and then we can pla... [15:41:46] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 (owner: 10Muehlenhoff) [15:42:07] (03PS3) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-eventschemas: Also restart envoyproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 [15:43:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:45:04] (03Merged) 10jenkins-bot: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:45:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) >>! In T326425#8509196, @Jclark-ctr wrote: > Preformed Flea Power Drain As requested by Dell Can we pool it back, or do... [15:46:28] (03PS28) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:48:16] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) >>! In T326119#8509402, @Jclark-ctr wrote: > Cleared SEL Dell requested set the system profile to performance The cpu governor is alrea... [15:48:36] !log enable puppet on all mw hosts [15:48:54] (03CR) 10Muehlenhoff: [C: 03+2] sre.misc-clusters.roll-restart-reboot-eventschemas: Also restart envoyproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 (owner: 10Muehlenhoff) [15:49:32] (03CR) 10Muehlenhoff: [C: 03+2] Add role_contacts for role::mariadb::misc::analytics::backup [puppet] - 10https://gerrit.wikimedia.org/r/868670 (owner: 10Muehlenhoff) [15:54:57] (03PS1) 10Ottomata: Add flink to profile::docker::builder::known_uid_mappings [puppet] - 10https://gerrit.wikimedia.org/r/877193 (https://phabricator.wikimedia.org/T316519) [15:56:37] (03PS1) 10Ottomata: Add comment about syncing known_uid_mappings with puppet [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877194 [15:56:45] (03PS2) 10Ottomata: Add comment about syncing known_uid_mappings with puppet [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877194 [15:56:59] (03CR) 10Ottomata: [C: 03+2] Add flink to profile::docker::builder::known_uid_mappings [puppet] - 10https://gerrit.wikimedia.org/r/877193 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:57:15] (03CR) 10Ottomata: [C: 03+2] Add comment about syncing known_uid_mappings with puppet [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877194 (owner: 10Ottomata) [15:57:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add comment about syncing known_uid_mappings with puppet [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877194 (owner: 10Ottomata) [15:58:55] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2019.codfw.wmnet [16:01:34] (03CR) 10Ahmon Dancy: "Clement, can you delete the old files from deploy1002:/etc/kubernetes/ please?" [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [16:02:50] (03PS1) 10Marostegui: db1176: Clarify db1176 status [puppet] - 10https://gerrit.wikimedia.org/r/877195 [16:03:26] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [16:03:35] (03CR) 10Jcrespo: [C: 03+1] db1176: Clarify db1176 status [puppet] - 10https://gerrit.wikimedia.org/r/877195 (owner: 10Marostegui) [16:04:17] marostegui: I think I didn't update zarcillo for those 2 hosts, in case it is needed [16:04:36] !log start VC link maintenance in eqiad - T325803 [16:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:39] T325803: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 [16:05:09] (03CR) 10Marostegui: [C: 03+2] db1176: Clarify db1176 status [puppet] - 10https://gerrit.wikimedia.org/r/877195 (owner: 10Marostegui) [16:05:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:08:12] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [16:09:57] (03PS1) 10Eigyan: [config]: Deploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877197 (https://phabricator.wikimedia.org/T325136) [16:11:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [16:11:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2019.codfw.wmnet [16:11:14] (03PS2) 10Eigyan: [config]: Deploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877197 (https://phabricator.wikimedia.org/T325136) [16:11:23] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2020.codfw.wmnet [16:16:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:19:24] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) Replaced and counters cleared. Let's check in a couple days. [16:22:31] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [16:29:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1630). [16:32:50] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [16:34:58] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST jobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:35:09] (03PS1) 10Ottomata: flink images - Use hkps:// and conditionally set http-proxy gpg keyserver-options [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877201 (https://phabricator.wikimedia.org/T316519) [16:36:36] (03PS1) 10Ayounsi: mr: explicitely set "mode event" when not in mode stream [homer/public] - 10https://gerrit.wikimedia.org/r/877202 (https://phabricator.wikimedia.org/T325806) [16:39:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink images - Use hkps:// and conditionally set http-proxy gpg keyserver-options [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877201 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [16:39:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:08] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [16:40:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [16:45:18] I'm not sure what exactly is causing this, but I'm seeing a lot of the following error in Huggle: WARNING: Failed to obtain diff for Priyamaana Thozhi (TV series) the error was: code: assertuserfailed details: See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at for notice of API deprecations and [16:45:18] breaking changes. [16:46:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [16:46:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [16:46:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:51] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc2020.codfw.wmnet [16:48:49] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [16:49:13] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [16:49:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST jobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:50:27] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-01-09-162934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/877204 [16:51:28] 10SRE, 10Machine-Learning-Team, 10ORES: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10akosiaris) [16:51:55] 10SRE, 10Machine-Learning-Team, 10ORES: Stress test ORES on kubernetes (above 4.5k scores/second) - https://phabricator.wikimedia.org/T214054 (10akosiaris) 05Stalled→03Declined ORES isn't going to ever be on Kubernetes, closing this as Declined. [16:52:52] (03PS1) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) [16:52:54] (03CR) 10Marostegui: [C: 03+1] ToolsDB: stop replicating a big problematic table [puppet] - 10https://gerrit.wikimedia.org/r/876011 (https://phabricator.wikimedia.org/T326261) (owner: 10FNegri) [16:54:39] (03CR) 10Ayounsi: [C: 03+2] mr: explicitely set "mode event" when not in mode stream [homer/public] - 10https://gerrit.wikimedia.org/r/877202 (https://phabricator.wikimedia.org/T325806) (owner: 10Ayounsi) [16:54:58] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST jobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:55:20] (03Merged) 10jenkins-bot: mr: explicitely set "mode event" when not in mode stream [homer/public] - 10https://gerrit.wikimedia.org/r/877202 (https://phabricator.wikimedia.org/T325806) (owner: 10Ayounsi) [16:59:49] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2021.codfw.wmnet [17:06:46] (03CR) 10FNegri: [C: 03+2] ToolsDB: stop replicating a big problematic table [puppet] - 10https://gerrit.wikimedia.org/r/876011 (https://phabricator.wikimedia.org/T326261) (owner: 10FNegri) [17:10:13] (03PS1) 10BCornwall: errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) [17:13:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:15:00] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39005/console" [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [17:15:10] (03CR) 10BCornwall: errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [17:15:38] (03PS1) 10JMeybohm: Increase memory requests/limits of cert-manager cainjector [deployment-charts] - 10https://gerrit.wikimedia.org/r/877229 [17:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:23:19] (03PS1) 10Ottomata: flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877230 (https://phabricator.wikimedia.org/T316519) [17:24:03] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877230 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [17:24:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877230 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [17:25:52] (03PS29) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [17:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:26:05] (03CR) 10Ottomata: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:31:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [17:32:24] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [17:33:01] (03CR) 10JMeybohm: [C: 03+2] Increase memory requests/limits of cert-manager cainjector [deployment-charts] - 10https://gerrit.wikimedia.org/r/877229 (owner: 10JMeybohm) [17:34:18] !log Finished codfw jobrunner rolling reboot [17:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:20] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:35:25] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:36:13] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [17:38:37] (03Merged) 10jenkins-bot: Increase memory requests/limits of cert-manager cainjector [deployment-charts] - 10https://gerrit.wikimedia.org/r/877229 (owner: 10JMeybohm) [17:41:01] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:41:35] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:41:43] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:42:04] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:43:56] (03CR) 10JMeybohm: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:45:35] (03PS2) 10BCornwall: errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) [17:46:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [17:46:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:46:44] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc2021.codfw.wmnet [17:47:07] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [17:53:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [17:56:05] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2022.codfw.wmnet [17:58:47] (03PS3) 10BCornwall: errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) [17:59:50] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) 05Open→03Resolved @clement_goubert. I updated this morning. Dell has said this will resolve our issue I am closing this ticket and hope i... [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1800) [18:00:04] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T1800). [18:00:38] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [18:02:11] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [18:02:48] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39006/console" [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [18:02:59] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [18:04:50] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:06:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:12] (03PS1) 10Clément Goubert: Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 [18:07:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [18:07:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2022.codfw.wmnet [18:07:13] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/865207/39007/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [18:07:40] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: /sec-warning page: please add an HTML comment that is more easily visible to API and transport-level inspection/debugging - https://phabricator.wikimedia.org/T240794 (10BCornwall) 05Open→03In progress a:03BCornwall [18:09:10] (03PS2) 10Clément Goubert: Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 (https://phabricator.wikimedia.org/T326119) [18:09:35] 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) @Jclark-ctr Thank you :) I'll repool the machine and remove the downtimes tomorrow. [18:11:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this changed the main group of vcs to "phd" but we only wanted it to be an additional group" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [18:12:02] (03CR) 10JMeybohm: [C: 03+1] "Apart from the duplicate annotations in the pod template, this LGTM. So feel free to merge with that fixed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:14:39] (03PS1) 10Dzahn: phabricator: have vcs system user in group vcs and phd [puppet] - 10https://gerrit.wikimedia.org/r/877235 [18:15:50] (03CR) 10Dzahn: [C: 03+2] phabricator: have vcs system user in group vcs and phd [puppet] - 10https://gerrit.wikimedia.org/r/877235 (owner: 10Dzahn) [18:20:13] (03PS1) 10Ottomata: flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877237 (https://phabricator.wikimedia.org/T316519) [18:20:40] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877237 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [18:20:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877237 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [18:21:02] (03CR) 10Dzahn: [C: 03+2] "User[vcs]/groups: groups changed phd to ['phd', 'vcs'] (corrective) - situation is like before we used systemd::sysuser now." [puppet] - 10https://gerrit.wikimedia.org/r/877235 (owner: 10Dzahn) [18:22:53] (03CR) 10Dzahn: [C: 03+1] "does this need phab deploy or only puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/877188 (https://phabricator.wikimedia.org/T155130) (owner: 10Aklapper) [18:24:27] (03PS30) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [18:24:29] (03CR) 10Ottomata: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:29:56] 10SRE, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10BCornwall) @Vgutierrez, since you were involved with the libvmod-netmapper upgrades, would you say that this 2-year-old issue is fixed? [18:30:01] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2023.codfw.wmnet [18:30:30] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [18:30:30] 10SRE, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10BCornwall) Ugh, sorry, meant to put that in T266651 [18:31:31] 10SRE, 10Traffic-Icebox: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 (10BCornwall) @Vgutierrez, since you were involved with the libvmod-netmapper upgrades, would you say that this 2-year-old issue is fixed? [18:33:06] (03PS1) 10Ottomata: flink-kubernetes-operator - add -Dmaven.antrun.skip=true to mvn package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877241 (https://phabricator.wikimedia.org/T316519) [18:33:23] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator - add -Dmaven.antrun.skip=true to mvn package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/877241 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [18:34:19] 10SRE, 10Traffic-Icebox: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10BCornwall) 05Open→03Invalid As ATS no longer serves client-facing traffic via HTTP/2 (HAProxy handles this now), this is no longer relevant. While it's possible for ATS to handle such tr... [18:36:02] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [18:38:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [18:39:42] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) @RLazarus Do you agree this is resolved (enough)? [18:41:29] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [18:42:17] (03CR) 10Ottomata: [C: 03+2] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:43:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [18:43:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2023.codfw.wmnet [18:43:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [18:44:08] (03PS4) 10BCornwall: errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) [18:44:40] (03CR) 10BBlack: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [18:47:22] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39009/console" [puppet] - 10https://gerrit.wikimedia.org/r/866641 (https://phabricator.wikimedia.org/T270526) (owner: 10BCornwall) [18:48:32] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2024.codfw.wmnet [18:48:36] (03Merged) 10jenkins-bot: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:49:14] (03CR) 10BBlack: [C: 03+1] "Awesome work!" [puppet] - 10https://gerrit.wikimedia.org/r/866641 (https://phabricator.wikimedia.org/T270526) (owner: 10BCornwall) [18:49:34] (03PS7) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [18:49:47] (03PS8) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [18:50:05] (03PS1) 10Ladsgroup: Remove Flow as default in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877244 [18:50:16] 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10MBinder_WMF) I can report today that while the site is still slower... [18:51:05] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39010/console" [puppet] - 10https://gerrit.wikimedia.org/r/866641 (https://phabricator.wikimedia.org/T270526) (owner: 10BCornwall) [18:52:00] (03CR) 10BBlack: [C: 03+1] varnish: Export runtime params for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/863406 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [18:56:54] (03CR) 10Ladsgroup: "I'm definitely pro this change but currently some changes happening on storage of PC. Most notably switching the parsoid cache from restba" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [18:57:34] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [19:00:13] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [19:00:39] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39011/console" [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [19:02:24] (03CR) 10BCornwall: [V: 03+1 C: 03+2] errorpage: Send a comment on browsersec errors [puppet] - 10https://gerrit.wikimedia.org/r/877227 (https://phabricator.wikimedia.org/T240794) (owner: 10BCornwall) [19:04:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [19:04:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:04:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2024.codfw.wmnet [19:04:45] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-01-09-162934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/877204 (owner: 10BryanDavis) [19:05:08] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2025.codfw.wmnet [19:05:28] (03CR) 10Dzahn: [C: 03+1] "no change that could be seen in compiler. it's unclear how this gets deployed exactly. https://puppet-compiler.wmflabs.org/output/877188/3" [puppet] - 10https://gerrit.wikimedia.org/r/877188 (https://phabricator.wikimedia.org/T155130) (owner: 10Aklapper) [19:09:09] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: /sec-warning page: please add an HTML comment that is more easily visible to API and transport-level inspection/debugging - https://phabricator.wikimedia.org/T240794 (10BCornwall) 05In progress→03Resolved /sec-warning now serves the following comment at the to... [19:10:07] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-01-09-162934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/877204 (owner: 10BryanDavis) [19:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:11:31] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [19:12:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:15:50] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2025.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [19:15:58] (KubernetesAPILatency) firing: (17) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:17:09] (03CR) 10Marostegui: "Sure, fine by me!" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [19:20:59] (KubernetesAPILatency) firing: (21) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:22:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2025.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [19:22:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:22:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2025.codfw.wmnet [19:25:58] (KubernetesAPILatency) resolved: (17) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:26:28] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (UPDATE clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:26:46] !log cp5032: set param transit_buffer=1M via varnishadm [19:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:41] (03CR) 10Brennen Bearnes: [C: 04-1] "Have discussed this with Dzahn on IRC. It looks to me like the translation.override value is actually hardcoded in the deployment repo -" [puppet] - 10https://gerrit.wikimedia.org/r/877188 (https://phabricator.wikimedia.org/T155130) (owner: 10Aklapper) [19:27:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:31:14] (KubernetesAPILatency) resolved: (17) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:33:18] !log cp5032: set param transit_buffer=4M via varnishadm [19:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:53] (03PS1) 10Zabe: Start reading from cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877249 (https://phabricator.wikimedia.org/T233004) [19:37:18] !log cp5032: set param transit_buffer=1M via varnishadm [19:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:25] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10BBlack) We have the patched package on cp5032 (bullseye). Did some manual testing on it today: * With stock config, can still reproduce the large transient spike by running `hey` with default params against a large... [19:51:23] (03CR) 10ArielGlenn: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [19:53:29] (03CR) 10Dzahn: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [19:53:57] (03CR) 10Dzahn: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [20:04:30] (03CR) 10Bking: [C: 03+2] Update ltr plugin to 7.10.2-wmf1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/865178 (https://phabricator.wikimedia.org/T324247) (owner: 10Ebernhardson) [20:14:49] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Export runtime params for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/863406 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [20:20:04] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [20:20:31] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [20:20:56] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [20:21:39] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [20:22:24] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Set CORS headers on upload error pages [puppet] - 10https://gerrit.wikimedia.org/r/866641 (https://phabricator.wikimedia.org/T270526) (owner: 10BCornwall) [20:24:52] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:24:55] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [20:25:13] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:28:24] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [20:28:24] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:29:50] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 163 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 169, active_shards: 169, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 163, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [20:29:50] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.903614457831324 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:30:27] 10SRE, 10Traffic, 10Patch-For-Review: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall) 05Open→03Resolved Thankfully, we have thumbor to give us a test! :D ` [~]$ curl -I 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Circuit_de_la_Sarthe_track_map.sv... [20:30:32] 10SRE, 10Thumbor, 10serviceops: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10BCornwall) [20:33:39] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [20:34:20] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [20:35:51] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:35:54] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [20:36:00] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 169, active_shards: 338, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:36:00] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:36:08] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:36:08] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:36:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:36:33] !log deleting global usage coming from commons in commons (T322588) [20:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:36] T322588: Run `refreshGlobalimagelinks.php --pages=nonexisting` from the GlobalUsage extension - https://phabricator.wikimedia.org/T322588 [20:44:13] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:44:13] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:44:16] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [20:44:38] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:52:12] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247 [20:52:15] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [20:52:30] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - bking@cumin1001 - T324247 [20:52:40] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:17] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/877257 [20:57:25] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2026.codfw.wmnet [20:57:40] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [20:59:35] Greetings All \o/ [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T2100) [21:00:05] eigyan and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:41] o/ [21:01:00] I can deploy. [21:01:23] eigyan, you here? [21:01:37] Hello kindrobot I am here [21:02:17] OK, we'll go in order of the ticket. :) So I'll deploy eigyan's then zabe's. [21:02:18] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [21:02:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:19] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1362.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, mw1447.eqiad.wmnet, mw1471.eqiad.wmnet, mw1361.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1485.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1410.eqiad.wmnet, mw1379.eq [21:02:36] t, mw1470.eqiad.wmnet, mw1464.eqiad.wmnet, mw1490.eqiad.wmnet, mw1462.eqiad.wmnet, mw1381.eqiad.wmnet, mw1450.eqiad.wmnet, mw1482.eqiad.wmnet, mw1484.eqiad.wmnet, mw1463.eqiad.wmnet, mw1492.eqiad.wmnet, mw1377.eqiad.wmnet, mw1396.eqiad.wmnet, mw1489.eqiad.wmnet, mw1359.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1443.eqiad.wmnet, mw1444.eqiad.wmnet, mw1404.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, [21:02:36] eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1465.eqiad.wmnet, mw1449.eqiad.wmnet, mw1394.eqiad.wmnet, mw1402.eqiad.wmnet, mw1483.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad https://wikitech.wikimedia.org/wiki/PyBal [21:02:38] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/p [21:02:38] ary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:02:38] thank you kindrobot [21:02:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:02:42] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 2817 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:02:46] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:02:49] here [21:02:53] what's going on [21:02:56] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [21:02:58] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:02:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1447.eqiad.wmnet, mw1471.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1485.eqiad.wmnet, mw1386.eqiad.wmnet, mw1470.eqiad.wmnet, mw1490.eqiad.wmnet, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1482.eq [21:02:58] t, mw1449.eqiad.wmnet, mw1377.eqiad.wmnet, mw1396.eqiad.wmnet, mw1489.eqiad.wmnet, mw1424.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1423.eqiad.wmnet, mw1450.eqiad.wmnet, mw1425.eqiad.wmnet, mw1444.eqiad.wmnet, mw1400.eqiad.wmnet, mw1402.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1392.eqiad.wmnet, mw1375.eqiad.wmnet, mw1376.eqiad.wmnet, mw1464.eqiad.wmnet are marked down but pooled https://wikit [21:02:58] media.org/wiki/PyBal [21:02:59] deployment? [21:03:00] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9677 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:03:08] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [21:03:11] Yup. [21:03:12] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:12] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:12] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:13] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read arti [21:03:13] January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [21:03:13] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [21:03:14] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response [21:03:14] ived: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page [21:03:14] out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:03:14] Backports [21:03:16] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:16] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:16] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:16] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:03:18] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:20] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was [21:03:20] : /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the [21:03:20] ted status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:03:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [21:03:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:03:22] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [21:03:22] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [21:03:22] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:23] kindrobot: site looks broken [21:03:26] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3973 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:03:27] Yeah. [21:03:27] kindrobot: can we revert [21:03:30] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:32] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:03:32] I haven't deployed yet. [21:03:34] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:38] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:38] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [21:03:42] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:44] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:03:58] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:03:58] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:00] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:00] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:06] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [21:04:06] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [21:04:08] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2714 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:04:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [21:04:30] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:41] But I'll probably postpone the backports 'til is calms down. [21:04:48] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:48] PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:48] *it [21:04:50] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:04:50] PROBLEM - PHP7 rendering on mw1485 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:54] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:54] PROBLEM - PHP7 rendering on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:54] PROBLEM - PHP7 rendering on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:54] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:58] PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:10] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} ( [21:05:10] a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:05:12] PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:12] PROBLEM - PHP7 rendering on mw1464 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:12] PROBLEM - PHP7 rendering on mw1471 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:14] PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:20] PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:20] PROBLEM - PHP7 rendering on mw1470 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:20] PROBLEM - PHP7 rendering on mw1462 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:20] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:22] PROBLEM - PHP7 rendering on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:24] effie: hello, is a mc reboot possibly related ^ [21:05:24] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:26] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:28] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:30] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:30] PROBLEM - PHP7 rendering on mw1490 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:32] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:34] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:36] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:36] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:42] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:42] PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:44] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:44] PROBLEM - PHP7 rendering on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:44] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:05:50] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:05:54] PROBLEM - PHP7 rendering on mw1463 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:06] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:06:07] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [21:06:08] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:10] PROBLEM - PHP7 rendering on mw1489 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:12] PROBLEM - PHP7 rendering on mw1482 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:22] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:06:30] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:06:38] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:06:46] RECOVERY - PHP7 rendering on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 8.762 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:52] PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:54] RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 9.217 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:56] PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:06:58] PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:07:04] PROBLEM - PHP7 rendering on mw1483 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:07:08] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:07:10] PROBLEM - PHP7 rendering on mw1491 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:07:14] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:07:17] I'll check with SRE. [21:07:20] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:07:22] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [21:07:38] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:08:00] RECOVERY - PHP7 rendering on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 8.957 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:02] RECOVERY - PHP7 rendering on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 9.684 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:10] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:24] RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 8.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:24] PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:32] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:36] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:08:44] RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 9.581 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:50] PROBLEM - PHP7 rendering on mw1448 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:58] PROBLEM - PHP7 rendering on mw1492 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:00] PROBLEM - PHP7 rendering on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:00] PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:04] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [21:09:06] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:09:06] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:06] PROBLEM - PHP7 rendering on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:06] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:09:08] PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:18] PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:30] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:09:30] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:32] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P42936 and previous config saved to /var/cache/conftool/dbconfig/20230109-210940-marostegui.json [21:09:46] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [21:10:00] PROBLEM - PHP7 rendering on mw1465 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:08] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:10:12] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:10:14] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:10:14] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:10:14] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:10:14] RECOVERY - PHP7 rendering on mw1491 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 7.494 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:16] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:10:20] RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 5.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:20] RECOVERY - PHP7 rendering on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 3.991 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:20] RECOVERY - PHP7 rendering on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 3.914 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:22] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 8.243 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:24] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 3.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:28] RECOVERY - PHP7 rendering on mw1492 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 2.826 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:28] RECOVERY - PHP7 rendering on mw1427 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 2.346 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:30] RECOVERY - PHP7 rendering on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 3.751 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:30] RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:32] RECOVERY - PHP7 rendering on mw1463 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 5.271 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:32] RECOVERY - PHP7 rendering on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 1.595 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:32] RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:38] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:42] RECOVERY - PHP7 rendering on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:42] RECOVERY - PHP7 rendering on mw1489 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:46] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:10:46] RECOVERY - PHP7 rendering on mw1482 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 1.625 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:46] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [21:10:54] RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:54] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.573 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:56] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:58] RECOVERY - PHP7 rendering on mw1485 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 1.318 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:58] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:58] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [21:10:58] RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:11:02] RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:06] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:11:08] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [21:11:18] RECOVERY - PHP7 rendering on mw1464 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:18] RECOVERY - PHP7 rendering on mw1471 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.382 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:20] RECOVERY - PHP7 rendering on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:22] RECOVERY - PHP7 rendering on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:24] RECOVERY - PHP7 rendering on mw1465 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:26] RECOVERY - PHP7 rendering on mw1470 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:26] RECOVERY - PHP7 rendering on mw1462 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:28] RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:28] RECOVERY - PHP7 rendering on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:30] RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.418 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:30] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:32] RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.721 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:34] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:36] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:11:36] RECOVERY - PHP7 rendering on mw1490 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:36] RECOVERY - PHP7 rendering on mw1483 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:36] RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:46] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:11:48] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:11:48] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [21:11:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:11:52] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.886 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:53] eigyan and zabe: SRE has advised me to delay the window while they're working on the issue. I'll keep you updated. [21:11:58] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:11:58] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:00] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:16] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:16] ok kindrobot I will hang out [21:12:16] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:34] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:38] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:38] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:40] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:12:40] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:40] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:40] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:42] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:44] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:12:46] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:12:48] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:48] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:48] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:12:48] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:48] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [21:12:50] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:13:00] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:00] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:13:00] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:06] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:06] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:14] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:13:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:13:18] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:13:19] PROBLEM - MariaDB Replica Lag: s4 #page on db1143 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 865.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:13:20] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:20] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:30] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:13:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:13:33] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:34] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:38] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:13:42] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:13:42] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:48] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:13:56] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:14:08] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:14:12] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:14:12] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:14:13] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:14:13] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:14:46] (03PS1) 10Marostegui: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/877258 [21:15:12] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:15:38] (03CR) 10Marostegui: [C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/877258 (owner: 10Marostegui) [21:15:41] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [21:16:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (3) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:17:48] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:49] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [21:18:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:18:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [21:18:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:18:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2026.codfw.wmnet [21:19:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:21:27] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2027.codfw.wmnet [21:21:38] OK, eigyan, we've got the green light. [21:21:42] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:21:54] excellent kindrobot [21:21:59] !log starting UTC late backport window [21:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:40] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:22:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877197 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [21:23:44] (03Merged) 10jenkins-bot: [config]: Deploy GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877197 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [21:23:51] (03PS3) 10Zabe: beta: Start writing to rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873888 (https://phabricator.wikimedia.org/T299954) [21:24:01] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:877197|[config]: Deploy GDI Safety Survey Wave 4 (T325136)]] [21:24:04] T325136: Deploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - week of January 9, 2023 - https://phabricator.wikimedia.org/T325136 [21:24:13] RECOVERY - MariaDB Replica Lag: s4 #page on db1143 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:26:21] !log kindrobot@deploy1002 kindrobot and essexigyan: Backport for [[gerrit:877197|[config]: Deploy GDI Safety Survey Wave 4 (T325136)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:26:48] eigyan: can you confirm? [21:27:14] will do, confirming now [21:27:43] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [21:29:59] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2027.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [21:31:22] kindrobot ES and FR wikis are looking good I can see my surveys. But EN, FA and PT wikis are crashing hard with a fatal error MediaWiki internal error. [21:31:23] Original exception: [a742342a-fe89-4c47-a859-f2aa7a65fb43] 2023-01-09 21:29:17: Fatal exception of type "TypeError" [21:31:23] Exception caught inside exception handler. [21:31:24] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [21:32:03] I am using server mwdebug1001.eqiad.wmnet in the debug tool kindrobot [21:33:14] Do you want to do some more debugging or shall I roll it back? [21:33:35] when you "wikis crashing hard" you mean only this survey.. or everything [21:34:00] eigyan ^ [21:34:02] kindrobot: I also see this new error on enwiki https://logstash.wikimedia.org/goto/e17c6d1b33b529872e1bb5d6e37aa665 [21:34:06] mutante I mean the server [21:34:16] not the survey mutante [21:34:28] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - bking@cumin1001 - T324247 [21:34:31] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [21:35:03] Thanks cdanis [21:35:14] eigyan: well, pt.wikipedia.org works for me but then let's roll it back I guess [21:35:22] I'm going to roll it back for now, OK eigyan? [21:35:36] kindrobot: it does look like all of those errors are coming from SurveyFactory [21:35:44] Let's wait for a moment please [21:35:54] could be on my end something I am doing wrong [21:36:08] Sure thing, lmk when you're ready. [21:37:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2027.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [21:37:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:37:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2027.codfw.wmnet [21:37:42] eigyan, the survey needs to be an array within the array for the wiki (I hope this formulation makes sence), I can try writing a follow-up if you like [21:37:48] Yes I see that now [21:38:09] zabe [21:38:18] can I fix this and resubmit [21:38:37] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2029.codfw.wmnet [21:39:00] Can it wait until tomorrow's window? [21:39:22] sure thing @kindrobot [21:39:40] OK, great. I'll rollback for now, and we'll do it tomorrow :) [21:39:48] !log kindrobot@deploy1002 Sync cancelled. [21:40:11] hmmm.. sync cancelled? [21:40:37] it's still going to be the same on all appservers? [21:40:46] no, only on debug servers [21:40:53] it never reached the other appservers [21:40:55] ok [21:42:35] I haven't been in this scenario yet. Do I need to manually revert the commit to the config or is there a scap command to do it? [21:42:38] Rollback? [21:43:07] `scap revert` [21:43:11] no [21:43:20] `scap backport --revert ...` [21:43:30] ^^ that one [21:43:32] Ah, OK. Great. Thank you. :) [21:44:32] (03PS1) 10TrainBranchBot: Revert "[config]: Deploy GDI Safety Survey Wave 4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877260 [21:44:34] (03CR) 10TrainBranchBot: "kindrobot@deploy1002 created a revert of this change as Ib4a2af4b5de3f3ee7584ebc03ab26eac92c6faa3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877197 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [21:44:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877260 (owner: 10TrainBranchBot) [21:45:41] (03Merged) 10jenkins-bot: Revert "[config]: Deploy GDI Safety Survey Wave 4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877260 (owner: 10TrainBranchBot) [21:45:55] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:877260|Revert "[config]: Deploy GDI Safety Survey Wave 4"]] [21:45:59] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247 [21:46:02] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [21:46:13] dancy: I imagine it's going to ask me if I want to sync it. Is it safe to say "no" because it never made it out to production and then go on with the next backport? [21:46:31] Yes that is safe. [21:46:41] OK, great. Thank you. [21:47:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [21:47:43] thank you kindrobot [21:47:45] !log kindrobot@deploy1002 kindrobot and trainbranchbot: Backport for [[gerrit:877260|Revert "[config]: Deploy GDI Safety Survey Wave 4"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:47:59] !log kindrobot@deploy1002 Sync cancelled. [21:48:10] No problem eigyan, thank you. :) [21:48:22] zabe, you ready? [21:49:29] yep [21:49:43] Great, I'll start the merge now. [21:50:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873888 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:50:29] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [21:51:00] (03Merged) 10jenkins-bot: beta: Start writing to rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873888 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:51:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:52:04] OK, it's deployed to beta zabe. Do you need anything else from me? [21:52:34] nope, thanks :) [21:52:59] !log close UTC late backport window [21:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:07] Thanks everyone for your help. <3 [21:54:05] thanks kindrobot [21:54:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [21:58:09] (03PS1) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) [22:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230109T2200). [22:00:36] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2029.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [22:03:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2029.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [22:03:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:03:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2029.codfw.wmnet [22:05:13] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2030.codfw.wmnet [22:11:44] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [22:15:41] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2030.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [22:16:58] (03PS1) 10Eigyan: [config]: GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) [22:23:32] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39017/console" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [22:23:55] (03PS5) 10Samwilson: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) [22:25:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2030.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [22:25:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:25:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2030.codfw.wmnet [22:27:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:28:09] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [22:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:32:41] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247 [22:32:47] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [22:33:10] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247 [22:34:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [22:36:58] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:41:45] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:41:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:45:23] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:46:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:47:40] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10phaultfinder) [22:50:35] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [22:51:58] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:52:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:58] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:00:07] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:55] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [23:11:49] (03PS1) 10Dzahn: admin: split system user data type into local and global [puppet] - 10https://gerrit.wikimedia.org/r/877274 [23:12:12] (03CR) 10Dzahn: [C: 03+2] admin: add data types to validate UIDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [23:13:50] (03CR) 10Dzahn: "I believe we could start with changes like this that can happen before doing the actual switch and reduce number of things that have to be" [puppet] - 10https://gerrit.wikimedia.org/r/867703 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:19:29] (03CR) 10MusikAnimal: [C: 03+1] Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) (owner: 10Samwilson) [23:20:19] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:45] (03PS1) 10Dzahn: phabricator: use specific data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877275 [23:22:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:27:15] (03PS1) 10Dzahn: statistics: use new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 [23:27:36] (03CR) 10CI reject: [V: 04-1] statistics: use new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [23:29:56] (03PS1) 10Dzahn: scap: data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 [23:30:17] (03CR) 10CI reject: [V: 04-1] scap: data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [23:30:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:32:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:33:26] 10SRE, 10Fundraising-Backlog, 10Traffic: nginx SSL_do_handshake failed - https://phabricator.wikimedia.org/T326601 (10AnnWF) [23:34:05] 10SRE, 10Fundraising-Backlog, 10Traffic: nginx SSL_do_handshake failed - https://phabricator.wikimedia.org/T326601 (10AnnWF) [23:35:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:36:41] (03PS2) 10Dzahn: scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 [23:37:03] (03CR) 10CI reject: [V: 04-1] scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [23:39:28] (03PS2) 10Dzahn: statistics: assert new data type for globally reserved UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877276