[00:00:13] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:03:01] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:03:31] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01108 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:07:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24623 and previous config saved to /var/cache/conftool/dbconfig/20220414-000740-ladsgroup.json [00:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24624 and previous config saved to /var/cache/conftool/dbconfig/20220414-002245-ladsgroup.json [00:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24625 and previous config saved to /var/cache/conftool/dbconfig/20220414-003750-ladsgroup.json [00:37:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:37:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:38] (03PS6) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [00:49:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34835/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [00:52:22] (03PS7) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [00:55:03] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34836/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [01:25:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:25:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [02:12:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [02:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24626 and previous config saved to /var/cache/conftool/dbconfig/20220414-021214-ladsgroup.json [02:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:25:45] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [02:27:51] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [02:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:53:43] > upstream connect error or disconnect/reset before headers. reset reason: connection failure [02:53:58] page loaded after ~15s [02:54:36] forget if there's a stalkword i'm supposed to use :P [02:55:35] also seeing it [02:55:41] "upstream connect error or disconnect/reset before headers. reset reason: overflow"? [02:55:49] 2 minutes late :P [02:55:52] lol [02:55:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:55:58] Just saw that someone else reported it. [02:56:01] actually, no, slightly different messages [02:56:06] Figured I'd add my $0.02. [02:56:13] Tamzin: Indeed [02:56:18] (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:56:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:56:59] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:57:03] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [02:58:16] looking [02:58:58] ack [03:00:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:01:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:01:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:30] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [03:01:54] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:05:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24627 and previous config saved to /var/cache/conftool/dbconfig/20220414-030550-ladsgroup.json [03:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:20:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24628 and previous config saved to /var/cache/conftool/dbconfig/20220414-032055-ladsgroup.json [03:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:49] Tamzin, AntiComposite, Oshwah: thanks for the reports <3 sorry it was quiet in here, we were talking elsewhere -- everything should be cleared up now, are you still seeing any issues? [03:23:09] smooth sailing here in Greater eqiad [03:23:13] rzl: Oh yeah, looks good. I saw the graphs clear up about 15+ minutes ago. [03:23:14] :-) [03:23:19] yeah, all fine here [03:23:35] šŸ‘ [03:25:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 41.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:29:15] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 49.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:29:17] ^ expected [03:29:17] you fixed it alarm ^ [03:29:21] haha [03:31:17] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 70.42 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:31:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 89.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24629 and previous config saved to /var/cache/conftool/dbconfig/20220414-033600-ladsgroup.json [03:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24630 and previous config saved to /var/cache/conftool/dbconfig/20220414-035105-ladsgroup.json [03:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:00:13] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:37] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:00] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:58:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:59:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:04] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0600). [06:09:58] (03PS1) 10Ayounsi: wmf-netbox: refactor _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 [06:12:44] (03PS1) 10Ayounsi: Use the new _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/780310 [06:21:43] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:39:14] (03PS1) 10Razzi: wikireplicas: repool clouddb1017-1020 following reimaging [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480) [06:41:54] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34837/console" [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [06:44:08] (03CR) 10Razzi: [V: 03+1 C: 03+2] wikireplicas: repool clouddb1017-1020 following reimaging [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [06:45:33] (03PS2) 10Ayounsi: Use the new _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/780310 [06:48:25] (03CR) 10Ayounsi: "Full diff in drmrs: https://phabricator.wikimedia.org/P24631" [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi) [07:00:04] Amir1, apergos, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0700). [07:00:49] o/ looks like nothing to do [07:01:54] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:05:07] hello [07:05:12] I didn't check but gtk [07:05:36] no patches in the window, let me see about trainees [07:06:17] TheresNoTime: you around? [07:07:12] no training because no patches, TheresNoTime :-( [07:08:27] (03CR) 10Ayounsi: [C: 03+1] "lgtm." [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [07:08:38] (03CR) 10Ayounsi: [C: 03+1] Add an A record for datahub.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [07:10:22] (03CR) 10Ayounsi: Add a trafficserver backend mapping rule for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [07:10:29] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:12:33] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:31] 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10akosiaris) [07:34:55] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:03] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:35:03] 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris) [07:35:21] 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris) [07:35:43] 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris) 05Openā†’03Stalled Stalling until T306121 is done. [07:36:22] (03PS1) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 [07:52:32] (03CR) 10Volans: [C: 03+1] "Code looks sane, but only a diff with existing config can ensure it's all correct šŸ˜Š" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi) [07:56:21] (03CR) 10Volans: [C: 03+1] wmf-netbox: refactor _get_junos_interfaces (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi) [07:56:35] (03CR) 10Volans: "Too much juniper for me, I'll leave this to Cathal. But the diff linked in the comments looks sane." [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi) [07:59:23] (03CR) 10Volans: [C: 03+1] "seems sane, nit inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi) [08:00:05] dancy and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0800). [08:00:13] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:07:28] !log fleet wide update of scap to 4.6.1 - T305949 [08:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:34] T305949: Deploy Scap version 4.6.1 - https://phabricator.wikimedia.org/T305949 [08:12:49] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:01] (03PS5) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) [08:15:19] (03PS5) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) [08:24:09] (03CR) 10Jforrester: "Please remember that before merging and deploying logo changes, files must be crushed, which the tox job does automatically for PNGs but n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [08:35:22] (03CR) 10JMeybohm: [C: 03+2] Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [08:36:46] !log added ops members to deplotment group - T305729 [08:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:50] T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 [08:47:19] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Lydia_Pintscher) Yes, we've had several occurrences over the last months where editors complained about maxlag being... [08:58:47] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) I think @Ladsgroup s point is rather than we potentially do not need to include wdqs delay in maxlag now,... [09:02:04] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) I see that's recently uploaded; did you do something unusual in uploading? I can confirm it's swift saying 401: ` mvernon@ms-fe1011:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http:... [09:07:57] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) At least some test wiki images are working, too: https://test.wikipedia.org/wiki/File:Kirchspiel,_R%C3%B6dder,_M%C3%A4usescheune_--_2014_--_2915-19.jpg [09:11:43] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) ` Apr 14 09:09:02 ms-fe1011 proxy-server: 127.0.0.1 127.0.0.1 14/Apr/2022/09/09/02 GET /v1/AUTH_mw/wikipedia-commons-local-public.88/8/88/Kirchspiel%252C_R%25C3%2 5B6dder%252C_M%25C3%25A4usesc... [09:12:05] (03CR) 10Vgutierrez: [C: 03+1] Add a trafficserver backend mapping rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [09:16:17] (03CR) 10Vgutierrez: [C: 03+1] "LGTM after https://gerrit.wikimedia.org/r/c/operations/puppet/+/779840" [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [09:17:27] (03CR) 10Btullis: [C: 03+2] Add a trafficserver backend mapping rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [09:23:14] (03PS1) 10Ayounsi: Prevent re-using network ports when provisioning hosts in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068) [09:23:30] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Prevent re-using network ports when provisioning hosts in Netbox - https://phabricator.wikimedia.org/T272068 (10ayounsi) a:03ayounsi [09:26:33] (03PS2) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 [09:27:15] (03CR) 10jerkins-bot: [V: 04-1] Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi) [09:28:39] (03PS3) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 [09:29:00] (03CR) 10JMeybohm: [C: 03+2] Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [09:30:08] (03CR) 10Ayounsi: [C: 03+2] "Thanks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi) [09:33:39] 10SRE, 10Infrastructure-Foundations, 10netbox: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10cmooney) 05Openā†’03Resolved a:03cmooney Should have updated this previously, the discussion and decision-making moved to live meetings / irc, but I ne... [09:37:03] (03CR) 10Btullis: [C: 03+2] Add an A record for datahub.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [09:38:21] (03PS1) 10Ayounsi: Refine "test_disabled_configured" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542 [09:42:23] (03PS5) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [09:42:56] (03CR) 10jerkins-bot: [V: 04-1] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [09:42:58] (03CR) 10MVernon: "Thanks for this - have updated the tests both to check for extra-/-removal, and that purging attempts get 403." [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [09:43:58] (03PS6) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [09:45:00] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:50:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 54.24 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:52:17] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:52:58] ^^ big spike on 4xx requests on esams, eqiad and eqsin: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&viewPanel=2&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=4 [09:54:58] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) datahub.wikimedia.org is now up {F35051150,width=60%} Now working on getting the datahub-gms.discovery.wmnet service up and running too. [09:58:26] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) That image seems to be hitting commons swift URL. Compare: https://upload.wikimedia.org/wikipedia/commons/8/88/Kirchspiel%2C_R%C3%B6dder%2C_M%C3%A4usescheune_--_2014_--_2915-19.jpg And https://up... [10:00:04] mvolz: May I have your attention please! Services ā€“ Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1000) [10:00:40] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Ladsgroup) Is this a duplicate of the above ticket? [10:01:41] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10RhinosF1) >>! In T306117#7854785, @Ladsgroup wrote: > Is this a duplicate of the above ticket? one is LDAP, one is shell [10:27:19] (03PS1) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [10:29:09] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:36] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542 (owner: 10Ayounsi) [10:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:34:30] (03CR) 10Cathal Mooney: [C: 03+1] "lgtm! nice work, I'm planning a few updates to this file once I can confirm the QFX support for mixed speeds / port blocks, so good examp" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068) (owner: 10Ayounsi) [10:35:24] (03PS2) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [10:40:08] (03PS1) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) [10:42:53] (03PS1) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) [10:44:22] (03PS2) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) [10:46:51] (03PS2) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) [10:56:21] (03CR) 10Vgutierrez: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [10:58:58] (03CR) 10Cathal Mooney: [C: 03+1] "Super work! I think this will be a lot cleaner, given the automation there is no real benefit from using the 'interface-range' stuff. I'" [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi) [11:00:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:52] (03PS3) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [11:01:54] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:36] (03PS1) 10Zabe: icinga: migrate sync-icinga-state cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) [11:04:38] (03PS1) 10Zabe: icinga: remove absented sync-icinga-state cron [puppet] - 10https://gerrit.wikimedia.org/r/780672 (https://phabricator.wikimedia.org/T273673) [11:06:52] (03CR) 10Btullis: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [11:08:26] (03CR) 10Cathal Mooney: [C: 03+1] "Nice work, loving the reduction in complexity! I'd agree with Volans about using the 'set' for the vlans (but tbh might not have hit on t" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi) [11:08:36] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10akosiaris) > * The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignor... [11:13:50] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34839/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:19:14] (03CR) 10Zabe: [V: 03+1] icinga: migrate sync-icinga-state cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:28:29] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10BTullis) >> The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored... [11:35:25] (03PS4) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [11:41:45] (03PS1) 104nn1l2: fawiki: Change wordmark & tagline for new Vector and logo for legacy Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) [11:43:33] (03CR) 10jerkins-bot: [V: 04-1] fawiki: Change wordmark & tagline for new Vector and logo for legacy Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [11:44:32] (03PS2) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) [11:45:35] (03CR) 10jerkins-bot: [V: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [12:02:49] (03CR) 10Ayounsi: [C: 03+2] Refine "test_disabled_configured" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542 (owner: 10Ayounsi) [12:23:21] Any idea why jenkins-bot gave -2 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/780728/ ? [12:25:04] nn1l2: looks like the file youā€™re adding has Wikipedia uppercase but the config has wikipedia lowercase [12:25:15] so itā€™s not the same file path [12:25:44] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) https://www.mediawiki.org/wiki/Readers/Structured_Data [12:28:37] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) >>! In T305358#7854870, @akosiaris wrote: >> * The monitoring: stanza can't be added as having that without lv... [12:34:05] (03PS1) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 [12:34:23] (03PS2) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) [12:34:32] (03PS3) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) [12:36:43] (03PS3) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) [12:41:09] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:24] Thanks [12:50:32] (03PS1) 10Hoo man: Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 [12:54:30] (03PS8) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [12:55:28] (03PS2) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 [12:56:19] (03PS6) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 [12:58:58] (03CR) 10JMeybohm: "Be aware that this might get pulled again, depending of the outcome of T305358" [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [12:59:49] (03CR) 10JMeybohm: [C: 04-2] Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1300). [13:00:05] nn1l2 and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] (03PS3) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) [13:00:19] (03CR) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [13:00:32] Iā€™m in a meeting, might be able to deploy in half an hour or so if nobody else is around [13:00:44] hi [13:01:05] I'm happy to go ahead with my change [13:01:38] (03PS11) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [13:01:50] (03PS5) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [13:02:05] nn1l2: I'll go ahead and start with my change and then Lucas_WMDE or I will do yours [13:02:21] (03CR) 10Btullis: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [13:02:25] Thanks [13:02:32] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) I'm not quite sure how all the plumbing works, but the container seems to be meant to be readable: ` root@ms-fe1009:/etc/swift# swift stat wikipedia-testcommons-local-public | grep 'Read ACL'... [13:02:37] (03CR) 10Hoo man: [C: 03+2] Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man) [13:03:21] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:03:43] (03Merged) 10jenkins-bot: Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man) [13:05:45] (03CR) 10Btullis: [C: 03+2] Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [13:06:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:06:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:01] alright, Iā€™m available now :) [13:08:22] (03CR) 10Andrew Bogott: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:08:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I didnā€™t see it in time but LGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man) [13:09:18] Lucas_WMDE: Something seems off [13:09:25] Way more results than before [13:09:28] (03PS4) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 [13:09:31] including some redirects :( [13:09:33] more results? [13:09:35] ouch [13:09:43] which wiki? [13:09:53] dewiki [13:09:58] https://de.wikipedia.org/wiki/Spezial:Nicht_verbundene_Seiten?limit=500&namespace=0 [13:10:04] compare normal vs mwdebug1001 [13:10:16] oh, is it only on mwdebug so far? [13:10:34] Yeah, I haven't synched it [13:10:45] hm, on mwdebug the list starts with the Module namespace for me o_O [13:11:02] why is it not starting with the article namespace [13:11:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:11:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:40] ah, because the reverse sorting is only in wmf.7 [13:11:46] and I donā€™t think we backported that to wmf.6 [13:11:47] Ah, thought so [13:11:56] but that still doesn't explain the redirects [13:12:00] no [13:12:28] how do you recognize the redirects? just click some of the module links? [13:12:57] aha, some of the article namespace ones are redirects indeed [13:12:57] I'm comparing the articles [13:13:10] grmbl [13:13:11] Yeah, and I've no idea what's wrong there [13:14:31] Lucas_WMDE: Revert and investigate later? [13:14:43] e.g. Ravita has page_is_redirect = 1 in the page table [13:14:47] probably yeah [13:15:11] Yeahā€¦ probably not coming through the migration script but the regular logic [13:15:28] probably [13:15:32] The pages seem to have a very recent page_touched [13:15:35] thereā€™s no way to find out when a page prop was written right? [13:15:47] since the table has no auto_increment PK [13:15:54] No, but page_touched is an indicatorā€¦ I mean, we don't set that, but link update does [13:15:59] I see [13:16:15] Ah, page_links_updated is there [13:16:18] that's even more explicit [13:16:21] nice [13:16:31] So, that's most definitely not the migration codes fault [13:16:34] anyway, let's revert [13:18:16] yeah [13:18:35] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:780753|Read from the "unexpectedUnconnectedPage" page prop]] (duration: 00m 56s) [13:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] oh, I thought it wouldnā€™t need a sync since it was only on mwdebug to begin with ^^ [13:19:30] Yeah, true [13:20:01] but that way we ahve it in the logs [13:20:14] I wonder if $title->isRedirect() in ClientParserOutputDataUpdater::setUnexpectedUnconnectedPage() needs to be a check on the $parserOutput instead [13:20:38] if the Title method reads the database, and the code runs before the database has been written, or something [13:20:43] and ok fair point [13:21:38] hm, not sure if that information is available in ParserOutput though [13:25:09] I'm done [13:25:21] should I deploy nn1l2ā€™s change then? [13:26:10] IMO, yes [13:26:12] I'm available [13:26:28] ok :) [13:26:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:26:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:05] hmm, some bots died [13:27:56] hm, looks like it [13:29:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:29:21] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:29:29] I found something minor to criticize in the patch anyways ;) [13:29:43] so stashbot has a few minutes to return before I want to deploy anything [13:29:44] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:30:01] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:30:43] bd808, greg-g, hashar: can you check on stashbot? (maintainers according to https://admin.toolforge.org/tool/stashbot) [13:32:16] ~tools.stashbot/stashbot.log has a bunch of errors writing to twitter [13:32:34] (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:32:48] though for all I know those errors might be pretty old [13:33:35] ok I think those twitter errors have been happening for at least two days [13:33:40] so thatā€™s probably not the cause of the disconnect now [13:34:44] (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:36:13] ah, stashbot seems to be back [13:37:07] !log relogging four messages that stashbot missed: 13:26 mwdebug-deploy@deploy1002 helmfile [eqiad/codfw] START/DONE helmfile.d/services/mwdebug: apply [13:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:39:57] (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:41:03] (03PS4) 10Lucas Werkmeister (WMDE): fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:41:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:42:12] (03Merged) 10jenkins-bot: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [13:42:27] (03PS8) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [13:44:09] nn1l2: the change is on mwdebug1001, can you test it? [13:44:23] (Iā€™m not sure if logo changes can actually be tested on mwdebug alone, I might have to sync at least the PNGs already [13:44:26] ) [13:44:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34843/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [13:44:56] I was fixing that error by the other user, [13:44:57] https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C?useskin=vector looks good to me [13:44:59] but OKay [13:45:00] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:05] let me check [13:45:13] nn1l2: I think that can wait a bit, but thanks :) [13:45:20] (03CR) 10Ssingh: [V: 03+1] "PCC looks happy for: dns1001, doh1001, cloudservices1003" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [13:45:26] (I was also already grepping for where those task numbers need to be added to end up in the file ^^) [13:45:47] LGTM [13:46:13] alright, so I assume I should sync static/images/ first [13:46:19] and thenā€¦ any particular order for the other three files? [13:46:25] probably doesnā€™t matter [13:46:33] (03CR) 10Ssingh: [V: 03+1] "Ready for review but let's plan to merge this on or after Tuesday." [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [13:46:34] 10SRE, 10Analytics, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) p:05Triageā†’03Medium [13:46:34] no specific order [13:46:38] ok [13:46:47] everything was fine IMO [13:47:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:47:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:30] (03PS9) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589) [13:48:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (1/4) (duration: 00m 56s) [13:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:19] T306030: Change the logo of Farsi Wikipedia for 900K milestone - https://phabricator.wikimedia.org/T306030 [13:50:00] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (2/4) (duration: 00m 55s) [13:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:01] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776232 (owner: 10PipelineBot) [13:51:05] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776231 (owner: 10PipelineBot) [13:51:11] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778277 (owner: 10PipelineBot) [13:51:23] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778278 (owner: 10PipelineBot) [13:51:24] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (3/4) (duration: 01m 00s) [13:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (4/4) (duration: 00m 53s) [13:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] (03CR) 10Hnowlan: "LGTM, some minor notes on style in the Dockerfile." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar) [13:53:35] !log UTC afternoon backport+config window done [13:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:20] jouncebot: next [13:55:21] In 2 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600) [13:55:47] nn1l2: if you want to fix that comment now, feel free to ping me and I can deploy it, we have two hours until the next window :) [13:56:04] Okay, will do [14:00:43] (03CR) 10Volans: [C: 03+1] "LGTM, one last question inline." [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [14:11:21] (03PS1) 10Volans: admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) [14:17:40] (03PS1) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) [14:17:46] (03PS1) 10Ayounsi: Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) [14:18:03] (03CR) 10Zabe: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [14:18:28] (03CR) 10Ayounsi: "Tested in https://netbox-next.wikimedia.org/extras/scripts/replace_device.ReplaceDevice/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [14:19:24] (03PS2) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) [14:20:37] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) a:03ayounsi [14:25:57] (03PS3) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) [14:27:58] please see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/780844 [14:28:13] * Lucas_WMDE looks [14:29:15] LGTM [14:29:16] (03PS4) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) [14:29:17] jouncebot: nowandnext [14:29:17] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [14:29:17] In 1 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600) [14:30:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2) [14:30:25] (03PS5) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) [14:30:37] (03CR) 10Lucas Werkmeister (WMDE): Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2) [14:31:23] nn1l2: are you done yet? :P [14:31:40] yes, I fixed the commit message, there was a typo [14:32:05] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [14:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:08] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 03s) [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2) [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:33:29] (03PS1) 10David Caro: DONOTMERGE: skeleteon for the replicaconfig service [puppet] - 10https://gerrit.wikimedia.org/r/780853 [14:33:31] (03Merged) 10jenkins-bot: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2) [14:34:14] (03CR) 10jerkins-bot: [V: 04-1] DONOTMERGE: skeleteon for the replicaconfig service [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [14:36:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:780844|Wikispecies: Fix logo ticket numbers (T306037)]] (1/2, expected no-op) (duration: 00m 55s) [14:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:12] T306037: Optimize logo for Wikispecies - https://phabricator.wikimedia.org/T306037 [14:36:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:59] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:780844|Wikispecies: Fix logo ticket numbers (T306037)]] (2/2, expected no-op) (duration: 00m 55s) [14:38:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:53] PROBLEM - Host kubestage2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:49] PROBLEM - Host mc2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:48:23] (03PS7) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [14:49:09] RECOVERY - Host kubestage2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [14:49:14] (03CR) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [14:51:51] * Lucas_WMDE experimenting on mwdebug1002 [14:52:07] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @hnowlan i am ready for restbase2021 [14:55:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet [14:55:32] (03CR) 10Btullis: [C: 03+2] Update datahub to use version 0.8.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/779898 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis) [14:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:28] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10hnowlan) >>! In T305469#7855361, @Papaul wrote: > @hnowlan i am ready for restbase2021 Go ahead! [14:56:29] Lucas_WMDE: just took a quick peek, but if what you were seeing in the stashbot logs was 'Status is a duplicate.' errors those are fairly common. Usually they are true as well (meaning that logmsgbot has sent the same irc message 2x in a row as happened in this channel at 14:38 UTC) [14:56:39] thatā€™s what it was yeah [14:56:46] thanks for looking [14:57:33] I should fix that particular log to have timestampes too... [14:59:31] RECOVERY - Host mc2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [14:59:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:46] (03CR) 10Volans: [C: 03+1] "LGTM (with the caveats of my first +1 :D )" [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [15:00:17] PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:00:21] PROBLEM - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.155 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:00:31] PROBLEM - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.154 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:00:39] PROBLEM - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:00:49] PROBLEM - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:00:52] (03Merged) 10jenkins-bot: Update datahub to use version 0.8.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/779898 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis) [15:01:01] hnowlan: expected? ^^^ [15:01:12] volans: yep, oops. [15:01:54] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:31] PROBLEM - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:04:00] hnowlan: do i have to power it off or you will do it? [15:04:12] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:04:59] 10SRE, 10Infrastructure-Foundations, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) [15:05:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:19] papaul: I will [15:05:26] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:05:28] hnowlan: ok let me know [15:05:35] * Lucas_WMDE done experimenting on mwdebug1002 (changes reset with scap pull) [15:05:44] (03CR) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [15:06:25] !log powerdown ganeti2020 for relocation [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase2021.codfw.wmnet with reason: Relocation [15:06:54] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase2021.codfw.wmnet with reason: Relocation [15:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:25] papaul: done [15:07:25] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:08:20] hnowlan: thanks [15:09:47] (03CR) 10Ssingh: [C: 03+1] admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans) [15:09:49] PROBLEM - Host ganeti2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:53] hnowlan: that spike in api server latency in *codfw* is probably related to the restbase depoolage, right? [15:12:04] (03CR) 10Volans: [C: 03+2] admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans) [15:12:10] thanks sukhe! [15:13:03] PROBLEM - Host ganeti2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:03] PROBLEM - Host restbase2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:09] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:14:19] rzl: I wouldn't expect it to cause it... [15:14:38] or at least it's not a regularly seen side effect of depooling a single node [15:15:39] I can't figure out a mechanism either -- but also note codfw only gets like 13 rps, so it's easy to throw the percentiles off [15:15:51] well, I guess that alert is the mean, but still [15:15:54] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:07] ah that could be it [15:16:15] it looks like mcrouter was unhappy [15:16:17] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:16:43] (03CR) 10Ottomata: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:17:52] anyway nothing that needs to be investigated all that deeply, especially since it's codfw-only and self-healing [15:18:24] (03PS1) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [15:19:43] (03CR) 10Zabe: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:20:21] RECOVERY - Host restbase2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [15:24:03] (03CR) 10Ottomata: "Oof, thanks! Sorry about that removal, dunno how that thappened." [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans) [15:24:23] (03CR) 10Ottomata: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:27:23] hnowlan: server will be coming up soon just commiting the switch change now [15:27:35] papaul: great, thanks! [15:30:55] did already already on ganeti2020 the interface is only set to access [15:32:49] hnowlan: server is up [15:33:33] RECOVERY - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-c valid until 2023-11-25 11:38:52 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:33:57] RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.033 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886 [15:33:59] RECOVERY - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is OK: TCP OK - 0.037 second response time on 10.192.16.155 port 9042 https://phabricator.wikimedia.org/T93886 [15:34:01] RECOVERY - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-b valid until 2023-11-25 11:38:50 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:34:09] RECOVERY - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is OK: TCP OK - 0.033 second response time on 10.192.16.154 port 9042 https://phabricator.wikimedia.org/T93886 [15:34:13] RECOVERY - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-a valid until 2023-11-25 11:38:47 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:38:17] (03PS9) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [15:40:34] 10SRE, 10DBA, 10Trust-and-Safety, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370 (10Zabe) [15:40:38] 10SRE, 10DBA, 10Trust-and-Safety, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370 (10Zabe) [15:40:44] (03PS10) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [15:41:47] (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [15:42:54] (03PS1) 10Eigyan: [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) [15:43:10] (03CR) 10Volans: "Some improvement suggestions inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:45:03] RECOVERY - Host ganeti2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.18 ms [15:45:49] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:46:29] RECOVERY - Host ganeti2020 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [15:49:02] (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [15:49:21] (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [15:49:35] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:51:47] (03CR) 10Krinkle: [C: 03+1] "Good to go. As always, stage on mwdebug and look out for warnings in Logstash before rolling out. These renames can be really sneaky at ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [15:52:12] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) All the nodes that are not cloud are now out of Rack B1. Thanks to all helping me to de-pool the servers and power them off. [15:56:12] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) If you're asking me, I have no idea how swift ACL or swift-mediawiki relation works. Sorry. [15:57:54] (03PS11) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [15:58:04] !log krinkle@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 55s) [15:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:43] (03CR) 10Krinkle: [C: 03+1] "The structured in prod has diverged, but, I've applied effectively the same rename + alias and deployed it. This is good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:00:04] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:34] (03CR) 10Ahmon Dancy: [C: 03+1] Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [16:01:00] papaul: thanks! [16:03:04] rzl: Can you deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/780629 for me (with the possibility of re-reverting in case it doesn't work out) [16:03:30] dancy: sure! looking [16:03:35] thx! [16:03:39] hnowlan: you welcome [16:04:00] ohh I see what we're doing, yeah [16:04:09] jayme: if you're still about, any objections? ^ [16:04:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:04:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:30] going ahead [16:07:42] (03CR) 10RLazarus: [C: 03+2] Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [16:09:56] dancy: merged, and ran puppet on deploy1002 [16:10:07] Thanks. Testing now. [16:10:08] (03PS1) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889 [16:12:23] jouncebot now [16:12:23] For the next 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600) [16:13:09] !log dancy@deploy1002 Started scap: (no justification provided) [16:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:20] ^ Testing image build and deploy [16:15:44] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Andrew) I have a dentist appointment at 2PM CDT on Monday the 18th; otherwise I'm available to help with this. Please be aware that I'm largely ignorant of network topology vs. racks so will be re... [16:17:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet [16:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:53] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:20:56] (03PS2) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889 [16:22:07] (03PS3) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889 [16:27:11] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [16:27:33] (03PS1) 10MSantos: mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891 [16:27:46] (03PS1) 10Zabe: httpbb: move redirect tests to test_redirects.yaml [puppet] - 10https://gerrit.wikimedia.org/r/780892 [16:38:57] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891 (owner: 10MSantos) [16:41:01] (03PS1) 10Ottomata: Actually set REQUESTS_CA_BUNDLE [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) [16:45:08] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891 (owner: 10MSantos) [16:46:07] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:52] dancy: ^ what do you think? [16:47:07] Taking a look.. probably permissions issues. [16:47:27] ah yeah, `error: insufficient permission for adding an object to repository database .git/objects` [16:48:09] `chown -R mwbuilder: /etc/helmfile-defaults/mediawiki/release` should take care of that [16:48:49] and `rm /var/lib/deploy-mwdebug/error` afterward [16:48:51] do you know where it is in puppet? [16:49:20] checking.. [16:49:31] !log depooling & disabling puppet on cp2027 for some manual testing T303534 [16:49:34] ah found it, modules/profile/manifests/kubernetes/deployment_server/mediawiki/release.pp [16:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:08] Prior to the revert that you merged today, the commands were run as root, so some of the files in the .git directory are owned by root.. and some not. [16:51:24] ahh okay yeah [16:51:54] I was just digging to see if puppet has an idea of who ought to own the recursive contents of that directory, I didn't want to be fighting it back and forth [16:52:08] chown -R sounds good in that case, running [16:52:21] Nod. It would be nice if something magical happened to keep things in order. [16:53:19] (03PS1) 10JHathaway: smart_data_dump: silence log output when running tests [puppet] - 10https://gerrit.wikimedia.org/r/780902 [16:53:38] I think adding recurse => true to that directory would do the right thing, but I'm not confident about unwanted side effects, and anyway we don't expect to be flipping this all that often [16:53:54] !log rzl@deploy1002:~$ sudo chown -R mwbuilder: /etc/helmfile-defaults/mediawiki/release [16:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:23] deleted the error file, another run should be coming up shortly [16:54:30] ok. Waiting eagerly. [16:54:59] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway) [16:55:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:06] success \o/ [16:56:45] hm, except it looks like the 16:39:54 attempt didn't succeed and now we're at "nothing to deploy" -- does this need a --force run? [16:57:47] hmmm [17:00:15] I'll trigger that. [17:00:29] !log dancy@deploy1002 Started scap: (no justification provided) [17:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:50] alright, something should happen in 2 minutes [17:02:55] sweet [17:04:43] (03CR) 10Cwhite: [C: 03+1] "Tests pass CI and is not a functional change to SDD. LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway) [17:05:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:05:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:49] hmm, different errors at least [17:07:14] nod. The one about not being able to download wmf-stable/mediawiki is new. Not sure what that's about. [17:07:20] !log bking@deploy1002 Started deploy [wdqs/wdqs@76ee675]: WDQS: Allow federated queries with Publication Office and European Commission [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:56] (03CR) 10JHathaway: [C: 03+2] smart_data_dump: silence log output when running tests [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway) [17:11:09] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:33] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:06] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:12:07] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] (03PS1) 10Btullis: Update the container images used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/780906 (https://phabricator.wikimedia.org/T306019) [17:12:17] I'm not sure if `WARNING: Kubernetes configuration file is group-readable. This is insecure.` is turning from a warning into an error downstream but I wouldn't be shocked [17:12:28] It does appear that way. [17:12:47] (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove enc api from puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [17:12:51] (03PS1) 10JHathaway: smart_data_dump: Use lsblk's json output [puppet] - 10https://gerrit.wikimedia.org/r/780907 [17:13:37] (03CR) 10JHathaway: "Kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780907 (owner: 10JHathaway) [17:14:01] !log bking@deploy1002 Finished deploy [wdqs/wdqs@76ee675]: WDQS: Allow federated queries with Publication Office and European Commission (duration: 06m 41s) [17:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:40] rzl: Is it feasible to set ownership of /etc/kubernetes/mwdebug-deploy-eqiad.config to `mwbuilder` and mode 0600? [17:17:02] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan) [17:18:19] digging into that now -- it seems like most of /etc/kubernetes/*.config is mwdeploy:deployment and 0640 though, so I'm still digging into what's going on there [17:19:21] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [17:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:07] rzl: The mwbuilder account can sudo to mwdeploy, so I'll try that approach first., [17:20:31] !log mwbuilder@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:20:32] !log mwbuilder@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:23] I think `Error: unknown command "diff" for "helm"` may be the real problem [17:21:40] s/the/a/ [17:22:50] (03CR) 10Btullis: [C: 03+2] Update the container images used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/780906 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis) [17:23:17] The helm-diff package is installed. [18:14:47] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.7 refs T305213 [18:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:52] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [18:17:21] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.5 (duration: 02m 10s) [18:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:52] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) Hi all - thanks to @jcrespo and @Dzahn for your help. Just now I tested the apps listed here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#wmf_group , with the following re... [18:19:10] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.3 (duration: 01m 48s) [18:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:46] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.4 (duration: 01m 35s) [18:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:21:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:12] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.2 (duration: 01m 25s) [18:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:17] !log dancy@deploy1002 Pruned MediaWiki: 1.37.0-wmf.1 (duration: 01m 04s) [18:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:04] Done cleaning cruft. [18:26:03] RECOVERY - Disk space on mw2289 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops [18:26:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:26:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:18] RECOVERY - Disk space on mw2276 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:15:26] PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:31] (03PS1) 10JHathaway: smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) [19:24:34] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [19:24:52] RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:25:00] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) [19:25:31] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) a:03razzi [19:27:20] PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:14] !log cdanis@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=ats-be,name=cp2027.* [19:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:21] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [19:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:13] ACKNOWLEDGEMENT - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T306215 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:30:18] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T306215 (10ops-monitoring-bot) [19:33:07] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/780892 (owner: 10Zabe) [19:35:38] !log cdanis@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=ats-be,name=cp2027.* [19:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:48] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T306215 (10wiki_willy) a:03Cmjohnson [19:38:48] PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:54] RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [19:42:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1068.eqiad.wmnet with OS stretch [19:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch [19:43:12] (03PS1) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [19:43:34] RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [19:44:26] (03CR) 104nn1l2: fawiki: Change logo for 900K milestone (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [19:44:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS stretch [19:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:34] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch [19:44:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1070.eqiad.wmnet with OS stretch [19:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch [19:44:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS stretch [19:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:53] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch [19:50:47] (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson) [19:52:41] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) [19:53:00] (03PS1) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013 [19:53:15] (03PS2) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [19:53:26] (03PS2) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013 [19:53:46] (03PS3) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [19:54:21] (03PS3) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013 [19:55:04] (03PS3) 10Dzahn: webperf: migrate warm_up_coal_cache cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:56:41] (03CR) 10Zabe: "Maybe I missunderstand this patch. But I actually can login at icinga.wikimedia.org (or is this for a different page?)." [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn) [19:57:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [19:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [19:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [19:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:59] (03CR) 10Dzahn: icinga: don't claim wmf or nda group gets you a login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn) [20:00:04] brennen: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T2000) [20:00:04] zabe and eigyan: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] (03Abandoned) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn) [20:00:12] o/ [20:00:21] PROBLEM - Number of messages locally queued by purged for processing on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [20:00:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [20:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:55] hey zabe [20:01:06] hi [20:01:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [20:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:19] * urbanecm waves to everyone [20:02:20] (03CR) 10Dzahn: [C: 03+2] "not entirely sure if the bash loop will be ok in the command string but easiest is to test it" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:02:39] (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson) [20:03:11] zabe: i'll roll out that first one [20:03:20] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.* [20:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:30] ok [20:03:39] greetings all [20:03:50] (03CR) 10Brennen Bearnes: [C: 03+2] Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:03:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [20:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:16] hi eigyan [20:04:47] Hello guys! [20:04:55] (03Merged) 10jenkins-bot: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:05:18] eigyan: your patch looks like it's beta-only, is that correct? [20:05:33] that is correct thcipriani [20:05:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [20:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:58] (03CR) 10Dzahn: "[webperf1001:~] $ sudo systemctl status warm_up_coal_cache.service" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:06:04] eigyan: cool, thanks for verifying, we'll get it merged shortly :) [20:06:17] RECOVERY - Number of messages locally queued by purged for processing on cp2027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [20:06:20] perfect thank you thcipriani [20:06:48] (03CR) 10Dzahn: "looks good on webperf1001. I did not do anything on deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud though" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:06:50] zabe: on mwdebug1002, looks like en wikipedia loads, assume that's basically all there is to check? [20:07:04] brennen, yep [20:07:08] cool, syncing [20:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] (03CR) 10Dzahn: "good to go.. IF ... puppet ran on deployment-prep host" [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:08:26] (03CR) 10Brennen Bearnes: [C: 03+2] [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan) [20:08:29] !log brennen@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:776259|Stop writing to $wmfUdp2logDest (T45956)]] (duration: 00m 48s) [20:08:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [20:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:34] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:05] (03Merged) 10jenkins-bot: [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan) [20:09:10] eigyan: going ahead with yours real quick and then i'll hand off to cjming for the rest of zabe's stuff [20:09:36] thanks brennen [20:10:05] !log gitlab - pausing and then deleting runner-1015, creating new bullseye runner-1026instance to replace it [20:10:07] eigyan: now that this is merged, this will go live on the next run of: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ [20:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:16] but we'll sync for completeness, too :) [20:10:40] excellent thank you thcipriani [20:11:20] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:780881|[wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA (T303963)]] (duration: 00m 48s) [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:24] T303963: Undeploy Safety Survey for EN, ES wikis from BETA - https://phabricator.wikimedia.org/T303963 [20:12:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:12:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:51] (03CR) 10Zabe: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:12:52] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:780881|[wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA (T303963)]] (duration: 00m 49s) [20:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:31] ^ first of those was a no-op, forgot a rebase. [20:13:40] (i suppose really they're both no-ops, effectively.) [20:14:22] (03PS3) 10Clare Ming: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:14:34] (03CR) 10Clare Ming: [C: 03+2] Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:14:55] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [20:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:02] Do not forget me [20:15:27] Juan_90264: don't worry, you're on our radar :D [20:15:38] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.* [20:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:42] (03Merged) 10jenkins-bot: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:15:49] Okay [20:16:20] (03PS4) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [20:16:31] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:16:44] thcipriani brennen validated on my end [20:16:58] thank you all [20:17:12] zabe: your 2nd patch is up on mwdebug1001 [20:17:27] cjming many thanks [20:17:30] (03PS3) 10Zabe: webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) [20:17:41] thanks eigyan :) [20:18:20] cjming, it's a doc-only change, I can't test anything [20:18:35] ok - syncing then [20:19:37] !log cjming@deploy1002 Synchronized private/readme.php: Config: [[gerrit:768259|Write the same value to wmgSwiftConfig as to wmfSwiftConfig (T45956)]] (duration: 00m 48s) [20:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:43] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:20:04] (03PS2) 10Thcipriani: Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:20:48] (03CR) 10Thcipriani: [C: 03+2] Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:21:16] (03CR) 10Zabe: webperf: remove absented warm_up_coal_cache cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:22:04] (03Merged) 10jenkins-bot: Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:22:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] (03CR) 10Dave Pifke: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:26:03] !log thcipriani@deploy1002 Synchronized wmf-config/filebackend.php: Config: [[gerrit:779856|Migrate $wmfSwiftConfig to $wmgSwiftConfig (T45956)]] (duration: 00m 49s) [20:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:07] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:26:15] ^ zabe there's your last one [20:26:23] thanks :) [20:26:26] Juan_90264: could you rebase your patch for me? I'm having some trouble [20:26:35] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [20:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:02] (03CR) 10Dzahn: "it's NOT showing that warning in prod. though:" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:27:28] (03CR) 10Dzahn: "is it because webperf1001 is still stretch?" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:27:32] Yes I can rebase [20:27:35] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.* [20:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10dpifke) systemd in bullseye (and up?) appears to object if `User=nobody`, see: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969... [20:27:50] (03CR) 10Juan90264: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:28:01] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.23 ms [20:28:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:28:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:57] (03CR) 10Dave Pifke: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:29:50] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) p:05Triageā†’03Medium [20:30:03] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) [20:31:03] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) [20:31:46] thcipriani: I'm not able to rebase, this error appears: "Could not perform action: The change could not be rebased due to a conflict during merge." [20:32:04] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) This was detailed on the procurement task, and I've migrated the testing to this onsite related task. [20:32:25] (03PS6) 10Juan90264: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:32:46] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [20:32:47] (03CR) 10jerkins-bot: [V: 04-1] Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:05] you need to do a manual rebase [20:33:39] https://www.mediawiki.org/wiki/Gerrit/Advanced_usage#Manually_rebase_(on_a_branch) [20:34:29] Juan_90264: git review -d 774834 ; git rebase -i origin/master ; (errors show up); git rebase --continue (it tells you where the error is).. manually look at the file and look for lines with "<<<". save, git add.. git rebase --continue until errors are gone, git commit --amend, git review [20:34:51] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.* [20:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:30] (03PS7) 10Thcipriani: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:37:53] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:38:21] ^ Juan_90264 does the patch look correct to you now? I rebased. [20:38:54] if so can you +1? [20:39:47] (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:40:10] thcipriani: That's better. [20:40:17] cool :) [20:40:32] (03CR) 10Thcipriani: [C: 03+2] Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:40:38] I've never seen this error before [20:41:18] (03Merged) 10jenkins-bot: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:41:19] it happens when something has merged since you made your change that conflicts with your change in a way git can't automatically resolve: requires a human [20:42:14] Juan_90264: live on mwdebug1002, check please :) [20:42:39] I'll check [20:46:15] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [20:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:39] 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10jhathaway) I think having a syntax validity check would be a great first start. I think using yamllint, a ruby script or a short python script would work well: ` $ yamllint -d... [20:48:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:51] thcipriani: I tested and approved [20:49:28] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.* [20:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:32] Juan_90264: cool, going live [20:51:00] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774834|Add extendedconfirmed user group for testwiki (T302860)]] (duration: 01m 04s) [20:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:03] T302860: Consider turning on extendedconfirmed user group for testwiki - https://phabricator.wikimedia.org/T302860 [20:51:10] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot verify NTP status asw1-b12-drmrs - https://phabricator.wikimedia.org/T305840 (10cmooney) I've opened a case with Juniper, let's see what they say. [20:51:12] ^ Juan_90264 should be live now [20:53:19] Change is already working, thanks for deploying thcipriani! [20:53:36] thanks for the change Juan_90264 :) [20:53:39] 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) @RLazarus were you going to work on the rest of this? We still need more plumbing inside requestctl correct? [20:56:27] (03CR) 10Ebernhardson: "might need more work, or at least more tests. Trying to understand how this works with multiple clusters on each host." [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson) [20:58:48] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049 [21:00:13] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049 (owner: 10Ahmon Dancy) [21:00:52] 10SRE, 10Traffic: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis) [21:01:25] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049 (owner: 10Ahmon Dancy) [21:01:51] (03PS1) 10Zabe: osm: migrate import_waterlines cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673) [21:01:53] (03PS1) 10Zabe: osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) [21:02:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jhathaway) >>! In T211750#7853874, @Volans wrote: > Although there are no doubt that an automatic formatter is of great help, there are also... [21:06:54] !log cdanis@cumin1001 conftool action : set/weight=100; selector: dc=codfw,service=ats-be,name=cp2027.* [21:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:00] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.* [21:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:49] !log enabled puppet on cp2027, restarted ats-be, & repooled after some manual testing T303534 [21:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) So I've been able to check the options here on the QFX5120 platform. It is **not** possible to mix 10G and 25G SFP modules in the... [21:10:22] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10cmooney) 05Openā†’03Resolved Thanks for the help on this one @Jclark-ctr. All done with the testing you can remove those cables and leave them with our others. thanks! [21:12:15] (03CR) 10Cathal Mooney: [C: 03+2] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [21:13:32] (03PS1) 10Zabe: wikitech: migrate mw-xml cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781053 (https://phabricator.wikimedia.org/T273673) [21:13:34] (03PS1) 10Zabe: wikitech: remove absented mw-xml cron [puppet] - 10https://gerrit.wikimedia.org/r/781054 (https://phabricator.wikimedia.org/T273673) [21:14:08] (03Merged) 10jenkins-bot: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [21:15:55] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223) [21:17:31] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223) (owner: 10CDanis) [21:19:45] !log Updated netbox-extras / interface_automation script for Netbox to add logic to rename interfaces (CR769729) [21:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10wiki_willy) a:03Jclark-ctr [21:20:46] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34846/" [puppet] - 10https://gerrit.wikimedia.org/r/781053 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:26:44] 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10RLazarus) a:03RLazarus Yeah -- I can do the implementation but I'm not sure if we've settled on what we want it to look like. I don't have a strong opini... [21:29:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) Actually I should clarify, it *may* be possible to use the channel-speed syntax to configure the switch in blocks of 2, it allows... [21:36:06] (03PS1) 10Zabe: Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) [21:38:02] (03CR) 10Dzahn: [C: 03+2] webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:38:17] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10cmooney) 05Openā†’03Resolved Closing ticket, duplicate. Results detailed in https://phabricator.wikimedia.org/T303529#7856797 [21:38:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) [21:38:25] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:39:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10cmooney) 05Resolvedā†’03Open Actually I spoke too soon, there is one other combination I want to check. @Jclark-ctr could you move the 10G cable in port xe-0/0/1 to port xe-0/0/2... [21:58:53] 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) a:03Zabe [21:59:35] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/1286/" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:01:04] !log gitlab deleting runner-1026, creating runner-1027 [22:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:24] 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) [22:11:31] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:04] (03PS1) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 [22:15:44] (03PS2) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 [22:15:49] log gitlab deleting runner-1017, creating runner-1028 [22:15:53] !log gitlab deleting runner-1017, creating runner-1028 [22:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:03] (03CR) 10Ahmon Dancy: "This is a subset of changes that have been used in the train-dev branch for a long time. Dealing with merge conflicts when pulling from m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [22:17:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [22:19:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:24:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang) [22:28:01] !log gitlab - deleting runner-1018, runner-1019, creating runner-1029, runner-1030 T297659 [22:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:06] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:40:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) >>! In T211750#7856763, @jhathaway wrote: > Thanks for the detailed write up of all the issues. It would be great at some point to c... [22:44:24] got a few reports of 503s [22:44:34] I didn't see them, seems to have recovered now [22:50:31] (03PS5) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) [22:57:10] AntiComposite: i saw them, but everything seems work fine on my end now. [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:09:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 31.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:10:19] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 33.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:10:25] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 16.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:10:25] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 37.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:10:47] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 17.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:12:25] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 73.78 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:12:31] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:12:33] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:12:35] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:12:40] (expected) [23:12:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:14:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 101.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:26:09] (03PS1) 10Andrew Bogott: cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781064 [23:26:54] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781064 (owner: 10Andrew Bogott) [23:30:30] (03PS1) 10Andrew Bogott: cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781070 [23:31:06] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781070 (owner: 10Andrew Bogott) [23:33:15] (03PS1) 10Andrew Bogott: cloudbackup2002: further attempt to keep lvm happy [puppet] - 10https://gerrit.wikimedia.org/r/781077 [23:34:37] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: further attempt to keep lvm happy [puppet] - 10https://gerrit.wikimedia.org/r/781077 (owner: 10Andrew Bogott)