[00:00:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[00:03:01] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[00:03:31] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01108 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:07:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24623 and previous config saved to /var/cache/conftool/dbconfig/20220414-000740-ladsgroup.json
[00:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24624 and previous config saved to /var/cache/conftool/dbconfig/20220414-002245-ladsgroup.json
[00:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24625 and previous config saved to /var/cache/conftool/dbconfig/20220414-003750-ladsgroup.json
[00:37:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[00:37:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[00:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:55] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[00:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:38] <wikibugs>	 (03PS6) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936
[00:49:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34835/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh)
[00:52:22] <wikibugs>	 (03PS7) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936
[00:55:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34836/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh)
[01:25:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[01:25:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[01:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:39:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[02:12:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[02:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24626 and previous config saved to /var/cache/conftool/dbconfig/20220414-021214-ladsgroup.json
[02:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:18] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[02:25:45] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[02:27:51] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[02:32:54] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:53:43] <Tamzin>	 > upstream connect error or disconnect/reset before headers. reset reason: connection failure
[02:53:58] <Tamzin>	 page loaded after ~15s
[02:54:36] <Tamzin>	 forget if there's a stalkword i'm supposed to use :P
[02:55:35] <AntiComposite>	 also seeing it
[02:55:41] <Oshwah>	 "upstream connect error or disconnect/reset before headers. reset reason: overflow"?
[02:55:49] <Tamzin>	 2 minutes late :P
[02:55:52] <Oshwah>	 lol
[02:55:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:55:58] <Oshwah>	 Just saw that someone else reported it.
[02:56:01] <Tamzin>	 actually, no, slightly different messages
[02:56:06] <Oshwah>	 Figured I'd add my $0.02.
[02:56:13] <Oshwah>	 Tamzin: Indeed
[02:56:18] <jinxer-wm>	 (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:56:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:56:59] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:57:03] <icinga-wm>	 PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f
[02:58:16] <rzl>	 looking
[02:58:58] <jhathaway>	 ack
[03:00:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:01:11] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:01:18] <jinxer-wm>	 (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:01:30] <icinga-wm>	 RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f
[03:01:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:05:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24627 and previous config saved to /var/cache/conftool/dbconfig/20220414-030550-ladsgroup.json
[03:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:56] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[03:20:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24628 and previous config saved to /var/cache/conftool/dbconfig/20220414-032055-ladsgroup.json
[03:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:22:49] <rzl>	 Tamzin, AntiComposite, Oshwah: thanks for the reports <3 sorry it was quiet in here, we were talking elsewhere -- everything should be cleared up now, are you still seeing any issues?
[03:23:09] <Tamzin>	 smooth sailing here in Greater eqiad
[03:23:13] <Oshwah>	 rzl: Oh yeah, looks good. I saw the graphs clear up about 15+ minutes ago.
[03:23:14] <Oshwah>	 :-)
[03:23:19] <AntiComposite>	 yeah, all fine here
[03:23:35] <rzl>	 👍
[03:25:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:05] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 41.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:29:15] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 49.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:29:17] <rzl>	 ^ expected
[03:29:17] <AntiComposite>	 you fixed it alarm ^
[03:29:21] <rzl>	 haha
[03:31:17] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 70.42 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:31:27] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 89.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:36:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24629 and previous config saved to /var/cache/conftool/dbconfig/20220414-033600-ladsgroup.json
[03:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:51:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24630 and previous config saved to /var/cache/conftool/dbconfig/20220414-035105-ladsgroup.json
[03:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:51:10] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[04:00:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:20:37] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:45:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:58:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:59:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0600).
[06:09:58] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: refactor _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304
[06:12:44] <wikibugs>	 (03PS1) 10Ayounsi: Use the new _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/780310
[06:21:43] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:32:54] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:39:14] <wikibugs>	 (03PS1) 10Razzi: wikireplicas: repool clouddb1017-1020 following reimaging [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480)
[06:41:54] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34837/console" [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi)
[06:44:08] <wikibugs>	 (03CR) 10Razzi: [V: 03+1 C: 03+2] wikireplicas: repool clouddb1017-1020 following reimaging [puppet] - 10https://gerrit.wikimedia.org/r/780435 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi)
[06:45:33] <wikibugs>	 (03PS2) 10Ayounsi: Use the new _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/780310
[06:48:25] <wikibugs>	 (03CR) 10Ayounsi: "Full diff in drmrs: https://phabricator.wikimedia.org/P24631" [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi)
[07:00:04] <jouncebot>	 Amir1, apergos, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0700).
[07:00:49] <taavi>	 o/ looks like nothing to do
[07:01:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:05:07] <apergos>	 hello
[07:05:12] <apergos>	 I didn't check but gtk
[07:05:36] <apergos>	 no patches in the window, let me see about trainees
[07:06:17] <apergos>	 TheresNoTime: you around?
[07:07:12] <apergos>	 no training because no patches, TheresNoTime :-(
[07:08:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm." [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[07:08:38] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add an A record for datahub.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[07:10:22] <wikibugs>	 (03CR) 10Ayounsi: Add a trafficserver backend mapping rule for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[07:10:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:12:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:31] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10akosiaris)
[07:34:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:35:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:35:03] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris)
[07:35:21] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris)
[07:35:43] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10akosiaris) 05Open→03Stalled Stalling until T306121 is done.
[07:36:22] <wikibugs>	 (03PS1) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463
[07:52:32] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Code looks sane, but only a diff with existing config can ensure it's all correct 😊" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi)
[07:56:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] wmf-netbox: refactor _get_junos_interfaces (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi)
[07:56:35] <wikibugs>	 (03CR) 10Volans: "Too much juniper for me, I'll leave this to Cathal. But the diff linked in the comments looks sane." [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi)
[07:59:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "seems sane, nit inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi)
[08:00:05] <jouncebot>	 dancy and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0800).
[08:00:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:07:28] <jayme>	 !log fleet wide update of scap to 4.6.1 - T305949
[08:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:34] <stashbot>	 T305949: Deploy Scap version 4.6.1 - https://phabricator.wikimedia.org/T305949
[08:12:49] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:01] <wikibugs>	 (03PS5) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729)
[08:15:19] <wikibugs>	 (03PS5) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729)
[08:24:09] <wikibugs>	 (03CR) 10Jforrester: "Please remember that before merging and deploying logo changes, files must be crushed, which the tox job does automatically for PNGs but n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[08:35:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm)
[08:36:46] <jayme>	 !log added ops members to deplotment group - T305729
[08:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:50] <stashbot>	 T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729
[08:47:19] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Lydia_Pintscher) Yes, we've had several occurrences over the last months where editors complained about maxlag being...
[08:58:47] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) I think @Ladsgroup s point is rather than we potentially do not need to include wdqs delay in maxlag now,...
[09:02:04] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) I see that's recently uploaded; did you do something unusual in uploading?  I can confirm it's swift saying 401: ` mvernon@ms-fe1011:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http:...
[09:07:57] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) At least some test wiki images are working, too: https://test.wikipedia.org/wiki/File:Kirchspiel,_R%C3%B6dder,_M%C3%A4usescheune_--_2014_--_2915-19.jpg
[09:11:43] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) ` Apr 14 09:09:02 ms-fe1011 proxy-server: 127.0.0.1 127.0.0.1 14/Apr/2022/09/09/02  GET /v1/AUTH_mw/wikipedia-commons-local-public.88/8/88/Kirchspiel%252C_R%25C3%2 5B6dder%252C_M%25C3%25A4usesc...
[09:12:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add a trafficserver backend mapping rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[09:16:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM after https://gerrit.wikimedia.org/r/c/operations/puppet/+/779840" [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[09:17:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a trafficserver backend mapping rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[09:23:14] <wikibugs>	 (03PS1) 10Ayounsi: Prevent re-using network ports when provisioning hosts in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068)
[09:23:30] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Prevent re-using network ports when provisioning hosts in Netbox - https://phabricator.wikimedia.org/T272068 (10ayounsi) a:03ayounsi
[09:26:33] <wikibugs>	 (03PS2) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463
[09:27:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi)
[09:28:39] <wikibugs>	 (03PS3) 10Ayounsi: Network report: alert on disabled but configured interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463
[09:29:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm)
[09:30:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Thanks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780463 (owner: 10Ayounsi)
[09:33:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10cmooney) 05Open→03Resolved a:03cmooney Should have updated this previously, the discussion and decision-making moved to live meetings / irc, but I ne...
[09:37:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an A record for datahub.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[09:38:21] <wikibugs>	 (03PS1) 10Ayounsi: Refine "test_disabled_configured" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542
[09:42:23] <wikibugs>	 (03PS5) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942)
[09:42:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon)
[09:42:58] <wikibugs>	 (03CR) 10MVernon: "Thanks for this - have updated the tests both to check for extra-/-removal, and that purging attempts get 403." [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon)
[09:43:58] <wikibugs>	 (03PS6) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942)
[09:45:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:50:05] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 54.24 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:52:17] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:52:58] <vgutierrez>	 ^^ big spike on 4xx requests  on esams, eqiad and eqsin: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&viewPanel=2&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=4
[09:54:58] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) datahub.wikimedia.org is now up {F35051150,width=60%} Now working on getting the datahub-gms.discovery.wmnet service up and running too.
[09:58:26] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) That image seems to be hitting commons swift URL. Compare: https://upload.wikimedia.org/wikipedia/commons/8/88/Kirchspiel%2C_R%C3%B6dder%2C_M%C3%A4usescheune_--_2014_--_2915-19.jpg  And  https://up...
[10:00:04] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1000)
[10:00:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Ladsgroup) Is this a duplicate of the above ticket?
[10:01:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10RhinosF1) >>! In T306117#7854785, @Ladsgroup wrote: > Is this a duplicate of the above ticket?  one is LDAP, one is shell
[10:27:19] <wikibugs>	 (03PS1) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358)
[10:29:09] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542 (owner: 10Ayounsi)
[10:32:54] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:34:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!  nice work, I'm planning a few updates to this file once I can confirm the QFX support for mixed speeds / port blocks, so good examp" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068) (owner: 10Ayounsi)
[10:35:24] <wikibugs>	 (03PS2) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358)
[10:40:08] <wikibugs>	 (03PS1) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358)
[10:42:53] <wikibugs>	 (03PS1) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249)
[10:44:22] <wikibugs>	 (03PS2) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358)
[10:46:51] <wikibugs>	 (03PS2) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249)
[10:56:21] <wikibugs>	 (03CR) 10Vgutierrez: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[10:58:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Super work!  I think this will be a lot cleaner, given the automation there is no real benefit from using the 'interface-range' stuff.  I'" [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi)
[11:00:55] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:01:52] <wikibugs>	 (03PS3) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358)
[11:01:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:04:36] <wikibugs>	 (03PS1) 10Zabe: icinga: migrate sync-icinga-state cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673)
[11:04:38] <wikibugs>	 (03PS1) 10Zabe: icinga: remove absented sync-icinga-state cron [puppet] - 10https://gerrit.wikimedia.org/r/780672 (https://phabricator.wikimedia.org/T273673)
[11:06:52] <wikibugs>	 (03CR) 10Btullis: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[11:08:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Nice work, loving the reduction in complexity!  I'd agree with Volans about using the 'set' for the vlans (but tbh might not have hit on t" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi)
[11:08:36] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10akosiaris) > * The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignor...
[11:13:50] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34839/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:19:14] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] icinga: migrate sync-icinga-state cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:28:29] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10BTullis) >> The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored...
[11:35:25] <wikibugs>	 (03PS4) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358)
[11:41:45] <wikibugs>	 (03PS1) 104nn1l2: fawiki: Change wordmark & tagline for new Vector and logo for legacy Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030)
[11:43:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] fawiki: Change wordmark & tagline for new Vector and logo for legacy Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[11:44:32] <wikibugs>	 (03PS2) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030)
[11:45:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[12:02:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Refine "test_disabled_configured" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780542 (owner: 10Ayounsi)
[12:23:21] <nn1l2>	 Any idea why jenkins-bot gave -2 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/780728/ ?
[12:25:04] <Lucas_WMDE>	 nn1l2: looks like the file you’re adding has Wikipedia uppercase but the config has wikipedia lowercase
[12:25:15] <Lucas_WMDE>	 so it’s not the same file path
[12:25:44] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) https://www.mediawiki.org/wiki/Readers/Structured_Data
[12:28:37] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) >>! In T305358#7854870, @akosiaris wrote: >> * The monitoring: stanza can't be added as having that without lv...
[12:34:05] <wikibugs>	 (03PS1) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629
[12:34:23] <wikibugs>	 (03PS2) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729)
[12:34:32] <wikibugs>	 (03PS3) 10JMeybohm: Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729)
[12:36:43] <wikibugs>	 (03PS3) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030)
[12:41:09] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:43:24] <nn1l2>	 Thanks <Lucas_WMDE>
[12:50:32] <wikibugs>	 (03PS1) 10Hoo man: Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753
[12:54:30] <wikibugs>	 (03PS8) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392)
[12:55:28] <wikibugs>	 (03PS2) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858
[12:56:19] <wikibugs>	 (03PS6) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857
[12:58:58] <wikibugs>	 (03CR) 10JMeybohm: "Be aware that this might get pulled again, depending of the outcome of T305358" [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[12:59:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-2] Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1300).
[13:00:05] <jouncebot>	 nn1l2 and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <wikibugs>	 (03PS3) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358)
[13:00:19] <wikibugs>	 (03CR) 10Btullis: Add a CNAME reference for datahub-gms.discovery.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[13:00:32] <Lucas_WMDE>	 I’m in a meeting, might be able to deploy in half an hour or so if nobody else is around
[13:00:44] <nn1l2>	 hi
[13:01:05] <hoo>	 I'm happy to go ahead with my change
[13:01:38] <wikibugs>	 (03PS11) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859
[13:01:50] <wikibugs>	 (03PS5) 10Btullis: Add datahub-gms to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358)
[13:02:05] <hoo>	 nn1l2: I'll go ahead and start with my change and then Lucas_WMDE or I will do yours
[13:02:21] <wikibugs>	 (03CR) 10Btullis: Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[13:02:25] <nn1l2>	 Thanks
[13:02:32] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10MatthewVernon) I'm not quite sure how all the plumbing works, but the container seems to be meant to be readable: ` root@ms-fe1009:/etc/swift# swift stat wikipedia-testcommons-local-public | grep 'Read ACL'...
[13:02:37] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man)
[13:03:21] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:03:43] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the "unexpectedUnconnectedPage" page prop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man)
[13:05:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a CNAME reference for datahub-gms.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/780658 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis)
[13:06:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:06:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:01] <Lucas_WMDE>	 alright, I’m available now :)
[13:08:22] <wikibugs>	 (03CR) 10Andrew Bogott: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:08:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I didn’t see it in time but LGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780753 (owner: 10Hoo man)
[13:09:18] <hoo>	 Lucas_WMDE: Something seems off
[13:09:25] <hoo>	 Way more results than before
[13:09:28] <wikibugs>	 (03PS4) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304
[13:09:31] <hoo>	 including some redirects :(
[13:09:33] <Lucas_WMDE>	 more results?
[13:09:35] <Lucas_WMDE>	 ouch
[13:09:43] <Lucas_WMDE>	 which wiki?
[13:09:53] <hoo>	 dewiki
[13:09:58] <hoo>	 https://de.wikipedia.org/wiki/Spezial:Nicht_verbundene_Seiten?limit=500&namespace=0
[13:10:04] <hoo>	 compare normal vs mwdebug1001
[13:10:16] <Lucas_WMDE>	 oh, is it only on mwdebug so far?
[13:10:34] <hoo>	 Yeah, I haven't synched it
[13:10:45] <Lucas_WMDE>	 hm, on mwdebug the list starts with the Module namespace for me o_O
[13:11:02] <Lucas_WMDE>	 why is it not starting with the article namespace
[13:11:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:11:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:40] <Lucas_WMDE>	 ah, because the reverse sorting is only in wmf.7
[13:11:46] <Lucas_WMDE>	 and I don’t think we backported that to wmf.6
[13:11:47] <hoo>	 Ah, thought so
[13:11:56] <hoo>	 but that still doesn't explain the redirects
[13:12:00] <Lucas_WMDE>	 no
[13:12:28] <Lucas_WMDE>	 how do you recognize the redirects? just click some of the module links?
[13:12:57] <Lucas_WMDE>	 aha, some of the article namespace ones are redirects indeed
[13:12:57] <hoo>	 I'm comparing the articles
[13:13:10] <Lucas_WMDE>	 grmbl
[13:13:11] <hoo>	 Yeah, and I've no idea what's wrong there
[13:14:31] <hoo>	 Lucas_WMDE: Revert and investigate later?
[13:14:43] <Lucas_WMDE>	 e.g. Ravita has page_is_redirect = 1 in the page table
[13:14:47] <Lucas_WMDE>	 probably yeah
[13:15:11] <hoo>	 Yeah… probably not coming through the migration script but the regular logic
[13:15:28] <Lucas_WMDE>	 probably
[13:15:32] <hoo>	 The pages seem to have a very recent page_touched
[13:15:35] <Lucas_WMDE>	 there’s no way to find out when a page prop was written right?
[13:15:47] <Lucas_WMDE>	 since the table has no auto_increment PK
[13:15:54] <hoo>	 No, but page_touched is an indicator… I mean, we don't set that, but link update does
[13:15:59] <Lucas_WMDE>	 I see
[13:16:15] <hoo>	 Ah, page_links_updated is there
[13:16:18] <hoo>	 that's even more explicit
[13:16:21] <Lucas_WMDE>	 nice
[13:16:31] <hoo>	 So, that's most definitely not the migration codes fault
[13:16:34] <hoo>	 anyway, let's revert
[13:18:16] <Lucas_WMDE>	 yeah
[13:18:35] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:780753|Read from the "unexpectedUnconnectedPage" page prop]] (duration: 00m 56s)
[13:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:58] <Lucas_WMDE>	 oh, I thought it wouldn’t need a sync since it was only on mwdebug to begin with ^^
[13:19:30] <hoo>	 Yeah, true
[13:20:01] <hoo>	 but that way we ahve it in the logs
[13:20:14] <Lucas_WMDE>	 I wonder if $title->isRedirect() in ClientParserOutputDataUpdater::setUnexpectedUnconnectedPage() needs to be a check on the $parserOutput instead
[13:20:38] <Lucas_WMDE>	 if the Title method reads the database, and the code runs before the database has been written, or something
[13:20:43] <Lucas_WMDE>	 and ok fair point
[13:21:38] <Lucas_WMDE>	 hm, not sure if that information is available in ParserOutput though
[13:25:09] <hoo>	 I'm done
[13:25:21] <Lucas_WMDE>	 should I deploy nn1l2’s change then?
[13:26:10] <hoo>	 IMO, yes
[13:26:12] <nn1l2>	 I'm available
[13:26:28] <Lucas_WMDE>	 ok :)
[13:26:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:26:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:26:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:26:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:27:05] <zabe>	 hmm, some bots died
[13:27:56] <Lucas_WMDE>	 hm, looks like it
[13:29:20] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:29:21] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[13:29:29] <Lucas_WMDE>	 I found something minor to criticize in the patch anyways ;)
[13:29:43] <Lucas_WMDE>	 so stashbot has a few minutes to return before I want to deploy anything
[13:29:44] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[13:30:01] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[13:30:43] <Lucas_WMDE>	 bd808, greg-g, hashar: can you check on stashbot? (maintainers according to https://admin.toolforge.org/tool/stashbot)
[13:32:16] <Lucas_WMDE>	 ~tools.stashbot/stashbot.log has a bunch of errors writing to twitter
[13:32:34] <wikibugs>	 (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:32:48] <Lucas_WMDE>	 though for all I know those errors might be pretty old
[13:33:35] <Lucas_WMDE>	 ok I think those twitter errors have been happening for at least two days
[13:33:40] <Lucas_WMDE>	 so that’s probably not the cause of the disconnect now
[13:34:44] <wikibugs>	 (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:36:13] <Lucas_WMDE>	 ah, stashbot seems to be back
[13:37:07] <Lucas_WMDE>	 !log relogging four messages that stashbot missed: 13:26 mwdebug-deploy@deploy1002 helmfile [eqiad/codfw] START/DONE helmfile.d/services/mwdebug: apply
[13:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:39:57] <wikibugs>	 (03CR) 104nn1l2: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:41:03] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:41:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:42:12] <wikibugs>	 (03Merged) 10jenkins-bot: fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780728 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[13:42:27] <wikibugs>	 (03PS8) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936
[13:44:09] <Lucas_WMDE>	 nn1l2: the change is on mwdebug1001, can you test it?
[13:44:23] <Lucas_WMDE>	 (I’m not sure if logo changes can actually be tested on mwdebug alone, I might have to sync at least the PNGs already
[13:44:26] <Lucas_WMDE>	 )
[13:44:35] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34843/console" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh)
[13:44:56] <nn1l2>	 I was fixing that error by the other user, <Lucas_WMDE>
[13:44:57] <Lucas_WMDE>	 https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C?useskin=vector looks good to me
[13:44:59] <nn1l2>	 but OKay
[13:45:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:45:05] <nn1l2>	 let me check
[13:45:13] <Lucas_WMDE>	 nn1l2: I think that can wait a bit, but thanks :)
[13:45:20] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC looks happy for: dns1001, doh1001, cloudservices1003" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh)
[13:45:26] <Lucas_WMDE>	 (I was also already grepping for where those task numbers need to be added to end up in the file ^^)
[13:45:47] <nn1l2>	 LGTM
[13:46:13] <Lucas_WMDE>	 alright, so I assume I should sync static/images/ first
[13:46:19] <Lucas_WMDE>	 and then… any particular order for the other three files?
[13:46:25] <Lucas_WMDE>	 probably doesn’t matter
[13:46:33] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Ready for review but let's plan to merge this on or after Tuesday." [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh)
[13:46:34] <wikibugs>	 10SRE, 10Analytics, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) p:05Triage→03Medium
[13:46:34] <nn1l2>	 no specific order
[13:46:38] <Lucas_WMDE>	 ok
[13:46:47] <nn1l2>	 everything was fine IMO
[13:47:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:47:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:47:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:30] <wikibugs>	 (03PS9) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589)
[13:48:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (1/4) (duration: 00m 56s)
[13:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:19] <stashbot>	 T306030: Change the logo of Farsi Wikipedia for 900K milestone - https://phabricator.wikimedia.org/T306030
[13:50:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (2/4) (duration: 00m 55s)
[13:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:01] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776232 (owner: 10PipelineBot)
[13:51:05] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776231 (owner: 10PipelineBot)
[13:51:11] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778277 (owner: 10PipelineBot)
[13:51:23] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778278 (owner: 10PipelineBot)
[13:51:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (3/4) (duration: 01m 00s)
[13:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:780728|fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector) (T306030)]] (4/4) (duration: 00m 53s)
[13:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:56] <wikibugs>	 (03CR) 10Hnowlan: "LGTM, some minor notes on style in the Dockerfile." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar)
[13:53:35] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:20] <Lucas_WMDE>	 jouncebot: next
[13:55:21] <jouncebot>	 In 2 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600)
[13:55:47] <Lucas_WMDE>	 nn1l2: if you want to fix that comment now, feel free to ping me and I can deploy it, we have two hours until the next window :)
[13:56:04] <nn1l2>	 Okay, will do
[14:00:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one last question inline." [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon)
[14:11:21] <wikibugs>	 (03PS1) 10Volans: admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978)
[14:17:40] <wikibugs>	 (03PS1) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037)
[14:17:46] <wikibugs>	 (03PS1) 10Ayounsi: Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166)
[14:18:03] <wikibugs>	 (03CR) 10Zabe: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata)
[14:18:28] <wikibugs>	 (03CR) 10Ayounsi: "Tested in https://netbox-next.wikimedia.org/extras/scripts/replace_device.ReplaceDevice/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi)
[14:19:24] <wikibugs>	 (03PS2) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037)
[14:20:37] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) a:03ayounsi
[14:25:57] <wikibugs>	 (03PS3) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037)
[14:27:58] <nn1l2>	 <Lucas_WMDE> please see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/780844
[14:28:13] * Lucas_WMDE looks
[14:29:15] <Lucas_WMDE>	 LGTM
[14:29:16] <wikibugs>	 (03PS4) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037)
[14:29:17] <Lucas_WMDE>	 jouncebot: nowandnext
[14:29:17] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 30 minute(s)
[14:29:17] <jouncebot>	 In 1 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600)
[14:30:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2)
[14:30:25] <wikibugs>	 (03PS5) 104nn1l2: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037)
[14:30:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2)
[14:31:23] <Lucas_WMDE>	 nn1l2: are you done yet? :P
[14:31:40] <nn1l2>	 yes, I fixed the commit message, there was a typo
[14:32:05] <logmsgbot>	 !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided)
[14:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:08] <logmsgbot>	 !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 03s)
[14:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2)
[14:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:33:29] <wikibugs>	 (03PS1) 10David Caro: DONOTMERGE: skeleteon for the replicaconfig service [puppet] - 10https://gerrit.wikimedia.org/r/780853
[14:33:31] <wikibugs>	 (03Merged) 10jenkins-bot: Wikispecies: Fix logo ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780844 (https://phabricator.wikimedia.org/T306037) (owner: 104nn1l2)
[14:34:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DONOTMERGE: skeleteon for the replicaconfig service [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro)
[14:36:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:780844|Wikispecies: Fix logo ticket numbers (T306037)]] (1/2, expected no-op) (duration: 00m 55s)
[14:36:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:12] <stashbot>	 T306037: Optimize logo for Wikispecies - https://phabricator.wikimedia.org/T306037
[14:36:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:780844|Wikispecies: Fix logo ticket numbers (T306037)]] (2/2, expected no-op) (duration: 00m 55s)
[14:38:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:53] <icinga-wm>	 PROBLEM - Host kubestage2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:49] <icinga-wm>	 PROBLEM - Host mc2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:23] <wikibugs>	 (03PS7) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942)
[14:49:09] <icinga-wm>	 RECOVERY - Host kubestage2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[14:49:14] <wikibugs>	 (03CR) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon)
[14:51:51] * Lucas_WMDE experimenting on mwdebug1002
[14:52:07] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @hnowlan i am ready for restbase2021
[14:55:29] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet
[14:55:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update datahub to use version 0.8.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/779898 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis)
[14:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:28] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10hnowlan) >>! In T305469#7855361, @Papaul wrote: > @hnowlan i am ready for restbase2021  Go ahead!
[14:56:29] <bd808>	 Lucas_WMDE: just took a quick peek, but if what you were seeing in the stashbot logs was 'Status is a duplicate.' errors those are fairly common. Usually they are true as well (meaning that logmsgbot has sent the same irc message 2x in a row as happened in this channel at 14:38 UTC)
[14:56:39] <Lucas_WMDE>	 that’s what it was yeah
[14:56:46] <Lucas_WMDE>	 thanks for looking
[14:57:33] <bd808>	 I should fix that particular log to have timestampes too...
[14:59:31] <icinga-wm>	 RECOVERY - Host mc2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms
[14:59:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (with the caveats of my first +1 :D )" [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon)
[15:00:17] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:00:21] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.155 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:00:31] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.154 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:00:39] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:00:49] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:00:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update datahub to use version 0.8.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/779898 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis)
[15:01:01] <volans>	 hnowlan: expected? ^^^
[15:01:12] <hnowlan>	 volans: yep, oops. 
[15:01:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:02:31] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:04:00] <papaul>	 hnowlan: do i have to power it off or you will do it?
[15:04:12] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:04:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi)
[15:05:13] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:19] <hnowlan>	 papaul: I will 
[15:05:26] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:05:28] <papaul>	 hnowlan: ok let me know 
[15:05:35] * Lucas_WMDE done experimenting on mwdebug1002 (changes reset with scap pull)
[15:05:44] <wikibugs>	 (03CR) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[15:06:25] <papaul>	 !log powerdown ganeti2020 for relocation 
[15:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:53] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase2021.codfw.wmnet with reason: Relocation
[15:06:54] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase2021.codfw.wmnet with reason: Relocation
[15:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:25] <hnowlan>	 papaul: done
[15:07:25] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:08:20] <papaul>	 hnowlan: thanks
[15:09:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans)
[15:09:49] <icinga-wm>	 PROBLEM - Host ganeti2020 is DOWN: PING CRITICAL - Packet loss = 100%
[15:09:53] <rzl>	 hnowlan: that spike in api server latency in *codfw* is probably related to the restbase depoolage, right?
[15:12:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add ldap-only user nathillard [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans)
[15:12:10] <volans>	 thanks sukhe!
[15:13:03] <icinga-wm>	 PROBLEM - Host ganeti2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:14:03] <icinga-wm>	 PROBLEM - Host restbase2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:14:09] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:14:19] <hnowlan>	 rzl: I wouldn't expect it to cause it... 
[15:14:38] <hnowlan>	 or at least it's not a regularly seen side effect of depooling a single node 
[15:15:39] <rzl>	 I can't figure out a mechanism either -- but also note codfw only gets like 13 rps, so it's easy to throw the percentiles off
[15:15:51] <rzl>	 well, I guess that alert is the mean, but still
[15:15:54] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:07] <hnowlan>	 ah that could be it 
[15:16:15] <rzl>	 it looks like mcrouter was unhappy
[15:16:17] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:16:43] <wikibugs>	 (03CR) 10Ottomata: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata)
[15:17:52] <rzl>	 anyway nothing that needs to be investigated all that deeply, especially since it's codfw-only and self-healing
[15:18:24] <wikibugs>	 (03PS1) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399)
[15:19:43] <wikibugs>	 (03CR) 10Zabe: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata)
[15:20:21] <icinga-wm>	 RECOVERY - Host restbase2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms
[15:24:03] <wikibugs>	 (03CR) 10Ottomata: "Oof, thanks! Sorry about that removal, dunno how that thappened." [puppet] - 10https://gerrit.wikimedia.org/r/780838 (https://phabricator.wikimedia.org/T305978) (owner: 10Volans)
[15:24:23] <wikibugs>	 (03CR) 10Ottomata: Declare new research-deployers group for airflow instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata)
[15:27:23] <papaul>	 hnowlan: server will be coming up soon just commiting the switch change now
[15:27:35] <hnowlan>	 papaul: great, thanks! 
[15:30:55] <papaul>	 did already already on ganeti2020 the interface is only set to access 
[15:32:49] <papaul>	 hnowlan: server is up 
[15:33:33] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-c valid until 2023-11-25 11:38:52 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:33:57] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.033 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886
[15:33:59] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is OK: TCP OK - 0.037 second response time on 10.192.16.155 port 9042 https://phabricator.wikimedia.org/T93886
[15:34:01] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-b valid until 2023-11-25 11:38:50 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:34:09] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is OK: TCP OK - 0.033 second response time on 10.192.16.154 port 9042 https://phabricator.wikimedia.org/T93886
[15:34:13] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-a valid until 2023-11-25 11:38:47 +0000 (expires in 589 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:38:17] <wikibugs>	 (03PS9) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392)
[15:40:34] <wikibugs>	 10SRE, 10DBA, 10Trust-and-Safety, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370 (10Zabe)
[15:40:38] <wikibugs>	 10SRE, 10DBA, 10Trust-and-Safety, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370 (10Zabe)
[15:40:44] <wikibugs>	 (03PS10) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392)
[15:41:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney)
[15:42:54] <wikibugs>	 (03PS1) 10Eigyan: [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963)
[15:43:10] <wikibugs>	 (03CR) 10Volans: "Some improvement suggestions inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi)
[15:45:03] <icinga-wm>	 RECOVERY - Host ganeti2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.18 ms
[15:45:49] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:46:29] <icinga-wm>	 RECOVERY - Host ganeti2020 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[15:49:02] <wikibugs>	 (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[15:49:21] <wikibugs>	 (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[15:49:35] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:51:47] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go. As always, stage on mwdebug and look out for warnings in Logstash before rolling out. These renames can be really sneaky at ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[15:52:12] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) All the nodes that are not cloud are now out of Rack B1. Thanks to all helping me to de-pool the servers and power them off.
[15:56:12] <wikibugs>	 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) If you're asking me, I have no idea how swift ACL or swift-mediawiki relation works. Sorry.
[15:57:54] <wikibugs>	 (03PS11) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392)
[15:58:04] <logmsgbot>	 !log krinkle@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 00m 55s)
[15:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:43] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "The structured in prod has diverged, but, I've applied effectively the same rename + alias and deployed it. This is good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:00:04] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:34] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm)
[16:01:00] <hnowlan>	 papaul: thanks! 
[16:03:04] <dancy>	 rzl: Can you deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/780629 for me (with the possibility of re-reverting in case it doesn't work out)
[16:03:30] <rzl>	 dancy: sure! looking
[16:03:35] <dancy>	 thx!
[16:03:39] <papaul>	 hnowlan: you welcome
[16:04:00] <rzl>	 ohh I see what we're doing, yeah
[16:04:09] <rzl>	 jayme: if you're still about, any objections? ^
[16:04:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:04:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:30] <rzl>	 going ahead
[16:07:42] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Revert "mwdebug_deploy: switch back to using the root user" [puppet] - 10https://gerrit.wikimedia.org/r/780629 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm)
[16:09:56] <rzl>	 dancy: merged, and ran puppet on deploy1002
[16:10:07] <dancy>	 Thanks.  Testing now.
[16:10:08] <wikibugs>	 (03PS1) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889
[16:12:23] <dancy>	 jouncebot now
[16:12:23] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T1600)
[16:13:09] <logmsgbot>	 !log dancy@deploy1002 Started scap: (no justification provided)
[16:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:20] <dancy>	 ^ Testing image build and deploy 
[16:15:44] <wikibugs>	 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Andrew) I have a dentist appointment at 2PM CDT on Monday the 18th; otherwise I'm available to help with this.  Please be aware that I'm largely ignorant of network topology vs. racks so will be re...
[16:17:09] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet
[16:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:53] <wikibugs>	 (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:20:56] <wikibugs>	 (03PS2) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889
[16:22:07] <wikibugs>	 (03PS3) 10Majavah: site: fix cloudstore1009,1010 definitions [puppet] - 10https://gerrit.wikimedia.org/r/780889
[16:27:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney)
[16:27:33] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891
[16:27:46] <wikibugs>	 (03PS1) 10Zabe: httpbb: move redirect tests to test_redirects.yaml [puppet] - 10https://gerrit.wikimedia.org/r/780892
[16:38:57] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891 (owner: 10MSantos)
[16:41:01] <wikibugs>	 (03PS1) 10Ottomata: Actually set REQUESTS_CA_BUNDLE [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197)
[16:45:08] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2022-04-13-110715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/780891 (owner: 10MSantos)
[16:46:07] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:52] <rzl>	 dancy: ^ what do you think?
[16:47:07] <dancy>	 Taking a look.. probably permissions issues.
[16:47:27] <rzl>	 ah yeah, `error: insufficient permission for adding an object to repository database .git/objects`
[16:48:09] <dancy>	 `chown -R mwbuilder: /etc/helmfile-defaults/mediawiki/release` should take care of that
[16:48:49] <dancy>	 and `rm /var/lib/deploy-mwdebug/error` afterward
[16:48:51] <rzl>	 do you know where it is in puppet?
[16:49:20] <dancy>	 checking..
[16:49:31] <cdanis>	 !log depooling & disabling puppet on cp2027 for some manual testing T303534
[16:49:34] <rzl>	 ah found it, modules/profile/manifests/kubernetes/deployment_server/mediawiki/release.pp
[16:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:08] <dancy>	 Prior to the revert that you merged today, the commands were run as root, so some of the files in the .git directory are owned by root.. and some not.
[16:51:24] <rzl>	 ahh okay yeah
[16:51:54] <rzl>	 I was just digging to see if puppet has an idea of who ought to own the recursive contents of that directory, I didn't want to be fighting it back and forth
[16:52:08] <rzl>	 chown -R sounds good in that case, running
[16:52:21] <dancy>	 Nod.  It would be nice if something magical happened to keep things in order.
[16:53:19] <wikibugs>	 (03PS1) 10JHathaway: smart_data_dump: silence log output when running tests [puppet] - 10https://gerrit.wikimedia.org/r/780902
[16:53:38] <rzl>	 I think adding recurse => true to that directory would do the right thing, but I'm not confident about unwanted side effects, and anyway we don't expect to be flipping this all that often
[16:53:54] <rzl>	 !log rzl@deploy1002:~$ sudo chown -R mwbuilder: /etc/helmfile-defaults/mediawiki/release
[16:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:23] <rzl>	 deleted the error file, another run should be coming up shortly
[16:54:30] <dancy>	 ok.  Waiting eagerly.
[16:54:59] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway)
[16:55:05] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:06] <rzl>	 success \o/
[16:56:45] <rzl>	 hm, except it looks like the 16:39:54 attempt didn't succeed and now we're at "nothing to deploy" -- does this need a --force run?
[16:57:47] <dancy>	 hmmm
[17:00:15] <dancy>	 I'll trigger that.
[17:00:29] <logmsgbot>	 !log dancy@deploy1002 Started scap: (no justification provided)
[17:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:50] <dancy>	 alright, something should happen in 2 minutes
[17:02:55] <rzl>	 sweet
[17:04:43] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Tests pass CI and is not a functional change to SDD.  LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway)
[17:05:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:05:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:49] <rzl>	 hmm, different errors at least
[17:07:14] <dancy>	 nod.   The one about not being able to download wmf-stable/mediawiki is new.  Not sure what that's about.
[17:07:20] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@76ee675]: WDQS: Allow federated queries with Publication Office and European Commission
[17:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:43] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:56] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] smart_data_dump: silence log output when running tests [puppet] - 10https://gerrit.wikimedia.org/r/780902 (owner: 10JHathaway)
[17:11:09] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:11:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:33] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:11:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:06] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:12:07] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:15] <wikibugs>	 (03PS1) 10Btullis: Update the container images used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/780906 (https://phabricator.wikimedia.org/T306019)
[17:12:17] <rzl>	 I'm not sure if `WARNING: Kubernetes configuration file is group-readable. This is insecure.` is turning from a warning into an error downstream but I wouldn't be shocked
[17:12:28] <dancy>	 It does appear that way.
[17:12:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove enc api from puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[17:12:51] <wikibugs>	 (03PS1) 10JHathaway: smart_data_dump: Use lsblk's json output [puppet] - 10https://gerrit.wikimedia.org/r/780907
[17:13:37] <wikibugs>	 (03CR) 10JHathaway: "Kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780907 (owner: 10JHathaway)
[17:14:01] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@76ee675]: WDQS: Allow federated queries with Publication Office and European Commission (duration: 06m 41s)
[17:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:40] <dancy>	 rzl: Is it feasible to set ownership of /etc/kubernetes/mwdebug-deploy-eqiad.config to `mwbuilder` and mode 0600? 
[17:17:02] <wikibugs>	 (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan)
[17:18:19] <rzl>	 digging into that now -- it seems like most of /etc/kubernetes/*.config is mwdeploy:deployment and 0640 though, so I'm still digging into what's going on there
[17:19:21] <inflatador>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[17:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:07] <dancy>	 rzl: The mwbuilder account can sudo to mwdeploy, so I'll try that approach first.,
[17:20:31] <logmsgbot>	 !log mwbuilder@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:20:32] <logmsgbot>	 !log mwbuilder@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:23] <dancy>	 I think `Error: unknown command "diff" for "helm"` may be the real problem
[17:21:40] <dancy>	 s/the/a/
[17:22:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the container images used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/780906 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis)
[17:23:17] <dancy>	 The helm-diff package is installed.
[18:14:47] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.7  refs T305213
[18:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:52] <stashbot>	 T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213
[18:17:21] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.5 (duration: 02m 10s)
[18:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) Hi all - thanks to @jcrespo and @Dzahn for your help. Just now I tested the apps listed here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#wmf_group , with the following re...
[18:19:10] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.3 (duration: 01m 48s)
[18:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:46] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.4 (duration: 01m 35s)
[18:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:21:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:12] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.2 (duration: 01m 25s)
[18:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:17] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.37.0-wmf.1 (duration: 01m 04s)
[18:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:04] <dancy>	 Done cleaning cruft.
[18:26:03] <icinga-wm>	 RECOVERY - Disk space on mw2289 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops
[18:26:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:26:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:26:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:18] <icinga-wm>	 RECOVERY - Disk space on mw2276 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops
[18:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:15:26] <icinga-wm>	 PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100%
[19:20:31] <wikibugs>	 (03PS1) 10JHathaway: smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564)
[19:24:34] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway)
[19:24:52] <icinga-wm>	 RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[19:25:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi)
[19:25:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) a:03razzi
[19:27:20] <icinga-wm>	 PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100%
[19:28:14] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=ats-be,name=cp2027.*
[19:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:21] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[19:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:13] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T306215 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:30:18] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T306215 (10ops-monitoring-bot)
[19:33:07] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/780892 (owner: 10Zabe)
[19:35:38] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=ats-be,name=cp2027.*
[19:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:48] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T306215 (10wiki_willy) a:03Cmjohnson
[19:38:48] <icinga-wm>	 PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100%
[19:42:54] <icinga-wm>	 RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[19:42:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1068.eqiad.wmnet with OS stretch
[19:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch
[19:43:12] <wikibugs>	 (03PS1) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[19:43:34] <icinga-wm>	 RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[19:44:26] <wikibugs>	 (03CR) 104nn1l2: fawiki: Change logo for 900K milestone (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2)
[19:44:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS stretch
[19:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch
[19:44:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1070.eqiad.wmnet with OS stretch
[19:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch
[19:44:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS stretch
[19:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch
[19:50:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson)
[19:52:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn)
[19:53:00] <wikibugs>	 (03PS1) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013
[19:53:15] <wikibugs>	 (03PS2) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[19:53:26] <wikibugs>	 (03PS2) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013
[19:53:46] <wikibugs>	 (03PS3) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[19:54:21] <wikibugs>	 (03PS3) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013
[19:55:04] <wikibugs>	 (03PS3) 10Dzahn: webperf: migrate warm_up_coal_cache cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[19:56:41] <wikibugs>	 (03CR) 10Zabe: "Maybe I missunderstand this patch. But I actually can login at icinga.wikimedia.org (or is this for a different page?)." [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn)
[19:57:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage
[19:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage
[19:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage
[19:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:59] <wikibugs>	 (03CR) 10Dzahn: icinga: don't claim wmf or nda group gets you a login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn)
[20:00:04] <jouncebot>	 brennen: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T2000)
[20:00:04] <jouncebot>	 zabe and eigyan: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:06] <wikibugs>	 (03Abandoned) 10Dzahn: icinga: don't claim wmf or nda group gets you a login [puppet] - 10https://gerrit.wikimedia.org/r/781013 (owner: 10Dzahn)
[20:00:12] <zabe>	 o/
[20:00:21] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027
[20:00:41] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage
[20:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:55] <thcipriani>	 hey zabe 
[20:01:06] <zabe>	 hi
[20:01:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage
[20:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:19] * urbanecm waves to everyone
[20:02:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "not entirely sure if the bash loop will be ok in the command string but easiest is to test it" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:02:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson)
[20:03:11] <brennen>	 zabe: i'll roll out that first one
[20:03:20] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:03:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:30] <zabe>	 ok
[20:03:39] <eigyan>	 greetings all
[20:03:50] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:03:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage
[20:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:16] <thcipriani>	 hi eigyan 
[20:04:47] <Juan_90264>	 Hello guys!
[20:04:55] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:05:18] <thcipriani>	 eigyan: your patch looks like it's beta-only, is that correct?
[20:05:33] <eigyan>	 that is correct thcipriani
[20:05:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage
[20:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:58] <wikibugs>	 (03CR) 10Dzahn: "[webperf1001:~] $ sudo systemctl status warm_up_coal_cache.service" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:06:04] <thcipriani>	 eigyan: cool, thanks for verifying, we'll get it merged shortly :)
[20:06:17] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027
[20:06:20] <eigyan>	 perfect thank you thcipriani
[20:06:48] <wikibugs>	 (03CR) 10Dzahn: "looks good on webperf1001. I did not do anything on  deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud  though" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:06:50] <brennen>	 zabe: on mwdebug1002, looks like en wikipedia loads, assume that's basically all there is to check?
[20:07:04] <zabe>	 brennen, yep
[20:07:08] <brennen>	 cool, syncing
[20:07:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:07:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:22] <wikibugs>	 (03CR) 10Dzahn: "good to go.. IF ... puppet ran on deployment-prep host" [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:08:26] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan)
[20:08:29] <logmsgbot>	 !log brennen@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:776259|Stop writing to $wmfUdp2logDest (T45956)]] (duration: 00m 48s)
[20:08:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage
[20:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:34] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:05] <wikibugs>	 (03Merged) 10jenkins-bot: [wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780881 (https://phabricator.wikimedia.org/T303963) (owner: 10Eigyan)
[20:09:10] <brennen>	 eigyan: going ahead with yours real quick and then i'll hand off to cjming for the rest of zabe's stuff
[20:09:36] <eigyan>	 thanks brennen
[20:10:05] <mutante>	 !log gitlab - pausing and then deleting runner-1015, creating new bullseye runner-1026instance to replace it
[20:10:07] <thcipriani>	 eigyan: now that this is merged, this will go live on the next run of: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/
[20:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:16] <thcipriani>	 but we'll sync for completeness, too :)
[20:10:40] <eigyan>	 excellent thank you thcipriani
[20:11:20] <logmsgbot>	 !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:780881|[wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA (T303963)]] (duration: 00m 48s)
[20:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:24] <stashbot>	 T303963: Undeploy Safety Survey for EN, ES wikis from BETA - https://phabricator.wikimedia.org/T303963
[20:12:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:12:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:51] <wikibugs>	 (03CR) 10Zabe: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:12:52] <logmsgbot>	 !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:780881|[wmf-config] Undeploy Safety Survey for EN, ES wikis from BETA (T303963)]] (duration: 00m 49s)
[20:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:31] <brennen>	 ^ first of those was a no-op, forgot a rebase.
[20:13:40] <brennen>	 (i suppose really they're both no-ops, effectively.)
[20:14:22] <wikibugs>	 (03PS3) 10Clare Ming: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:14:34] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:14:55] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:02] <Juan_90264>	 Do not forget me
[20:15:27] <thcipriani>	 Juan_90264: don't worry, you're on our radar :D
[20:15:38] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:42] <wikibugs>	 (03Merged) 10jenkins-bot: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:15:49] <Juan_90264>	 Okay
[20:16:20] <wikibugs>	 (03PS4) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[20:16:31] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:16:44] <eigyan>	 thcipriani brennen validated on my end
[20:16:58] <eigyan>	 thank you all
[20:17:12] <cjming>	 zabe: your 2nd patch is up on mwdebug1001
[20:17:27] <eigyan>	 cjming many thanks
[20:17:30] <wikibugs>	 (03PS3) 10Zabe: webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673)
[20:17:41] <thcipriani>	 thanks eigyan :)
[20:18:20] <zabe>	 cjming, it's a doc-only change, I can't test anything
[20:18:35] <cjming>	 ok - syncing then
[20:19:37] <logmsgbot>	 !log cjming@deploy1002 Synchronized private/readme.php: Config: [[gerrit:768259|Write the same value to wmgSwiftConfig as to wmfSwiftConfig (T45956)]] (duration: 00m 48s)
[20:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:43] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:20:04] <wikibugs>	 (03PS2) 10Thcipriani: Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:20:48] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:21:16] <wikibugs>	 (03CR) 10Zabe: webperf: remove absented warm_up_coal_cache cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:22:04] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:22:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:22:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:25] <wikibugs>	 (03CR) 10Dave Pifke: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:26:03] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/filebackend.php: Config: [[gerrit:779856|Migrate $wmfSwiftConfig to $wmgSwiftConfig (T45956)]] (duration: 00m 49s)
[20:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:07] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:26:15] <thcipriani>	 ^ zabe  there's your last one
[20:26:23] <zabe>	 thanks :)
[20:26:26] <thcipriani>	 Juan_90264: could you rebase your patch for me? I'm having some trouble
[20:26:35] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:02] <wikibugs>	 (03CR) 10Dzahn: "it's NOT showing that warning in prod. though:" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:27:28] <wikibugs>	 (03CR) 10Dzahn: "is it because webperf1001 is still stretch?" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:27:32] <Juan_90264>	 Yes I can rebase
[20:27:35] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:37] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10dpifke) systemd in bullseye (and up?) appears to object if `User=nobody`, see: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969...
[20:27:50] <wikibugs>	 (03CR) 10Juan90264: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:28:01] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.23 ms
[20:28:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:28:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:57] <wikibugs>	 (03CR) 10Dave Pifke: webperf: migrate warm_up_coal_cache cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:29:50] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) p:05Triage→03Medium
[20:30:03] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH)
[20:31:03] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH)
[20:31:46] <Juan_90264>	 thcipriani: I'm not able to rebase, this error appears: "Could not perform action: The change could not be rebased due to a conflict during merge."
[20:32:04] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10RobH) This was detailed on the procurement task, and I've migrated the testing to this onsite related task.
[20:32:25] <wikibugs>	 (03PS6) 10Juan90264: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:32:46] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:32:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:05] <zabe>	 you need to do a manual rebase
[20:33:39] <zabe>	 https://www.mediawiki.org/wiki/Gerrit/Advanced_usage#Manually_rebase_(on_a_branch)
[20:34:29] <mutante>	 Juan_90264: git review -d 774834 ; git rebase -i origin/master ; (errors show up); git rebase --continue (it tells you where the error is).. manually look at the file and look for lines with "<<<". save, git add.. git rebase --continue until errors are gone, git commit --amend, git review
[20:34:51] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:30] <wikibugs>	 (03PS7) 10Thcipriani: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:37:53] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:38:21] <thcipriani>	 ^ Juan_90264 does the patch look correct to you now? I rebased.
[20:38:54] <thcipriani>	 if so can you +1?
[20:39:47] <wikibugs>	 (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:40:10] <Juan_90264>	 thcipriani: That's better.
[20:40:17] <thcipriani>	 cool :)
[20:40:32] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:40:38] <Juan_90264>	 I've never seen this error before
[20:41:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh)
[20:41:19] <thcipriani>	 it happens when something has merged since you made your change that conflicts with your change in a way git can't automatically resolve: requires a human
[20:42:14] <thcipriani>	 Juan_90264: live on mwdebug1002, check please :)
[20:42:39] <Juan_90264>	 I'll check
[20:46:15] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:46:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:39] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10jhathaway) I think having a syntax validity check would be a great first start. I think using yamllint, a ruby script or a short python script would work well:  ` $ yamllint -d...
[20:48:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:48:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:51] <Juan_90264>	 thcipriani: I tested and approved
[20:49:28] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=ats-be,name=cp2027.*
[20:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:32] <thcipriani>	 Juan_90264: cool, going live
[20:51:00] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774834|Add extendedconfirmed user group for testwiki (T302860)]] (duration: 01m 04s)
[20:51:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:03] <stashbot>	 T302860: Consider turning on extendedconfirmed user group for testwiki - https://phabricator.wikimedia.org/T302860
[20:51:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Cannot verify NTP status asw1-b12-drmrs - https://phabricator.wikimedia.org/T305840 (10cmooney) I've opened a case with Juniper, let's see what they say.
[20:51:12] <thcipriani>	 ^ Juan_90264 should be live now
[20:53:19] <Juan_90264>	 Change is already working, thanks for deploying thcipriani!
[20:53:36] <thcipriani>	 thanks for the change Juan_90264 :)
[20:53:39] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) @RLazarus were you going to work on the rest of this?  We still need more plumbing inside requestctl correct?
[20:56:27] <wikibugs>	 (03CR) 10Ebernhardson: "might need more work, or at least more tests. Trying to understand how this works with multiple clusters on each host." [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson)
[20:58:48] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049
[21:00:13] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049 (owner: 10Ahmon Dancy)
[21:00:52] <wikibugs>	 10SRE, 10Traffic: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis)
[21:01:25] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/781049 (owner: 10Ahmon Dancy)
[21:01:51] <wikibugs>	 (03PS1) 10Zabe: osm: migrate import_waterlines cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673)
[21:01:53] <wikibugs>	 (03PS1) 10Zabe: osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673)
[21:02:36] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jhathaway) >>! In T211750#7853874, @Volans wrote: > Although there are no doubt that an automatic formatter is of great help, there are also...
[21:06:54] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/weight=100; selector: dc=codfw,service=ats-be,name=cp2027.*
[21:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:00] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=ats-be,name=cp2027.*
[21:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:49] <cdanis>	 !log enabled puppet on cp2027, restarted ats-be, & repooled after some manual testing T303534	
[21:07:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) So I've been able to check the options here on the QFX5120 platform.  It is **not** possible to mix 10G and 25G SFP modules in the...
[21:10:22] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10cmooney) 05Open→03Resolved Thanks for the help on this one @Jclark-ctr.  All done with the testing you can remove those cables and leave them with our others.  thanks!
[21:12:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney)
[21:13:32] <wikibugs>	 (03PS1) 10Zabe: wikitech: migrate mw-xml cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781053 (https://phabricator.wikimedia.org/T273673)
[21:13:34] <wikibugs>	 (03PS1) 10Zabe: wikitech: remove absented mw-xml cron [puppet] - 10https://gerrit.wikimedia.org/r/781054 (https://phabricator.wikimedia.org/T273673)
[21:14:08] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney)
[21:15:55] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223)
[21:17:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223) (owner: 10CDanis)
[21:19:45] <topranks>	 !log Updated netbox-extras / interface_automation script for Netbox to add logic to rename interfaces (CR769729)
[21:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10wiki_willy) a:03Jclark-ctr
[21:20:46] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34846/" [puppet] - 10https://gerrit.wikimedia.org/r/781053 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:26:44] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10RLazarus) a:03RLazarus Yeah -- I can do the implementation but I'm not sure if we've settled on what we want it to look like.  I don't have a strong opini...
[21:29:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) Actually I should clarify, it *may* be possible to use the channel-speed syntax to configure the switch in blocks of 2, it allows...
[21:36:06] <wikibugs>	 (03PS1) 10Zabe: Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956)
[21:38:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:38:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10cmooney) 05Open→03Resolved Closing ticket, duplicate.  Results detailed in https://phabricator.wikimedia.org/T303529#7856797
[21:38:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney)
[21:38:25] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:39:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10cmooney) 05Resolved→03Open Actually I spoke too soon, there is one other combination I want to check.  @Jclark-ctr could you move the 10G cable in port xe-0/0/1 to port xe-0/0/2...
[21:58:53] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) a:03Zabe
[21:59:35] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/1286/" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:01:04] <mutante>	 !log gitlab deleting runner-1026, creating runner-1027
[22:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:24] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe)
[22:11:31] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:12:04] <wikibugs>	 (03PS1) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060
[22:15:44] <wikibugs>	 (03PS2) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060
[22:15:49] <mutante>	 log gitlab deleting runner-1017, creating runner-1028
[22:15:53] <mutante>	 !log gitlab deleting runner-1017, creating runner-1028
[22:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:03] <wikibugs>	 (03CR) 10Ahmon Dancy: "This is a subset of changes that have been used in the train-dev branch for a long time.  Dealing with merge conflicts when pulling from m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy)
[22:17:49] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[22:19:55] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[22:24:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang)
[22:28:01] <mutante>	 !log gitlab - deleting runner-1018, runner-1019, creating runner-1029, runner-1030 T297659
[22:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:06] <stashbot>	 T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659
[22:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:40:23] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) >>! In T211750#7856763, @jhathaway wrote: > Thanks for the detailed write up of all the issues. It would be great at some point to c...
[22:44:24] <AntiComposite>	 got a few reports of 503s
[22:44:34] <AntiComposite>	 I didn't see them, seems to have recovered now
[22:50:31] <wikibugs>	 (03PS5) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129)
[22:57:10] <urbanecm>	 AntiComposite: i saw them, but everything seems work fine on my end now.
[23:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:09:59] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 31.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:10:19] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 33.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:10:25] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 16.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:10:25] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 37.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:10:47] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 17.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:12:25] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 73.78 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:12:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:12:33] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:12:35] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:12:40] <AntiComposite>	 (expected)
[23:12:55] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:14:11] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 101.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:26:09] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781064
[23:26:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781064 (owner: 10Andrew Bogott)
[23:30:30] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781070
[23:31:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: Fix up lvm params for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/781070 (owner: 10Andrew Bogott)
[23:33:15] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup2002: further attempt to keep lvm happy [puppet] - 10https://gerrit.wikimedia.org/r/781077
[23:34:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2002: further attempt to keep lvm happy [puppet] - 10https://gerrit.wikimedia.org/r/781077 (owner: 10Andrew Bogott)