[00:03:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P34557 and previous config saved to /var/cache/conftool/dbconfig/20220913-000340-ladsgroup.json
[00:04:55] <wikibugs>	 (03PS1) 10Dzahn: phabricator: move scap user sudo defaults to file, fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259)
[00:05:20] <wikibugs>	 (03CR) 10Dzahn: "as done in modules/profile/manifests/toolforge/base.pp" [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[00:08:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: move scap user sudo defaults to file, fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[00:14:46] <wikibugs>	 (03PS1) 10Dzahn: phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259)
[00:15:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[00:15:56] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[00:18:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34558 and previous config saved to /var/cache/conftool/dbconfig/20220913-001846-ladsgroup.json
[00:18:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[00:18:50] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[00:19:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[00:19:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34559 and previous config saved to /var/cache/conftool/dbconfig/20220913-001908-ladsgroup.json
[00:21:06] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:30:26] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:00] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: syntax error in sudo
[00:48:15] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: syntax error in sudo
[00:49:02] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: syntax error in sudo
[00:49:18] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: syntax error in sudo
[00:49:44] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2001.codfw.wmnet with reason: syntax error in sudo
[00:49:59] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2001.codfw.wmnet with reason: syntax error in sudo
[00:51:09] <wikibugs>	 (03PS1) 10Dzahn: phabricator: add double quotes around sudo config file [puppet] - 10https://gerrit.wikimedia.org/r/831640
[00:52:42] <wikibugs>	 (03PS2) 10Dzahn: phabricator: add double quotes around sudo config file [puppet] - 10https://gerrit.wikimedia.org/r/831640
[00:53:02] <wikibugs>	 (03PS3) 10Dzahn: phabricator: add double quotes around sudo config line [puppet] - 10https://gerrit.wikimedia.org/r/831640
[00:59:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: add double quotes around sudo config line [puppet] - 10https://gerrit.wikimedia.org/r/831640 (owner: 10Dzahn)
[01:06:32] <wikibugs>	 (03PS1) 10Dzahn: phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642
[01:08:32] <wikibugs>	 (03PS2) 10Dzahn: phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642
[01:08:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 (owner: 10Dzahn)
[01:08:43] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 (owner: 10Dzahn)
[01:24:04] <wikibugs>	 10SRE, 10Observability-Metrics: SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata)
[01:35:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34560 and previous config saved to /var/cache/conftool/dbconfig/20220913-013555-ladsgroup.json
[01:35:59] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:51:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P34561 and previous config saved to /var/cache/conftool/dbconfig/20220913-015102-ladsgroup.json
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0200)
[02:06:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P34562 and previous config saved to /var/cache/conftool/dbconfig/20220913-020608-ladsgroup.json
[02:06:14] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:07:19] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:07:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190)
[02:07:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[02:07:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:07:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:08:40] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: OpenSent - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:09:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:10:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:10:36] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page  - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[02:10:40] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:11:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5003.eqsin.wmnet, cp5006.eqsin.wmnet, cp5014.eqsin.wmnet, cp5005.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5014.eqsin.wmnet, cp5006.eqsin.wmnet, cp5005.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:11:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5014.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet, cp5005.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5003.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:11:36] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 148 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:11:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:19] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:12:24] <rzl>	 hi, looking
[02:12:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:12:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:13:00] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:14] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 346, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:13:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:13:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:13:29] <jhathaway>	 just acked, looking as well rzl 
[02:13:54] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:14:00] <rzl>	 looks like a transit blip in eqsin, I was just about to do a precautionary depool but maybe we're back? checking
[02:14:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:15:10] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:15:23] <rzl>	 HTTP requests received at eqsin went to zero for about a minute but look fully recovered
[02:15:26] <jhathaway>	 rzl: how did you pinpoint it to eqsin?
[02:15:28] <mutante>	 I am seeing "socket: permission denied" as the most common error message but as if it's over
[02:15:36] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page  - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[02:16:09] <rzl>	 jhathaway: just the alert text -- BGP status on cr3-eqsin, pybal alerts for cp5xxx, etc
[02:16:21] <jhathaway>	 good point!
[02:16:50] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:23] <jinxer-wm>	 (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:17:45] <rzl>	 and that HTTP requests data point was from https://grafana.wikimedia.org/goto/joxhmCGVz?orgId=1
[02:17:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (3) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:17:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (3) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:18:16] <rzl>	 I believe "firing: [...] recovered from" means it is a recovery
[02:19:38] <mutante>	 yea, the original "firing" is still shown as 6 minutes ago
[02:20:16] <rzl>	 best guess, we dropped about 950,000 requests for cache-text in eqsin
[02:20:22] <rzl>	 over the course of about a minute
[02:20:32] <rzl>	 no lossage for upload, curiously
[02:21:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34563 and previous config saved to /var/cache/conftool/dbconfig/20220913-022114-ladsgroup.json
[02:21:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[02:21:19] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[02:21:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[02:21:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34564 and previous config saved to /var/cache/conftool/dbconfig/20220913-022136-ladsgroup.json
[02:21:48] <rzl>	 I don't see anything still broken though, and I don't think there's any sense in depooling just in case it happens again -- maybe if there's a second blip we'd depool before there's a third, but not otherwise
[02:22:16] <rzl>	 jhathaway, mutante: anything you think we need to do here? if not I'll write a very tiny IR and call it a night
[02:22:16] <mutante>	 also see mail from fastnetmon, btw
[02:22:33] <mutante>	 No, I agree with you 
[02:22:36] <jhathaway>	 rzl: nope, I think that makes sense
[02:22:43] <rzl>	 yeah, I assume we lost one transit route and so we saturated the other
[02:22:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:22:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:23:58] <mutante>	 clicking that link first showed "10 min ago" and then it disappeared
[02:24:08] <mutante>	 kind of not matching the IRC message
[02:24:10] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:24:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[02:27:24] <mutante>	 the RIPE Atlas probe are green again when looking from Singapore itself
[02:27:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:27:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:30:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:31:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:31:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:32:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0300)
[03:01:14] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190)
[03:01:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[03:01:32] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[03:02:22] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.1  refs T314190
[03:02:25] <stashbot>	 T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190
[03:06:12] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:08:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:08:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:09:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:13:10] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:37:59] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1  refs T314190 (duration: 35m 37s)
[03:38:02] <stashbot>	 T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190
[03:39:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:40:01] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.39.0-wmf.27 (duration: 01m 59s)
[03:46:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:46:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:50:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:50:20] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[03:52:40] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[03:53:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:58:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:01:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:01:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:08:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:12:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34565 and previous config saved to /var/cache/conftool/dbconfig/20220913-041251-ladsgroup.json
[04:12:55] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[04:27:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P34566 and previous config saved to /var/cache/conftool/dbconfig/20220913-042758-ladsgroup.json
[04:43:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P34567 and previous config saved to /var/cache/conftool/dbconfig/20220913-044304-ladsgroup.json
[04:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34568 and previous config saved to /var/cache/conftool/dbconfig/20220913-045811-ladsgroup.json
[04:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[04:58:15] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[04:58:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[04:58:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34569 and previous config saved to /var/cache/conftool/dbconfig/20220913-045832-ladsgroup.json
[05:00:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 249 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:18:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 257 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:28:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:35:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:42:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 192 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:53:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0600).
[06:00:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:00:52] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:07:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:09:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34570 and previous config saved to /var/cache/conftool/dbconfig/20220913-060938-ladsgroup.json
[06:09:42] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[06:17:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:21:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:24:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P34571 and previous config saved to /var/cache/conftool/dbconfig/20220913-062444-ladsgroup.json
[06:26:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:33:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 285 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:38:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:39:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:39:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34572 and previous config saved to /var/cache/conftool/dbconfig/20220913-063908-ladsgroup.json
[06:39:11] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[06:39:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P34573 and previous config saved to /var/cache/conftool/dbconfig/20220913-063951-ladsgroup.json
[06:42:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 192 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:49:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend Tumult labs access by a week, current contract extension still WIP [puppet] - 10https://gerrit.wikimedia.org/r/831766
[06:53:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend Tumult labs access by a week, current contract extension still WIP [puppet] - 10https://gerrit.wikimedia.org/r/831766 (owner: 10Muehlenhoff)
[06:54:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 265 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:54:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34574 and previous config saved to /var/cache/conftool/dbconfig/20220913-065457-ladsgroup.json
[06:54:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[06:55:02] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[06:55:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[06:55:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:03:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 289 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:06:16] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:08:20] <wikibugs>	 (03PS3) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013)
[07:11:40] <logmsgbot>	 !log jhuneidi@deploy1002 deploy-promote aborted:  (duration: 00m 09s)
[07:13:14] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190)
[07:13:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[07:13:18] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:00] <jeena>	 we are rolling back from testwikis due to the high rate of fatals since the sync
[07:14:03] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[07:14:18] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.28  refs T314190
[07:14:21] <stashbot>	 T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190
[07:14:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:16:46] <moritzm>	 log installing zlib security updates on buster
[07:17:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:18:47] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.28  refs T314190 (duration: 04m 29s)
[07:19:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:24:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:24:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:24:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:24:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:31:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:36:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[07:40:06] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[07:43:27] <wikibugs>	 (03PS1) 10Cathal Mooney: Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690)
[07:43:43] <wikibugs>	 (03CR) 10Hashar: systemd: allow changing override filename (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[07:43:58] <wikibugs>	 (03PS2) 10Hashar: systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637)
[07:46:56] <wikibugs>	 (03PS5) 10Hashar: jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637)
[07:48:21] <wikibugs>	 (03CR) 10Hashar: "I have rebased since the parent change had some tweaks." [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[07:51:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[08:03:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[08:06:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05Open→03Stalled I'm waiting for a while after merging https://gerrit.wikimedia.org/r/831528, next steps aren't feasible in the short term
[08:09:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:10:00] <hashar>	 the train got rolled back by jeena at ~ 7:15 UTC
[08:10:07] <hashar>	 so we are solely running 1.39.0-wmf.28
[08:11:39] <hashar>	 and there are a bunch of errrs from MySQLPrimaryPos such as `PHP Notice: Undefined index: position`  | ` InvalidArgumentException: GTID set cannot be empty.`
[08:13:43] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[08:15:27] <topranks>	 !log de-pooling codfw ahead of core router upgrades at the site
[08:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:22] <jynus>	 hashar: are the errors ongoing? or were past?
[08:17:47] <moritzm>	 !log roll-restarting apache/FPM on mw canaries to pick up zlib security updates
[08:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:12] <hashar>	 jynus: it is ongoing, apparently due to a serialization issue between .28 and the new 1.40.0-wmf.1
[08:18:16] <hashar>	 I am digging ;-]
[08:19:33] <jynus>	 what I see is "Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending" on the job queue
[08:23:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 168 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:27:47] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.cf
[08:27:47] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[08:27:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet
[08:27:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:28:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:28:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the locations of the new hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/831532 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis)
[08:31:35] <hashar>	 hmm I am not qualified for that serialization issue, it is marked as a blocker ( https://phabricator.wikimedia.org/T317606 )
[08:32:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:32:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:33:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[08:34:17] <wikibugs>	 (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[08:35:05] <hashar>	 jbond: I think the first can be merged right now, the second affects Jenkins and requires some manual steps for deployment but I am gathering evidence for the mw train blocker ;)
[08:36:22] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet
[08:37:30] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10Tanuja_Doriya) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks...
[08:37:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:39:56] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet
[08:40:26] <hashar>	 Amir1: duesen: I am around for the train blocker if need be, but I can't say I understand what is going on :-\
[08:40:34] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-be20[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549)
[08:41:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10WMDE-leszek)
[08:42:12] <hashar>	 and there are bunches of `PHP Notice: apcu_fetch(): Error at offset 42 of 856 bytes`
[08:43:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34575 and previous config saved to /var/cache/conftool/dbconfig/20220913-084307-ladsgroup.json
[08:43:11] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[08:43:14] <Amir1>	 hashar: the error doesn't happen on the new version, it happens in the old. It's just a incompatible of old/new serialization 
[08:43:45] <hashar>	 what I noticed is that the chronologyprotector cache key version got bumped
[08:44:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10WMDE-leszek) Tagging wmf-legal was a mistake, apologies.
[08:44:02] <hashar>	 so I kind of expect the caches to be namespaced by that
[08:44:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap, wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Tanuja_Doriya)
[08:46:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet
[08:46:59] <topranks>	 !log Disabled LVS/PyBal peerings on cr1-codfw ain advance of upgrade to router.
[08:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:18] <wikibugs>	 (03CR) 10Vgutierrez: "please let's move forward with this. It is taking too much time|energy" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[08:53:59] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614)
[08:54:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T317614
[08:54:05] <stashbot>	 T317614: Switchover codfw s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T317614
[08:54:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T317614
[08:54:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2112 with weight 0 T317614', diff saved to https://phabricator.wikimedia.org/P34576 and previous config saved to /var/cache/conftool/dbconfig/20220913-085456-marostegui.json
[08:56:31] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[08:56:37] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10JMeybohm)
[08:56:40] <topranks>	 !log Flipping primary routing engine to RE1 on cr1-codfw (disruptive) as part of upgrade.
[08:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) 05Open→03Resolved Merged as `sre.k8s.pool-depool-cluster`
[08:57:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I think switchmaster can do this 😄" [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614) (owner: 10Marostegui)
[08:57:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614) (owner: 10Marostegui)
[08:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P34577 and previous config saved to /var/cache/conftool/dbconfig/20220913-085814-ladsgroup.json
[08:59:21] <icinga-wm>	 PROBLEM - Host cr1-codfw #page is DOWN: PING CRITICAL - Packet loss = 100%
[08:59:38] <jynus>	 I think that is expected?
[08:59:42] <vgutierrez>	 I think so
[08:59:48] <Amir1>	 ok
[09:00:02] <jynus>	 but we should make sure it is 0 impact
[09:00:02] * Emperor arrives from the p.age
[09:00:19] <Amir1>	 gonna ack it
[09:01:02] <Emperor>	 this is the work topranks is doing, presumably?
[09:01:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "I went through the linked tasks from 2db4b19f660 and couldn't find the reason why we need to extend ephemeral port range. Do we actually h" [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn)
[09:01:20] <topranks>	 em yes...
[09:01:33] <topranks>	 Emporer: yes... apologies thought I'd downtimed the host
[09:01:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:01:54] <icinga-wm>	 PROBLEM - Host cr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[09:02:28] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt with reason: router upgrade
[09:02:46] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Page on etcdmirror critical status - https://phabricator.wikimedia.org/T317402 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert
[09:02:50] <topranks>	 I messed up syntax it seems and it didn't do what I thought.
[09:02:54] <topranks>	 apologies for noise
[09:02:55] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt with reason: router upgrade
[09:02:55] <icinga-wm>	 RECOVERY - Host cr1-codfw #page is UP: PING OK - Packet loss = 0%, RTA = 45.62 ms
[09:03:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=927fadc1-f5b2-478f-95ce-98bfc47881a9) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th...
[09:03:07] <jynus>	 I can access codfw and edit as normal
[09:03:19] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:04:16] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:06:10] <jynus>	 topranks: for clarification, I know codfw was depooled, but the page itself was not expected within your maintenance?
[09:06:12] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:02] <topranks>	 jynus: Page shouldn't have fired - host should have been downtimed but I'd messed up the command and that didn't happen
[09:07:08] <topranks>	 It's properly downtimed now.
[09:07:20] <jynus>	 ok, that is not important
[09:07:27] <jynus>	 but the maintenance was done as expected, right?
[09:07:54] <topranks>	 maintenance is ongoing, but all going to plan, everything right now routing via cr2-codfw so no services should be affected
[09:08:03] <jynus>	 thanks for clarification
[09:08:13] <jynus>	 just a monitoring issues then
[09:08:18] <icinga-wm>	 RECOVERY - Host cr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms
[09:08:30] <jynus>	 *downtime issue
[09:08:50] <topranks>	 yeah exactly
[09:11:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.network.cf
[09:11:06] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[09:13:03] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez)
[09:13:14] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:18] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) p:05Triage→03Medium
[09:13:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P34578 and previous config saved to /var/cache/conftool/dbconfig/20220913-091320-ladsgroup.json
[09:14:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap, wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Aklapper)
[09:14:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10Aklapper)
[09:15:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Aklapper)
[09:19:42] <marostegui>	 !log Starting s1 codfw failover from db2103 to db2112 - T317614
[09:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:45] <stashbot>	 T317614: Switchover codfw s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T317614
[09:20:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 primary T317614', diff saved to https://phabricator.wikimedia.org/P34579 and previous config saved to /var/cache/conftool/dbconfig/20220913-092032-root.json
[09:21:42] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez)
[09:22:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103 T317614', diff saved to https://phabricator.wikimedia.org/P34580 and previous config saved to /var/cache/conftool/dbconfig/20220913-092200-root.json
[09:23:41] <wikibugs>	 (03PS1) 10Marostegui: db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831824
[09:24:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831824 (owner: 10Marostegui)
[09:24:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[09:24:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[09:25:01] <jbond>	 marostegui: happy fopr me to merge yours
[09:25:33] <hashar>	 !log Stopped Puppet on contint2001 for a Jenkins systemd change
[09:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:49] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device cr1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:26:52] <marostegui>	 jbond: there are multiple puppet changes pending from you
[09:27:02] <marostegui>	 jbond: go ahead
[09:27:11] <jbond>	 ack merging (cc hashar )
[09:27:27] <hashar>	 ready to run puppet and validate on releases1002
[09:27:37] <hashar>	 will continue over private chat
[09:28:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34581 and previous config saved to /var/cache/conftool/dbconfig/20220913-092826-ladsgroup.json
[09:28:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[09:28:30] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[09:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[09:28:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:28:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:29:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34582 and previous config saved to /var/cache/conftool/dbconfig/20220913-092904-ladsgroup.json
[09:33:14] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet
[09:33:38] <hashar>	 !log Enabling Puppet on contint2001 for Jenkins systemd change
[09:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:48] <jinxer-wm>	 (Emergency syslog message) resolved: Device cr1-codfw.wikimedia.org recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:37:04] <hashar>	 !log Restarting CI Jenkins on contint2001 (with new systemd service)
[09:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:49] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device cr1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:41:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[09:41:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10HasanAkgun_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public task...
[09:42:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet
[09:43:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831564
[09:45:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons.
[09:45:42] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 04-1] "Do not merge until T305406 is resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno)
[09:46:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831564 (owner: 10Marostegui)
[09:46:49] <jinxer-wm>	 (Device rebooted) firing: Alert for device cr1-codfw.wikimedia.org - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[09:50:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10Aklapper) Hi, please use the template at https://phabricator.wikimedia.org/project/profile/1564/ and update any potential onboarding docs, if applicable. Thanks.
[09:50:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging2002.codfw.wmnet with reason: Kafka PKI upgrade
[09:51:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging2002.codfw.wmnet with reason: Kafka PKI upgrade
[09:51:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34583 and previous config saved to /var/cache/conftool/dbconfig/20220913-095137-root.json
[09:51:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[09:51:49] <jinxer-wm>	 (Device rebooted) resolved: Device cr1-codfw.wikimedia.org recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[09:52:46] <elukey>	 !log move kafka-logging2002 to PKI-based TLS certs
[09:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, also adding kieth for an additional sanity check" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[10:00:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric)
[10:02:52] <wikibugs>	 (03PS1) 10Elukey: role::kafka::logging: move kafka on all codfw nodes to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130)
[10:04:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons.
[10:04:29] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37242/console" [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[10:05:00] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:06:18] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) > @Milimetric Can you verify that this really is what you want? Going forward, it sounds like perhaps amending the documentation to be a little...
[10:06:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34584 and previous config saved to /var/cache/conftool/dbconfig/20220913-100642-root.json
[10:10:26] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:00] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) @MatthewVernon we would like to get your input here. Before tuning Swift's current TLS termination we'd like to know what are your plans regarding it. Is a migration to envoy in...
[10:14:26] <wikibugs>	 (03PS1) 10Muehlenhoff: ml-etcd: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/831832
[10:16:03] <topranks>	 !log Flipping master RE on cr1-codfw to backup as part of upgrade
[10:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:17:45] <wikibugs>	 (03PS1) 10FNegri: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219)
[10:20:38] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:20:50] <wikibugs>	 (03CR) 10Elukey: "We don't have ml-etcd staging nodes, is it the right alias?" [puppet] - 10https://gerrit.wikimedia.org/r/831832 (owner: 10Muehlenhoff)
[10:21:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "of course we have, sorry, forgot about them :D" [puppet] - 10https://gerrit.wikimedia.org/r/831832 (owner: 10Muehlenhoff)
[10:21:36] <icinga-wm>	 PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:21:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34585 and previous config saved to /var/cache/conftool/dbconfig/20220913-102147-root.json
[10:22:06] <icinga-wm>	 PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:22:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:22:16] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:22:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:22:32] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:22:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34586 and previous config saved to /var/cache/conftool/dbconfig/20220913-102232-ladsgroup.json
[10:22:36] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:22:52] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:24:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) 05Resolved→03Open sudo cookbook -d sre.dns.netbox  This command is requiring me to enter password and not working
[10:24:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:25:00] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:25:54] <icinga-wm>	 RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:26:14] <wikibugs>	 (03PS1) 10Krinkle: rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606)
[10:26:26] <icinga-wm>	 RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:26:36] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:26:50] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:26:56] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:26:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:27:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:27:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr you need to use the `secure-cookbook` binary instead of the `cookbook` one. See also the related patch above for how thats configu...
[10:31:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:35:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T317627
[10:35:39] <stashbot>	 T317627: Switchover s2 codfw master (db2104 -> db2107) - https://phabricator.wikimedia.org/T317627
[10:35:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T317627
[10:35:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons.
[10:36:09] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:36:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:36:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2107 to s2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831837 (https://phabricator.wikimedia.org/T317627)
[10:36:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2107 with weight 0 T317627', diff saved to https://phabricator.wikimedia.org/P34587 and previous config saved to /var/cache/conftool/dbconfig/20220913-103621-marostegui.json
[10:36:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:36:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2107 from api T317627', diff saved to https://phabricator.wikimedia.org/P34588 and previous config saved to /var/cache/conftool/dbconfig/20220913-103658-marostegui.json
[10:37:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34589 and previous config saved to /var/cache/conftool/dbconfig/20220913-103705-root.json
[10:38:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to s2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831837 (https://phabricator.wikimedia.org/T317627) (owner: 10Marostegui)
[10:39:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, some nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri)
[10:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:43:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:46:07] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838
[10:46:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606) (owner: 10Krinkle)
[10:48:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:50:10] <wikibugs>	 (03CR) 10FNegri: [C: 04-1] "I tried running 'cookbook wmcs.ceph.osd.bootstrap_and_add --new-osd-fqdn cloudcephosd1030.eqiad.wmnet --only-check' and the jumbo ping che" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro)
[10:52:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34590 and previous config saved to /var/cache/conftool/dbconfig/20220913-105210-root.json
[10:55:59] <marostegui>	 !log Starting s2 codfw failover from db2104 to db2107 - T317627
[10:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:02] <stashbot>	 T317627: Switchover s2 codfw master (db2104 -> db2107) - https://phabricator.wikimedia.org/T317627
[10:56:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838 (owner: 10Volans)
[10:56:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons.
[10:56:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2107 to s2 codfw primary T317627', diff saved to https://phabricator.wikimedia.org/P34591 and previous config saved to /var/cache/conftool/dbconfig/20220913-105642-marostegui.json
[10:57:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T317627', diff saved to https://phabricator.wikimedia.org/P34592 and previous config saved to /var/cache/conftool/dbconfig/20220913-105733-root.json
[10:59:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690)
[11:01:54] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838 (owner: 10Volans)
[11:01:55] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10jcrespo) Ganeti exporter has been unavailable since 20:17:30: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1  W...
[11:02:57] <icinga-wm>	 PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 12 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[11:03:06] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606) (owner: 10Krinkle)
[11:03:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[11:04:08] <wikibugs>	 (03PS1) 10Btullis: Put the new hadoop nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/831841 (https://phabricator.wikimedia.org/T311210)
[11:06:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[11:06:52] <wikibugs>	 (03PS2) 10Jgreen: DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401)
[11:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:07:15] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[11:07:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34593 and previous config saved to /var/cache/conftool/dbconfig/20220913-110715-root.json
[11:07:28] <wikibugs>	 (03Merged) 10jenkins-bot: Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[11:07:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:07:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:07:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34594 and previous config saved to /var/cache/conftool/dbconfig/20220913-110755-ladsgroup.json
[11:07:58] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[11:08:02] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen)
[11:08:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:08:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:08:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34595 and previous config saved to /var/cache/conftool/dbconfig/20220913-110850-root.json
[11:09:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:09:32] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.1/includes/libs/rdbms/ChronologyProtector.php: Backport: [[gerrit:831847|rdbms: Bump ChronologyProtector cache key version (T317606)]] (duration: 03m 49s)
[11:09:35] <stashbot>	 T317606: PHP Notice: Undefined index: asOfTime - https://phabricator.wikimedia.org/T317606
[11:11:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:12:27] <icinga-wm>	 RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[11:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:14:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt
[11:14:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt
[11:15:02] <topranks>	 !log completed cr1-codfw upgrade, will proceed to cr2-codfw shortly
[11:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:20] <wikibugs>	 (03PS1) 10Jgreen: Fix DMARC external domain verification records. [dns] - 10https://gerrit.wikimedia.org/r/831843 (https://phabricator.wikimedia.org/T211401)
[11:17:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:17:24] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Fix DMARC external domain verification records. [dns] - 10https://gerrit.wikimedia.org/r/831843 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen)
[11:19:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: router upgrade
[11:21:12] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: router upgrade
[11:21:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34596 and previous config saved to /var/cache/conftool/dbconfig/20220913-112112-ladsgroup.json
[11:21:16] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:23:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[11:23:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34597 and previous config saved to /var/cache/conftool/dbconfig/20220913-112355-root.json
[11:24:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[11:24:33] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: remove ms-be20[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[11:27:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[11:27:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[11:27:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:28:09] <icinga-wm>	 PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7281 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:28:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:28:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34598 and previous config saved to /var/cache/conftool/dbconfig/20220913-112818-ladsgroup.json
[11:28:22] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:34:23] <hashar>	 !log Upgrading CI Jenkins T317418
[11:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:27] <stashbot>	 T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418
[11:36:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P34599 and previous config saved to /var/cache/conftool/dbconfig/20220913-113619-ladsgroup.json
[11:37:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Jgreen) 05Open→03Resolved a:03Jgreen ;; ANSWER SECTION: w.wiki._report._dmarc.wikimedia.org. 3600 IN TXT "v=DMARC1;"  ;; ANSWER SECTION: wiki...
[11:39:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34600 and previous config saved to /var/cache/conftool/dbconfig/20220913-113900-root.json
[11:39:06] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) Thanks for asking! No, we don't currently have a move to envoy on our roadmap (I'm afraid there is too much higher-priority stuff there right now), though I'm not opposed to...
[11:51:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P34601 and previous config saved to /var/cache/conftool/dbconfig/20220913-115125-ladsgroup.json
[11:54:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34602 and previous config saved to /var/cache/conftool/dbconfig/20220913-115405-root.json
[11:54:38] <icinga-wm>	 PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[11:57:09] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) One of the hosts actually do report having "None" oper_vcpus, rather than 0.    ` instances[30] {'disk_usage': 51328,...
[11:57:57] <topranks>	 !Disabling transit and ixp BGP on cr2-codfw in advance of software upgrade
[11:58:04] <topranks>	 !log Disabling transit and ixp BGP on cr2-codfw in advance of software upgrade
[11:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:22] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10MoritzMuehlenhoff) >>! In T311288#8232068, @SLyngshede-WMF wrote: > One of the hosts actually do report having "None" oper_vcpus, rat...
[12:02:50] <icinga-wm>	 RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 76 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[12:03:35] <wikibugs>	 (03PS1) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868
[12:03:37] <wikibugs>	 (03PS1) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[12:03:44] <icinga-wm>	 PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:04:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[12:05:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831868 (owner: 10Jbond)
[12:06:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34603 and previous config saved to /var/cache/conftool/dbconfig/20220913-120632-ladsgroup.json
[12:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[12:06:36] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:06:46] <wikibugs>	 (03PS2) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868
[12:06:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[12:06:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34604 and previous config saved to /var/cache/conftool/dbconfig/20220913-120653-ladsgroup.json
[12:06:55] <wikibugs>	 (03PS2) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[12:08:04] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "wrong config file?" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[12:08:24] <wikibugs>	 (03PS3) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[12:09:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34605 and previous config saved to /var/cache/conftool/dbconfig/20220913-120910-root.json
[12:09:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37245/console" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[12:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34606 and previous config saved to /var/cache/conftool/dbconfig/20220913-121204-ladsgroup.json
[12:12:08] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:12:46] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device cr2-codfw.wikimedia.org - Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[12:14:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] P:spicerack: add firmware directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[12:14:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Put the new hadoop nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/831841 (https://phabricator.wikimedia.org/T311210) (owner: 10Btullis)
[12:16:38] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) The problem is this host: dispatch-be1001.eqiad.wmnet which is configured to be down. It does in fact have no vCPUs a...
[12:17:46] <jinxer-wm>	 (Processor usage over 85%) resolved: Device cr2-codfw.wikimedia.org recovered from Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[12:21:46] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device cr2-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[12:24:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34607 and previous config saved to /var/cache/conftool/dbconfig/20220913-122415-root.json
[12:26:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:27:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P34608 and previous config saved to /var/cache/conftool/dbconfig/20220913-122710-ladsgroup.json
[12:31:46] <jinxer-wm>	 (Emergency syslog message) resolved: Device cr2-codfw.wikimedia.org recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[12:31:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:33:22] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10MoritzMuehlenhoff) >>! In T311288#8232088, @SLyngshede-WMF wrote: > The problem is this host: dispatch-be1001.eqiad.wmnet which is co...
[12:34:17] <wikibugs>	 (03PS1) 10Jgreen: Add fundraising host frdm1001.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/831870 (https://phabricator.wikimedia.org/T317443)
[12:36:00] <wikibugs>	 (03PS1) 10Slyngshede: Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288)
[12:36:26] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add fundraising host frdm1001.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/831870 (https://phabricator.wikimedia.org/T317443) (owner: 10Jgreen)
[12:38:50] <wikibugs>	 (03PS2) 10Slyngshede: Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288)
[12:41:10] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) The patch use the oper_state of the instances, rather than just assuming that None should be 0....
[12:42:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P34609 and previous config saved to /var/cache/conftool/dbconfig/20220913-124217-ladsgroup.json
[12:46:31] <topranks>	 !log forcing non-graceful RE switchover on cr2-codfw as part of upgrade
[12:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34610 and previous config saved to /var/cache/conftool/dbconfig/20220913-124758-ladsgroup.json
[12:48:01] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:48:21] <wikibugs>	 (03CR) 10Jcrespo: "LGTM, but note I don't have the context of this." [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede)
[12:52:38] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:52:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:52:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:53:52] <icinga-wm>	 PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:54:00] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:54:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:55:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:55:08] <icinga-wm>	 PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:56:12] <icinga-wm>	 RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:56:22] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:56:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:57:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:57:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34611 and previous config saved to /var/cache/conftool/dbconfig/20220913-125723-ladsgroup.json
[12:57:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[12:57:28] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:57:30] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:57:32] <icinga-wm>	 RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:57:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[12:57:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T314041)', diff saved to https://phabricator.wikimedia.org/P34612 and previous config saved to /var/cache/conftool/dbconfig/20220913-125745-ladsgroup.json
[12:59:53] <topranks>	 !log Switching active RE back to RE1 on cr1-codfw as firmware hadn't been loaded while it was master
[12:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1300)
[13:00:05] <jouncebot>	 phuedx and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1300)
[13:00:11] <phuedx>	 o/
[13:00:17] <koi>	 o/
[13:00:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1144 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:51] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:02:11] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:02:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P34613 and previous config saved to /var/cache/conftool/dbconfig/20220913-130304-ladsgroup.json
[13:03:07] <Lucas_WMDE>	 o/
[13:03:28] <Lucas_WMDE>	 I can deploy!
[13:03:35] <icinga-wm>	 RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:04:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:08] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300)
[13:05:47] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx)
[13:05:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx)
[13:06:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx)
[13:07:14] <Lucas_WMDE>	 ok, grep confirms wmf-config/InitialiseSettings.php is the only remaining file with a reference to SearchRelevancePages
[13:07:43] <Lucas_WMDE>	 phuedx: the first change is on mwdebug1001, do you quickly want to check that nothing’s broken?
[13:07:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:07:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance
[13:07:52] <Lucas_WMDE>	 (otherwise I’m also okay with syncing it directly, looks safe enough)
[13:07:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance
[13:08:09] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:08:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:08:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:08:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance
[13:08:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance
[13:08:53] <phuedx>	 Lucas_WMDE: A spot check of a couple of wikis on mwdebug1001 and I see no obvious breakages. As you say, the variable isn't used anywhee
[13:08:54] <phuedx>	 *re
[13:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:08:59] <Lucas_WMDE>	 ok!
[13:10:11] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:11:18] <wikibugs>	 (03PS4) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[13:11:29] <Lucas_WMDE>	 reviewing the second change
[13:11:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34614 and previous config saved to /var/cache/conftool/dbconfig/20220913-131148-ladsgroup.json
[13:11:49] <Lucas_WMDE>	 apparently conf-labs-en_rtlwiki.json gets "rate": 0, not "rate": 1, in the diffConfig output
[13:11:52] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:12:19] <Lucas_WMDE>	 I guess that’s not in the wikipedia dblist
[13:12:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37246/console" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[13:12:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:13:01] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:13:14] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx)
[13:13:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824685|Remove $wgWMESearchRelevancePages]] (unused) (duration: 03m 53s)
[13:13:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx)
[13:14:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:46] <topranks>	 !log Flipping back to RE0 on cr2-codfw (last disruptive switch)
[13:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:55] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx)
[13:15:51] <Lucas_WMDE>	 phuedx: the edit_attempt change is on mwdebug1001, can you test it?
[13:15:55] <phuedx>	 On it
[13:17:49] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:17:49] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:17:49] <phuedx>	 Lucas_WMDE: LGTM. I see the stream definition on testwiki but not on enwiki or dewiki for example
[13:17:57] <Lucas_WMDE>	 ok \o/
[13:18:03] <Lucas_WMDE>	 thanks
[13:18:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P34615 and previous config saved to /var/cache/conftool/dbconfig/20220913-131811-ladsgroup.json
[13:18:27] <Lucas_WMDE>	 syncing
[13:19:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:19:15] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:19:18] <Emperor>	 !log set thanos ring replicas to 3.85 T311690
[13:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:21] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[13:20:19] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915)
[13:20:27] <icinga-wm>	 PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:20:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:20:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:21:25] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:21:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:21:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:826234|testwiki: Add mediawiki.edit_attempt stream (T309013)]] (1/2) (duration: 03m 39s)
[13:22:06] <stashbot>	 T309013: EditAttemptStep Migration to MP - https://phabricator.wikimedia.org/T309013
[13:22:19] <icinga-wm>	 RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:22:23] <phuedx>	 Lucas_WMDE: Thanks <3
[13:22:27] <Lucas_WMDE>	 np :)
[13:22:30] <Lucas_WMDE>	 (still syncing IS-labs ^^)
[13:23:02] <Lucas_WMDE>	 koi: don’t you think it’s too early for that tnwiki change? it doesn’t look like anyone else voted for it (or reacted at all) at https://tn.wikipedia.org/wiki/Wikipedia:Patlelo_ya_set%C5%A1haba#Enabling_Extended_Confirmed_User_Group
[13:23:18] <Lucas_WMDE>	 I can see in the recent changes that Rebel Agent is the most active editor there, but they’re not the only one either
[13:24:05] <koi>	 hmm, more than one week passed, and the author of that task is a (temporary) sysop at tnwiki
[13:24:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:24:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:24:57] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:25:12] <Lucas_WMDE>	 let me see if there are any other recent-ish tnwiki config changes and what kind of community approval they had
[13:25:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:25:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:826234|testwiki: Add mediawiki.edit_attempt stream (T309013)]] (2/2) (duration: 03m 33s)
[13:26:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:26:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P34616 and previous config saved to /var/cache/conftool/dbconfig/20220913-132654-ladsgroup.json
[13:27:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:27:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:27:53] <Lucas_WMDE>	 hm, not finding much in the way of tnwiki config changes
[13:28:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:29:35] <Lucas_WMDE>	 koi: I don’t know if other deployers would handle this differently, but to me there’s not enough community consensus to deploy that, sorry
[13:29:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10HasanAkgun_WMDE)
[13:30:21] <koi>	 Lucas_WMDE: fair enough, I'll wait for another couple of days
[13:31:05] <Lucas_WMDE>	 I would feel better if the community page had an update like “if no one objects until X then this will be deployed”
[13:31:08] <Lucas_WMDE>	 but idk if that’s usual or not
[13:31:21] <Lucas_WMDE>	 if another deployer wants to go ahead with that config change, I don’t mind either
[13:33:17] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:33:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34617 and previous config saved to /var/cache/conftool/dbconfig/20220913-133317-ladsgroup.json
[13:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[13:33:22] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:33:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[13:33:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34618 and previous config saved to /var/cache/conftool/dbconfig/20220913-133339-ladsgroup.json
[13:36:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10karapayneWMDE) I am the Engineering manager for wikidata and I approve this request and confirm Hasan's affiliation with WDME.
[13:41:29] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:42:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P34619 and previous config saved to /var/cache/conftool/dbconfig/20220913-134201-ladsgroup.json
[13:51:15] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:55:57] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:56:28] <sukhe>	 ^ restarting bird on doh/durum, so expected. should clear up themselves
[13:56:47] <sukhe>	 if not, then it's a problem and we will see :)
[13:57:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34620 and previous config saved to /var/cache/conftool/dbconfig/20220913-135707-ladsgroup.json
[13:57:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[13:57:12] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:57:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[13:57:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34621 and previous config saved to /var/cache/conftool/dbconfig/20220913-135729-ladsgroup.json
[13:59:24] <wikibugs>	 (03PS22) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[14:00:18] <wikibugs>	 (03PS23) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[14:01:35] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "one typo, lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[14:02:15] <wikibugs>	 (03PS24) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[14:02:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik)
[14:02:19] <wikibugs>	 (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond)
[14:03:16] <wikibugs>	 (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[14:03:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[14:07:41] <topranks>	 !log re-activating Transit on IX BGP on cr2-codfw
[14:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:29] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:12:46] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt
[14:12:47] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt
[14:13:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:16:06] <wikibugs>	 (03Merged) 10jenkins-bot: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik)
[14:17:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:28] <topranks>	 !log Core router upgrade in codfw complete - maintenance closed.
[14:18:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:20:05] <wikibugs>	 (03PS1) 10Cathal Mooney: Re-pool codfw after upgrading core routers on site [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690)
[14:26:06] <wikibugs>	 (03CR) 10MacFan4000: [C: 03+1] ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) (owner: 10Jforrester)
[14:27:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[14:28:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[14:29:22] <icinga-wm>	 PROBLEM - MariaDB read only db_inventory #page on db2093 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.26-MariaDB-log, Uptime 6574s, event_scheduler: True, 285.80 QPS, connection latency: 0.004492s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:29:36] * volans here
[14:29:37] <jynus>	 is that maintenance?=
[14:29:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Re-pool codfw after upgrading core routers on site [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[14:29:43] <jynus>	 downtime expired maybe?
[14:29:48] <volans>	 rebooted
[14:29:48] <jynus>	 should cause no impact
[14:29:50] <marostegui>	 grrr
[14:29:50] * Emperor here
[14:29:52] <volans>	 uptime 1:52
[14:29:52] <marostegui>	 yeah
[14:29:55] <marostegui>	 fixing it
[14:30:00] <Emperor>	 TY
[14:30:04] <jynus>	 did it crash?
[14:30:09] <marostegui>	 nop
[14:31:15] <topranks>	 !re-pooling codfw on authdns after router upgrades completed.
[14:32:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:33:07] <marostegui>	 I am actually going to disable notifications for that host, it shouldn't have them enabled anymore
[14:33:16] <marostegui>	 It is no longer active orchestrator db master
[14:33:22] <marostegui>	 So it shouldn't create noise like this
[14:33:47] <jynus>	 +1
[14:33:50] <icinga-wm>	 RECOVERY - MariaDB read only db_inventory #page on db2093 is OK: Version 10.4.26-MariaDB-log, Uptime 6842s, read_only: True, event_scheduler: True, 70.98 QPS, connection latency: 0.004412s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:35:04] <jynus>	 marostegui: I saw some host that were paging but probably shouldn't on misc, this was one of them
[14:35:17] <jynus>	 but there were others (pasive misc hosts)
[14:35:48] <marostegui>	 jynus: db2078 perhaps?
[14:35:50] <jynus>	 it is ok if they alerted without paging
[14:35:57] <jynus>	 let me see
[14:36:13] <wikibugs>	 (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891
[14:36:33] <jynus>	 that host doesn't exist anymore, right?
[14:36:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 (owner: 10Marostegui)
[14:36:59] <marostegui>	 yeah, I am asking in case you saw that one some time ago
[14:37:59] <jynus>	 MariaDB read only m1
[14:38:28] <wikibugs>	 (03PS2) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891
[14:38:49] <marostegui>	 jynus: not sure what that check is and what it comes from?
[14:38:51] <wikibugs>	 (03PS1) 10Volans: Release v0.6.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831892
[14:38:52] <marostegui>	 which host was that?
[14:38:53] <jynus>	 ok to deploy that, but probably better disabling paging
[14:39:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: analytics-reportupdater-logs-rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:34] <jynus>	 db2132 db2133 db2134 db2135
[14:39:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 (owner: 10Marostegui)
[14:40:12] <jynus>	 meaning probably a deeper review has to be done (but doesn't have to happen now)
[14:40:28] <jynus>	 to disable pages on non critical servers
[14:40:34] <marostegui>	 jynus: so what's wrong with those hosts?
[14:40:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) cr1-codfw and cr2-codfw sucessfully upgraded today.  Took a while with the firmware upgrades too, I've added some notes [[https://wikitech.wikimedia.o...
[14:41:04] <jynus>	 marostegui: as far as I understand, they page but have no user traffic (they are misc)
[14:41:18] <jynus>	 it is ok for them to alert, but paging may be too much
[14:41:33] <marostegui>	 jynus: ah ok, yeah. I don't think they should even send IRC notifications, icinga should be enough
[14:41:36] <marostegui>	 I will check them tomorrow
[14:41:55] <jynus>	 yeah, just noticing that, no rush now
[14:42:27] <marostegui>	 I can just do profile::monitoring::is_critical: false for them
[14:42:32] <marostegui>	 I will check tomorrow
[14:42:32] <wikibugs_>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:42:38] <jynus>	 yeah, other day with more time :-)
[14:42:40] <dancy>	 jouncebot now
[14:42:40] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 17 minute(s)
[14:42:53] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:43:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:28] <moritzm>	 !log installing libxslt security updates on buster
[14:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:00] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190)
[14:46:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[14:46:33] <moritzm>	 !log restarting FPM/Apache on mediawiki canaries
[14:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:44] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[14:47:01] <logmsgbot>	 !log dancy@deploy1002 prep aborted:  (duration: 00m 12s)
[14:47:01] <logmsgbot>	 !log dancy@deploy1002 deploy-promote aborted:  (duration: 01m 03s)
[14:47:11] <dancy>	 moritzm: Lemme know when you're done please.
[14:47:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) @Volans  still seems to have a issue ` Traceback (most recent call last):   File "/usr/lib/python3/dist-packages/wmflib/config.py", line 3...
[14:49:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:50:05] <wikibugs>	 (03PS5) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[14:50:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:50:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:50:20] <wikibugs>	 (03PS3) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868
[14:50:26] <wikibugs>	 (03PS6) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869
[14:51:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:54:04] <moritzm>	 dancy: you can proceed, I'll do the rest when the deployments are complete
[14:54:18] <dancy>	 Thanks! My part should take about 3 minutes
[14:54:45] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.1  refs T314190
[14:54:48] <stashbot>	 T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190
[14:55:51] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300)
[14:56:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34622 and previous config saved to /var/cache/conftool/dbconfig/20220913-145631-ladsgroup.json
[14:56:35] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[14:59:29] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1  refs T314190 (duration: 04m 43s)
[14:59:40] <dancy>	 moritzm: Back atcha
[15:00:50] <moritzm>	 dancy: cheers, I'll resume
[15:01:17] <wikibugs>	 (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[15:02:05] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.6.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831892 (owner: 10Volans)
[15:08:36] <logmsgbot>	 !log dancy@deploy1002 deploy-promote aborted:  (duration: 00m 02s)
[15:09:20] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190)
[15:09:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[15:10:10] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot)
[15:10:23] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.28  refs T314190
[15:10:26] <stashbot>	 T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190
[15:11:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P34623 and previous config saved to /var/cache/conftool/dbconfig/20220913-151138-ladsgroup.json
[15:12:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.0 - volans@cumin1001
[15:13:20] <wikibugs>	 (03PS1) 10Btullis: Failover hive to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/831906 (https://phabricator.wikimedia.org/T311807)
[15:13:51] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.0 - volans@cumin1001
[15:14:23] <wikibugs>	 (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) (owner: 10Jdlrobson)
[15:14:55] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.28  refs T314190 (duration: 04m 31s)
[15:16:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:17:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:17:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:17:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868 (owner: 10Jbond)
[15:17:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[15:17:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitiz
[15:17:49] <icinga-wm>	 immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:spicerack: add firmware directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond)
[15:18:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:18:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover hive to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/831906 (https://phabricator.wikimedia.org/T311807) (owner: 10Btullis)
[15:23:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:25:53] <wikibugs>	 (03PS1) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908
[15:26:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:26:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P34624 and previous config saved to /var/cache/conftool/dbconfig/20220913-152644-ladsgroup.json
[15:30:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:30:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:31:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff)
[15:34:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[15:36:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:38:25] <wikibugs>	 (03PS9) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[15:40:32] <wikibugs>	 (03PS10) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[15:40:37] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@031604d]: Automatically drop hitsorical partitions of subgraph analysis
[15:41:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34625 and previous config saved to /var/cache/conftool/dbconfig/20220913-154151-ladsgroup.json
[15:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[15:41:54] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[15:41:58] <wikibugs>	 (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[15:42:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[15:42:45] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@031604d]: Automatically drop hitsorical partitions of subgraph analysis (duration: 02m 07s)
[15:42:58] <wikibugs>	 (03PS1) 10Hashar: gerrit: disable automatic plugin handling [puppet] - 10https://gerrit.wikimedia.org/r/831913 (https://phabricator.wikimedia.org/T317412)
[15:47:09] <icinga-wm>	 PROBLEM - Host db1189 #page is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:17] <rzl>	 here
[15:47:27] <jynus>	 a crash maybe?
[15:47:27] <marostegui>	 woot
[15:47:31] <marostegui>	 maybe 
[15:47:38] <jynus>	 marostegui: you depool?
[15:47:49] <marostegui>	 yes
[15:48:05] <jynus>	 checking impact meanwhile
[15:48:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P34626 and previous config saved to /var/cache/conftool/dbconfig/20220913-154810-root.json
[15:48:20] <rzl>	 letting the two of you run things but I'm here to help if needed :)
[15:48:34] <jynus>	 "Wikimedia\Rdbms\LoadMonitor::computeServerStates: host db1189 is unreachable" but that is to be expected
[15:48:56] <jynus>	 it should stop receiving connections and just error out on the retries
[15:49:09] <icinga-wm>	 RECOVERY - Host db1189 #page is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[15:49:16] <marostegui>	 it got rebooted
[15:50:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: down
[15:50:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: down
[15:50:34] <jynus>	 10.4, so shouldn't be an issue
[15:50:43] <marostegui>	 Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A10. Immediately replace the DIMM.
[15:50:46] <marostegui>	 Memory 
[15:50:50] <marostegui>	 I will create a task
[15:51:01] <wikibugs>	 (03PS1) 10Hashar: gerrit: scap checks script to automatize deployment [puppet] - 10https://gerrit.wikimedia.org/r/831916 (https://phabricator.wikimedia.org/T317412)
[15:51:18] <jynus>	 not the master candidate, so we should be ok without it
[15:51:19] * jbond also here if needed
[15:51:47] <jynus>	 no more errors now
[15:52:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: scap checks script to automatize deployment [puppet] - 10https://gerrit.wikimedia.org/r/831916 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[15:52:14] <wikibugs>	 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui)
[15:52:39] <wikibugs>	 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) p:05Triage→03Medium
[15:53:58] <rzl>	 marostegui, jynus: thanks!
[15:54:23] <wikibugs>	 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Started mysql for now. Will do a data check but will leave the host depooled. @Cmjohnson @Jclark-ctr once the DIMM is received and ready to  be replaced, please let us know so we can power off the host for you.
[15:55:03] <jynus>	 won't create an outage report, as even if we did nothing, there would be almost no user impact, just the summary on the handover doc
[15:55:17] <rzl>	 resolved in VO
[15:55:31] <jynus>	 only on-the-fly (read only) queries get affected + monitoring spam
[15:56:17] <jynus>	 handling it is also very well documented: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica
[15:56:48] <jynus>	 (so mw doesn't keep retrying connecting and alerting)
[15:56:49] <wikibugs>	 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[15:57:53] <wikibugs>	 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10wiki_willy) @Cmjohnson - just a heads up, this was just recently installed, so it's under warranty for submitting a RMA with Dell.  Thanks, Willy
[15:59:40] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919
[15:59:42] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:04:42] <wikibugs>	 (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831922 (https://phabricator.wikimedia.org/T317662)
[16:05:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831922 (https://phabricator.wikimedia.org/T317662) (owner: 10Marostegui)
[16:05:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34628 and previous config saved to /var/cache/conftool/dbconfig/20220913-160536-ladsgroup.json
[16:05:40] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[16:07:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance
[16:07:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance
[16:07:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet with reason: Maintenance
[16:07:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet with reason: Maintenance
[16:09:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[16:09:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[16:11:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:09] <godog>	 !log add 200G to prometheus/eqiad instance ops
[16:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P34629 and previous config saved to /var/cache/conftool/dbconfig/20220913-162043-ladsgroup.json
[16:31:04] <wikibugs>	 (03PS2) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908
[16:33:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[16:34:43] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] role::kafka::logging: move kafka on all codfw nodes to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[16:34:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff)
[16:35:02] <wikibugs>	 (03PS11) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[16:35:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P34630 and previous config saved to /var/cache/conftool/dbconfig/20220913-163549-ladsgroup.json
[16:35:55] <wikibugs>	 (03PS3) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908
[16:36:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:36:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:36:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:36:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:37:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:37:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:39:11] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:47:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34631 and previous config saved to /var/cache/conftool/dbconfig/20220913-164734-ladsgroup.json
[16:47:38] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[16:50:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34632 and previous config saved to /var/cache/conftool/dbconfig/20220913-165056-ladsgroup.json
[16:50:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:51:00] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[16:51:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:51:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34633 and previous config saved to /var/cache/conftool/dbconfig/20220913-165117-ladsgroup.json
[16:52:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34634 and previous config saved to /var/cache/conftool/dbconfig/20220913-165202-ladsgroup.json
[17:02:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34635 and previous config saved to /var/cache/conftool/dbconfig/20220913-170241-ladsgroup.json
[17:07:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P34636 and previous config saved to /var/cache/conftool/dbconfig/20220913-170708-ladsgroup.json
[17:14:45] <wikibugs>	 (03PS1) 10Hashar: gerrit: ignore lint error in role [puppet] - 10https://gerrit.wikimedia.org/r/831932
[17:14:47] <wikibugs>	 (03PS1) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933
[17:15:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar)
[17:16:48] <wikibugs>	 (03CR) 10Herron: "Masking (bulk?) mail originating from a 3rd party system as our own has risks.  High volume or problematic content could cause deliverabil" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[17:17:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34637 and previous config saved to /var/cache/conftool/dbconfig/20220913-171747-ladsgroup.json
[17:19:55] <wikibugs>	 (03PS2) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933
[17:22:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P34638 and previous config saved to /var/cache/conftool/dbconfig/20220913-172215-ladsgroup.json
[17:22:28] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar)
[17:24:56] <wikibugs>	 (03PS4) 10Ryan Kemper: wcqs/wdqs: New rolling restart nginx cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff)
[17:26:27] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:27:10] <wikibugs>	 (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/pcc-worker1003/1436/" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar)
[17:29:16] <wikibugs>	 (03PS1) 10Volans: wmf-netbox plugin: fix pynetbox issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936
[17:32:46] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936 (owner: 10Volans)
[17:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34639 and previous config saved to /var/cache/conftool/dbconfig/20220913-173254-ladsgroup.json
[17:32:58] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[17:36:59] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] wmf-netbox plugin: fix pynetbox issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936 (owner: 10Volans)
[17:37:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34640 and previous config saved to /var/cache/conftool/dbconfig/20220913-173721-ladsgroup.json
[17:37:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:37:25] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:37:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:40:21] <wikibugs>	 (03CR) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall)
[17:40:55] <wikibugs>	 (03PS1) 10Volans: Deploy fix for the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831937
[17:41:42] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Upgrade wmf-netbox plugin - volans@cumin1001
[17:43:19] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Upgrade wmf-netbox plugin - volans@cumin1001
[17:44:24] <wikibugs>	 (03Abandoned) 10Volans: Deploy fix for the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831937 (owner: 10Volans)
[17:46:00] <wikibugs>	 (03PS3) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174)
[17:46:07] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686)
[17:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34642 and previous config saved to /var/cache/conftool/dbconfig/20220913-174718-ladsgroup.json
[17:47:22] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:47:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:51:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @Milimetric Thanks for the reply and for expanding the docs. I think that Wikitech is a more appropriate place for documentation of the group th...
[17:51:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 125 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:52:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: Add Hannah Okwelum to analytics-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall)
[17:52:38] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall)
[17:52:44] <wikibugs>	 (03PS2) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545)
[17:53:54] <dancy>	 The mediawiki errors are https://phabricator.wikimedia.org/T317606
[17:54:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 24 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:56:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr just to avoid misunderstanding, did you run it with sudo?  ` sudo secure-cookbook -d sre.dns.netbox "noop" `
[17:57:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:58:53] <wikibugs>	 (03PS1) 10MusikAnimal: InitialiseSettings-labs.php: Set $wgPhonosPath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417)
[18:00:04] <jouncebot>	 dancy and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1800).
[18:00:30] <dancy>	 The train is blocked on https://phabricator.wikimedia.org/T317606
[18:00:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P34643 and previous config saved to /var/cache/conftool/dbconfig/20220913-180225-ladsgroup.json
[18:02:57] <wikibugs>	 (03PS1) 10Cwhite: bugfixes [software/ecs] - 10https://gerrit.wikimedia.org/r/831942
[18:02:59] <wikibugs>	 (03PS1) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098)
[18:03:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bugfixes [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite)
[18:03:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite)
[18:05:05] <wikibugs>	 (03PS1) 10Dduvall: scap: Remove use of --preserve-env for sudo'd scripts [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953)
[18:05:07] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:05:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:06:14] <wikibugs>	 (03PS4) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174)
[18:06:15] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:37] <wikibugs>	 (03PS2) 10Cwhite: bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942
[18:06:39] <wikibugs>	 (03PS2) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098)
[18:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:07:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite)
[18:08:29] <wikibugs>	 (03Merged) 10jenkins-bot: bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite)
[18:08:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) That was without sudo.    With Sudo still requires password
[18:10:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:11:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 (owner: 10Jbond)
[18:11:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:12:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, it just need a manual clean/move of the existing files once deployed" [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 (owner: 10Jbond)
[18:13:09] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:15:23] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, and 3 others: Mediawiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10aaron)
[18:17:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:17:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P34644 and previous config saved to /var/cache/conftool/dbconfig/20220913-181731-ladsgroup.json
[18:17:44] <wikibugs>	 (03PS5) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174)
[18:19:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) Ahhh I think I know what happened here, it's the dry-run option. Try to run it for real: ` sudo secure-cookbook sre.dns.netbox "noop" `  @jbon...
[18:19:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:21:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:24:40] <TheresNoTime>	 Hi, I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/831941/, which only touches the beta cluster (`wmf-config/InitialiseSettings-labs.php`) — any pressing reasons why I shouldn't? I believe the current window for the train isn't happening because it's blocked
[18:25:04] <dancy>	 Go for it.
[18:25:10] <TheresNoTime>	 :)
[18:25:44] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "Beta cluster deploy, no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[18:26:28] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Set $wgPhonosPath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[18:27:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) >>! In T317545#8233696, @BCornwall wrote: > I think that Wikitech is a more appropriate place for documentation of the group than the codebase:...
[18:27:49] <wikibugs>	 (03PS1) 10Ssingh: P:wikidough: remove redundant resource absentees [puppet] - 10https://gerrit.wikimedia.org/r/831946
[18:28:35] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37247/console" [puppet] - 10https://gerrit.wikimedia.org/r/831946 (owner: 10Ssingh)
[18:28:49] <TheresNoTime>	 !log deploying a beta cluster only config change, T317417
[18:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:53] <stashbot>	 T317417: Phonos links to unroutable domain/URL for the MP3 file - https://phabricator.wikimedia.org/T317417
[18:29:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:29:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall)
[18:31:35] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:31:48] <logmsgbot>	 !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:831941|InitialiseSettings-labs.php: Set $wgPhonosPath (T317417)]] (duration: 03m 45s)
[18:32:08] <TheresNoTime>	 (done, thanks)
[18:32:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34645 and previous config saved to /var/cache/conftool/dbconfig/20220913-183238-ladsgroup.json
[18:32:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:32:45] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[18:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:33:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34646 and previous config saved to /var/cache/conftool/dbconfig/20220913-183259-ladsgroup.json
[18:33:25] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: remove redundant resource absentees [puppet] - 10https://gerrit.wikimedia.org/r/831946 (owner: 10Ssingh)
[18:33:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If that's true, would you be kind enough to prov...
[18:36:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:36:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:36:42] <wikibugs>	 (03PS1) 10Cwhite: logstash: expand ecs pre and post filter gates [puppet] - 10https://gerrit.wikimedia.org/r/831949 (https://phabricator.wikimedia.org/T292585)
[18:38:09] <wikibugs>	 (03PS1) 10Volans: cli: add --version option [software/homer] - 10https://gerrit.wikimedia.org/r/831951
[18:38:28] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686)
[18:38:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03BCornwall Hi! Thanks for the request.  Could I get Hasan's shell username? I'm unable to find that information.
[18:39:27] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37248/console" [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper)
[18:40:41] <wikibugs>	 (03PS3) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933
[18:41:01] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar)
[18:41:21] <wikibugs>	 (03PS1) 10Cwhite: logstash: migrate mediawiki_ecs to ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/831952 (https://phabricator.wikimedia.org/T314098)
[18:41:28] <wikibugs>	 (03CR) 10Hashar: "Patchset 3 moves the Apache templates from the Gerrit module to the profile." [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar)
[18:42:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:43:07] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper)
[18:43:12] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper)
[18:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:46:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03BCornwall Hi, Tanuja! I'll need approval from your manager before proceeding. Could you tag them here, please?
[18:46:09] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:46:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686
[18:46:58] <stashbot>	 T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686
[18:47:50] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686
[18:50:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10BCornwall) p:05Triage→03Medium a:03BCornwall
[18:52:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) p:05Triage→03Medium a:03ayounsi @ayounsi as you are a bureaucrat on wikitech, I choose you for the privilege of renaming!  (Thanks for doing that if you can!)
[18:54:23] <wikibugs>	 (03PS1) 10MusikAnimal: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417)
[18:55:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[19:00:26] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10taavi) @ayounsi @bcornwall Please don't if you don't fully understand the effects of an account rename on all the systems that use developer account/LDAP authentication.  We don't usually...
[19:00:41] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:00:45] <wikibugs>	 (03PS2) 10MusikAnimal: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417)
[19:01:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686
[19:01:57] <stashbot>	 T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686
[19:08:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BTullis) >>! In T317545#8233883, @BCornwall wrote: > @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If tha...
[19:16:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34647 and previous config saved to /var/cache/conftool/dbconfig/20220913-191632-ladsgroup.json
[19:16:37] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:17:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) a:05ayounsi→03None
[19:17:40] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "I think a batch size of 1 with a 1 second delay is probably fine, given how fast nginx comes back up." [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff)
[19:17:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) 05Open→03Invalid Thanks for the information, @taavi. I missed the banner since the given link was an anchor, TBH. Given this, I think it's safe to close this as invalid.
[19:18:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:43] <icinga-wm>	 PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:19:11] <inflatador>	 ^^ ryankemper I'm looking at 1080 now
[19:19:34] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686
[19:19:38] <stashbot>	 T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686
[19:20:34] <ryankemper>	 inflatador: we ran puppet across the whole fleet once, perhaps we should have ran it twice
[19:21:42] <inflatador>	 ryankemper ACK, continuing convo in search
[19:28:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall)
[19:28:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10BCornwall)
[19:31:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P34648 and previous config saved to /var/cache/conftool/dbconfig/20220913-193139-ladsgroup.json
[19:32:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10WMDE-leszek) @taavi I am not sure I have understood the reasoning fully. The request is about removing the WMDE suffix that Hasan has accidentally included? Or is your suggestion to not r...
[19:34:30] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10Dzahn) Creating a new account will be MUCH easier than renaming. And if has just recently been created and therefore not much history then it especially makes the most sense to simply cre...
[19:35:44] <wikibugs>	 (03CR) 10JHathaway: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[19:45:51] <icinga-wm>	 RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:46:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P34649 and previous config saved to /var/cache/conftool/dbconfig/20220913-194645-ladsgroup.json
[19:47:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10WMDE-leszek) alright thanks @Dzahn. I somehow managed to miss the topmost banner as well.
[19:51:07] <icinga-wm>	 RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:55:29] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686
[19:55:33] <stashbot>	 T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T2000).
[20:00:05] <jouncebot>	 Aishik: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:29] <cjming>	 o/
[20:00:32] <cjming>	 i can deploy
[20:00:37] <TheresNoTime>	 woo
[20:00:40] <TheresNoTime>	 ^^
[20:00:42] <cjming>	 lol
[20:01:11] <Aishik>	 🙂 TheresNoTime
[20:01:36] <Aishik>	 Your 'emoji'
[20:01:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34650 and previous config saved to /var/cache/conftool/dbconfig/20220913-200152-ladsgroup.json
[20:01:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[20:01:56] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[20:01:59] <cjming>	 hi Aishik: getting started with your patch
[20:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[20:02:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34651 and previous config saved to /var/cache/conftool/dbconfig/20220913-200214-ladsgroup.json
[20:02:24] <wikibugs>	 (03PS6) 10Clare Ming: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:03:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:03:14] <wikibugs>	 (03PS1) 10Gmodena: charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390)
[20:03:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[20:03:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[20:03:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34652 and previous config saved to /var/cache/conftool/dbconfig/20220913-200344-ladsgroup.json
[20:03:57] <cjming>	 Aishik: CI needs a newline -- can you take care of that? otherwise i can push up a quick fix
[20:04:21] <Aishik>	 Do it please
[20:04:42] <cjming>	 np
[20:05:52] <Aishik>	 (:
[20:06:18] <wikibugs>	 (03PS7) 10Clare Ming: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:06:57] <icinga-wm>	 PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:22] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:07:37] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:09] <wikibugs>	 (03Merged) 10jenkins-bot: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:08:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman)
[20:09:03] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]]
[20:09:06] <stashbot>	 T313174: add tagline and wordmark in ptwikinews  - https://phabricator.wikimedia.org/T313174
[20:09:24] <logmsgbot>	 !log cjming@deploy1002 cjming and aishik: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:09:26] <cjming>	 Aishik: can you verify on mwdebug?
[20:10:24] <Aishik>	 tagline is working!
[20:10:30] <cjming>	 yay - going live
[20:10:40] <Aishik>	 but not the wordmark!
[20:11:27] <icinga-wm>	 RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:11:34] <cjming>	 hmm - whoops - it might need to be purged - i already started the sync
[20:13:45] <Aishik>	 😴
[20:13:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:14:26] <cjming>	 TheresNoTime: i forget - do run the purgeList script on the deployment server?
[20:14:53] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]] (duration: 05m 50s)
[20:14:56] <stashbot>	 T313174: add tagline and wordmark in ptwikinews  - https://phabricator.wikimedia.org/T313174
[20:15:02] <cjming>	 so something like: "echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikinews-tagline-pt.svg' | mwscript purgeList.php"?
[20:15:22] <TheresNoTime>	 cjming: nae, https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging - on `mwmaint` :)
[20:15:34] <cjming>	 ah - thanks
[20:16:44] <cjming>	 Aishik: just purged that svg - can you check on prod?
[20:17:30] <Aishik>	 Wordmark is not working yet...
[20:17:46] <cjming>	 gah - purged the wrong svg - 1 sec
[20:17:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:17:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:18:04] <wikibugs>	 (03PS3) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098)
[20:18:26] <cjming>	 Aishik: how about now?
[20:18:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:19:13] <Aishik>	 Nope!
[20:21:40] <Aishik>	 ...
[20:21:57] <zabe>	 The wordmark is the thing you see in the left up corner with the new vector skin, isn't it?
[20:22:02] <cjming>	 hmm - not sure what to say about that - maybe it takes a minute?
[20:22:39] <Aishik>	 @zabe yeap!
[20:22:54] <Aishik>	 vector 2022 skin
[20:22:56] <zabe>	 Aishik, could you try clearing you browser cache?
[20:23:29] <Aishik>	 It's working!
[20:23:34] <cjming>	 yay!
[20:23:43] <cjming>	 thanks zabe - it's always cache
[20:23:51] <Aishik>	 Thank you (:
[20:24:21] <cjming>	 alrighty closing the backport window seeing there's nothing else in the queue
[20:25:00] <zabe>	 yw
[20:25:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) ` jclark@cumin1001:~$ sudo secure-cookbook sre.dns.netbox "noop"  We trust you have received the usual lecture from the local System Admin...
[20:25:41] <cjming>	 !log end of UTC late backport window
[20:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:01] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, thanos-fe1002, thanos-fe1003, t
[20:27:01] <icinga-wm>	 2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[20:29:29] <wikibugs>	 (03PS1) 10Hashar: gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963
[20:30:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar)
[20:34:49] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, thanos-fe1002, thanos-fe1003, t
[20:34:49] <icinga-wm>	 2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[20:48:31] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:01:08] <wikibugs>	 (03PS1) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259)
[21:01:21] <wikibugs>	 (03PS2) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259)
[21:01:22] <dancy>	 jouncebot nowandnext
[21:01:22] <jouncebot>	 No deployments scheduled for the next 9 hour(s) and 58 minute(s)
[21:01:22] <jouncebot>	 In 9 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T0700)
[21:03:46] <wikibugs>	 (03PS3) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259)
[21:04:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[21:04:32] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing T299648
[21:04:36] <stashbot>	 T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648
[21:06:16] <wikibugs>	 (03PS1) 10Volans: admin: fix sudo permission for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/831987 (https://phabricator.wikimedia.org/T306654)
[21:07:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Patch-For-Review: Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr whoops, that's not wha't supposed to happen. On second review I think that the original patch has an error,...
[21:08:37] <wikibugs>	 (03PS4) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259)
[21:10:33] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:12:54] <wikibugs>	 (03PS5) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259)
[21:13:52] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Makes sense to me.  Merge at will I'd say." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:14:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:14:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:14:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:14:32] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:14:51] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:14:54] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:15:10] <logmsgbot>	 !log dancy@deploy1002 dancy: testing T299648 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:15:13] <stashbot>	 T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648
[21:15:44] <wikibugs>	 (03CR) 10Dduvall: [V: 03+1] "Sorry for the noise. Manually verified in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[21:16:05] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[21:16:42] <dancy>	 !log dancy@deploy1002  touch /var/lib/deploy-mwdebug/pause
[21:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:09] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] "Thanks! Verified in devtools. This won't work until I09cb4161712257f27999bc322a1bd80206afe82a is merged but the deployment doesn't work cu" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:18:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:36:51] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[21:37:12] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ MW_CONFIG_BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restric
[21:37:12] <logmsgbot>	 ted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 00m 20s)
[21:39:13] <wikibugs>	 (03CR) 10Herron: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[21:47:36] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[21:48:18] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:50:37] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:50:53] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:54:47] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:54:54] <logmsgbot>	 !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:55:08] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[21:55:17] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[21:55:58] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:56:52] <wikibugs>	 (03CR) 10JHathaway: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[21:58:39] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:58:55] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:01:03] <wikibugs>	 (03PS1) 10RLazarus: httpbb: In PHP version routing tests, allow either 7.2 or 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/831997
[22:01:16] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:01:32] <logmsgbot>	 !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[22:01:35] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[22:02:36] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[22:03:15] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:05:59] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:06:07] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:06:33] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:06:45] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:07:02] <logmsgbot>	 !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:07:07] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[22:07:46] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[22:08:26] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:10:49] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:11:05] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:11:50] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:11:59] <logmsgbot>	 !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[22:12:07] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[22:12:45] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[22:13:00] <dancy>	 Sorry for the noise. I think this will be the last run for the day.
[22:13:25] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:13:35] <wikibugs>	 (03CR) 10Dzahn: "wow, great idea to use the validate_command with file. thanks for that. will get to it soon! currently a bit afk" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[22:14:05] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:14:12] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:14:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:14:54] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:15:01] <logmsgbot>	 !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[22:15:04] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[22:15:09] <wikibugs>	 (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318)
[22:16:06] <dancy>	 !log dancy@deploy1002$ rm /var/lib/deploy-mwdebug/pause
[22:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34653 and previous config saved to /var/cache/conftool/dbconfig/20220913-221734-ladsgroup.json
[22:17:39] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[22:19:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:19:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:19:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:19:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:27:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34654 and previous config saved to /var/cache/conftool/dbconfig/20220913-222738-ladsgroup.json
[22:27:42] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[22:30:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[22:30:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[22:30:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34655 and previous config saved to /var/cache/conftool/dbconfig/20220913-223025-ladsgroup.json
[22:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P34656 and previous config saved to /var/cache/conftool/dbconfig/20220913-223241-ladsgroup.json
[22:32:50] <wikibugs>	 (03PS15) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[22:34:19] <wikibugs>	 (03CR) 10Raymond Ndibe: "Hello David, the new tests you added are failing because of the sudo command we are using. I'm currently looking for a way to fix this" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[22:36:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[22:39:39] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[22:40:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "compiled and confirmed with manual visudo on phab2001, disabled puppet on other hosts" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[22:42:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P34657 and previous config saved to /var/cache/conftool/dbconfig/20220913-224244-ladsgroup.json
[22:43:26] <wikibugs>	 (03PS1) 10Andrew Bogott: toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000
[22:43:28] <wikibugs>	 (03PS1) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714)
[22:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:44:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 (owner: 10Andrew Bogott)
[22:45:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "root@phab2001:/etc/sudoers.d# su phab-deploy" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[22:47:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P34658 and previous config saved to /var/cache/conftool/dbconfig/20220913-224749-ladsgroup.json
[22:49:24] <wikibugs>	 (03PS2) 10Andrew Bogott: toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000
[22:49:26] <wikibugs>	 (03PS2) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714)
[22:57:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P34659 and previous config saved to /var/cache/conftool/dbconfig/20220913-225750-ladsgroup.json
[23:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34660 and previous config saved to /var/cache/conftool/dbconfig/20220913-230026-ladsgroup.json
[23:00:30] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:01:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "@dduval it works. I tested it like this:" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:02:49] <wikibugs>	 (03CR) 10Dduvall: [V: 03+1] phabricator: Fix sudo env_keep format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:02:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34661 and previous config saved to /var/cache/conftool/dbconfig/20220913-230255-ladsgroup.json
[23:02:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[23:03:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[23:03:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T314041)', diff saved to https://phabricator.wikimedia.org/P34662 and previous config saved to /var/cache/conftool/dbconfig/20220913-230317-ladsgroup.json
[23:06:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 (owner: 10Andrew Bogott)
[23:10:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed on all phab servers now" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:12:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34663 and previous config saved to /var/cache/conftool/dbconfig/20220913-231257-ladsgroup.json
[23:12:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:13:03] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:13:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:15:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34664 and previous config saved to /var/cache/conftool/dbconfig/20220913-231533-ladsgroup.json
[23:19:17] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:30:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34665 and previous config saved to /var/cache/conftool/dbconfig/20220913-233039-ladsgroup.json
[23:45:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34666 and previous config saved to /var/cache/conftool/dbconfig/20220913-234546-ladsgroup.json
[23:45:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[23:45:50] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[23:46:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34667 and previous config saved to /var/cache/conftool/dbconfig/20220913-234607-ladsgroup.json
[23:47:35] <wikibugs>	 (03PS3) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714)