[00:00:38] <icinga-wm>	 RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:06:02] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 4 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[00:07:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:08:22] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[00:20:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28374 and previous config saved to /var/cache/conftool/dbconfig/20220524-002006-ladsgroup.json
[00:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:13] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[00:22:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28375 and previous config saved to /var/cache/conftool/dbconfig/20220524-002246-ladsgroup.json
[00:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:53] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[00:35:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28376 and previous config saved to /var/cache/conftool/dbconfig/20220524-003511-ladsgroup.json
[00:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28377 and previous config saved to /var/cache/conftool/dbconfig/20220524-003752-ladsgroup.json
[00:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:10] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 170, active_shards: 300, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma
[00:43:10] <icinga-wm>	 g_in_queue_millis: 0, active_shards_percent_as_number: 97.71986970684038 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:43:26] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, active_shards_percent_as_number: 99.6742671009772, active_shards: 306, timed_out: False, delayed_unassigned_shards: 0, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 2, initializing_shards: 1, active_primary_shards: 170, relocating_shards: 0, status: yellow, nu
[00:43:27] <icinga-wm>	 in_flight_fetch: 0, number_of_pending_tasks: 0, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:45:10] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:50:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28378 and previous config saved to /var/cache/conftool/dbconfig/20220524-005016-ladsgroup.json
[00:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28379 and previous config saved to /var/cache/conftool/dbconfig/20220524-005257-ladsgroup.json
[00:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:54] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0100)
[01:05:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28380 and previous config saved to /var/cache/conftool/dbconfig/20220524-010521-ladsgroup.json
[01:05:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[01:05:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[01:05:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[01:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:27] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[01:05:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[01:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28381 and previous config saved to /var/cache/conftool/dbconfig/20220524-010534-ladsgroup.json
[01:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28382 and previous config saved to /var/cache/conftool/dbconfig/20220524-010622-ladsgroup.json
[01:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:42] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1004.eqiad.wmnet with OS bullseye
[01:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28383 and previous config saved to /var/cache/conftool/dbconfig/20220524-010802-ladsgroup.json
[01:08:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[01:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[01:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:08] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[01:08:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28384 and previous config saved to /var/cache/conftool/dbconfig/20220524-010810-ladsgroup.json
[01:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:47] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.hosts.reimage: update usage w/ req arg [cookbooks] - 10https://gerrit.wikimedia.org/r/797712
[01:09:30] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:13:42] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 21 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 21, active_shards: 21, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 21, delayed_unassigned_shards: 0, number_of_pending_tasks: 0
[01:13:42] <icinga-wm>	 _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:14:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 167 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 137, active_shards: 137, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 167, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number
[01:14:58] <icinga-wm>	 light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 45.06578947368421 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:16:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:48] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1004.eqiad.wmnet with reason: host reimage
[01:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:33] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1004.eqiad.wmnet with reason: host reimage
[01:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:54] <icinga-wm>	 PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28385 and previous config saved to /var/cache/conftool/dbconfig/20220524-012127-ladsgroup.json
[01:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:00] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 167 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 137, active_shards: 137, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 167, delayed_unassigned_shards: 0, number_of_pending_tasks: 0
[01:26:00] <icinga-wm>	 _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 45.06578947368421 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:26:00] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 21 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 21, active_shards: 21, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 21, delayed_unassigned_shards: 0, number_of_pending_
[01:26:00] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:36:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28386 and previous config saved to /var/cache/conftool/dbconfig/20220524-013632-ladsgroup.json
[01:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:30] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:47] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1004.eqiad.wmnet with OS bullseye
[01:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:01] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:50:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28387 and previous config saved to /var/cache/conftool/dbconfig/20220524-015137-ladsgroup.json
[01:51:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[01:51:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[01:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:44] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[01:51:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28388 and previous config saved to /var/cache/conftool/dbconfig/20220524-015145-ladsgroup.json
[01:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:00:12] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:06:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823
[02:07:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823 (owner: 10TrainBranchBot)
[02:09:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:09:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:12] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:11:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: retry rails console, don't keep gitlab-secrets.json [puppet] - 10https://gerrit.wikimedia.org/r/797301 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[02:24:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823 (owner: 10TrainBranchBot)
[02:32:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:36:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28389 and previous config saved to /var/cache/conftool/dbconfig/20220524-024333-ladsgroup.json
[02:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:39] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[02:44:14] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:47:47] <wikibugs>	 (03PS4) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961)
[02:50:10] <wikibugs>	 (03PS5) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961)
[02:53:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:58:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28390 and previous config saved to /var/cache/conftool/dbconfig/20220524-025838-ladsgroup.json
[02:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28391 and previous config saved to /var/cache/conftool/dbconfig/20220524-031343-ladsgroup.json
[03:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:19:32] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:21:20] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 4.583 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[03:21:41] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:23:12] <icinga-wm>	 RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:34] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[03:28:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28392 and previous config saved to /var/cache/conftool/dbconfig/20220524-032848-ladsgroup.json
[03:28:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[03:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[03:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:28:55] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[03:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:52:48] <wikibugs>	 (03PS1) 10KartikMistry: Enable Content and Section Translation in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304858)
[04:07:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:10:26] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 274, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 33, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_
[04:10:26] <icinga-wm>	 in_queue_millis: 0, active_shards_percent_as_number: 89.25081433224756 https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:10:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28393 and previous config saved to /var/cache/conftool/dbconfig/20220524-041034-ladsgroup.json
[04:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:10:41] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[04:15:48] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:23:54] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (netbox1002), No backups: 2 (backup1002, ...), Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:25:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28394 and previous config saved to /var/cache/conftool/dbconfig/20220524-042539-ladsgroup.json
[04:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:39:06] <wikibugs>	 (03CR) 10Samwilson: [C: 03+1] Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro)
[04:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28395 and previous config saved to /var/cache/conftool/dbconfig/20220524-044044-ladsgroup.json
[04:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:40] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:52:24] <wikibugs>	 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10Marostegui) @valhallasw if you can update https://www.mediawiki.org/wiki/Wikibugs to make it clearer...I think that'd be the only pending thing before we can close t...
[04:53:46] <wikibugs>	 (03PS1) 10Marostegui: db1172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/798043
[04:54:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/798043 (owner: 10Marostegui)
[04:55:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 1%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28396 and previous config saved to /var/cache/conftool/dbconfig/20220524-045508-root.json
[04:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28397 and previous config saved to /var/cache/conftool/dbconfig/20220524-045549-ladsgroup.json
[04:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[04:55:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[04:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:54] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[04:55:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:56:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28398 and previous config saved to /var/cache/conftool/dbconfig/20220524-045602-ladsgroup.json
[04:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:56:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:15] <wikibugs>	 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui)
[05:07:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_tasks: 0
[05:07:38] <icinga-wm>	 _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:10:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28399 and previous config saved to /var/cache/conftool/dbconfig/20220524-051011-root.json
[05:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:11] <marostegui>	 !log Rename revision_actor_temp on s6 T307906
[05:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:16] <stashbot>	 T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906
[05:17:02] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:20:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_
[05:20:40] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:20:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_
[05:20:40] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:25:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28400 and previous config saved to /var/cache/conftool/dbconfig/20220524-052515-root.json
[05:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:08] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:38:00] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:40:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28401 and previous config saved to /var/cache/conftool/dbconfig/20220524-054019-root.json
[05:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:55:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28402 and previous config saved to /var/cache/conftool/dbconfig/20220524-055523-root.json
[05:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0600).
[06:08:09] <marostegui>	 !log Rename revision_actor_temp on s8 T307906
[06:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:15] <stashbot>	 T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906
[06:10:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28403 and previous config saved to /var/cache/conftool/dbconfig/20220524-061027-root.json
[06:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:58] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28404 and previous config saved to /var/cache/conftool/dbconfig/20220524-061119-ladsgroup.json
[06:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:25] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[06:12:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:12:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28405 and previous config saved to /var/cache/conftool/dbconfig/20220524-061237-ladsgroup.json
[06:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:45] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[06:15:06] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:17:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:25:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28406 and previous config saved to /var/cache/conftool/dbconfig/20220524-062531-root.json
[06:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28407 and previous config saved to /var/cache/conftool/dbconfig/20220524-062625-ladsgroup.json
[06:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:06] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:39:42] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:41:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28408 and previous config saved to /var/cache/conftool/dbconfig/20220524-064130-ladsgroup.json
[06:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:56:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28409 and previous config saved to /var/cache/conftool/dbconfig/20220524-065635-ladsgroup.json
[06:56:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[06:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[06:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:42] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[06:56:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28410 and previous config saved to /var/cache/conftool/dbconfig/20220524-065643-ladsgroup.json
[06:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0700).
[07:00:04] <jouncebot>	 mainframe98: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28411 and previous config saved to /var/cache/conftool/dbconfig/20220524-070052-ladsgroup.json
[07:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:58] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:01:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Add more contributors [puppet] - 10https://gerrit.wikimedia.org/r/798352
[07:02:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add more contributors [puppet] - 10https://gerrit.wikimedia.org/r/798352 (owner: 10Muehlenhoff)
[07:05:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Looks good. merging." [puppet] - 10https://gerrit.wikimedia.org/r/797366 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:06:36] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:10:33] <wikibugs>	 (03CR) 10Muehlenhoff: "This module only ships an args.erb file which isn't used anywhere in Puppet, I think instead we can simply remove it for good?" [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:12:02] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:13:01] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[07:13:28] <_joe_>	 ok
[07:13:31] <_joe_>	 timeouts
[07:13:40] <Amir1>	 here
[07:13:46] <godog>	 uhoh
[07:13:47] <_joe_>	 I'll look at the backends
[07:13:49] <godog>	 here too
[07:14:42] <_joe_>	 can someone look at the nel dashboard for patterns?
[07:14:50] <godog>	 not seeing any obvious drop in frontend traffic
[07:14:53] <godog>	 _joe_: ok I will
[07:14:57] * jelto around
[07:15:41] <godog>	 looks like a spike now btw
[07:15:57] <_joe_>	 nothing of note on either mediawiki nor the edge
[07:15:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28412 and previous config saved to /var/cache/conftool/dbconfig/20220524-071557-ladsgroup.json
[07:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:46] <wikibugs>	 (03CR) 10Zabe: tmpreaper: Add SPDX header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:17:02] <godog>	 https://logstash.wikimedia.org/goto/c08e26277bf98ea92a6a8c33361a6aaa  the spike
[07:17:43] <godog>	 I'm happy to tweak the alert a little bit too in terms of how sensitive it is
[07:18:00] <godog>	 should be recovering soon 
[07:18:01] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[07:18:24] <_joe_>	 godog: nah I think it's ok
[07:18:33] <_joe_>	 I mean this wasn't a false positive
[07:19:11] * volans here
[07:19:14] <volans>	 did I miss the fun?
[07:19:25] <volans>	 anything I can do?
[07:19:54] <godog>	 volans: turns out it was a spike
[07:20:53] <godog>	 _joe_: yeah fair enough, I was thinking sth like a higher threshold and/or smaller threshold but 'for' duration a little longer, anyways let's see what happens
[07:21:41] <volans>	 ack
[07:21:43] <godog>	 while we're on the subject, I'm happy to report that the shower of individual pages for failing services from icinga will be going away soon: https://gerrit.wikimedia.org/r/q/topic:bug%252FT291946-monitoring-and-host-removal
[07:22:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:22:06] <Amir1>	 godog: I owe you a beer
[07:22:27] <godog>	 Amir1: awww <3 will gladly accept, thank you!
[07:22:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28413 and previous config saved to /var/cache/conftool/dbconfig/20220524-072243-ladsgroup.json
[07:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:50] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:23:36] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Thanks for the patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/797712 (owner: 10Ryan Kemper)
[07:23:39] <wikibugs>	 (03PS1) 10Majavah: nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372
[07:24:47] <mainframe98>	 Amir1: Now that the crisis is over, can we deploy/merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793402?
[07:25:17] <Amir1>	 mainframe98: sure go ahead
[07:25:44] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35507/console" [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah)
[07:26:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10fgiunchedi) 05Open→03Resolved Thank you @papaul! Resolving, we'll be following up in {T309074}
[07:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: update usage w/ req arg [cookbooks] - 10https://gerrit.wikimedia.org/r/797712 (owner: 10Ryan Kemper)
[07:27:40] <mainframe98>	 Amir1: I don't have +2 in operations/mediawiki-config and I'm not a deployer myself; I need someone to +2 the change and do the follow up steps; how do I do that?
[07:28:01] <Amir1>	 mainframe98: you ping me ;)
[07:28:57] <Amir1>	 so the config is removed in code but it's not deployed yet. It looks like they are exactly the same so it's still should be noop
[07:29:39] <mainframe98>	 That's right
[07:30:09] <wikibugs>	 (03CR) 10Ladsgroup: "Since it's the same as the default values. It's fine to deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98)
[07:30:13] <wikibugs>	 (03PS3) 10Ladsgroup: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98)
[07:30:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98)
[07:31:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28414 and previous config saved to /var/cache/conftool/dbconfig/20220524-073102-ladsgroup.json
[07:31:03] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98)
[07:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:45] <Amir1>	 pulled to mwdebug and confirm it didn't change
[07:33:40] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793402|Remove wgPriorityHints and wgPriorityHintsRatio (T308707)]] (duration: 00m 50s)
[07:33:40] <mainframe98>	 \o/
[07:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:45] <stashbot>	 T308707: Remove inactive code from Priority Hints experiment in MW core - https://phabricator.wikimedia.org/T308707
[07:34:49] <wikibugs>	 (03PS2) 10Majavah: nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372
[07:35:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35508/console" [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah)
[07:35:54] <mainframe98>	 Amir1: Thanks!
[07:36:19] <Amir1>	 mainframe98: Thank you for doing the work I just pressed some shiny buttons
[07:36:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:36:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[07:37:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[07:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28415 and previous config saved to /var/cache/conftool/dbconfig/20220524-073738-ladsgroup.json
[07:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:43] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[07:37:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28416 and previous config saved to /var/cache/conftool/dbconfig/20220524-073748-ladsgroup.json
[07:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The code looks ok; however I have one main issue here. Specifically, we're tying the schema of service::catalog to the spicerack code quit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[07:46:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28417 and previous config saved to /var/cache/conftool/dbconfig/20220524-074607-ladsgroup.json
[07:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:14] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:46:53] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation for Hindi in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798389 (https://phabricator.wikimedia.org/T308834)
[07:47:22] <wikibugs>	 (03CR) 10Volans: [C: 03+2] service: add new module to expose service::catalog (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[07:48:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
[07:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:31] <moritzm>	 kubetcd2005 will be going down temporarily
[07:49:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove unused rewrite_static_assets param [puppet] - 10https://gerrit.wikimedia.org/r/778602 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[07:50:06] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:51:42] <icinga-wm>	 PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:17] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:52:21] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28418 and previous config saved to /var/cache/conftool/dbconfig/20220524-075253-ladsgroup.json
[07:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28419 and previous config saved to /var/cache/conftool/dbconfig/20220524-075259-ladsgroup.json
[07:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:04] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[07:53:05] <wikibugs>	 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) >>! In T271143#7951004, @bking wrote: > @volans , we are ready to do "brave mode" on the remaining CODF...
[07:53:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
[07:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:50] <icinga-wm>	 RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[07:56:06] <wikibugs>	 (03Merged) 10jenkins-bot: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[07:56:21] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:56:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add some additional SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798393
[07:57:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:07] <wikibugs>	 (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/798394
[08:00:13] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) p:05Triage→03High
[08:00:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798393 (owner: 10Muehlenhoff)
[08:02:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: remove static assets rewrite clause. [deployment-charts] - 10https://gerrit.wikimedia.org/r/798395 (https://phabricator.wikimedia.org/T302465)
[08:02:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks good. Will merge." [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:02:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:06:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The values in the chart are for an example site, it doesn't really make sense to add them there." [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix)
[08:06:56] <wikibugs>	 (03PS1) 10Jbond: spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401
[08:07:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond)
[08:07:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:07:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond)
[08:07:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28420 and previous config saved to /var/cache/conftool/dbconfig/20220524-080758-ladsgroup.json
[08:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:04] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[08:08:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28421 and previous config saved to /var/cache/conftool/dbconfig/20220524-080804-ladsgroup.json
[08:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:51] <wikibugs>	 (03PS2) 10Jbond: spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401
[08:10:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[08:11:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond)
[08:12:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
[08:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:46] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[08:15:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:17:50] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy)
[08:20:03] <wikibugs>	 (03CR) 10Muehlenhoff: tmpreaper: Add SPDX header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:20:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
[08:20:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] toil: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:22:12] <godog>	 !log resume deletion of 'swift-tegola-container' on thanos-fe2001 - T307184
[08:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:17] <stashbot>	 T307184: Followups for Tegola and Swift interactions  - https://phabricator.wikimedia.org/T307184
[08:23:00] <kart_>	 Amir1: marostegui Anything except: https://phabricator.wikimedia.org/T306963#7949625 needed from the Language team to go ahead?
[08:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28422 and previous config saved to /var/cache/conftool/dbconfig/20220524-082309-ladsgroup.json
[08:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy)
[08:27:28] <Amir1>	 kart_: grants are important, I guess only SELECT for now?
[08:30:11] <wikibugs>	 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) hi @Jgiannelos, I have resumed work on this and was wondering what's the theoretical limit of tiles per container? As...
[08:30:26] <wikibugs>	 (03PS1) 10Volans: Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425
[08:30:58] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy)
[08:31:19] <wikibugs>	 (03PS2) 10Volans: Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013)
[08:33:06] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet
[08:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:47] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:36] <wikibugs>	 (03PS1) 10Jbond: spdx: add task to convert modules [puppet] - 10https://gerrit.wikimedia.org/r/798426
[08:35:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spdx: add task to convert modules [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond)
[08:36:37] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] "yes, please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[08:38:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28423 and previous config saved to /var/cache/conftool/dbconfig/20220524-083814-ladsgroup.json
[08:38:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[08:38:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[08:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:20] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[08:38:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28424 and previous config saved to /var/cache/conftool/dbconfig/20220524-083822-ladsgroup.json
[08:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013) (owner: 10Volans)
[08:40:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet
[08:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:03] <kart_>	 Amir1: yes. SELECT only for now.
[08:42:40] <Amir1>	 I will do it ASAP
[08:42:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[08:43:11] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013) (owner: 10Volans)
[08:43:25] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:29] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081)
[08:44:37] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:45:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35509/console" [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi)
[08:47:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: allow deploying site-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:48:08] <wikibugs>	 (03PS3) 10Filippo Giunchedi: alerts: allow deploying site-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847)
[08:49:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:50:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add some additional SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798393 (owner: 10Muehlenhoff)
[08:50:33] <wikibugs>	 (03CR) 10Volans: "Some typos and a suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond)
[08:50:37] <moritzm>	 godog: shall I merge along?
[08:50:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847)
[08:50:53] <godog>	 moritzm: yes please!
[08:51:06] <godog>	 sorry about that, totally forgot 
[08:51:13] <moritzm>	 ack, done
[08:52:48] <wikibugs>	 (03PS1) 10Slyngshede: Allow for Apache2 to not bind to port 80. [puppet] - 10https://gerrit.wikimedia.org/r/798446
[08:53:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28425 and previous config saved to /var/cache/conftool/dbconfig/20220524-085314-ladsgroup.json
[08:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:20] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[08:53:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:55:02] <Amir1>	 kart_: one other thing, the db you have is sqlite, can you make a mysql dump instead? it'd make moving the data much easier
[08:55:32] <wikibugs>	 (03PS2) 10Jbond: rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426
[08:56:17] <wikibugs>	 (03PS1) 10Ladsgroup: ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673)
[08:57:14] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673)
[08:57:40] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35510/console" [puppet] - 10https://gerrit.wikimedia.org/r/798446 (owner: 10Slyngshede)
[08:58:16] <wikibugs>	 (03PS3) 10Jbond: rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426
[08:58:18] <wikibugs>	 (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond)
[08:58:42] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[08:58:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847)
[08:59:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond)
[08:59:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond)
[09:00:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847)
[09:00:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
[09:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:07:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
[09:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable cassandra encryption (aqs cluster) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[09:08:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28427 and previous config saved to /var/cache/conftool/dbconfig/20220524-090819-ladsgroup.json
[09:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:02] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:11:26] <wikibugs>	 (03PS1) 10Slyngshede: Move Apache2 to alternative port [puppet] - 10https://gerrit.wikimedia.org/r/798450
[09:13:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001
[09:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:03] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Move Apache2 to alternative port [puppet] - 10https://gerrit.wikimedia.org/r/798450 (owner: 10Slyngshede)
[09:16:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10fgiunchedi) 05Open→03Invalid I can't find any more errors for now, tentatively and optimistically resolving as invalid, will reopen if issues pop up again
[09:20:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] purged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793401 (owner: 10Muehlenhoff)
[09:22:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet
[09:22:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28428 and previous config saved to /var/cache/conftool/dbconfig/20220524-092324-ladsgroup.json
[09:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:54] <wikibugs>	 (03PS1) 10Jcrespo: mariadb::misc: Fix motd that was marking misc hosts as core [puppet] - 10https://gerrit.wikimedia.org/r/798467
[09:25:02] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[09:25:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[09:28:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
[09:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:48] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: mediawiki: remove static assets rewrite clause. [deployment-charts] - 10https://gerrit.wikimedia.org/r/798395 (https://phabricator.wikimedia.org/T302465) (owner: 10Giuseppe Lavagetto)
[09:29:59] <wikibugs>	 (03PS1) 10Slyngshede: Add listen port to move repo from 80 to 8080 [puppet] - 10https://gerrit.wikimedia.org/r/798478
[09:30:57] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Add listen port to move repo from 80 to 8080 [puppet] - 10https://gerrit.wikimedia.org/r/798478 (owner: 10Slyngshede)
[09:32:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet
[09:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:22] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5001.eqsin.wmnet to ganeti01.svc.eqsin.wmnet
[09:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:12] <logmsgbot>	 !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5001.eqsin.wmnet to ganeti01.svc.eqsin.wmnet
[09:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
[09:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[09:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28430 and previous config saved to /var/cache/conftool/dbconfig/20220524-093830-ladsgroup.json
[09:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:36] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[09:41:55] <wikibugs>	 (03PS1) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504
[09:42:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[09:46:23] <wikibugs>	 (03PS2) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504
[09:49:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet
[09:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:22] <moritzm>	 !log installing openssl security updates
[09:50:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28431 and previous config saved to /var/cache/conftool/dbconfig/20220524-095030-ladsgroup.json
[09:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:38] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[09:51:56] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001
[09:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[09:52:53] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:54:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet
[09:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "thanks for catching this" [puppet] - 10https://gerrit.wikimedia.org/r/798467 (owner: 10Jcrespo)
[09:59:11] <wikibugs>	 (03PS4) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[09:59:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847)
[10:04:37] <wikibugs>	 (03CR) 10Jbond: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[10:05:05] <wikibugs>	 (03PS1) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537
[10:05:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28432 and previous config saved to /var/cache/conftool/dbconfig/20220524-100553-ladsgroup.json
[10:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:00] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[10:06:52] <wikibugs>	 (03PS3) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504
[10:06:54] <wikibugs>	 (03CR) 10Jbond: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[10:07:26] <moritzm>	 !log installing imagemagick securitx updates
[10:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:09] <wikibugs>	 (03PS2) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537
[10:08:38] <jbond>	 hi all im going to temporarily disable puppet fleet wide to preform puppetmaster/db reboots
[10:09:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 (owner: 10Jbond)
[10:09:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Not 100% sure about the PCC diff, I wasn't expecting all the new resources, is that expected ?" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[10:10:23] <wikibugs>	 (03PS3) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537
[10:10:45] <godog>	 jbond: ack
[10:13:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
[10:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:41] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet
[10:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:26] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet
[10:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:39] <icinga-wm>	 PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff)
[10:18:03] <moritzm>	 !log rebalance Ganeti cluster in eqsin T308211
[10:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:08] <stashbot>	 T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211
[10:18:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:18:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:18:55] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @BBlack do you have thoughts on this?
[10:19:21] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah)
[10:20:25] <moritzm>	 !log installing vim security updates
[10:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:37] <icinga-wm>	 RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms
[10:20:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28434 and previous config saved to /var/cache/conftool/dbconfig/20220524-102058-ladsgroup.json
[10:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:46] <wikibugs>	 (03PS2) 10Majavah: nrpe: manage sudo rules via nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/797422
[10:22:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:22:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:48] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[10:25:54] <wikibugs>	 (03PS3) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891)
[10:25:58] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet
[10:25:59] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35511/console" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[10:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:24] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1002.eqiad.wmnet
[10:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:37] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet
[10:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:38] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2002.codfw.wmnet
[10:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:03] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet
[10:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:08] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet
[10:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:25] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1004.eqiad.wmnet
[10:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:13] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1004.eqiad.wmnet
[10:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:49] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1005.eqiad.wmnet
[10:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:55] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2002.codfw.wmnet
[10:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:06] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet
[10:33:09] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2003.codfw.wmnet
[10:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:14] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1003.eqiad.wmnet
[10:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:34] <wikibugs>	 (03PS1) 10Slyngshede: Handle port deallocation for Apache. [puppet] - 10https://gerrit.wikimedia.org/r/798570
[10:33:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571
[10:34:07] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2022 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:34:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[10:36:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28435 and previous config saved to /var/cache/conftool/dbconfig/20220524-103603-ladsgroup.json
[10:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:24] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1003.eqiad.wmnet
[10:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:11] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1005.eqiad.wmnet
[10:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:20] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2003.codfw.wmnet
[10:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:44] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2004.codfw.wmnet
[10:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:15] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2005.codfw.wmnet
[10:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh)
[10:40:57] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Handle port deallocation for Apache. [puppet] - 10https://gerrit.wikimedia.org/r/798570 (owner: 10Slyngshede)
[10:41:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[10:41:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 (owner: 10Jbond)
[10:42:13] <jbond>	 slyngs: happy for me to merge your Apache port deallocation CR
[10:42:23] <slyngs>	 Yes
[10:43:05] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2004.codfw.wmnet
[10:43:08] <wikibugs>	 (03CR) 10Volans: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond)
[10:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:37] <wikibugs>	 (03PS1) 10Jbond: spdx: drop puts/debugging [puppet] - 10https://gerrit.wikimedia.org/r/798585
[10:43:47] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2005.codfw.wmnet
[10:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] spdx: drop puts/debugging [puppet] - 10https://gerrit.wikimedia.org/r/798585 (owner: 10Jbond)
[10:44:47] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
[10:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:43] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:51:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28436 and previous config saved to /var/cache/conftool/dbconfig/20220524-105108-ladsgroup.json
[10:51:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[10:51:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[10:51:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:14] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[10:51:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28437 and previous config saved to /var/cache/conftool/dbconfig/20220524-105116-ladsgroup.json
[10:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:55] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:54:31] <taavi>	 jbond: just to confirm: do you want someone else to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/795380?
[10:55:51] <wikibugs>	 (03PS1) 10Jbond: C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604
[11:00:32] <jynus>	 !log restart db1150 T308315
[11:00:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:39] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:03:29] <wikibugs>	 (03PS1) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615
[11:04:07] <wikibugs>	 (03CR) 10Muehlenhoff: C:httpd: add documentation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond)
[11:04:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond)
[11:07:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28438 and previous config saved to /var/cache/conftool/dbconfig/20220524-110728-ladsgroup.json
[11:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:34] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[11:12:34] <wikibugs>	 (03PS1) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617
[11:14:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet
[11:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet
[11:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[11:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond)
[11:21:03] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:21:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond)
[11:22:29] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:22:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28439 and previous config saved to /var/cache/conftool/dbconfig/20220524-112233-ladsgroup.json
[11:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:56] <wikibugs>	 (03PS2) 10Filippo Giunchedi: test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571
[11:23:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:10] <wikibugs>	 (03PS2) 10Jbond: C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604
[11:27:12] <wikibugs>	 (03PS2) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615
[11:28:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35513/console" [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond)
[11:30:12] <jbond>	 disabling puppet again i missed puppetmaster1001
[11:30:18] <jbond>	 !log disable puppet fleet wide
[11:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571 (owner: 10Filippo Giunchedi)
[11:31:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28440 and previous config saved to /var/cache/conftool/dbconfig/20220524-113112-ladsgroup.json
[11:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:18] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[11:33:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[11:33:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:38] <moritzm>	 kubetcd2004 will be going down temporarily
[11:34:10] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet
[11:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:41] <icinga-wm>	 PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[11:37:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28441 and previous config saved to /var/cache/conftool/dbconfig/20220524-113738-ladsgroup.json
[11:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond)
[11:39:01] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:27] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster1001.eqiad.wmnet
[11:40:29] <icinga-wm>	 RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms
[11:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond)
[11:40:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond)
[11:44:33] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:45:01] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw2382 is CRITICAL: connect to address 10.192.0.45 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[11:45:09] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: connect to address 10.64.48.84 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[11:45:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw2306 is CRITICAL: connect to address 10.192.0.176 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:45:18] <jinxer-wm>	 (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:23] <icinga-wm>	 PROBLEM - Apache HTTP on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:45:25] <jbond>	 _joe_: 
[11:45:29] <jbond>	 #page
[11:45:33] <jbond>	 ok i think i broke things
[11:45:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:45:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:45:52] <_joe_>	 uh what's going on with apache
[11:45:53] <jbond>	 !log disable puppet on mw servers
[11:45:55] <godog>	 jbond: ack, how can we help ?
[11:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1440 is CRITICAL: connect to address 10.64.48.79 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:46:07] <jbond>	 _joe_: i rolled out a patch which casued an apache relod
[11:46:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw2377 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw2389 is CRITICAL: connect to address 10.192.0.52 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw2403 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:11] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: connect to address 10.64.48.79 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[11:46:16] <jbond>	 i have disabled puppet every where and rolling back the patch nopw
[11:46:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2306 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28442 and previous config saved to /var/cache/conftool/dbconfig/20220524-114617-ladsgroup.json
[11:46:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:46:21] <icinga-wm>	 PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:21] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw2351 is CRITICAL: connect to address 10.192.32.201 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[11:46:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw2273 is CRITICAL: connect to address 10.192.48.95 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw2254 is CRITICAL: connect to address 10.192.16.53 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:31] <wikibugs>	 (03PS1) 10Jbond: Revert "C:httpd: allow users to pass the listen_ports to use" [puppet] - 10https://gerrit.wikimedia.org/r/797222
[11:46:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2351 is CRITICAL: connect to address 10.192.32.201 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:46:37] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:46:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "C:httpd: allow users to pass the listen_ports to use" [puppet] - 10https://gerrit.wikimedia.org/r/797222 (owner: 10Jbond)
[11:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2382 is CRITICAL: connect to address 10.192.0.45 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:46:46] <_joe_>	 oh sigh, I think we're down
[11:46:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1330 is CRITICAL: connect to address 10.64.32.32 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:46:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2254 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:59] <marostegui>	 eswiki still up for me
[11:47:04] <_joe_>	 jbond: start from eqiad with the forced puppet run
[11:47:05] <RhinosF1>	 enwiki up here
[11:47:06] <jbond>	 emn still up for me
[11:47:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1445 is CRITICAL: connect to address 10.64.48.84 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:47:11] <icinga-wm>	 PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:11] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:11] <icinga-wm>	 PROBLEM - Check systemd state on mw1366 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw2335 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:17] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:17] <jbond>	 ack just reverting now
[11:47:20] <_joe_>	 yeah we won't be for long
[11:47:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw2339 is CRITICAL: connect to address 10.192.32.117 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1036 is CRITICAL: connect to address 10.64.16.91 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:29] <icinga-wm>	 PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:32] <_joe_>	 reevert and target eqiad first
[11:47:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw2297 is CRITICAL: connect to address 10.192.0.167 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:37] <icinga-wm>	 PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1366 is CRITICAL: connect to address 10.64.48.208 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:44] <jbond>	 ack targeting equiad now
[11:47:45] <icinga-wm>	 PROBLEM - Check systemd state on mw2360 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:47:49] <icinga-wm>	 PROBLEM - HTTPS-peopleweb on people1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.003 second response time https://wikitech.wikimedia.org/wiki/People.wikimedia.org
[11:47:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw2360 is CRITICAL: connect to address 10.192.32.210 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:47:51] <icinga-wm>	 PROBLEM - Check systemd state on mw2407 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw2307 is CRITICAL: connect to address 10.192.0.177 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:48:05] <icinga-wm>	 PROBLEM - Check systemd state on mw2297 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:07] <icinga-wm>	 PROBLEM - Check systemd state on people1003 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:48:10] <_joe_>	 jbond: please first verify the fix sworks
[11:48:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1333 is CRITICAL: connect to address 10.64.32.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:48:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2307 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw2268 is CRITICAL: connect to address 10.192.16.69 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:48:17] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01073 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:48:19] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw2411 is CRITICAL: connect to address 10.192.0.122 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[11:48:21] <icinga-wm>	 PROBLEM - Check systemd state on mw2382 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:21] <icinga-wm>	 PROBLEM - Check systemd state on mw2335 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2273 is CRITICAL: connect to address 10.192.48.95 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:48:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:48:23] <icinga-wm>	 PROBLEM - Check systemd state on mw2411 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:31] <icinga-wm>	 PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:48:35] <icinga-wm>	 PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:37] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:48:38] <jbond>	 _joe_: ack
[11:48:40] <icinga-wm>	 PROBLEM - LVS kibana7 eqiad port 443/tcp - Kibana v7 env - HTTPS IPv4 #page on kibana7.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[11:48:41] <icinga-wm>	 PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1395 is CRITICAL: connect to address 10.64.16.153 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:48:45] <icinga-wm>	 PROBLEM - piwik.wikimedia.org requires authentication on matomo1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:48:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:49] <icinga-wm>	 PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:59] <icinga-wm>	 PROBLEM - Check systemd state on wtp1036 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2403 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2377 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:03] <icinga-wm>	 PROBLEM - Check systemd state on mw1445 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:11] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:49:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:21] <icinga-wm>	 PROBLEM - Check systemd state on mw1421 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1330 is CRITICAL: connect to address 10.64.32.32 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw2272 is CRITICAL: connect to address 10.192.48.94 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:49:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2254 is CRITICAL: connect to address 10.192.16.53 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2311 is CRITICAL: connect to address 10.192.16.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:49:41] <icinga-wm>	 PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:48] <_joe_>	 jbond: it's still not working AFAICT
[11:49:56] <_joe_>	 so please don't run puppet everywhere
[11:49:59] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1033 is CRITICAL: connect to address 10.64.16.88 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:50:01] <icinga-wm>	 PROBLEM - Apache HTTP on parse2002 is CRITICAL: connect to address 10.192.0.183 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:50:09] <icinga-wm>	 PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:50:15] <jbond>	 _joe_: ack not running puppet anywhere troublkshooting on mw1395
[11:50:19] <icinga-wm>	 PROBLEM - Check systemd state on mw2268 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2339 is CRITICAL: connect to address 10.192.32.117 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:50:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2407 is CRITICAL: connect to address 10.192.0.75 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:50:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:50:25] <icinga-wm>	 PROBLEM - Check systemd state on doc2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2403 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:27] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:39] <icinga-wm>	 PROBLEM - Check systemd state on wtp1033 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:45] <icinga-wm>	 PROBLEM - Check systemd state on mw2408 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:46] <_joe_>	 jbond: it seems it tries to listen on port 443
[11:50:59] <icinga-wm>	 PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1333 is CRITICAL: connect to address 10.64.32.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:51:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:51:17] <taavi>	 jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled
[11:51:19] <jbond>	 yes and envoy is on there one sec let me remove that from the config via cumin
[11:51:25] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:51:28] <taavi>	 the default file wraps it in IfModule
[11:51:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2360 is CRITICAL: connect to address 10.192.32.210 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:51:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2408 is CRITICAL: connect to address 10.192.0.76 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:51:43] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1048 is CRITICAL: connect to address 10.64.48.166 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:51:43] <icinga-wm>	 PROBLEM - Check systemd state on parse2012 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1395 is CRITICAL: connect to address 10.64.16.153 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:51:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw2292 is CRITICAL: connect to address 10.192.0.162 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:51:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw2408 is CRITICAL: connect to address 10.192.0.76 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:51:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2411 is CRITICAL: connect to address 10.192.0.122 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:51:57] <icinga-wm>	 PROBLEM - Apache HTTP on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:51:59] <_joe_>	 yeah thanks taavi I was about to point that out
[11:52:07] <icinga-wm>	 PROBLEM - Apache HTTP on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:52:23] <icinga-wm>	 PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1366 is CRITICAL: connect to address 10.64.48.208 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2292 is CRITICAL: connect to address 10.192.0.162 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2297 is CRITICAL: connect to address 10.192.0.167 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2307 is CRITICAL: connect to address 10.192.0.177 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:37] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2268 is CRITICAL: connect to address 10.192.16.69 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:39] <icinga-wm>	 PROBLEM - PHP7 rendering on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28443 and previous config saved to /var/cache/conftool/dbconfig/20220524-115243-ladsgroup.json
[11:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:52:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:52:47] <icinga-wm>	 RECOVERY - Apache HTTP on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:50] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[11:52:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28444 and previous config saved to /var/cache/conftool/dbconfig/20220524-115251-ladsgroup.json
[11:52:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:55] <icinga-wm>	 PROBLEM - PHP7 rendering on parse2002 is CRITICAL: connect to address 10.192.0.183 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2351 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:03] <icinga-wm>	 PROBLEM - Check systemd state on mw2377 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:15] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:15] <icinga-wm>	 PROBLEM - Check systemd state on wtp1048 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:37] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: connect to address 10.64.16.88 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:53:49] <icinga-wm>	 PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:57] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: connect to address 10.64.48.166 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:54:01] <icinga-wm>	 PROBLEM - PHP7 rendering on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:54:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:54:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:29] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:35] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.03066 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:54:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:54:51] <icinga-wm>	 PROBLEM - PHP7 rendering on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:01] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:05] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2306 is CRITICAL: connect to address 10.192.0.176 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2335 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:19] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: connect to address 10.64.16.91 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:23] <icinga-wm>	 PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:25] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:55:25] <icinga-wm>	 PROBLEM - Check systemd state on mw1330 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:39] <icinga-wm>	 PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:56:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:56:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw2311 is CRITICAL: connect to address 10.192.16.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:56:45] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:56:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw2407 is CRITICAL: connect to address 10.192.0.75 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:56:53] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[11:57:07] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:57:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:57:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:57:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:57:25] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:57:27] <icinga-wm>	 RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:29] <icinga-wm>	 RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:33] <icinga-wm>	 RECOVERY - Check systemd state on mw1333 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2408 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2377 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2351 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:57:41] <icinga-wm>	 RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:51] <icinga-wm>	 RECOVERY - Check systemd state on mw2339 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2382 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:57:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:57:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:09] <icinga-wm>	 PROBLEM - Check systemd state on parse2009 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2377 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:11] <icinga-wm>	 RECOVERY - Check systemd state on mw1445 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:11] <icinga-wm>	 RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2254 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:15] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw2382 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:58:15] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:58:21] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:23] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:58:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:25] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:25] <icinga-wm>	 RECOVERY - Check systemd state on mw1366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:27] <icinga-wm>	 RECOVERY - Check systemd state on mw1421 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:27] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw2272 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:35] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:37] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:45] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2411 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:58:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:47] <icinga-wm>	 RECOVERY - Apache HTTP on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2272 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:58:57] <icinga-wm>	 RECOVERY - Check systemd state on mw2360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:03] <icinga-wm>	 RECOVERY - Check systemd state on mw2407 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:11] <icinga-wm>	 RECOVERY - Apache HTTP on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:19] <icinga-wm>	 RECOVERY - Check systemd state on mw2297 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:19] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw2377 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:25] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:27] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:27] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:27] <icinga-wm>	 RECOVERY - Check systemd state on mw2268 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:59:29] <icinga-wm>	 RECOVERY - Check systemd state on mw2307 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:59:31] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw2411 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:59:33] <icinga-wm>	 RECOVERY - Check systemd state on mw2403 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2382 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2306 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2335 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:39] <icinga-wm>	 RECOVERY - Check systemd state on mw2411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:39] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw2351 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:59:39] <icinga-wm>	 RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:01:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28445 and previous config saved to /var/cache/conftool/dbconfig/20220524-120122-ladsgroup.json
[12:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:54] <wikibugs>	 (03PS1) 10Majavah: httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631
[12:01:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:02:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:02:52] <icinga-wm>	 PROBLEM - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:03:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah)
[12:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:03:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpd: reintroduce the default debian ports.conf where no changes were expected. [puppet] - 10https://gerrit.wikimedia.org/r/798633
[12:03:41] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:03:47] <godog>	 mmhhh looking at the thanos-query thing, I suspect that's the apache issue jbond 
[12:04:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[12:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:16] <jbond>	 godog: definetly possible patch being reviewed now
[12:04:31] <jynus>	 so far I see no end users reporting problems anywhere
[12:04:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah)
[12:04:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] httpd: reintroduce the default debian ports.conf where no changes were expected. [puppet] - 10https://gerrit.wikimedia.org/r/798633 (owner: 10Giuseppe Lavagetto)
[12:05:01] <jynus>	 I am monitoring phabricator and other channels
[12:05:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah)
[12:05:29] <RhinosF1>	 At no point did enwiki go down as far as I saw
[12:05:44] <godog>	 jbond: ack
[12:05:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:06:02] * Emperor here (sorry, was eating lunch)
[12:06:05] <jbond>	 !log disable puppet on c:httpd
[12:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:06:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:06:55] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.129 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:06:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:07:28] <icinga-wm>	 RECOVERY - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:07:43] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:13] <icinga-wm>	 RECOVERY - HTTPS-peopleweb on people1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 1.008 second response time https://wikitech.wikimedia.org/wiki/People.wikimedia.org
[12:08:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28446 and previous config saved to /var/cache/conftool/dbconfig/20220524-120816-ladsgroup.json
[12:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:22] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[12:08:25] <icinga-wm>	 RECOVERY - Check systemd state on people1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:09] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:21] <icinga-wm>	 RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.010 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:09:33] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:09:37] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.134 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:09:41] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:15] <wikibugs>	 (03PS8) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[12:10:57] <icinga-wm>	 RECOVERY - Check systemd state on wtp1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:57] <icinga-wm>	 RECOVERY - PHP7 rendering on parse2002 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:11:03] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:11:15] <icinga-wm>	 RECOVERY - Check systemd state on wtp1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:17] <icinga-wm>	 RECOVERY - Check systemd state on parse2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:25] <icinga-wm>	 RECOVERY - Check systemd state on wtp1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:33] <icinga-wm>	 RECOVERY - Check systemd state on parse2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:33] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:11:45] <icinga-wm>	 RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:51] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:11:55] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:11:55] <icinga-wm>	 RECOVERY - PHP7 rendering on parse2012 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:11:55] <icinga-wm>	 RECOVERY - Apache HTTP on parse2001 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:11:57] <icinga-wm>	 RECOVERY - Check systemd state on parse2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:59] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:03] <icinga-wm>	 RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:07] <icinga-wm>	 RECOVERY - Apache HTTP on parse2009 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:17] <icinga-wm>	 RECOVERY - Apache HTTP on parse2012 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:23] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:25] <icinga-wm>	 RECOVERY - Apache HTTP on parse2002 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:29] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:12:33] <icinga-wm>	 PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:41] <icinga-wm>	 RECOVERY - PHP7 rendering on parse2009 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:12:45] <icinga-wm>	 RECOVERY - PHP7 rendering on parse2001 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:12:51] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:14:08] <wikibugs>	 (03PS9) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[12:16:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28447 and previous config saved to /var/cache/conftool/dbconfig/20220524-121627-ladsgroup.json
[12:16:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[12:16:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[12:16:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:34] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[12:16:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:39] <icinga-wm>	 RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:41] <icinga-wm>	 RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28448 and previous config saved to /var/cache/conftool/dbconfig/20220524-121641-ladsgroup.json
[12:16:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 564 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:57] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:57] <icinga-wm>	 RECOVERY - Check systemd state on doc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:17:21] <icinga-wm>	 RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:23] <icinga-wm>	 RECOVERY - LVS kibana7 eqiad port 443/tcp - Kibana v7 env - HTTPS IPv4 #page on kibana7.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10033 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:17:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[12:17:27] <icinga-wm>	 RECOVERY - piwik.wikimedia.org requires authentication on matomo1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 542 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:31] <icinga-wm>	 RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:53] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:33] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 550 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:18:36] <wikibugs>	 (03PS10) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[12:19:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35516/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[12:20:18] <jinxer-wm>	 (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:22] <godog>	 interestingly for kibana it was a partial failure as seen by prometheus, due to lvs hashing, 1005 did see the failure but 1006 did not
[12:22:34] <wikibugs>	 (03PS1) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223
[12:22:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[12:22:54] <wikibugs>	 (03PS2) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617
[12:22:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 (owner: 10Jbond)
[12:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "After an email exchange with Eric we decided to move the config to a multi-instance config with a single cassandra instance for each node," [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[12:23:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond)
[12:23:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28449 and previous config saved to /var/cache/conftool/dbconfig/20220524-122321-ladsgroup.json
[12:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:33] <wikibugs>	 (03Abandoned) 10Slyngshede: Allow for Apache2 to not bind to port 80. [puppet] - 10https://gerrit.wikimedia.org/r/798446 (owner: 10Slyngshede)
[12:27:17] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again. Nothing in the apache2.log on mediawiki12 since 11:55 (UTC?)
[12:27:38] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) >>! In T302699#7952842, @dom_walden wrote: > Happening again. Nothing in the apache2.log on mediawiki...
[12:30:32] <moritzm>	 !log installing openldap security updates
[12:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
[12:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:36] <wikibugs>	 (03PS2) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223
[12:32:04] <wikibugs>	 (03PS3) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617
[12:33:18] <wikibugs>	 (03PS2) 10Muehlenhoff: Only add component/memcached16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214)
[12:33:59] <icinga-wm>	 PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:34:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:35:45] <icinga-wm>	 RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms
[12:36:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
[12:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28450 and previous config saved to /var/cache/conftool/dbconfig/20220524-123826-ladsgroup.json
[12:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Set fixed uid/gid for kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey)
[12:41:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:46:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
[12:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[12:52:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
[12:52:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28452 and previous config saved to /var/cache/conftool/dbconfig/20220524-125331-ladsgroup.json
[12:53:36] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite hi! There is no rush at the moment of course, but I am wondering what remaining clients needed to be migrated before being able to switch the broker's T...
[12:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:38] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1300).
[13:00:05] <jouncebot>	 MdsShakil and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:39] <koi>	 hello
[13:00:51] <MdsShakil>	 Good afternoon koi (•‿•)
[13:01:43] <wikibugs>	 (03PS4) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799)
[13:03:23] <taavi>	 hey, I can deploy if no-one else is around
[13:04:59] <Amir1>	 jouncebot: nowandnext
[13:04:59] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1300)
[13:04:59] <jouncebot>	 In 2 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1600)
[13:05:17] <wikibugs>	 (03PS12) 10Majavah: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil)
[13:05:23] <taavi>	 MdsShakil: are you familiar with the backport process?
[13:05:26] <taavi>	 hello Amir1
[13:05:37] <Amir1>	 hello :D
[13:06:00] <Amir1>	 I'll wait if you want to do it, once done I have some stuff
[13:06:02] <MdsShakil>	 Yes, Once there was an opportunity
[13:06:21] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil)
[13:06:38] <taavi>	 great! I'll ping you once your patch is testable
[13:07:09] <wikibugs>	 (03Merged) 10jenkins-bot: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil)
[13:07:18] <wikibugs>	 (03CR) 10Vgutierrez: [WIP] esitest service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[13:08:15] <taavi>	 MdsShakil: please test your change on mwdebug1001.eqiad.wmnet
[13:09:45] <MdsShakil>	 Looking goods
[13:09:52] <taavi>	 ok, deploying
[13:10:58] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793790|Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki (T308945)]] (duration: 00m 53s)
[13:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:05] <stashbot>	 T308945: Remove patrol rights from autoconfirmed users and create a separate user group on bnwiki - https://phabricator.wikimedia.org/T308945
[13:11:22] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963)
[13:13:07] <taavi>	 koi: with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/792752, is it expected that the logos downloaded from commons via `python3 logos/manage.py update zhwikisource` don't match what's already in the repository?
[13:13:37] <MdsShakil>	 Thanks taavi
[13:13:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:09] <koi>	 yes for the size, as its width is not 135px
[13:14:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:14:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:07] <wikibugs>	 (03CR) 10CDanis: "one nit one question" [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:15:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:05] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] fastnetmon: remove alert, ported to Prometheus / Alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:17:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:25] <taavi>	 not sure if I understand - are you saying that the commons file can't actually be used to generate the logo files? I thought that was the point of declaring commons: in logos.yaml
[13:19:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791567 (owner: 10Jbond)
[13:19:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:34] <koi>	 I mean the file currently inside this repository was indeed generated from SVG file on commons - and recently I made some amendment of the file on commons (for its width-height ratio)
[13:19:53] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:58] <koi>	 and the file inside another patch submitted is generated from the new commons file
[13:20:00] <wikibugs>	 (03PS1) 10Jbond: netbox: add discovery name [puppet] - 10https://gerrit.wikimedia.org/r/798663
[13:20:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10akosiaris)
[13:20:17] <taavi>	 ahh, now I understand. thanks!
[13:20:26] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:20:39] <wikibugs>	 (03PS4) 10Majavah: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:20:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: add discovery name [puppet] - 10https://gerrit.wikimedia.org/r/798663 (owner: 10Jbond)
[13:20:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:20:53] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti2030 is CRITICAL: CRIT Memory 97% used. Largest process: qemu-system-x86 (20037) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[13:20:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) (owner: 10Ladsgroup)
[13:21:24] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963)
[13:21:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add aikochou and kevinbazira to deployment [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308)
[13:21:29] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) (owner: 10Ladsgroup)
[13:21:43] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:22:02] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi)
[13:22:36] <taavi>	 I'm guessing the first one can't be tested since it's comments only?
[13:22:38] <wikibugs>	 (03CR) 10Elukey: "Hi Alex! Already filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/791036, lemme know if it is ok or if I have to drop it :)" [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) (owner: 10Alexandros Kosiaris)
[13:23:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:23:53] <koi>	 yes, but need a sync
[13:24:03] <taavi>	 ack, will do
[13:24:14] <wikibugs>	 (03PS2) 10Majavah: zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:24:21] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:25:01] <logmsgbot>	 !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:792752|zhwikisource: Declare commons files for logo (T308620)]] (duration: 00m 53s)
[13:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:08] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[13:25:09] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:25:54] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792752|zhwikisource: Declare commons files for logo (T308620)]] (duration: 00m 52s)
[13:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847)
[13:26:05] <wikibugs>	 (03PS5) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[13:26:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add sgimeno to deployment [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045)
[13:26:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the review" [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:26:25] <taavi>	 koi: the second one can now be tested on mwdebug1001
[13:26:32] <koi>	 looking
[13:26:47] <koi>	 and LGTM
[13:27:06] <taavi>	 ok, syncing
[13:27:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:57] <wikibugs>	 (03PS2) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847)
[13:27:58] <logmsgbot>	 !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:793127|zhwikisource: Optimize logo per commons files (T308620)]] (duration: 00m 55s)
[13:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:28:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:28:18] <taavi>	 done
[13:28:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:24] <koi>	 ty!
[13:28:25] <taavi>	 anyone have anything else to deploy?
[13:28:28] <taavi>	 Amir1: ^
[13:28:37] <Amir1>	 mine takes time to merge
[13:28:56] <Amir1>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/797220
[13:29:04] <Amir1>	 I +2 it and I can self-serve
[13:29:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:29:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[13:29:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:16] <wikibugs>	 (03CR) 10Majavah: "There's also a group called 'restricted' which may be more suitable here" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[13:29:25] <taavi>	 ack
[13:30:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi)
[13:34:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:59] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01022 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:35:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:35:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:20] <jbond>	 i think thats a lie
[13:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:36:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:38] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:36:58] <Amir1>	 jbond: icinga is keeping you on your toes 
[13:37:17] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002044 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:37:30] <jbond>	 LD
[13:37:36] <jbond>	 thats better icinga 
[13:37:37] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004088 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:39:33] <jynus>	 unrelated, ganeti2030 has started swapping due to memory pressure
[13:39:44] <jynus>	 a rebalance may be needed
[13:39:55] <jynus>	 (this means my alert is working as intended)
[13:40:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:40:17] <wikibugs>	 (03PS6) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[13:40:54] <jynus>	 if there is ongoing ganeti reboots it will fix itself, otherwise I will have a look after lunch
[13:42:33] <moritzm>	 this will balance itself out along with the reboots
[13:42:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet
[13:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:51] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: testing non-RAIDing SSDs
[13:43:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:43:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:44:17] <jynus>	 thanks moritzm! Still, I think the alert is useful to surface unnotice issues :-D
[13:45:00] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: retry rails console, don't keep gitlab-secrets.json [puppet] - 10https://gerrit.wikimedia.org/r/797301 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[13:45:08] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[13:47:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:47:25] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847)
[13:48:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, will let others vote on this tho" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[13:49:54] <wikibugs>	 (03PS1) 10Jbond: O:gerrit: move code around [puppet] - 10https://gerrit.wikimedia.org/r/798677
[13:50:29] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/api/ApiQueryBacklinksprop.php: Backport: [[gerrit:797220|ApiQueryBacklinksprop: Completely remove index hints (T306673)]] (duration: 00m 55s)
[13:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:35] <stashbot>	 T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673
[13:51:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:52:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:38] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673)
[13:53:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[13:54:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:54:32] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[13:54:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:04] <godog>	 mhhh checking, I think it might be timeouts 
[13:55:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:55:27] <_joe_>	 godog: what do you mean?
[13:55:37] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797221|Revert "Revert read new on frwiki for templatelinks migration"]] (duration: 00m 52s)
[13:55:37] * Emperor twitches
[13:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:57] <godog>	 _joe_: the probe hitting timeouts, though I thought I tweaked it
[13:56:32] <_joe_>	 frankly I'm not sure I see issues with thumbor
[13:56:43] <_joe_>	 Also not ure if it means eqiad or codfw
[13:57:08] <wikibugs>	 (03PS2) 10Jbond: O:gerrit: Pass rendered ports.conf config to httpd file [puppet] - 10https://gerrit.wikimedia.org/r/798677
[13:57:53] <_joe_>	 heh the 75th percentile is a bit elevated
[13:58:28] <godog>	 that'd be eqiad (from the dashboard and the page) I'm also checking the 20s timeout we've set is actually being honored
[13:58:45] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:58:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35518/console" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[13:59:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:59:34] <_joe_>	 it is a real problem though
[13:59:38] <_joe_>	 thumbor is overloaded
[14:00:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:01:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:01:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:32] <Emperor>	 there is a bit of a spike in originals uploads (but since about 10:00 when it peaked)
[14:02:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet
[14:03:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:35] <jbond>	 dont see a high load of thu/me dont see anything in sampled-1000 that look strange for thumb
[14:03:40] <Emperor>	 cf https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=eqiad&var-prometheus=eqiad+prometheus%2Fops&from=1653374687094&to=1653400790979&viewPanel=31
[14:03:43] <godog>	 also definitely not the 20s timeout we specified in the blackbox-exporter config
[14:03:46] <godog>	 ts=2022-05-24T14:02:25.359Z caller=main.go:169 module=http_thumbor_ip4 target=http://[10.2.2.24]:8800/healthcheck level=debug msg="Beginning probe" probe=http timeout_seconds=2.5
[14:04:26] <Emperor>	 (I think new uploads => new thumbs => more thumbor load?)
[14:05:06] <Emperor>	 we've had bigger peaks in the last 6 months, though, so perhaps a red herring
[14:05:22] <godog>	 Emperor: yeah the => implications are correct
[14:06:55] <Emperor>	 so are we looking at higher-than-usual load, but overly-aggresive paging because the timeout isn't the 20s we were expecting?
[14:08:21] <godog>	 indeed
[14:13:40] <godog>	 yeah so we set 3s in prometheus at the job level, and that takes precedence over the blackbox configuration
[14:19:45] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, but I'd like someone elses too to go through it with more contex on what happen earlier today." [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:23:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:ssh::client: Add GSSAPIDelegateCredentials support to ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791567 (owner: 10Jbond)
[14:27:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:34] <_joe_>	 to be clear: thumbor is not down, just slow
[14:28:12] <wikibugs>	 (03CR) 10Dzahn: "The comment "what happened earlier today" makes me curious. This is a reaction to a specific event? Was gerrit down or something?" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:28:16] <_joe_>	 one thing we could do as an emergency measure is to cancel all thumbnailrender jobs for these pdfs
[14:28:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:gerrit: Pass rendered ports.conf config to httpd file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:30:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is not the ccorrect way to handle this, and could have backscatter effects. Please wait before touching httpd/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:31:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: temp disable paging for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/798706
[14:31:29] <godog>	 if anyone wants to stamp ^
[14:32:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:gerrit: Pass rendered ports.conf config to httpd file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:32:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:50] <Emperor>	 👀
[14:35:26] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks - though we should capture the need to get the timeout honoured properly?" [puppet] - 10https://gerrit.wikimedia.org/r/798706 (owner: 10Filippo Giunchedi)
[14:36:24] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:36:27] <Emperor>	 properly> err, I meant in a phab item or similar which is easier to track than a comment :)
[14:38:00] <godog>	 Emperor: thank you, yes I'll attach a phab task too
[14:38:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707
[14:39:01] <_joe_>	 jbond: ^^
[14:39:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:39:25] <_joe_>	 I still need to remove the class from the role, sigh
[14:39:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[14:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707
[14:40:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: temp disable paging for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/798706 (https://phabricator.wikimedia.org/T309107)
[14:40:29] <jbond>	 _joe_: i think you will need to lint:ignore the httpd class
[14:40:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:40:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: temp disable paging for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798706 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi)
[14:41:10] <wikibugs>	 (03Abandoned) 10Jbond: O:gerrit: Pass rendered ports.conf config to httpd file [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond)
[14:41:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35520/console" [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:41:23] <_joe_>	 jbond: yeah w/e
[14:41:36] <_joe_>	 I mean the right thing to do is to do those things in a profile
[14:41:45] <jbond>	 yes i agree 
[14:42:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Allow new idp-test hosts in Ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/798709 (https://phabricator.wikimedia.org/T308214)
[14:42:30] <wikibugs>	 (03PS3) 10Jbond: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:42:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:42:57] <_joe_>	 jbond: so the way the httpd module was written, the idea was that ports would be managed this way
[14:43:06] <_joe_>	 if someone wanted something non-standard
[14:43:31] <jbond>	 ack ill leave httpd alone then and use this going forward
[14:43:38] <jbond>	 thx
[14:44:05] <_joe_>	 We clearly need better docs :)
[14:44:13] <_joe_>	 I'll merge thsi and see it works
[14:44:20] <jbond>	 thanks
[14:44:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto)
[14:44:48] <jbond>	 docs are probably fine i should have just been paitent and waited for a review
[14:46:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[14:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:47] <_joe_>	 change merged, gerrit works
[14:49:56] <jbond>	 great thanks
[14:51:58] <wikibugs>	 (03PS4) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617
[14:52:26] <wikibugs>	 (03CR) 10Jbond: P:aptrepo::private: update to use httpd listen_ports (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond)
[14:57:49] <wikibugs>	 (03PS1) 10Jbond: C:requesttracker: drop requesttracker::apache [puppet] - 10https://gerrit.wikimedia.org/r/798727
[14:59:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35522/console" [puppet] - 10https://gerrit.wikimedia.org/r/798727 (owner: 10Jbond)
[14:59:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:requesttracker: drop requesttracker::apache [puppet] - 10https://gerrit.wikimedia.org/r/798727 (owner: 10Jbond)
[15:04:48] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 55.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:05:54] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 56.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:07:11] <wikibugs>	 (03PS1) 10Jbond: C:reprepro: ensure /var/lib/reprepro/.bashrc exists [puppet] - 10https://gerrit.wikimedia.org/r/798740
[15:09:20] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:09:57] <bblack>	 seems like some unusual ripples that legitimately trip those, but so far doesn't look really out of whack, either, may just be "normal" -ish external variation from heavier clients
[15:10:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[15:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798709 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[15:10:58] <bblack>	 ulsfo in particular, and I remember some thumbor load mentioned earlier? could be related (ulsfo tends to take some of the big tech company traffic, and they tend to do thumbory things sometimes?)
[15:12:18] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 77.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:12:30] <_joe_>	 bblack: no, thumbor is self-inflicted pain
[15:12:36] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 86.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:12:58] <_joe_>	 someone uploads a 1k pages pdf and we generate 4k thumbnails
[15:13:02] <_joe_>	 one page at a time
[15:13:08] <_joe_>	 so we render that pdf 4k times
[15:15:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[15:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/798740 (owner: 10Jbond)
[15:17:49] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10MatthewVernon) There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit `^r` during boot), the general BIOS setup...
[15:20:08] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 4 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[15:20:41] <vgutierrez>	 uh?
[15:21:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[15:21:45] <vgutierrez>	 vgutierrez@acmechief1001:~$ sudo -i openssl x509 -dates -noout -in /var/lib/acme-chief/certs/gerrit/live/rsa-2048.crt
[15:21:45] <vgutierrez>	 notBefore=Apr 28 20:28:01 2022 GMT
[15:21:45] <vgutierrez>	 notAfter=Jul 27 20:28:00 2022 GMT
[15:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:12] <wikibugs>	 (03PS1) 10Jbond: C:cfssl: create a refresh only resource to force resigns [puppet] - 10https://gerrit.wikimedia.org/r/798765
[15:22:15] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[15:22:16] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[15:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35523/console" [puppet] - 10https://gerrit.wikimedia.org/r/798765 (owner: 10Jbond)
[15:23:16] <vgutierrez>	 _joe_: manual reload of httpd?
[15:23:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cfssl: create a refresh only resource to force resigns [puppet] - 10https://gerrit.wikimedia.org/r/798765 (owner: 10Jbond)
[15:23:28] <_joe_>	 vgutierrez: what?
[15:23:44] <_joe_>	 no I did not
[15:23:57] <vgutierrez>	 _joe_: I've saw you logged on gerrit1001 and I've assumed that you fixed it
[15:24:11] <_joe_>	 vgutierrez: nope
[15:24:26] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 56.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:24:30] <_joe_>	 I'm not even sure that check is performed against gerrit1001 directly
[15:25:31] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@644075e]: increase executor jvm heap for convert_to_esbulk
[15:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 72.64 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:27:42] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:53] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@644075e]: increase executor jvm heap for convert_to_esbulk (duration: 02m 22s)
[15:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:33] <wikibugs>	 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) I've updated Netbox running the following code:  `lang=python >>> import uuid >>> request_id = uuid.uui...
[15:29:07] <jbond>	 vgutierrez: _joe_: the ports.conf change would have caused an apache reload on gerrit
[15:29:25] <_joe_>	 jbond: yeah but I ran that like 30 minutes ago
[15:29:30] <_joe_>	 puppet, I mean
[15:29:52] <jbond>	 ack  and also dosn;t explain why it would fail then recover   
[15:29:57] <_joe_>	 yeah
[15:30:01] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour [puppet] - 10https://gerrit.wikimedia.org/r/798772
[15:30:24] <_joe_>	 vgutierrez: can you check what's the actual check performed by icinga?
[15:30:34] <vgutierrez>	 gerrit.wm.o
[15:30:36] <vgutierrez>	 that's the hostname
[15:30:52] <_joe_>	 yeah I mean the whole command
[15:30:56] <_joe_>	 check_httpd?
[15:30:58] <vgutierrez>	 check_https_expiry!gerrit.wikimedia.org!443
[15:31:30] <_joe_>	 err the whole command line is not that :)
[15:31:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[15:31:48] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:22] <vgutierrez>	 _joe_: check_http --ssl --sni --certificate 9,7 -I $HOSTADDRESS$ -H gerrit.wikimedia.org -p 443
[15:32:41] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10Volans) There is a 5th way and is via Redfish API ;) We do have basic support for redfish API in spicerack right now and there is plan to add suppor...
[15:32:45] <_joe_>	 and hostaddress is I guess gerrit.wikimedia.org
[15:32:49] <vgutierrez>	 yep
[15:33:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test.py: increase timeout for DNS check [puppet] - 10https://gerrit.wikimedia.org/r/795730 (https://phabricator.wikimedia.org/T305909) (owner: 10Andrew Bogott)
[15:35:57] <wikibugs>	 (03PS5) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761)
[15:36:02] <wikibugs>	 (03PS6) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761)
[15:37:12] <wikibugs>	 (03PS1) 10Jbond: C:netbox: Add discovery namer as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777
[15:38:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:reprepro: ensure /var/lib/reprepro/.bashrc exists [puppet] - 10https://gerrit.wikimedia.org/r/798740 (owner: 10Jbond)
[15:39:35] <vgutierrez>	 so puppet reloaded apache2 on gerrit1001 at 14:48 and the alert was triggered at 15:20
[15:40:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] OpenStack nova.conf: set reclaim_instance_interval to half an hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott)
[15:40:42] <wikibugs>	 (03PS2) 10Jbond: C:netbox: Add discovery namer as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777
[15:41:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35525/console" [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond)
[15:42:52] <RhinosF1>	 vgutierrez: that's been flapping for days
[15:43:16] <wikibugs>	 (03PS3) 10Jbond: C:netbox: Add discovery name as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777
[15:43:25] <vgutierrez>	 RhinosF1: uh? that's interesting
[15:43:25] <RhinosF1>	 I assumed it was the bug where apache serves old + new cert until restart
[15:43:40] <RhinosF1>	 vgutierrez: there's a task
[15:44:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35526/console" [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond)
[15:44:25] <RhinosF1>	 vgutierrez: https://phabricator.wikimedia.org/T308908#7946277
[15:46:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:netbox: Add discovery name as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond)
[15:47:34] <wikibugs>	 10SRE, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Joe)
[15:49:06] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:49:35] <vgutierrez>	 so httpd has some worker lurking around up to 1 month without being killed?
[15:49:49] <vgutierrez>	 that's pretty bad and not only for TLS material purposes
[15:50:16] <vgutierrez>	 acme-chief reissues certificates one month before the current one expires
[15:50:42] <wikibugs>	 10SRE, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle)
[15:50:49] <_joe_>	 vgutierrez: I don't think that's it tbh
[15:50:57] <vgutierrez>	 and gerrit1001 got the new one on April 28th, lrwxrwxrwx  1 root root   54 Apr 28 21:33 live -> /etc/acmecerts/gerrit/3a11664f5fdd45f48b53bd646c3bda1e
[15:51:13] <RhinosF1>	 vgutierrez: legokt.m links the upstream bug I believe on the task.
[15:54:56] <volans>	 vgutierrez: we don't set MaxConnectionsPerChild?
[15:55:30] <wikibugs>	 10SRE, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Vgutierrez) that's intended, every time that acme-chief fetches fresh OCSP stapling responses it issues a reload of apache2  >>! In T293826#7446839, @Legoktm wrote...
[15:56:07] <vgutierrez>	 volans: nope apparently
[15:56:24] <volans>	 I wish there was also a MaxDaysPerChild :D
[15:56:47] <vgutierrez>	 actually
[15:56:49] <volans>	 for low traffic or secondary hosts for example
[15:56:50] <vgutierrez>	 mods-enabled/mpm_event.conf:    MaxConnectionsPerChild   0
[15:57:31] <volans>	 there you go, immortal :D
[15:58:09] <dancy>	 Is the idea that the lurking old worker is delivering an old copy of the cert?
[15:58:31] <wikibugs>	 (03CR) 10Btullis: "Hi, sorry that I'm late to the party here." [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[15:59:00] <icinga-wm>	 RECOVERY - Ganeti memory on ganeti2030 is OK: OK Memory 67% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:59:04] <vgutierrez>	 dancy: yes
[15:59:33] <vgutierrez>	 the old version that expires on May 28th
[15:59:58] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:04] <jouncebot>	 jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:50] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Joe)
[16:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:07:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10cscott)
[16:11:26] <wikibugs>	 (03PS1) 10C. Scott Ananian: CONTRIBUTORS: Add C. Scott Ananian [puppet] - 10https://gerrit.wikimedia.org/r/798800 (https://phabricator.wikimedia.org/T308013)
[16:12:09] <wikibugs>	 (03PS2) 10Zabe: tmpreaper: Remove args.erb [puppet] - 10https://gerrit.wikimedia.org/r/797362
[16:15:11] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] dumps: remove generic python 2.25.1 user agent block [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway)
[16:17:57] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle)
[16:18:06] <wikibugs>	 10SRE, 10MediaWiki-Uploading, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle)
[16:28:54] <wikibugs>	 (03PS2) 10KartikMistry: Enable Content and Section Translation in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304858)
[16:32:12] <wikibugs>	 (03PS1) 10BBlack: ntp.drmrs should use dns6001 [dns] - 10https://gerrit.wikimedia.org/r/798856
[16:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28455 and previous config saved to /var/cache/conftool/dbconfig/20220524-163221-ladsgroup.json
[16:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:30] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[16:33:02] <mutante>	 !log gitlab1003 - restarting rsync, trying to debug mysterious "rsync - read-only file system" error we ran into before but could not reproduce
[16:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:30] <dancy>	 That sounds scary
[16:38:43] <wikibugs>	 (03PS2) 10Cathal Mooney: Modifications to install server netboot.cfg ommited in previous change [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989)
[16:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:42:21] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modifications to install server netboot.cfg ommited in previous change [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[16:45:10] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye
[16:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28456 and previous config saved to /var/cache/conftool/dbconfig/20220524-164726-ladsgroup.json
[16:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:53] <wikibugs>	 (03CR) 10David Caro: "Thanks! LGTM, can you run pcc on it before getting it merged?" [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[16:48:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "To be merge after the previous one has run a few times right?" [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[16:50:03] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Restructure ORES labs redis puppet role - https://phabricator.wikimedia.org/T281495 (10elukey) 05Open→03Resolved a:03elukey This has been solved with https://gerrit.wikimedia.org/r/c/operations/puppet/+/785111 in theory, closing...
[16:50:04] <mutante>	 !log gitlab1003 (gitlab-replica-new) - rebooting for fsck - T307142
[16:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:11] <stashbot>	 T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142
[16:50:41] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab1003.wikimedia.org with reason: fsck
[16:50:44] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab1003.wikimedia.org with reason: fsck
[16:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:24] <wikibugs>	 (03PS1) 10BryanDavis: base: remove "managed by puppet" notice on /etc/skel/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/798874
[16:59:55] <wikibugs>	 (03CR) 10Zabe: acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:00:49] <wikibugs>	 (03CR) 10David Caro: "LGTM, I'll wait for Jbond to do the final ack and merge." [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[17:00:56] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage
[17:00:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28457 and previous config saved to /var/cache/conftool/dbconfig/20220524-170231-ladsgroup.json
[17:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:12] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage
[17:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:25] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808
[17:06:47] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809
[17:09:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[17:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:52] <Amir1>	 jouncebot: nowandnext
[17:11:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 48 minute(s)
[17:11:52] <jouncebot>	 In 0 hour(s) and 48 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1800)
[17:12:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808 (owner: 10Ladsgroup)
[17:12:15] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson)
[17:12:19] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809 (owner: 10Ladsgroup)
[17:14:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[17:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883
[17:14:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging." [puppet] - 10https://gerrit.wikimedia.org/r/798800 (https://phabricator.wikimedia.org/T308013) (owner: 10C. Scott Ananian)
[17:16:30] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:16:42] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:17:32] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:17:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28459 and previous config saved to /var/cache/conftool/dbconfig/20220524-171736-ladsgroup.json
[17:17:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[17:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[17:17:43] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[17:17:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[17:17:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:48] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:49] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bullseye
[17:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[17:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:18:22] <moritzm>	 !log failover ganeti master in codfw to ganeti2022
[17:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:37] <wikibugs>	 (03PS1) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224)
[17:21:53] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[17:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:34] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[17:23:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite)
[17:23:45] <mutante>	 !log gitlab1003 - short downtime for maintenance
[17:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:20] <wikibugs>	 (03PS2) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224)
[17:25:32] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) >>! In T308940#7951736, @Dzahn wrote: > https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting  "A flood of API traffic from an...
[17:31:07] <wikibugs>	 (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/35531/" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite)
[17:32:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808 (owner: 10Ladsgroup)
[17:32:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809 (owner: 10Ladsgroup)
[17:32:38] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add target index validation step [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[17:33:27] <wikibugs>	 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) All servers in B12 fixed and the power draw went from 8.9/1.9 to 5.6/5.5 so disabling the 'hot spare' option and splitting the load evenly ends up saving power.  Going to give it a bit out of paranoia and then a...
[17:35:42] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/api/ApiQueryBacklinksprop.php: Backport: [[gerrit:798808|Revert "ApiQueryBacklinksprop: Completely remove index hints"]] (duration: 00m 50s)
[17:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:49] <wikibugs>	 (03PS1) 10Cathal Mooney: Add v6 reverse zone for Vlan1116 / cloudsw1-c8 to cloudsw1-d5 linknet [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936)
[17:36:00] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:00] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[17:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:39:36] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH)
[17:39:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:44] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[17:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:40:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:16] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:14] <wikibugs>	 10ops-drmrs: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH)
[17:50:45] <wikibugs>	 10ops-drmrs, 10Traffic-Icebox: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) This is a dns server, so we'll have to check with traffic before we go taking it offline for repair.
[17:50:59] <wikibugs>	 10ops-drmrs, 10Traffic: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH)
[17:52:27] <wikibugs>	 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) 05Open→03Resolved All hosts except dns6002 have been fixed.  T309124 filed for dns6002 repair
[17:53:03] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[17:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:52] <dancy>	 Starting train operations
[17:59:55] <wikibugs>	 10ops-drmrs, 10Traffic: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) 05Open→03Resolved fixed the power via the idrac ssh cli
[17:59:59] <wikibugs>	 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH)
[18:00:11] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] ntp.drmrs should use dns6001 [dns] - 10https://gerrit.wikimedia.org/r/798856 (owner: 10BBlack)
[18:02:14] <wikibugs>	 (03PS1) 10Ahmon Dancy: testwikis wikis to 1.39.0-wmf.13  refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915
[18:02:16] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.39.0-wmf.13  refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915 (owner: 10Ahmon Dancy)
[18:02:57] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.13  refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915 (owner: 10Ahmon Dancy)
[18:03:54] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.13  refs T305219
[18:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:01] <stashbot>	 T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219
[18:06:11] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney)
[18:06:17] <wikibugs>	 (03PS2) 10BBlack: Add v6 reverse zone for Vlan1116 / cloudsw1-c8 to cloudsw1-d5 linknet [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney)
[18:06:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:49] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) It fixed itself with reboot    ` Normal,Tue 24 May 2022 18:06:22,The self-heal operation successfully completed at DIMM DIMM_B2., Normal,Tue 24 May 2022 18:06:22,The self-h...
[18:07:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:07:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:36] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH)
[18:08:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks this looks really good, have let some minor nits but no blockers will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[18:15:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798874 (owner: 10BryanDavis)
[18:18:02] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:33:43] <wikibugs>	 (03CR) 10BBlack: [WIP] esitest service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[18:34:12] <wikibugs>	 (03PS5) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799)
[18:36:07] <icinga-wm>	 PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:36:44] <SandraEbele>	 !log deploying analytics refinery as part of the weekly deployment
[18:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:09] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31]: Regular analytics weekly train [analytics/refinery@8314d31]
[18:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) While I was out, they closed the task and I had to re-open.  They will be sending a new motherboard was where it left off.   New ticket Successfully Submitted Case Number: 5...
[18:41:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:41:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Cmjohnson) @wiki_willy @Jclark-ctr We do not have a spare on-site.
[18:42:47] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547
[18:44:36] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Traffic, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10Krinkle)
[18:44:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:44:59] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:45:05] <wikibugs>	 (03PS2) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[18:45:29] <wikibugs>	 (03CR) 10Gehel: elastic: add reimage to rolling-operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[18:45:39] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.13  refs T305219 (duration: 41m 45s)
[18:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:45] <stashbot>	 T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219
[18:46:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:46:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:07] <wikibugs>	 (03PS6) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799)
[18:48:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[18:48:11] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.9, 1.39.0-wmf.8, 1.39.0-wmf.10 (duration: 02m 28s)
[18:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:07] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) a:03Krinkle
[18:52:11] <wikibugs>	 (03PS3) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[18:52:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:42] <wikibugs>	 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle)
[18:53:50] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle)
[18:53:53] <wikibugs>	 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle)
[18:53:56] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle)
[18:54:06] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle)
[18:54:12] <wikibugs>	 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle) 05duplicate→03Open
[18:55:10] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) >>! In T308893#7947393, @Alexey_Skripnik wrote: > User Vladis13 did a great job importing some public domain texts in Russian Wikisource...
[18:55:17] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle)
[18:59:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:01:49] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31]: Regular analytics weekly train [analytics/refinery@8314d31] (duration: 23m 40s)
[19:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:09] <wikibugs>	 (03PS4) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:04:46] <wikibugs>	 (03PS1) 10Ahmon Dancy: group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219)
[19:04:48] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy)
[19:05:01] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547 (owner: 10Ryan Kemper)
[19:06:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy)
[19:06:35] <wikibugs>	 (03PS5) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:07:16] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.13  refs T305219
[19:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:23] <stashbot>	 T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219
[19:09:03] <wikibugs>	 (03PS6) 10Ryan Kemper: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:10:20] <wikibugs>	 (03PS2) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883
[19:12:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:20] <wikibugs>	 (03PS11) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[19:13:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:13:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:25] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31] (thin): Regular analytics weekly train THIN [analytics/refinery@8314d31]
[19:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:31] <wikibugs>	 (03CR) 10Ryan Kemper: elastic: add reimage to rolling-operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:14:33] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31] (thin): Regular analytics weekly train THIN [analytics/refinery@8314d31] (duration: 00m 08s)
[19:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:13] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8314d31]
[19:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:47] <wikibugs>	 (03PS7) 10Ryan Kemper: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:16:51] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[19:18:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) Hi @BTullis - I noticed analytics1068 has a failed status and is set to be refreshed after @Cmjohnson finishes up T293922.  As a quick fix, would we be able to pull the RA...
[19:18:32] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:19:22] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10aaron) It would be good to look at the performance of pages at https://ru.wikisource.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1...
[19:21:20] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:25] <stashbot>	 T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606
[19:22:14] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:34] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8314d31] (duration: 07m 21s)
[19:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:40] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:23:36] <topranks>	 ^^ sry this was me adding new peerings here.
[19:23:58] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[19:23:58] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:24:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:24:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:08] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:34] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:50] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:25] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606)
[19:28:13] <wikibugs>	 (03CR) 10Bking: [V: 03+1] elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[19:28:35] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[19:29:17] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798974 (https://phabricator.wikimedia.org/T305175)
[19:29:37] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3ae51e7]: (no justification provided)
[19:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:45] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3ae51e7]: (no justification provided) (duration: 00m 08s)
[19:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:21] <stashbot>	 T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606
[19:31:03] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:22] <icinga-wm>	 PROBLEM - IPMI Sensor Status on aqs1014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:34:18] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:39:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:52] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:42:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:42:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[19:42:16] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[19:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:16] <stashbot>	 T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606
[19:46:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:46:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:02] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068)
[19:47:31] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders)
[19:48:09] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders)
[19:49:27] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1003.eqiad.wmnet with OS bullseye
[19:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:38] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030)
[19:53:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:43] <wikibugs>	 (03PS1) 10Eevans: Allow `LOGIN` for image_suggestions Cassandra user [puppet] - 10https://gerrit.wikimedia.org/r/798977
[19:54:02] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] beta-logs: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798974 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[19:54:31] <SandraEbele>	 !Log Refinery Deployment is complete
[19:55:14] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:55:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:55:37] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] "I'm marking this -1 until we've established that this fixes the current connection failures.  If it does we can merge it (it's already bee" [puppet] - 10https://gerrit.wikimedia.org/r/798977 (owner: 10Eevans)
[19:55:44] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:59:33] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage
[19:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T2000).
[20:00:04] <jouncebot>	 Tran, zabe, cjming, koi, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <zabe>	 heya o/
[20:00:44] <Tran>	 👋 I'm here!
[20:00:51] <MatmaRex>	 hi
[20:00:53] <koi>	 here/
[20:01:25] <cjming>	 hi all - i can deploy
[20:01:35] <cjming>	 if anyone can/wants to self-serve, just lmk
[20:02:21] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage
[20:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:33] <cjming>	 Tran: I'll start with your patches
[20:02:34] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Alexey_Skripnik) This is one of the longest pagest in Russian Wikisource: https://ru.wikisource.org/wiki/%D0%A4%D0%B8%D0%BD%D0%B8%...
[20:02:47] <Tran>	 :+1
[20:02:50] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[20:03:01] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606)
[20:03:50] <wikibugs>	 (03Merged) 10jenkins-bot: Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[20:04:04] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:04:04] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:04:04] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:04:59] <cjming>	 Tran: I think bec your 1st patch is labs, I can move on with your 2nd?  When confirming rebase on master, it says it breaks the relation chain but i'm assuming that's ok?
[20:05:02] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606)
[20:05:45] <Tran>	 Yes that's fine. The first two are just comments.
[20:05:54] <wikibugs>	 (03PS2) 10Clare Ming: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[20:06:25] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Alexey_Skripnik) In comparison, this is some **random short page** from Russian Wikisource: https://ru.wikisource.org/wiki/%D0%95%...
[20:07:15] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:07:43] <AntiComposite>	 seeing general slowness, one report of "upstream connect error or disconnect/reset before headers. reset reason: overflow"
[20:08:11] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:08:14] <wikibugs>	 (03PS1) 10Cwhite: logstash: curator support new and legacy index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798982 (https://phabricator.wikimedia.org/T305175)
[20:08:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:08:18] <jinxer-wm>	 (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:29] <rzl>	 looking
[20:08:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:08:49] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:08:55] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:793848|Remove outdated comment about IPInfo from CommonSettings-labs.php (T308876)]] (duration: 00m 49s)
[20:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:08:56] <rzl>	 cjming: pause deploying please
[20:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:59] <stashbot>	 T308876: Improve comments in mediawiki-config about IPInfo - https://phabricator.wikimedia.org/T308876
[20:09:03] <cjming>	 rzl: ok
[20:09:12] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[20:09:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:09:20] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:09:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[20:09:32] <rzl>	 cjming not sure if this is deployment related or not, checking, but if you rolled anything out in the last few minutes, please prepare a rollback and don't merge it yet
[20:09:34] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:09:47] * jbond here i can ic will set up the doc
[20:09:54] <rzl>	 jbond: ack,thanks
[20:10:19] <cjming>	 rzl: i deployed the first patch is all https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793848
[20:10:21] <bblack>	 here too
[20:10:28] <jhathaway>	 here as well
[20:10:48] <cjming>	 rzl: should it be rolled back?
[20:10:52] <zabe>	 the only deployed patch is labs-only. It doesn't seem like it could have caused this.
[20:10:59] <rzl>	 looks like a spike of DB queries to s5 that saturated php-fpm workers, seems like it's already cleared
[20:11:09] <cjming>	 that's what I was thinking - i.e. labs only
[20:11:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:11:16] <rzl>	 cjming: yeah, safe to assume that's unrelated, just sit tight for a minute, thank you
[20:11:23] <cjming>	 rzl: sure thing
[20:11:28] <rzl>	 sorry, just ruling stuff out :D
[20:11:42] <cjming>	 rzl: i'll wait for your green light before proceeding
[20:11:48] <rzl>	 perfect, thanks, will let you know
[20:12:53] <rzl>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s5&var-role=All&from=now-3h&to=now s5 did see a traffic spike but recovered, still digging
[20:13:18] <jinxer-wm>	 (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:13:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:13:44] <bblack>	 aside from current s5, note also that s5 replag has been growing since ~4.5h ago, not sure if that's a problem or related
[20:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:13:56] <bblack>	 https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1&viewPanel=6
[20:14:19] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:14:34] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:15:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:15:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:50] <jbond>	 rzl: do we know more then "issue with S5"
[20:16:53] <rzl>	 jbond: issue driven by about a 6x spike in qps, and cwhite has some information in the other channel, but that's where we're at as far as I know
[20:17:06] <jbond>	 thx
[20:18:45] <rzl>	 https://orchestrator.wikimedia.org/web/cluster/alias/s5 shows high replication lag to db1154 but I think it's still depooled
[20:19:01] <rzl>	 ^ any SRE with a pair of hands free, can you verify that please?
[20:19:23] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:19:46] <rzl>	 I lag to dbstore1003 and to codfw also but I'm not worried about that right now
[20:20:22] <zabe>	 it got downtimed for 2 days
[20:20:52] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1003.eqiad.wmnet with OS bullseye
[20:20:54] <zabe>	 I think Amir is running a schema change on them
[20:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:10] <zabe>	 T298560
[20:21:11] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[20:21:13] <jbond>	 rzl: that db is still depooled
[20:21:17] <rzl>	 zabe: yes :) what I need to find out is whether it's still depooled but thank you
[20:21:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:22] <rzl>	 jbond: rad thanks
[20:22:09] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:45] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:26:25] <rzl>	 cjming: fyi we've switched channels but still digging, haven't forgotten you :) it looks like we're stable but we'd like to get a better sense of what's going on before we unblock deploys, will still let you know
[20:26:31] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[20:26:31] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:26:46] <cjming>	 rzl: sounds good - i'll be standing by
[20:36:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[20:38:25] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:51] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 122 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 152, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 120, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num
[20:40:51] <icinga-wm>	 n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 55.47445255474452 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:42:19] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:44:15] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:51:55] <rzl>	 cjming: all clear, thanks for your patience!
[20:52:11] <cjming>	 rzl: thanks!
[20:52:25] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[20:52:43] <_joe_>	 cjming: sorry for the wait
[20:52:53] <cjming>	 Tran: I'm going to sync your 2nd patch since it's just a comment as well
[20:53:03] <cjming>	 _joe_: np! glad it all got sorted out
[20:53:06] <Tran>	 👍 thanks!
[20:53:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[20:53:32] <_joe_>	 we were trying to be sure of the root cause so that if the problem happens again we won't get in your way :)
[20:54:20] <cjming>	 gtk we're all in good hands
[20:54:44] <wikibugs>	 (03PS2) 10Clare Ming: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders)
[20:54:48] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:793849|Add comment to consult Legal before updating IPInfo access (T308876)]] (duration: 00m 52s)
[20:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:55] <stashbot>	 T308876: Improve comments in mediawiki-config about IPInfo - https://phabricator.wikimedia.org/T308876
[20:55:47] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders)
[20:56:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:21] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders)
[20:57:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:57:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:57] <cjming>	 Tran: is your 3rd patch something that can be checked? on mwdebug1001
[20:58:03] <cjming>	 otherwise I can just sync
[20:58:11] <Tran>	 Yes I think I can check the version page to see if it's installed. Please hold
[20:58:15] <icinga-wm>	 PROBLEM - Host ml-serve1007 is DOWN: PING CRITICAL - Packet loss = 100%
[20:58:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) Hi Willy, That works for me. Shall I shut down analytics1068 at a convenient time tomorrow?  Many thanks, Ben
[20:58:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:24] <Tran>	 Yes I can confirm it's installed on mwdebug1001
[20:59:29] <cjming>	 cool - syncing
[20:59:56] <wikibugs>	 (03PS5) 10Clare Ming: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:00:28] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793841|Deploy IPInfo to all wikis by default (T260597)]] (duration: 00m 52s)
[21:00:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:34] <stashbot>	 T260597: Deploy IP Info extension to all wikis (as a beta feature) - https://phabricator.wikimedia.org/T260597
[21:00:36] <cjming>	 Tran: should be live
[21:00:47] <Tran>	 Looks good thank you!
[21:00:53] <cjming>	 np!
[21:01:00] <cjming>	 Zabe: I can do yours next if you're still around
[21:01:14] <zabe>	 i am still here
[21:01:19] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:02:09] <icinga-wm>	 RECOVERY - Host ml-serve1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[21:02:15] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson)
[21:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:03:24] <cjming>	 Zabe: is your patch testable? on mwdebug1001
[21:03:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:04:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:04:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:07] <zabe>	 cjming, it's not really testable. I made sure editing does not result in fatals. I will keep an eye on logstash after you synced it.
[21:05:16] <cjming>	 sounds good - syncing then
[21:05:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:52] <cjming>	 koi: doing my patch real quick and will do yours next if you're still around
[21:06:10] <koi>	 still here
[21:06:18] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797294|Start writing to cuc_actor in s3, kcgwiki and labtestwiki (T233004)]] (duration: 00m 52s)
[21:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:23] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:06:24] <cjming>	 Zabe: should be live
[21:06:41] <zabe>	 ok, thanks :)
[21:07:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (9) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:08:53] <wikibugs>	 (03PS2) 10Clare Ming: zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[21:10:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:14:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:14:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:13] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10aaron) It would be good to look at the performance of pages at  https://he.wikisource.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%93%D7%A4%D7%99%...
[21:15:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:52] <wikibugs>	 (03PS1) 10Catrope: CONTRIBUTORS: Add myself (Roan Kattouw) [puppet] - 10https://gerrit.wikimedia.org/r/798991 (https://phabricator.wikimedia.org/T308013)
[21:21:44] <cjming>	 koi: sorry - my patch is taking forever to merge -- it's almost there
[21:22:03] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson)
[21:23:35] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.12/resources/src/mediawiki.skinning/accessibility.less: Backport: [[gerrit:797219|mediawiki.skinning: `transition-duration` accessibility override set to `0` (T308979)]] (duration: 00m 51s)
[21:23:37] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[21:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:42] <stashbot>	 T308979: Infinite motion when "Reduces motion" is enabled on mobile device for skins that are not responsive (Modern, Vector legacy) - https://phabricator.wikimedia.org/T308979
[21:24:30] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[21:24:59] <cjming>	 koi: can you check mwdebug1001?
[21:25:07] <koi>	 looking
[21:25:33] <koi>	 cjming: LGTM
[21:25:39] <cjming>	 great - syncing
[21:26:41] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikisource-wordmark-zh.svg: Config: [[gerrit:792971|zhwikisource: Adjust workmark size (T308620)]] (duration: 00m 50s)
[21:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:47] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[21:27:20] <zabe>	 Column 'cuc_actor' cannot be null
[21:27:21] <zabe>	 bah
[21:27:53] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792971|zhwikisource: Adjust workmark size (T308620)]] (duration: 00m 50s)
[21:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:14] <cjming>	 koi: should be live - purged the svg
[21:28:48] <koi>	 indeed, thx
[21:29:26] <cjming>	 MatmaRex: are you still around? happy to do your patches unless you'd like to self-serve
[21:29:50] <MatmaRex>	 cjming: yeah, i'm around if you're still deploying
[21:30:09] <MatmaRex>	 (i don't have deploy access)
[21:30:14] <cjming>	 sure - np
[21:30:25] <wikibugs>	 (03PS5) 10Clare Ming: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders)
[21:30:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Jclark-ctr) @BTullis  i am available tomorrow at 3pm est
[21:31:41] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders)
[21:31:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:31:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:31:44] <wikibugs>	 (03CR) 10Ori: [C: 03+2] CONTRIBUTORS: Add myself (Roan Kattouw) [puppet] - 10https://gerrit.wikimedia.org/r/798991 (https://phabricator.wikimedia.org/T308013) (owner: 10Catrope)
[21:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:26] <wikibugs>	 (03Merged) 10jenkins-bot: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders)
[21:33:10] <cjming>	 MatmaRex: your 1st patch is on mwdebug1001 if it's verifiable
[21:33:46] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030) (owner: 10Bartosz Dziewoński)
[21:33:46] <MatmaRex>	 cjming: it should be a no-op
[21:33:56] <cjming>	 alrighty then - syncing
[21:34:11] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606
[21:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:16] <stashbot>	 T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606
[21:34:56] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771872|Disable autotopicsub user option by default (T297966)]] (duration: 00m 48s)
[21:35:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030) (owner: 10Bartosz Dziewoński)
[21:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:06] <stashbot>	 T297966: Auto topic subscription should be enabled by default on 3rd party installs - https://phabricator.wikimedia.org/T297966
[21:35:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:41] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:35:57] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068) (owner: 10Bartosz Dziewoński)
[21:36:37] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:798976|Update beta cluster DiscussionTools A/B test config (T304030)]] (duration: 00m 49s)
[21:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:44] <stashbot>	 T304030: Implement Topic Subscriptions A/B test bucketing - https://phabricator.wikimedia.org/T304030
[21:37:01] <cjming>	 MatmaRex: just waiting for your last patch to merge
[21:37:44] <MatmaRex>	 thanks
[21:38:54] <zabe>	 cjming, sorry, but I need to revert my patch, it's causing fatals
[21:39:09] <cjming>	 zabe: ok
[21:40:03] <wikibugs>	 (03PS1) 10Zabe: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004)
[21:40:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:41:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:50] <wikibugs>	 (03PS2) 10Clare Ming: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:46:19] <cjming>	 zabe: just waiting for the current patch to go thru and we can do your revert
[21:46:29] <zabe>	 ok :)
[21:52:13] <MatmaRex>	 so what is taking 15 minutes there
[21:52:20] <MatmaRex>	 oh, selenium tests
[21:52:49] <cjming>	 ya - so slow
[21:53:34] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068) (owner: 10Bartosz Dziewoński)
[21:54:13] <MatmaRex>	 apparently it takes 2 minutes just to install the npm dependencies for them. truly we're doomed
[21:54:24] <cjming>	 lol
[21:54:40] <cjming>	 MatmaRex: your last patch is on mwdebug1001 if you can confirm
[21:54:56] <MatmaRex>	 yeah. testing
[21:55:19] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 51420 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[21:55:30] <mutante>	 awww. man
[21:55:37] <MatmaRex>	 looks fixed at https://m.mediawiki.org/wiki/Project:Sandbox (i did a null edit)
[21:55:45] <cjming>	 cool - syncing then
[21:56:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/MobileFrontend: Backport: [[gerrit:798811|Follow-up I97c27fd7: Fix after-edit reload in source editor (T309068)]] (duration: 00m 48s)
[21:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:56] <cjming>	 MatmaRex: should be live
[21:56:57] <stashbot>	 T309068: [betalabs-mobile] Publishing edits from source editor re-opens page in editing mode - https://phabricator.wikimedia.org/T309068
[21:57:05] <MatmaRex>	 thank you cjming. have a good evening
[21:57:10] <cjming>	 thanks! you too
[21:57:18] <cjming>	 ok zabe: onto your revert
[21:57:25] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:57:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:41] <MatmaRex>	 (oh, still not done :( )
[21:58:20] <cjming>	 MatmaRex: ?
[21:58:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:58:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:58:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:56] <MatmaRex>	 cjming: sorry, everything's alright, i'm just bemoaning the deployment running over :)
[21:59:15] <cjming>	 appreciate the empathy lol
[21:59:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:21] <cjming>	 zabe: i can go ahead and sync
[22:00:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Jclark-ctr) All host racked , powered  and management.  will update once network is completed name rack position mw1457 A8 1 mw1458 A8 2 mw1459 A8 3 mw1460 A8 12 mw14...
[22:01:01] <cjming>	 unless you want to verify on mwdebug1001
[22:01:19] <cjming>	 gonna sync cuz i gotta run
[22:02:14] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:798813|Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" (T233004 T309148)]] (duration: 00m 49s)
[22:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:22] <stashbot>	 T309148: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'cuc_actor' cannot be nullFunction: MediaWiki\CheckUser\Hooks::updateCheckUserDataQuery: INSERT INTO `cu_changes` (cuc_namespace,cuc_title,cuc_minor,cuc_user,cuc_user_text,cuc_ - https://phabricator.wikimedia.org/T309148
[22:02:22] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[22:03:16] <zabe>	 yes
[22:03:33] <zabe>	 thanks for helping with this :)
[22:03:34] <mutante>	 !log centrallog2002 - alerted because running out of disk. /srv/syslog# find . -name *.gz -mtime +100 -delete
[22:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:17] <cjming>	 zabe: np ! revert should be live
[22:04:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:04:41] <cjming>	 !log end of UTC late backport window
[22:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:08:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:03] <wikibugs>	 (03CR) 10Dzahn: "what Majavah says,"restricted" will give access to mediawiki::maintenance hosts and deployment isn't really needed afaict." [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[22:15:54] <wikibugs>	 (03CR) 10Krinkle: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[22:19:00] <wikibugs>	 (03CR) 10Dzahn: "looking at https://phabricator.wikimedia.org/T307452#7930485  if this is really just that command and it's running twice a week... would i" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[22:21:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) >>! In T309045#7950982, @MShilova_WMF wrote: > I confirm that @sgs needs access to a production server and it is currently blocking {https://phabric...
[22:22:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 23.42 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:23:07] <rzl>	 ^ OK
[22:24:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 41.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:24:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 41.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:25:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:26:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 81.09 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:26:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 99.39 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:28:17] <wikibugs>	 (03PS1) 10Cwhite: logstash: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175)
[22:42:05] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:47:46] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah)
[22:51:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 268, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma
[22:51:27] <icinga-wm>	 g_in_queue_millis: 0, active_shards_percent_as_number: 97.8102189781022 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:52:19] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 273, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma
[22:52:19] <icinga-wm>	 g_in_queue_millis: 0, active_shards_percent_as_number: 99.63503649635037 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:52:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah)
[23:10:01] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:12:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle)
[23:15:12] <wikibugs>	 (03PS1) 10MewOphaswongse: Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799007 (https://phabricator.wikimedia.org/T309152)
[23:15:28] <wikibugs>	 (03PS1) 10MewOphaswongse: Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152)
[23:40:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse)
[23:43:13] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:44:56] <wikibugs>	 (03PS7) 10BryanDavis: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah)
[23:55:42] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "I plan to merge and deploy this for the first time on 2022-05-25." [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah)