[00:01:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:20] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10KFrancis) Thanks for signing! I'm just waiting on legal counsel to counter sign. [00:18:07] (03CR) 10Dzahn: "I also have https://gerrit.wikimedia.org/r/c/operations/puppet/+/812142 to fall back to monitoring port 80 and not envoy, so 3 changes alt" [puppet] - 10https://gerrit.wikimedia.org/r/812282 (owner: 10Dzahn) [00:18:57] (03CR) 10Dzahn: "same here, stalled for a week, then will be merged" [puppet] - 10https://gerrit.wikimedia.org/r/812326 (owner: 10Dzahn) [00:19:36] (03CR) 10Dzahn: [C: 04-1] "only a fallback if we decide to not monitor envoy but just the backend, stalled for a week" [puppet] - 10https://gerrit.wikimedia.org/r/812142 (owner: 10Dzahn) [00:27:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:05] (03PS1) 10BCornwall: varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) [00:32:17] (03PS1) 10CDanis: haproxy: also log high client concurrency [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) [00:33:29] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) (forgot one last one!) [00:35:59] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [00:45:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:47:11] (03PS1) 10Eevans: [DRAFT]: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [00:47:52] (03CR) 10Eevans: [C: 04-1] "Not yet ready; Needs additional IPs added" [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans) [01:05:29] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [01:22:52] (03CR) 10Krinkle: [C: 03+2] ResourceLoader: Switch Image.php to injected log channel [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812108 (https://phabricator.wikimedia.org/T32956) (owner: 10Krinkle) [01:22:56] (03CR) 10Krinkle: [C: 03+2] Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 (owner: 10Krinkle) [01:23:06] (03CR) 10CI reject: [V: 04-1] Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 (owner: 10Krinkle) [01:27:59] (03PS2) 10Krinkle: Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 [01:28:03] (03CR) 10Krinkle: [C: 03+2] Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 (owner: 10Krinkle) [01:28:56] (03Merged) 10jenkins-bot: Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 (owner: 10Krinkle) [01:35:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:35:31] !log krinkle@deploy1002 Synchronized wmf-config/: I1bb97d1d601 (duration: 03m 24s) [01:36:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:36:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:37:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:02] (03Merged) 10jenkins-bot: ResourceLoader: Switch Image.php to injected log channel [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812108 (https://phabricator.wikimedia.org/T32956) (owner: 10Krinkle) [01:42:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:43:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:44:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:47:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:33] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.19/includes/ResourceLoader/: I3e43b10d26858c5b (duration: 03m 37s) [01:51:27] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:47] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:09] (03CR) 10Dzahn: [C: 03+2] gitlab: add prometheus blackbox http monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [02:31:45] (03PS1) 10Dzahn: gitlab: for now, only monitor the active host, not the replica [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) [02:33:38] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36239/" [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) (owner: 10Dzahn) [02:35:17] (03CR) 10Dzahn: [C: 03+2] gitlab: for now, only monitor the active host, not the replica [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) (owner: 10Dzahn) [02:38:16] (03CR) 10Dzahn: [C: 03+2] "should stop this now:" [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) (owner: 10Dzahn) [03:09:12] (03PS2) 10Dzahn: vrts/blackbox: adjust monitoring back to port 80, but fix path [puppet] - 10https://gerrit.wikimedia.org/r/812142 (https://phabricator.wikimedia.org/T312194) [03:09:27] (03PS2) 10Dzahn: Revert "vrts/prometheus: comment out broken check" [puppet] - 10https://gerrit.wikimedia.org/r/812282 (https://phabricator.wikimedia.org/T312194) [03:09:34] (03PS2) 10Dzahn: vrts/prometheus: re-activate commented check after fixing path [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) [03:09:36] (03CR) 10CI reject: [V: 04-1] vrts/blackbox: adjust monitoring back to port 80, but fix path [puppet] - 10https://gerrit.wikimedia.org/r/812142 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [03:16:03] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) [03:30:04] (03PS1) 10Dzahn: doc: set role_owner to serviceops [puppet] - 10https://gerrit.wikimedia.org/r/812430 [03:30:36] (03PS2) 10Dzahn: doc: set role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/812430 [03:34:45] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [04:41:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:42:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:46:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:47:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:59] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:27] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:09] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:35] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:49] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:31] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:58:55] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220709T0700) [08:03:39] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8818.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:27] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:25:57] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8818.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:09] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:09] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8818.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:10] (03PS1) 10Volans: redfish: better compare Dell SCP attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 [09:58:09] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:35] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service,thumbor@8820.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [10:57:47] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:13] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service,thumbor@8820.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [11:22:13] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:39] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:43] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service,thumbor@8820.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [13:27:03] (03CR) 10Majavah: [C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812412 (owner: 10Zabe) [13:27:54] (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812412 (owner: 10Zabe) [13:32:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:33:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:33:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:34:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:21:08] (03PS1) 10Volans: sre.hosts.provision: ask to setup the RAID [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 [14:22:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:23:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:29:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:33:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.850 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:15] (03PS2) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [14:53:17] (03PS1) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 [14:57:09] (03CR) 10CI reject: [V: 04-1] wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond) [14:59:37] (03PS2) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 [14:59:39] (03PS3) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [15:16:33] (03Abandoned) 10Ori: admin: steal Giuseppe's docker shortcuts [puppet] - 10https://gerrit.wikimedia.org/r/800122 (owner: 10Ori) [16:24:02] (03PS4) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [16:24:04] (03PS1) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 [16:25:07] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [16:25:43] (03PS2) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 [16:47:43] (03PS5) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [16:47:45] (03PS1) 10Jbond: P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 [16:48:46] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [16:51:41] (03CR) 10CI reject: [V: 04-1] P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 (owner: 10Jbond) [16:57:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:58] (03PS6) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:20:02] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:23:05] (03CR) 10Ori: "I cherry-picked this on Beta and confirmed it works via some manual testing." [puppet] - 10https://gerrit.wikimedia.org/r/812450 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [17:24:39] (03PS2) 10Jbond: P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 [17:24:41] (03PS7) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:26:23] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:27:52] (03CR) 10CI reject: [V: 04-1] P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 (owner: 10Jbond) [17:37:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:57:43] (03PS8) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:58:25] (03CR) 10Jbond: beaker: add initial beaker files (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:59:05] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [18:11:24] (03PS9) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [18:11:26] (03PS1) 10Jbond: base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461 [18:12:55] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [18:15:30] (03CR) 10CI reject: [V: 04-1] base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461 (owner: 10Jbond) [18:39:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 39.53 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:41:41] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:01:21] (03PS10) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [19:04:46] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [19:20:51] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8817.service,thumbor@8820.service [19:20:51] @8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 54.12 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:22:43] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 92.82 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:45:25] (03PS11) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [20:48:51] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [20:57:57] (03CR) 10Jbond: beaker: add initial beaker files (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [20:58:51] (03CR) 10Andrew Bogott: wmcs: add alerts for any node going down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro) [20:59:59] (03CR) 10Andrew Bogott: "one comment about a comment, otherwise +1" [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro) [21:08:41] (03CR) 10Jbond: beaker: add initial beaker files (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [21:50:38] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [22:22:25] (03PS1) 10Krinkle: Limit "CentralAuth" log channel to level=info and above [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812478 (https://phabricator.wikimedia.org/T312704) [22:22:27] (03PS1) 10Krinkle: Remove unused 'CentralAuthRename' log config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812479 (https://phabricator.wikimedia.org/T312704) [22:57:37] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8803.service.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8826 [22:57:37] https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:36] (03CR) 10Andrew Bogott: [C: 04-1] "Thanks for this -- it looks very complete and well-thought-through!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [23:33:13] (03CR) 10Andrew Bogott: [C: 03+1] openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 (owner: 10David Caro) [23:53:35] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state