[00:02:49] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:05] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:15] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:51] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:21:53] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:31] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:34:15] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f31dc256280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:15] org/wiki/Search%23Administration [00:34:37] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:53] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 669, active_shards: 1514, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:35:53] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:12] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 3 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10thcipriani) >>! In T326788#8526494, @RhinosF1 wrote: > Hi, > > Can som... [00:41:01] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:48:31] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 3 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Ladsgroup) I'm around in my capacity as SRE. Going to help on this. [00:49:09] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:21] hey Amir1 thanks for volunteering as tribute to help with backporting https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/879798/ I'm going to go ahead and get the backport going [00:49:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879798 (https://phabricator.wikimedia.org/T316559) (owner: 10Func) [00:49:52] Thank you for doing the backport. Do you need me for anything specific as well? [00:50:05] Should I glue my eyes to some monitor or something [00:50:30] if you can help me verify the fix once it's out on test servers: that'd be amazing :) [00:51:34] I actually saw it somewhere and was annoyed by the bug [00:51:44] ah I do [00:51:59] yup [00:52:16] ping me once it's in mwdebug [00:52:28] cool, looks like I can recreate it too from the task [00:52:30] will do [00:52:59] (zuul says 10 minutes, but zuul often speaks lies about these things) [00:55:37] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:03] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:54] (03Merged) 10jenkins-bot: LanguageDropdown: Check if the page is in talk namespaces instead [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879798 (https://phabricator.wikimedia.org/T316559) (owner: 10Func) [01:05:09] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879798|LanguageDropdown: Check if the page is in talk namespaces instead (T316559 T326788)]] [01:05:15] T326788: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 [01:05:16] T316559: Communicate when pages are not supported in other languages in the current language selector - https://phabricator.wikimedia.org/T316559 [01:11:45] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:49] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:15:17] !log thcipriani@deploy1002 thcipriani and func: Backport for [[gerrit:879798|LanguageDropdown: Check if the page is in talk namespaces instead (T316559 T326788)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [01:15:22] T326788: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 [01:15:23] T316559: Communicate when pages are not supported in other languages in the current language selector - https://phabricator.wikimedia.org/T316559 [01:15:31] ^ Amir1 ok, finally on mwdebug [01:15:37] awesome [01:16:23] thcipriani: I confirm it fixes the issue and doesn't break anything I can find [01:17:37] great! Looks good to me, too. logs look clear on mwdebug (unrelated timeout). Syncing! [01:21:10] k8s builds/deploys seem a bit sluggish, but maybe that's my Sunday night impatience... [01:21:27] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:40] thcipriani: no, the thing is that if it hasn't been rebuilt for a while, it must get fully rebuilt which takes way more time [01:21:50] and it doesn't get rebuilt because weekend [01:22:30] usually the deploy without the full rebuilt takes around 8 minutes and that's already too much because we have to do k8s and php fpm restart in the old cluster [01:22:32] ah, right, I remember danc.y mentioned he was working on a fix for this [01:23:30] that's gooood to know [01:23:54] blarg, bunch of lvs depool errors :\ [01:24:38] > Error depooling the servers: enabled/up/pooled [01:24:40] fun [01:27:10] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 4 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Ladsgroup) I'd appreciate an incident report on this: https://wikitech.... [01:28:34] I take a look [01:28:47] thanks, mw2289 is the latest example I see [01:28:58] but there are a bunch [01:29:57] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879798|LanguageDropdown: Check if the page is in talk namespaces instead (T316559 T326788)]] (duration: 24m 47s) [01:30:02] T326788: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 [01:30:02] T316559: Communicate when pages are not supported in other languages in the current language selector - https://phabricator.wikimedia.org/T316559 [01:30:28] !log 01:29:56 php-fpm-restart: 100% (in-flight: 0; ok: 184; fail: 112; left: 0) [01:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:06] failing that many restarts will cause a cascading error [01:31:14] like pybal wouldn't be able to depool all [01:31:39] we batch them, we don't try to depool more than 10% at one time [01:32:45] hmm [01:33:42] do you want me to run a rolling php-fpm restart from cumin? [01:34:16] thcipriani: ^ let me once the scap finishes, I can do that part [01:34:55] Amir1: yep, scap is finished now, so you're clear [01:35:25] !log rolling restart of php-fpm across the fleet [01:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:28] sounds like a good plan as long as it's batched [01:35:59] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:10] any insight into depooling failures? [01:36:51] I haven't seen anything in the logs [01:37:03] maybe you can point me in the right direction? [01:37:31] (where to find logs) [01:38:49] dm'd you scap output. FWIW the command run by scap in each server is /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 [01:40:49] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:15] the reboots failed [01:45:57] nothing in icinga [01:48:50] ugh [01:50:31] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:50] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw2283.codfw.wmnet [01:55:23] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:37] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver,service=nginx [02:03:55] what a mess [02:04:44] the api cluster doesn't seem unhappy or broken, moving to the appserver cluster [02:04:51] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:07] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver,service=nginx [02:06:33] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 3 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Func) 05In progress→03Open [02:06:39] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 3 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Func) p:05Unbreak!→03High Lower the priority since backported. [02:07:46] (JobUnavailable) firing: (5) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:03] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:22] I will personally strangle an-worker1125 [02:11:11] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:30] threats seemed to have worked [02:12:46] (JobUnavailable) firing: (12) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:53] 10SRE, 10Desktop Improvements (Vector 2022), 10Language-Team, 10Release-Engineering-Team, and 3 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Ladsgroup) The deployment went way bumpier than anticipated. I assume a... [02:22:21] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:33] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:43] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,service=parsoid-php [02:27:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:51] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:03] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:05] T327042 [02:50:06] T327042: an-worker1125 has been flapping non-stop - https://phabricator.wikimedia.org/T327042 [02:51:25] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:15] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:55] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:04:19] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:42] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10demon) >>! In T165631#8485123, @Dzahn wrote: > @ayounsi It would mean a considerable effort to recreate an entire LVS service, which we just recently shut do... [03:10:47] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:07] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:37] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:31:51] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:55] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:15] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:45] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:27] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:07:25] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:39] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:57] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:25] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [04:38:11] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:47] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:54] (03PS7) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [04:43:24] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39138/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [04:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:47:03] (03CR) 10Stevemunene: [V: 03+1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [04:52:45] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:39] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:57] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:11] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:53] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:15] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:28:21] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:43] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:19] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:34] (03CR) 10Santhosh: [C: 03+2] testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [05:47:20] (03Merged) 10jenkins-bot: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [05:51:05] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:19] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.21.8: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:56:33] PROBLEM - Hadoop DataNode on an-worker1125 is CRITICAL: NRPE: Call to popen() failed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [05:56:33] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: NRPE: Call to popen() failed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:58:19] PROBLEM - puppet last run on an-worker1125 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.21.8: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:59:07] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:15:18] (03PS1) 10KartikMistry: Revert "testwiki: Use Parsoid in Mediawiki Core for Content Translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879986 [06:16:04] (03CR) 10KartikMistry: [C: 03+2] Revert "testwiki: Use Parsoid in Mediawiki Core for Content Translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879986 (owner: 10KartikMistry) [06:16:46] (03Merged) 10jenkins-bot: Revert "testwiki: Use Parsoid in Mediawiki Core for Content Translation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879986 (owner: 10KartikMistry) [06:19:58] PROBLEM - Host db1198 #page is DOWN: PING CRITICAL - Packet loss = 100% [06:21:54] <_joe_> yeah it is completely down [06:22:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1198 maint', diff saved to https://phabricator.wikimedia.org/P43157 and previous config saved to /var/cache/conftool/dbconfig/20230116-062211-ladsgroup.json [06:22:21] I'm gonna depooled [06:23:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maint [06:24:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maint [06:24:12] downtimed it, I go sleep and will check it once I wake up [06:24:54] <_joe_> Amir1: thanks I acked the alert and I'll take a look at the consol [06:24:58] <_joe_> *console [06:27:32] [06:32:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:23] (03CR) 10Ayounsi: [C: 03+2] network:external: add wikidough v6 + descriptions [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [06:47:41] !log add 2001:67c:930::/48 to network:external in data.yaml [06:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:53:33] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:55:31] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:08] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:55:56. https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [06:56:08] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:55:56. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:56:08] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:55:56. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:08] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:55:56. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:56:15] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:33] ACKNOWLEDGEMENT - thanos.wikimedia.org tls expiry on thanos-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [06:59:33] ACKNOWLEDGEMENT - thanos.wikimedia.org requires authentication on thanos-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [06:59:33] ACKNOWLEDGEMENT - Thanos swift https on thanos-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/Thanos [06:59:33] ACKNOWLEDGEMENT - SSH on thanos-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:59:33] ACKNOWLEDGEMENT - Memcached on thanos-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/Memcached [06:59:33] ACKNOWLEDGEMENT - Host thanos-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. [06:59:33] ACKNOWLEDGEMENT - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T327001 - The acknowledgement expires at: 2023-01-18 06:59:18. https://wikitech.wikimedia.org/wiki/Swift [07:10:02] <_joe_> !log depooling mediawiki in codfw [07:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:15] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=(mw.*|appservers|api)-ro,name=codfw [07:13:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_research:poc.service,swift-account-stats_search:platform.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:35] 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10ayounsi) There is also the link connecting lvs2009 to row B connected to that switch. So the impact is larger than expected. [07:16:07] 10ops-codfw, 10Infrastructure-Foundations, 10conftool, 10netops, 10serviceops: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) a:03Joe [07:17:05] (03PS1) 10Giuseppe Lavagetto: lvs2009: contact an etcd server not in row B [puppet] - 10https://gerrit.wikimedia.org/r/880304 (https://phabricator.wikimedia.org/T327041) [07:17:12] <_joe_> XioNoX: ^^ [07:17:24] alright [07:19:19] 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10ayounsi) [07:19:25] 10ops-codfw, 10Infrastructure-Foundations, 10conftool, 10netops, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10ayounsi) [07:20:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs2009: contact an etcd server not in row B [puppet] - 10https://gerrit.wikimedia.org/r/880304 (https://phabricator.wikimedia.org/T327041) (owner: 10Giuseppe Lavagetto) [07:26:57] <_joe_> !log restarting pybal on lvs2009 [07:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:39] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:53] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:03] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [07:32:03] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:09] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:25] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:37] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:41:59] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw22(59|6[0-9]|70).codfw.wmnet [07:44:39] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw23([12][0-9]|3[0-4]).codfw.wmnet [07:48:53] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=parse20(0[6-9]|10).codfw.wmnet [07:50:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) p:05Unbreak!→03High The situation is as follows: * I depooled codfw from mediawiki; before repooling, we'll need to do a scap pull... [07:50:57] PROBLEM - mediawiki-installation DSH group on mw2268 is CRITICAL: Host mw2268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:51:08] <_joe_> ah rigfht [07:51:10] <_joe_> sigh [07:51:23] <_joe_> we'll get all those alerts, which is expected sadly [07:51:54] _joe_: should I add to the draft IR that we should have depooled [07:51:57] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:27] PROBLEM - mediawiki-installation DSH group on mw2260 is CRITICAL: Host mw2260 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:52:41] <_joe_> RhinosF1: I don't think it's your place to do so, but thanks for offering [07:53:05] _joe_: I put all the information in an IR format so it’s not on wikitech yet [07:53:06] https://phabricator.wikimedia.org/P43154 [07:53:21] Did that on Sunday so alerts / timings were there for an SRE [07:53:25] That’s what I meant [07:54:09] <_joe_> RhinosF1: right now I don't have time to discuss this, I'm trying to get us to a stable situation :) [07:54:17] PROBLEM - mediawiki-installation DSH group on mw2310 is CRITICAL: Host mw2310 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:54:43] Okay, I’ll see pings when not busy all day [07:56:29] PROBLEM - mediawiki-installation DSH group on mw2317 is CRITICAL: Host mw2317 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:56:29] PROBLEM - mediawiki-installation DSH group on mw2316 is CRITICAL: Host mw2316 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:56:55] PROBLEM - mediawiki-installation DSH group on parse2006 is CRITICAL: Host parse2006 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:57:07] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:57:27] PROBLEM - mediawiki-installation DSH group on mw2267 is CRITICAL: Host mw2267 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:58:37] PROBLEM - mediawiki-installation DSH group on mw2269 is CRITICAL: Host mw2269 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230116T0800) [08:01:27] PROBLEM - mediawiki-installation DSH group on mw2321 is CRITICAL: Host mw2321 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:01:53] 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Joe) >>! In T327001#8527049, @ayounsi wrote: > There is also the link connecting lvs2009 to row B connected to that switch. So the impact is larger than expected. Specifically, the following servers can't be reached by lvs200... [08:02:17] PROBLEM - mediawiki-installation DSH group on mw2270 is CRITICAL: Host mw2270 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:07:19] PROBLEM - mediawiki-installation DSH group on mw2312 is CRITICAL: Host mw2312 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:08:19] PROBLEM - mediawiki-installation DSH group on mw2319 is CRITICAL: Host mw2319 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:08:19] PROBLEM - mediawiki-installation DSH group on mw2320 is CRITICAL: Host mw2320 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:08:19] PROBLEM - mediawiki-installation DSH group on mw2318 is CRITICAL: Host mw2318 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:09:23] !log stopped swift_rclone_sync on ms-be1069 [08:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:47] PROBLEM - mediawiki-installation DSH group on mw2262 is CRITICAL: Host mw2262 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:11:51] PROBLEM - mediawiki-installation DSH group on mw2261 is CRITICAL: Host mw2261 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:13:50] (03CR) 10Ayounsi: [C: 03+2] BGPalerter: monitorPathNeighbors bump threshold [puppet] - 10https://gerrit.wikimedia.org/r/879747 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [08:13:58] (03PS2) 10Ayounsi: BGPalerter: monitorPathNeighbors bump threshold [puppet] - 10https://gerrit.wikimedia.org/r/879747 (https://phabricator.wikimedia.org/T230600) [08:14:03] !log oblivian@deploy1002 Synchronized README: test null deployment for T327041 (duration: 07m 12s) [08:14:07] T327041: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 [08:16:48] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) Confirmed that now scap works and we can do deployments normally. Please @papaul @ayounsi ping serviceops so that we can bring things b... [08:18:03] PROBLEM - mediawiki-installation DSH group on mw2327 is CRITICAL: Host mw2327 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:18:03] PROBLEM - mediawiki-installation DSH group on mw2329 is CRITICAL: Host mw2329 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:18:03] PROBLEM - mediawiki-installation DSH group on mw2325 is CRITICAL: Host mw2325 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:19:55] PROBLEM - mediawiki-installation DSH group on parse2009 is CRITICAL: Host parse2009 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:24:19] PROBLEM - mediawiki-installation DSH group on parse2010 is CRITICAL: Host parse2010 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:26:09] PROBLEM - mediawiki-installation DSH group on mw2311 is CRITICAL: Host mw2311 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:26:53] PROBLEM - mediawiki-installation DSH group on mw2331 is CRITICAL: Host mw2331 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:26:53] PROBLEM - mediawiki-installation DSH group on mw2333 is CRITICAL: Host mw2333 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:28:37] PROBLEM - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:28:45] PROBLEM - mediawiki-installation DSH group on mw2334 is CRITICAL: Host mw2334 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:28:45] PROBLEM - mediawiki-installation DSH group on mw2332 is CRITICAL: Host mw2332 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:29:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:31:35] PROBLEM - mediawiki-installation DSH group on mw2323 is CRITICAL: Host mw2323 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:32:48] (03CR) 10Muehlenhoff: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [08:38:01] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:38:43] PROBLEM - mediawiki-installation DSH group on mw2313 is CRITICAL: Host mw2313 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:39:39] PROBLEM - mediawiki-installation DSH group on parse2008 is CRITICAL: Host parse2008 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:41:25] PROBLEM - mediawiki-installation DSH group on mw2326 is CRITICAL: Host mw2326 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:42:15] PROBLEM - mediawiki-installation DSH group on mw2263 is CRITICAL: Host mw2263 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:42:15] PROBLEM - mediawiki-installation DSH group on mw2259 is CRITICAL: Host mw2259 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:42:15] PROBLEM - mediawiki-installation DSH group on mw2328 is CRITICAL: Host mw2328 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:43:23] PROBLEM - mediawiki-installation DSH group on mw2315 is CRITICAL: Host mw2315 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:43:33] PROBLEM - mediawiki-installation DSH group on mw2314 is CRITICAL: Host mw2314 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:45:21] PROBLEM - mediawiki-installation DSH group on mw2322 is CRITICAL: Host mw2322 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:46:01] PROBLEM - mediawiki-installation DSH group on parse2007 is CRITICAL: Host parse2007 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:46:27] !log powercycle an-worker1125 - soft lockup traces registered in the tty, host frozen [08:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:53] PROBLEM - mediawiki-installation DSH group on mw2324 is CRITICAL: Host mw2324 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:48:03] PROBLEM - mediawiki-installation DSH group on mw2264 is CRITICAL: Host mw2264 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:48:03] PROBLEM - mediawiki-installation DSH group on mw2265 is CRITICAL: Host mw2265 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:48:03] PROBLEM - mediawiki-installation DSH group on mw2266 is CRITICAL: Host mw2266 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:50:17] RECOVERY - SSH on an-worker1125 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:19] RECOVERY - Hadoop DataNode on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [08:50:19] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:51:23] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:57] PROBLEM - puppet last run on an-worker1125 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:53:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:55:35] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [08:56:14] doing a roll restart to pick up the new TLS PKI certs, first time that happens so I want to check if all goes fine etc.. [08:57:33] RECOVERY - puppet last run on an-worker1125 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:00:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:00:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:03:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.967 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:04:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:19:45] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:07] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:33:26] (03PS3) 10Muehlenhoff: quarry: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863302 (https://phabricator.wikimedia.org/T308013) [09:34:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [09:45:43] (03CR) 10Muehlenhoff: [C: 03+2] quarry: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863302 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:22:45] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:31] (03PS1) 10Muehlenhoff: Add new bastions [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) [10:36:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [10:37:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:01] (03PS2) 10Jelto: gitlab: Allow use of any rustlang/rust image [puppet] - 10https://gerrit.wikimedia.org/r/879693 (https://phabricator.wikimedia.org/T326515) (owner: 10Legoktm) [10:43:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:41] (03CR) 10Jelto: [C: 03+2] gitlab: Allow use of any rustlang/rust image [puppet] - 10https://gerrit.wikimedia.org/r/879693 (https://phabricator.wikimedia.org/T326515) (owner: 10Legoktm) [10:47:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:48:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:48:08] !log installing libtasn1-6 security updates on Bullseye [10:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:12] (03PS1) 10Muehlenhoff: Make bast4004 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/880435 [10:54:12] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:55:02] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:56:00] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:57:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:58:18] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [10:58:21] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [10:59:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [11:01:27] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:06] 10SRE, 10Desktop Improvements (Vector 2022), 10Release-Engineering-Team, 10UniversalLanguageSelector, and 4 others: Unexpected "Page contents not supported in other languages" in non-article namespace - https://phabricator.wikimedia.org/T326788 (10Nikerabbit) [11:09:13] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_research:poc.service,swift-account-stats_thanos:prod.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:20:13] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:33] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:26:24] (03PS1) 10Muehlenhoff: Update repository key for k8s repo [puppet] - 10https://gerrit.wikimedia.org/r/880437 [11:32:02] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [11:38:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:42:40] (03CR) 10Muehlenhoff: [C: 03+2] Make bast4004 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/880435 (owner: 10Muehlenhoff) [11:42:47] (03PS2) 10Muehlenhoff: Make bast4004 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/880435 [11:48:54] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:49:26] (03CR) 10Jelto: [C: 03+1] "lgtm, compared to key at https://packages.cloud.google.com/apt" [puppet] - 10https://gerrit.wikimedia.org/r/880437 (owner: 10Muehlenhoff) [11:49:37] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:51:31] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:56:09] (03PS1) 10Muehlenhoff: Cleanup bastion role [puppet] - 10https://gerrit.wikimedia.org/r/880439 [11:58:43] (03CR) 10Hashar: "Replaced "kibana" with "opensearch-dashboards" ;)" [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [11:58:52] (03PS4) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) [12:02:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[123].eqiad.wmnet [12:05:52] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes101[123].eqiad.wmnet [12:09:22] (03CR) 10Slyngshede: "One typo in IP6 addr." [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [12:13:04] (03PS1) 10Jgiannelos: maps: Remove unused geoshapes sql reference [puppet] - 10https://gerrit.wikimedia.org/r/880443 [12:17:17] (03CR) 10Muehlenhoff: [C: 03+2] Cleanup bastion role [puppet] - 10https://gerrit.wikimedia.org/r/880439 (owner: 10Muehlenhoff) [12:27:05] (03CR) 10Muehlenhoff: [C: 03+2] Update repository key for k8s repo [puppet] - 10https://gerrit.wikimedia.org/r/880437 (owner: 10Muehlenhoff) [12:29:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:34:54] !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes102.eqiad.wmnet [12:35:35] !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes102[012].eqiad.wmnet [12:35:52] !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes101[5-9].eqiad.wmnet [12:36:16] !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes100[5-9].eqiad.wmnet [12:39:36] 10SRE, 10Parsoid, 10Scap, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10MoritzMuehlenhoff) [12:41:43] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) https://netbox.wikimedia.org/dcim/cables/5899 Cable is Ran and plugged in [12:43:40] !log power cycled db1198 [12:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:47:18] RECOVERY - Host db1198 #page is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [12:47:24] !log hnowlan@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: service=thumbor,name=kubernetes100[5-9].eqiad.wmnet [12:50:19] !log drain eqiad-eqord link - T304712 [12:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:23] T304712: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 [12:51:27] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes1008.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:53:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes100[5-9].eqiad.wmnet [12:56:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[1234].eqiad.wmnet [12:57:04] (03PS1) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) [12:58:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Alerts: stop alerting on thumb number mismatch [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) (owner: 10MVernon) [13:02:12] (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: use rsync::quickdatacopy for arclamp data [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [13:02:14] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: use rsync::quickdatacopy for arclamp data [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [13:02:50] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39140/console" [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:02:54] (03CR) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:03:23] (03CR) 10Muehlenhoff: Add a third-party apt repo for ceph-quincy packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:04:04] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39141/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:04:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:05:03] (03CR) 10Filippo Giunchedi: [C: 03+2] arclamp: move to EnvironmentFile for generate/compress jobs [puppet] - 10https://gerrit.wikimedia.org/r/879814 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [13:05:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:08:42] (03CR) 10Filippo Giunchedi: [C: 03+2] systemd: send ::syslog output to remote destination [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [13:09:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10StephaneRebai) 05Open→03Resolved I was able to review code on Gerrit. Thanks [13:13:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:14:30] that's likely me ^ checking [13:17:38] (03PS2) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) [13:17:51] (03CR) 10Btullis: Add a third-party apt repo for ceph-quincy packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:18:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:23:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:23:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:24:01] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10jcrespo) A draft has been posted at https://wikitech.wikimedia.org/wiki/Incidents/20... [13:24:05] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:25:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39142/console" [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:27:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39143/console" [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:28:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:29:30] (03CR) 10Muehlenhoff: "Looks good in general, two remaining questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [13:30:31] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8116799408 and 1640 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:30:33] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13227226384 and 1644 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:31:09] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9580781640 and 1678 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:31:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:31:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:33:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:33:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:34:53] !log repool eqiad-eqord link - T304712 [13:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:56] T304712: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 [13:35:13] !log disable one of 3 cr1-cr2 eqiad links - T304712 [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:44] (03CR) 10MVernon: [C: 03+2] Alerts: stop alerting on thumb number mismatch [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) (owner: 10MVernon) [13:39:10] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:39:21] RECOVERY - Number of mw swift objects in eqiad greater than codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [13:39:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:40:06] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) [13:40:45] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Number of mw swift objects in eqiad greater than codfw - https://phabricator.wikimedia.org/T326857 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Resolved, thanks for the report. [13:42:30] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10taavi) [13:44:10] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:44:26] still me ^ [13:44:43] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 719224 and 2492 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:44:49] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2499 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:48:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:48:51] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1684648 and 2740 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:48:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:48:58] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) According to [GitLab deprecations list](https://docs.gitlab.com/ee/update/deprecations.html#cas-omniauth-provider)... [13:49:10] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:49:29] (03PS1) 10Muehlenhoff: Extend MOUs for shubhankar/dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/880471 [13:54:36] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOUs for shubhankar/dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/880471 (owner: 10Muehlenhoff) [13:57:56] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [14:04:42] (03CR) 10Jcrespo: "+1 for the bacula part, I haven't checked the rest. but Please, plase make sure to clean up any leftovers after deployment to avoid copyin" [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [14:08:43] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_tegola:prod.service,swift-account-stats_wdqs:flink.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:55] (03PS1) 10Muehlenhoff: Add new bastions in esams/eqsin/drmrs [puppet] - 10https://gerrit.wikimedia.org/r/880477 (https://phabricator.wikimedia.org/T324974) [14:10:15] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:35] (03CR) 10Muehlenhoff: [C: 03+2] Add new bastions in esams/eqsin/drmrs [puppet] - 10https://gerrit.wikimedia.org/r/880477 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [14:16:35] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:24:15] (03PS1) 10Ayounsi: eqiad: move GRE tunnels out of FPC4 [homer/public] - 10https://gerrit.wikimedia.org/r/880478 (https://phabricator.wikimedia.org/T304712) [14:25:36] (03PS4) 10Elukey: knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) [14:29:08] (03PS2) 10Ayounsi: eqiad: move links out of FPC4 [homer/public] - 10https://gerrit.wikimedia.org/r/880478 (https://phabricator.wikimedia.org/T304712) [14:30:14] (03CR) 10Ayounsi: [C: 03+2] eqiad: move links out of FPC4 [homer/public] - 10https://gerrit.wikimedia.org/r/880478 (https://phabricator.wikimedia.org/T304712) (owner: 10Ayounsi) [14:32:05] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar) [14:35:27] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:40:49] (03PS1) 10Ayounsi: cr1-eqiad: no need for transport-in from codfw [homer/public] - 10https://gerrit.wikimedia.org/r/880480 [14:44:16] (03CR) 10Ayounsi: [C: 03+2] cr1-eqiad: no need for transport-in from codfw [homer/public] - 10https://gerrit.wikimedia.org/r/880480 (owner: 10Ayounsi) [14:45:48] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Thanks John and Papaul, as soon as Netbox is updated this can be closed! [14:47:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:46] (03CR) 10Btullis: [V: 03+1] Add a third-party apt repo for ceph-quincy packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [14:53:49] (03PS3) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) [14:54:35] (03CR) 10Btullis: Add a third-party apt repo for ceph-quincy packages (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [14:57:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [14:58:18] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) Awesome, I think that covers everything we need. Should be able to try it out soon. Thanks a lot... [15:17:43] (03PS8) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [15:35:13] (03PS9) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [15:35:58] (03PS19) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [15:41:06] (03CR) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:43:04] (03PS1) 10AOkoth: vrts: add vrts2001 hieradata and database port [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) [15:45:27] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/880488/39144/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:46:43] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3976960128 and 9812 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:46:49] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 10000855840 and 9820 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:46:49] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3122389888 and 9820 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:47:17] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6978595416 and 9847 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:47:49] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1042.eqiad.wmnet with OS bullseye [15:51:31] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 229448 and 134 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:53:15] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 303527024 and 238 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:54:51] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:55:19] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 94 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:56:27] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 162 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:56:33] (03PS2) 10Muehlenhoff: Add new bastions [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) [15:56:49] (03CR) 10Muehlenhoff: Add new bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [15:59:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1042.eqiad.wmnet with reason: host reimage [16:00:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/879557 (owner: 10Giuseppe Lavagetto) [16:00:17] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [16:01:55] RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:05] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/880493 [16:02:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1042.eqiad.wmnet with reason: host reimage [16:02:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Gehel) [16:04:17] 10SRE, 10API Platform, 10Traffic, 10Discovery-Search (Current work): Generic strategy to deal with high volume / expensive traffic from cloud providers - https://phabricator.wikimedia.org/T326782 (10Gehel) [16:04:47] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/880493 (owner: 10Muehlenhoff) [16:05:14] (03Merged) 10jenkins-bot: Update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/879557 (owner: 10Giuseppe Lavagetto) [16:16:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1042.eqiad.wmnet with OS bullseye [16:18:14] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) @Papaul, this is my bad, thank you for taking care of Netbox (or the gentle soul that did so). [16:20:09] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11459016752 and 1583 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:17] PROBLEM - Query Service HTTP Port on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:20:27] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17297854664 and 1601 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:27] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18798364848 and 1601 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:23:49] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1044.eqiad.wmnet with OS bullseye [16:26:25] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:56] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [16:30:57] (03PS1) 10Filippo Giunchedi: aptrepo: update Grafana url and key [puppet] - 10https://gerrit.wikimedia.org/r/880496 [16:35:02] (03PS1) 10Hnowlan: thumbor: move liveness check to hit haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/880498 (https://phabricator.wikimedia.org/T233196) [16:35:36] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1044.eqiad.wmnet with reason: host reimage [16:36:59] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 106722520 and 2594 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:38:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1044.eqiad.wmnet with reason: host reimage [16:38:54] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3989602352 and 2708 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:39:02] (03CR) 10Ladsgroup: "This needs to have gradual deployment. e.g. add something in CommonSetting.php and then with mt_rand being between 0 and 0.1, then turn on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:39:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] thumbor: move liveness check to hit haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/880498 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:41:09] (03CR) 10Effie Mouzeli: [C: 03+2] maps: Remove unused geoshapes sql reference [puppet] - 10https://gerrit.wikimedia.org/r/880443 (owner: 10Jgiannelos) [16:41:35] (03PS1) 10Elukey: kserve: upgrade to version 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/880499 (https://phabricator.wikimedia.org/T325528) [16:41:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "should be okay to deploy tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878927 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [16:42:40] (03CR) 10Hnowlan: [C: 03+2] thumbor: move liveness check to hit haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/880498 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:43:10] (03PS2) 10Elukey: kserve: upgrade to version 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/880499 (https://phabricator.wikimedia.org/T325528) [16:43:48] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1402848 and 3003 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:46:16] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 344232 and 3150 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:47:32] (03Merged) 10jenkins-bot: thumbor: move liveness check to hit haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/880498 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:53:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1044.eqiad.wmnet with OS bullseye [16:53:30] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:56:52] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:58] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35700768 and 3792 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:01:22] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5016 and 4056 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:02:39] (03PS1) 10Filippo Giunchedi: WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) [17:04:10] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:04:29] (03CR) 10CI reject: [V: 04-1] WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [17:04:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [17:05:35] (03CR) 10Filippo Giunchedi: "This is only a sketch of a solution, tests can be run locally with (docker or podman required)" [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [17:05:50] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1556264 and 4324 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:06:00] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 580712 and 4334 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:06:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [17:07:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[1234].eqiad.wmnet [17:28:13] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10lmata) >>! In T325477#8527736, @jcrespo wrote: > A draft has been posted at https://... [17:39:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[1234].eqiad.wmnet [17:47:04] (03PS1) 10Zabe: admin: add .gitconfig for zabe [puppet] - 10https://gerrit.wikimedia.org/r/880505 [18:02:31] 10ops-eqiad, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10jcrespo) [18:07:56] 10ops-eqiad, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10jcrespo) [18:09:11] 10ops-eqiad, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10jcrespo) Datacenter ops, please request a memory module replacement, I think this host is still under warranty (Willy or Rob may confirm). Please then get in contact with @Ladsgroup to prepare the host for h... [18:16:09] Since 12:20 there has been a consistent background level of memory exahaustions- not sure if deployment caused, traffic caused, or some other maintenance procces-caused. It is very small but consistent (45 errors every 10 minutes) [18:16:27] s/errors/exceptions/ [18:16:53] seems to affect only dewiki, so betting on traffic-related [18:17:33] and it is api [18:18:31] funnilly, it almost matches the dissapearance of jsonTruncated warnings [18:41:14] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:47:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:48:07] I've reported it at https://phabricator.wikimedia.org/T327111 [18:58:40] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [19:03:37] 10SRE, 10ops-eqiad, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10jcrespo) a:03Jclark-ctr Assigning to john as per robh suggestion. [19:12:58] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:23:32] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:41:11] (03PS1) 10Ssingh: Release 0.44.0+ds1-1 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/880530 (https://phabricator.wikimedia.org/T325557) [19:43:24] (03PS1) 10Jdrewniak: Add enwiki to desktop-improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880531 (https://phabricator.wikimedia.org/T326892) [19:47:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:28] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3314809496 and 14022 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:47:28] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3919700120 and 14022 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:48:06] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8439458792 and 14060 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:50:42] (03CR) 10Ssingh: "Debian packaging is ready for review. Tested a successful built with this CR cherry-picked on build2001." [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/880530 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [19:52:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:35] (03PS1) 10Jdrewniak: Show edit button in sticky header for desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880533 (https://phabricator.wikimedia.org/T324799) [20:01:16] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5090507056 and 14850 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:04:54] 10SRE, 10Parsoid, 10Scap, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Arlolra) > Bastionhosts are used by parsoid deployers to restart parsoid machines and they use the dsh groups that are maintained in scap::dsh. https://github.com/wikimedia/puppet/c... [20:05:50] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:14:00] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 190739800 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:16:24] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:18:48] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:00] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:19:24] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:22:06] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:22:34] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 262 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:23:10] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 299 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:47:56] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 323262552 and 1783 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:48:10] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18856991648 and 1797 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:48:10] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 15833152728 and 1798 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:48:46] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18789213968 and 1833 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:49:10] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3279757344 and 1857 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:52:44] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3936 and 2071 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:53:56] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4520 and 2144 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:12:08] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2009152 and 3235 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:14:18] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 434088 and 3366 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:19:54] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:29:42] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 311140824 and 4290 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:32:54] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32 and 4481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:47:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:18:26] PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 892143 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops