[00:00:04] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:05] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T0000). [00:00:05] Juan_90264: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:01] Juan_90264: I'm still around so I can deploy for you [00:02:47] (03CR) 10Ahmon Dancy: [C: 03+2] Change the Traditional Chinese and Simplified Chinese logo for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751530 (https://phabricator.wikimedia.org/T298550) (owner: 10Juan90264) [00:04:08] (03Merged) 10jenkins-bot: Change the Traditional Chinese and Simplified Chinese logo for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751530 (https://phabricator.wikimedia.org/T298550) (owner: 10Juan90264) [00:05:44] Juan_90264: Changes have been pulled to mwdebug. Can you check it out before I proceed? [00:06:45] Of course [00:06:54] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:08:04] (03CR) 10Cwhite: [C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/pcc-worker1001/33148/" [puppet] - 10https://gerrit.wikimedia.org/r/751785 (owner: 10Cwhite) [00:08:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:34] (03CR) 10Razzi: [C: 03+2] clouddb: depool clouddb1018 to update views [puppet] - 10https://gerrit.wikimedia.org/r/751824 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [00:12:20] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:15:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:40] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:17:52] Juan_90264: Any report? I'm chained to my computer in the meantime. [00:19:32] dancy: The common logo (zhwikinews - zh) is ok, but I'm still testing the logo in zh-hans [00:19:40] ok [00:19:52] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:26] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:26:53] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [00:27:23] dancy: I tested and approve [00:27:35] great.. moving forward [00:29:34] !log dancy@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:751530|Change the Traditional Chinese and Simplified Chinese logo for zhwikinews (T298550)]] (duration: 01m 17s) [00:30:05] dancy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [00:30:06] T298550: Requesting logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T298550 [00:30:31] Hmm.. that's a new one. [00:30:40] huh [00:30:56] !log dancy@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:751530|Change the Traditional Chinese and Simplified Chinese logo for zhwikinews (T298550)]] (duration: 01m 07s) [00:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:10] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:16] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:31:53] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [00:32:14] !log dancy@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:751530|Change the Traditional Chinese and Simplified Chinese logo for zhwikinews (T298550)]] (duration: 01m 07s) [00:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:26] Juan_90264: all set [00:32:46] I'm done working for the day. Have a good one everybody [00:32:58] Perfect, thanks dancy! [00:33:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ldap site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:36:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:36:18] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:49:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={sidekiq,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:30] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:05] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T0100). [01:02:48] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:20] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:28] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:10] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:14] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 82.15 ms [01:34:10] 10SRE, 10Data-Engineering, 10Observability-Metrics, 10Superset: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10odimitrijevic) [01:35:02] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:49] 10SRE, 10Analytics, 10Data-Engineering, 10Discovery, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10odimitrijevic) [01:43:32] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:13] 10SRE, 10Analytics, 10Data-Engineering: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10odimitrijevic) [02:07:50] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:22] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:00] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:57:50] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:36] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:31] 10SRE, 10Analytics, 10Data-Engineering, 10Traffic-Icebox: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10odimitrijevic) [03:23:17] 10SRE, 10Analytics, 10Data-Engineering, 10Research-Backlog, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10odimitrijevic) [03:29:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:26] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:14] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:59:02] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:54] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:04] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:31] 10SRE, 10Analytics, 10Data-Engineering, 10Traffic-Icebox: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source - https://phabricator.wikimedia.org/T225786 (10odimitrijevic) [04:24:48] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:16] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:56] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:27] (03CR) 1020after4: [C: 03+1] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [05:01:24] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:00] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:54] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:40] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:12] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:20] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:08] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:02] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:31:42] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:28] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:34] !log revoke DROP from wikiadmin globally [07:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:02] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:20] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:57] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751914 [07:36:32] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:44] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751918 [07:41:48] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751919 [07:49:24] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:28] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:07:08] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:47] (03CR) 10ZPapierski: [C: 03+1] sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [08:20:46] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:05] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10hashar) 05Resolved→03Open I no more receive alarms from `contint2001.mgmt` wh... [08:38:52] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:34] (03CR) 10Hashar: "recheck due to some unit tests having a time based race condition ( AssertionError: '0.0.1-20220105-183950' != '0.0.1-20220105-183951' )" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [08:50:28] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) The above change still needs to be deployed, I won't have time until mid next week if someone wants to beat me to it. [08:51:45] (03PS4) 10Hashar: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:52:34] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:56] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-omega-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:44] (03CR) 10Hashar: [C: 03+1] "Rebased on top of https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/747104/ which fixes mypy related build issues" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [09:03:10] PROBLEM - MD RAID on elastic2051 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:03:11] ACKNOWLEDGEMENT - MD RAID on elastic2051 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T298674 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:03:16] 10SRE, 10ops-codfw: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) [09:10:44] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:00] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:15:33] ^ is because elastic2051 went down [09:17:22] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10dcausse) elastic2051 being an eligible master on the omega cluster we might perhaps want to change the list of masters if this host is going to be down for long. [09:21:52] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:39] (03CR) 10David Caro: {p,r}:gerrit:migration/migration_base: remove unused role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:34:36] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:36:48] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:56] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738370 (https://phabricator.wikimedia.org/T187897) (owner: 10Hashar) [09:40:49] (03CR) 10David Caro: statsd: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:41:10] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:27] hey, lists.wikimedia.org is getting seconds to answer for almost any request on my end. [09:50:40] (03CR) 10David Caro: [C: 03+1] "I have not tested it (don't have mac), but looks good (nits can be ignored)." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [09:51:52] (03CR) 10David Caro: [C: 03+2] zuul: send errors from git-daemon to client [puppet] - 10https://gerrit.wikimedia.org/r/738370 (https://phabricator.wikimedia.org/T187897) (owner: 10Hashar) [09:52:37] ACKNOWLEDGEMENT - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service Btullis Investigating https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:49] !log Restarting zuul-merger on contint2001 and contint1001 | https://gerrit.wikimedia.org/r/c/operations/puppet/+/738370/ | T187897 [09:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:54] T187897: fatal: remote error: access denied or repository not exported: /mediawiki/extensions/ReadingLists - https://phabricator.wikimedia.org/T187897 [10:01:20] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:01:40] (03CR) 10David Caro: Be strict on undefined variables such as seed_image (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [10:02:18] PROBLEM - Device not healthy -SMART- on elastic2051 is CRITICAL: cluster=elasticsearch device=sda instance=elastic2051 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2051&var-datasource=codfw+prometheus/ops [10:24:37] (03CR) 10Hashar: "I am rebasing due to a conflict with https://gerrit.wikimedia.org/r/c/operations/puppet/+/738370/ which I got merged this morning. There" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [10:24:47] (03PS8) 10Hashar: Refactor git-daemon use in profile::zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [10:25:24] (03CR) 10jerkins-bot: [V: 04-1] Refactor git-daemon use in profile::zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [10:25:48] (03CR) 10Jelto: [C: 03+2] charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:25:59] (03CR) 10jerkins-bot: [V: 04-1] charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:27:22] (03PS9) 10Hashar: Refactor git-daemon use in profile::zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [10:28:04] (03PS7) 10Jelto: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) [10:29:47] (03CR) 10Hashar: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33150/" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [10:30:43] (03CR) 10David Caro: {role:,profile:,}peek: remove unused classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:31:16] (03CR) 10David Caro: [C: 03+2] service::deploy::scap: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751732 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:35:11] (03CR) 10JMeybohm: [C: 03+1] charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:35:47] (03CR) 10Jelto: [C: 03+2] charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:39:28] (03Merged) 10jenkins-bot: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:48:44] (03CR) 10David Caro: osm: remove unused profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:49:14] (03CR) 10David Caro: [C: 03+2] r:wmcs:openstack:eqiad1:cumin_controller: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751461 (https://phabricator.wikimedia.org/T234462) (owner: 10David Caro) [10:50:20] (03CR) 10Ladsgroup: "Generally looks good. I suggest splitting this to two patches, first introducing ClusterConfig and the second using it, given the fact tha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [10:51:40] (03CR) 10David Caro: [C: 03+2] p:wmcs::nfs::misc/misc_backup/backup_keys: remove unused profiles [puppet] - 10https://gerrit.wikimedia.org/r/751460 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:52:50] (03CR) 10David Caro: [C: 03+2] r:wmcs:paws:k8s:etcd: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751463 (https://phabricator.wikimedia.org/T188912) (owner: 10David Caro) [10:53:19] (03CR) 10David Caro: [C: 03+2] sonofagridengine: cleanup unused classes [puppet] - 10https://gerrit.wikimedia.org/r/751456 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:53:50] (03CR) 10David Caro: [C: 03+2] lshell: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751130 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:55:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:56:06] (03CR) 10David Caro: [C: 03+2] labs_lvm:swap: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751103 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:56:34] (03CR) 10David Caro: [C: 03+2] profile::ceph::common: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751403 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:58:56] (03CR) 10David Caro: [C: 03+2] service::packages: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751734 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:00:02] (03CR) 10David Caro: [C: 03+2] role::memcached: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751727 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1100). [11:00:08] (03PS2) 10David Caro: role::memcached: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751727 (https://phabricator.wikimedia.org/T272559) [11:02:32] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:53] (03PS3) 10Jelto: changeprop/eventgate: bump kafka-dev dependencie to 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) [11:07:44] (03CR) 10JMeybohm: [C: 03+1] changeprop/eventgate: bump kafka-dev dependencie to 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [11:10:56] (03CR) 10Hnowlan: conftool: clean up references to obsolete restbase service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747098 (https://phabricator.wikimedia.org/T244843) (owner: 10Hnowlan) [11:11:43] (03CR) 10Jelto: [C: 03+2] changeprop/eventgate: bump kafka-dev dependencie to 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [11:15:00] (03Merged) 10jenkins-bot: changeprop/eventgate: bump kafka-dev dependencie to 0.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [11:21:00] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [11:21:31] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [11:28:29] (03Abandoned) 10Cparle: Filter out non-string keys/values from query string before using [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747702 (https://phabricator.wikimedia.org/T297828) (owner: 10Lucas Werkmeister (WMDE)) [11:29:30] (03CR) 10Hnowlan: [C: 03+2] maps: write tegola swift credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [11:30:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:32:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:47:20] (03CR) 10Hashar: Be strict on undefined variables such as seed_image (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [11:48:59] (03CR) 10David Caro: [C: 03+2] r:analytics_test_cluster::{turnilo,webserver}: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751714 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:00:05] Amir1, Lucas_WMDE, and apergos: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1200). [12:00:19] nothing to do, it seems [12:00:21] no trainees signed up for training, no patches in the window [12:06:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:09:10] (03PS1) 10Hnowlan: maps: correctly template swift credentials [puppet] - 10https://gerrit.wikimedia.org/r/751928 (https://phabricator.wikimedia.org/T292700) [12:12:19] (03CR) 10Hnowlan: "lgtm but adding some data engineering folks just to be sure!" [puppet] - 10https://gerrit.wikimedia.org/r/751089 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:45:30] (03CR) 10Hashar: [C: 03+1] Provide current $PATH to the verify script (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [12:48:31] (03PS5) 10Hashar: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [13:09:05] 10SRE, 10Fundraising-Backlog: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Thanks @Dzahn I've just tried ACKing another alert after logging in with all lower case chars but the outcome is still the same sadly. {F34909313} [13:12:32] ö/back [13:20:24] (03PS1) 10Ssingh: hieradata: add Wikidough cluster [puppet] - 10https://gerrit.wikimedia.org/r/751937 [13:51:57] !log deploy cfssl_1.6.1-0+deb9u1_amd64 to stretch systems [13:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] twentyafterfour and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1400). [14:07:45] (03PS1) 10Majavah: wikitech: Re-enable users after unblock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751941 [14:11:09] (03PS2) 10Ssingh: hieradata: add Wikidough cluster [puppet] - 10https://gerrit.wikimedia.org/r/751937 [14:13:52] (03CR) 10Ema: [C: 03+1] hieradata: add Wikidough cluster [puppet] - 10https://gerrit.wikimedia.org/r/751937 (owner: 10Ssingh) [14:14:56] (03CR) 10Ssingh: [C: 03+2] hieradata: add Wikidough cluster [puppet] - 10https://gerrit.wikimedia.org/r/751937 (owner: 10Ssingh) [14:38:39] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [14:39:24] (03CR) 10Ottomata: [C: 03+1] e:sevice:consumer: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751089 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:39:41] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Resolved→03Open Broken again .. not sure if someone reverted the patch or something else overwriote your changes but https:... [14:41:00] (03CR) 10Ottomata: [C: 03+1] "These roles are for use in cloud vps when testing kafka, so they were used." [puppet] - 10https://gerrit.wikimedia.org/r/751723 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:41:35] (03CR) 10Ottomata: [C: 03+1] r:cloud_analytics: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751716 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:41:44] (03CR) 10Ottomata: [C: 03+1] jmxtrans: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751112 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:41:53] (03CR) 10Ottomata: [C: 03+1] b:h:j:{metatstore,server}: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751080 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:42:14] (03CR) 10Ottomata: [C: 03+1] icinga:nsca:client: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:42:24] (03CR) 10David Caro: [C: 03+2] e:sevice:consumer: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751089 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:43:09] (03CR) 10Ottomata: [C: 03+1] c:kafka::mirrors: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751086 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:43:26] (03CR) 10David Caro: [C: 03+2] r:kafka::simple::mirror: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751723 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:43:41] (03CR) 10Jgiannelos: [C: 03+1] maps: correctly template swift credentials [puppet] - 10https://gerrit.wikimedia.org/r/751928 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [14:43:55] (03CR) 10David Caro: [C: 03+2] r:cloud_analytics: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751716 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:44:15] (03CR) 10Ottomata: [C: 03+1] bigtop:spark: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751083 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:44:21] (03CR) 10Ottomata: [C: 03+1] b:hadoop:httpfs: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751082 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:44:51] (03CR) 10David Caro: [C: 03+2] b:hadoop:httpfs: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751082 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:45:10] (03CR) 10David Caro: [C: 03+2] bigtop:spark: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751083 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:46:06] (03CR) 10David Caro: [C: 03+2] b:h:j:{metatstore,server}: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751080 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:46:23] (03CR) 10David Caro: [C: 03+2] jmxtrans: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751112 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:47:03] (03CR) 10David Caro: [C: 03+2] c:kafka::mirrors: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751086 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:47:34] (03CR) 10David Caro: [C: 03+2] icinga:nsca:client: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:53:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:00:08] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] (03PS1) 10Btullis: Add the aqs_next hosts to the deployment targets [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/751948 (https://phabricator.wikimedia.org/T297460) [15:07:51] (03PS1) 10Cparle: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751836 (https://phabricator.wikimedia.org/T297484) [15:09:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:44] (03PS1) 10Majavah: Clean up nova-network remains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751949 [15:36:31] (03PS1) 10Majavah: reverse-proxy: add drmrs ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751952 (https://phabricator.wikimedia.org/T282787) [15:40:52] (03CR) 10BryanDavis: [C: 03+1] wikitech: Re-enable users after unblock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751941 (owner: 10Majavah) [15:41:19] (03CR) 10Ahmon Dancy: [C: 03+1] Refactor git-daemon use in profile::zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [15:42:53] twentyafterfour: hashar: are you using the current train slot or can I sneak in a operations/mediawiki-config change? [15:43:27] * bd808 just tabbed over here to check on the same thing [15:43:40] is there a current train slot? [15:43:46] jouncebot: now [15:43:46] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1400) [15:44:19] not using it so go ahead [15:44:44] cool thx [15:45:00] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751941 (owner: 10Majavah) [15:45:55] (03Merged) 10jenkins-bot: wikitech: Re-enable users after unblock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751941 (owner: 10Majavah) [15:50:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:22] (03PS1) 10Majavah: wikitech: Re-add missing use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751955 [15:50:36] bd808: ^ mind quickly reviewing that too? [15:51:02] (03CR) 10BryanDavis: [C: 03+1] wikitech: Re-add missing use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751955 (owner: 10Majavah) [15:51:09] (03CR) 10Majavah: [C: 03+2] wikitech: Re-add missing use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751955 (owner: 10Majavah) [15:51:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:51:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:23] taavi: sorry for not noticing that they were still needed [15:51:38] I guess I'm too used to Phan noticing these things [15:51:54] (03Merged) 10jenkins-bot: wikitech: Re-add missing use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751955 (owner: 10Majavah) [15:52:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:32] (03PS1) 10Aqu: admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 [15:53:34] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/751956 (owner: 10Aqu) [15:53:54] * taavi sees some errors in logstash :( [15:55:02] (03CR) 10jerkins-bot: [V: 04-1] admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (owner: 10Aqu) [15:55:56] bd808: do you happen to know what could cause "Phab user.ldapquery error '{"result":null,"error_code":"ERR-INVALID-SESSION","error_info":"Session key is not present."}'"? [15:57:04] If I channel my inner Reedy, it's because the session key is not present. ;) [15:57:23] * bd808 stares at code [15:57:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:44] as far as I can see, it worked last night, now it doesn't after I moved some things around [15:59:04] where are you seeing the error? [15:59:19] In wikitech logs after deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751941/ [15:59:30] 10SRE, 10Discovery-Search (Current work): Consider filesystem/disk based improvements on WQDS servers - https://phabricator.wikimedia.org/T298570 (10Kormat) Minor nit: XFS was introduced in 1994. Ext4 was introduced in 2008, 14 years _later_. :) [16:00:13] use ( $wmfPhabricatorApiToken ) got removed? [16:00:51] (03CR) 10Zabe: wikitech: Re-enable users after unblock (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751941 (owner: 10Majavah) [16:00:53] seems like that'd be the cause [16:01:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:01:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:57] and the '$username = strtolower( $block->getTargetName() )' in line 203 probably needs to be removed [16:02:12] zabe: yeah, I already saw that but wanted to focus on the phab first [16:02:22] ah ok [16:02:51] right, this isn't Python.. I need a `global $wmfPhabricatorApiToken;` even when only reading I think [16:03:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:19] I'm getting confused by the different syntax (`global $wmfPhabricatorApiToken;` vs `use ( $wmfPhabricatorApiToken )`) [16:03:23] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add the aqs_next hosts to the deployment targets [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/751948 (https://phabricator.wikimedia.org/T297460) (owner: 10Btullis) [16:03:35] taavi: or just pass it into the function you extracted as an arg [16:03:53] which is what the `use` does [16:04:03] yeah ^ that [16:04:28] yeah, I'll do that [16:05:06] (03PS1) 10Bking: elasticsearch: changed master eligible on codfw omega to 2052 [puppet] - 10https://gerrit.wikimedia.org/r/751958 (https://phabricator.wikimedia.org/T298674) [16:05:33] (03PS2) 10Aqu: admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 [16:06:16] (03CR) 10jerkins-bot: [V: 04-1] admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (owner: 10Aqu) [16:06:50] (03PS1) 10Majavah: wikitech: Fix credential access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751960 [16:06:57] that should hopefully work [16:07:47] (03CR) 10Zabe: admin: create shell user aqu, add to analytics-privatedata-users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/751956 (owner: 10Aqu) [16:08:01] could any of you quickly double-check before merging? [16:08:31] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751958 (https://phabricator.wikimedia.org/T298674) (owner: 10Bking) [16:08:48] (03CR) 10Bking: [C: 03+2] elasticsearch: changed master eligible on codfw omega to 2052 [puppet] - 10https://gerrit.wikimedia.org/r/751958 (https://phabricator.wikimedia.org/T298674) (owner: 10Bking) [16:10:13] (03CR) 10Zabe: [C: 03+1] wikitech: Fix credential access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751960 (owner: 10Majavah) [16:10:30] (03CR) 10Majavah: [C: 03+2] wikitech: Fix credential access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751960 (owner: 10Majavah) [16:10:58] !log btullis@deploy1002 Started deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production [16:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:18] (03Merged) 10jenkins-bot: wikitech: Fix credential access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751960 (owner: 10Majavah) [16:12:24] (03PS2) 10Hnowlan: maps: correctly template swift credentials [puppet] - 10https://gerrit.wikimedia.org/r/751928 (https://phabricator.wikimedia.org/T292700) [16:14:56] it works on phabricator [16:15:33] I see "[6fd9c1ad-4825-4680-96a3-bbc41fbeb54a] /wiki/Special:Unblock/Majavah_test PHP Notice: Undefined variable: username" for gerrit, but that's a simple fix and only affects logging [16:16:00] maybe we should consider enabling phan in mediawiki-config? [16:17:15] (03PS1) 10Majavah: wikitech: Fix logging for Gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751962 [16:17:17] (03PS3) 10Aqu: admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) [16:17:25] if it understands the repo ($wgConf etc strangeness), then sure [16:17:34] that should be the final patch. sorry this is such a mess [16:18:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:14] !log btullis@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production (duration: 07m 16s) [16:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:20] (03CR) 10Zabe: [C: 03+1] wikitech: Fix logging for Gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751962 (owner: 10Majavah) [16:18:30] (03CR) 10Majavah: [C: 03+2] wikitech: Fix logging for Gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751962 (owner: 10Majavah) [16:18:30] !log btullis@deploy1002 Started deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production [16:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] (03CR) 10jerkins-bot: [V: 04-1] admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) (owner: 10Aqu) [16:19:11] !log btullis@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production (duration: 00m 41s) [16:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:59] (03Merged) 10jenkins-bot: wikitech: Fix logging for Gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751962 (owner: 10Majavah) [16:20:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:51] !log taavi@deploy1002 Synchronized wmf-config/wikitech.php: wikitech: Re-enable Phabricator and Gerrit users after unblock (duration: 01m 09s) [16:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:09] ok I think I'm done touching production [16:25:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:26:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:43] (03PS1) 10Andrew Bogott: designate sink: fix proxy cleanup when proxy domain == project domain [puppet] - 10https://gerrit.wikimedia.org/r/751963 (https://phabricator.wikimedia.org/T298681) [16:33:43] !log reset wikitech email for User:Iniquity per T298683 [16:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:47] T298683: Account recovery help needed for Developer account Iniquity - https://phabricator.wikimedia.org/T298683 [16:34:46] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) One more: https://commons.wikimedi... [16:37:17] !log restarting elastic2052 for configuration change - T298674 [16:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:20] T298674: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 [16:39:43] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [16:39:54] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:41:56] !log otto@deploy1002 Started deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production [16:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:30] !log otto@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production (duration: 00m 34s) [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:11:02] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:38] huh, globalblocks does not seem to be updated when global renames are performed [17:14:43] I guess it's pretty rare that stewards are renamed, but nice to fix... [17:18:51] yeah, were are getting a bunch of 'Blocker must be a local user or a name that cannot be a local user' now [17:25:50] zabe: is there a task / something we can do to fix the errors? [17:26:04] T298707 [17:26:04] T298707: InvalidArgumentException: Blocker must be a local user or a name that cannot be a local user - https://phabricator.wikimedia.org/T298707 [17:27:21] long-term let's update that to use CentralIdLookup instead of storing usernames [17:27:45] we can probably write a small maintenance script to fix the currently wrong entries [17:28:17] yeah, sounds reasonable [17:28:25] do you want to write it or should I? [17:29:59] I can do it [17:30:10] thanks [17:38:58] (03PS4) 10Aqu: admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) [17:39:50] (03CR) 10Ebernhardson: "Not sure who i should ping for merge, but we are hoping to use this next tuesday (jan 11) to load data into wcqs servers." [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [17:42:56] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@6f5caf9]: allow for null columns in export to relforge [17:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:30] (03CR) 10Aqu: "I've put the user's definition in the right group, fixed the commit message, and set my uid." [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) (owner: 10Aqu) [17:45:07] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@6f5caf9]: allow for null columns in export to relforge (duration: 02m 11s) [17:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:01] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) I just brought up the issue on my regular call with our Dell account reps today. If things still don't work... [18:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1800). [18:00:08] !log btullis@deploy1002 Started deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production [18:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:18] !log btullis@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@fb10de1] (aqs): Deploying logstash-logback-encoder to production (duration: 00m 09s) [18:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:29] (03CR) 10Dzahn: "let me keep it one more round of cleanup please, I will delete it myself if I don't end up using it again this year." [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:10:13] (03CR) 10Dzahn: [C: 03+1] "+1 per Ori" [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:10:39] 7 [18:11:32] (03CR) 10Dzahn: [C: 03+1] "While not necessarily convinced that anything has been restored once it's gone, I am in no way meaning to block this or have a strong opin" [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:13:03] 8 [18:13:20] 9 [18:23:36] A [18:29:20] (03CR) 10Dzahn: "I would not delete this unless we think we'll never use passive checks with Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:30:05] (03CR) 10Dzahn: "currently only done by FR and they forked from our puppet repo I guess. but passive checks wouldn't be bad for scaling" [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:30:09] (03PS1) 10AOkoth: kubernetes: point to new kubestage node [dns] - 10https://gerrit.wikimedia.org/r/751976 (https://phabricator.wikimedia.org/T293729) [18:31:18] (03CR) 10Dzahn: "I would suggest to let observability decide about Icinga related things." [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:32:52] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) its not 404ing for me right now. I suspect it was still cached on some of the caching servers. change is not reverted [18:35:19] (03CR) 10Dzahn: "respectfully removing myself until releng wants to do this, I have waited myself with the identical change" [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:39:08] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [18:40:12] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Open→03Resolved Okay, thanks! :) Yes, working for me as well now. [18:43:10] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:06] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33151/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [18:47:27] !log contint* - deploying zuul-merger puppet refactor change, first codfw-only [18:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:34] (03PS1) 10Razzi: Revert "clouddb: depool clouddb1018 to update views" [puppet] - 10https://gerrit.wikimedia.org/r/751840 [18:49:48] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) Thanks for your ongoing attention on this! I'm frustrated by how easily I can produce this issue in production bu... [18:50:03] !log run sudo maintain-views --databases centralauth --replace-all on clouddb1018 for T298505 [18:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:06] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [18:50:08] (03CR) 10Dzahn: "+++ /tmp/puppet-file20220106-14019-y8y085 2022-01-06 18:49:00.055070679 +0000" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [18:51:40] !log contint1001 - after contint2001 also re-enabled puppet and deployed 751816 zuul-merger refactor - service git-daemon refreshed and runnning [18:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:54] (03CR) 10Dzahn: "18:51 < mutante> !log contint1001 - after contint2001 also re-enabled puppet and deployed 751816 zuul-merger refactor - service git-daemon" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [18:53:02] (03CR) 10Razzi: [C: 03+2] Revert "clouddb: depool clouddb1018 to update views" [puppet] - 10https://gerrit.wikimedia.org/r/751840 (owner: 10Razzi) [18:55:38] 10SRE, 10Fundraising-Backlog: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) a:03Dzahn [18:55:47] 10SRE, 10Fundraising-Backlog: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) 05Open→03In progress [18:59:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) Yup, for sure. I definitely hear ya on that @Andrew. @Cmjohnson - maybe we can open a Dell Tech Direct tick... [18:59:32] !log puppetmaster1001 - creating missing Icinga contact for jgleeson in private puppet repo T298649 [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:36] T298649: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 [19:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1900) [19:00:04] nn1l2: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:13] hi [19:00:33] hey [19:01:26] (03CR) 10Majavah: [C: 03+2] Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) (owner: 104nn1l2) [19:01:30] Hello [19:01:46] Hi there! [19:02:08] (03Merged) 10jenkins-bot: Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) (owner: 104nn1l2) [19:02:35] taavi: Looks like you're doing the deployment already? [19:02:38] !log systemctl restart haproxy on dbproxy1018 to repool clouddb1018 for T298505 [19:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:41] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [19:02:42] nn1l2: your patch is on mwdebug1001, please test [19:02:45] RoanKattouw: indeed [19:03:44] (03PS7) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [19:04:02] LGTM https://commons.wikimedia.org/wiki/File:NHMUK013623127.jpg [19:04:32] Let's synch, taavi [19:04:37] sure [19:05:32] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:751538|Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T298451)]] (duration: 01m 09s) [19:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:35] T298451: Add https://www.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T298451 [19:05:45] done [19:05:50] anyone have anything else to deploy? [19:05:57] Thanks [19:06:34] It was tonight [19:06:44] You should call it a day! [19:07:09] !log UTC evening deploys done [19:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:49] (03PS1) 10Dzahn: icinga: let Jack Gleeson run commands for any host or service [puppet] - 10https://gerrit.wikimedia.org/r/751980 (https://phabricator.wikimedia.org/T298649) [19:17:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Ottomata) Approved. [19:18:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Ottomata) [19:18:51] 10SRE, 10Fundraising-Backlog, 10Patch-For-Review: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) Hey @jgleeson soo.. you did not have an Icinga contact and that had to be created in the private puppet repository. I just did... [19:19:27] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar, 10Patch-For-Review: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) [19:32:24] (03PS3) 10Juan90264: Adjusting wordmark size in bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751841 (https://phabricator.wikimedia.org/T298033) [19:37:50] (03PS1) 10Tpt: Makes sure $imgContHorizontal is always initialized [extensions/ProofreadPage] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751843 (https://phabricator.wikimedia.org/T298694) [19:40:40] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar, 10Patch-For-Review: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Jgreen) >>! In T298649#7602585, @Dzahn wrote: > > When I did a " grep contact_groups puppet_h... [19:40:51] urbanecm: Are they still deploying? [19:42:42] jouncebot: now [19:42:42] For the next 0 hour(s) and 17 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T1900) [19:43:26] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) [19:45:22] Any deployers available for the time? This change is fast [19:45:50] I'm here [19:46:17] preparing to deploy the train but I can help deploy a patch if you need me [19:46:48] i think i need you now [19:46:54] I'm also somewhat around if needed [19:47:09] Perfect [19:49:13] Juan_90264: what is in need of deploying? [19:49:37] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751841 [19:49:51] Also on the calendar [19:49:56] twentyafterfour: are you deploying or should I? [19:50:15] I can do it [19:53:10] (03CR) 1020after4: [C: 03+2] Adjusting wordmark size in bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751841 (https://phabricator.wikimedia.org/T298033) (owner: 10Juan90264) [19:53:53] (03Merged) 10jenkins-bot: Adjusting wordmark size in bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751841 (https://phabricator.wikimedia.org/T298033) (owner: 10Juan90264) [19:54:58] Perfect merged! [19:55:08] Mwdebug1001 ou 1002? [19:56:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:12] mwdebug1001 [19:57:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:31] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@63c162d]: generate entity revision maps for commons / wcqs [19:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:36] I see the change myself. The text got smaller [19:57:47] look good Juan_90264? [19:59:38] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@63c162d]: generate entity revision maps for commons / wcqs (duration: 02m 07s) [19:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:41] Yes [20:00:04] twentyafterfour and hashar: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T2000) [20:00:16] syncing it [20:00:59] twentyafterfour: The text is correct to appear, the community had requested this in the task [20:01:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:07] !log twentyafterfour@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751841 (duration: 01m 08s) [20:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:37] Juan_90264: Understood. It should be live now everywhere. [20:04:16] (03CR) 10Razzi: [C: 03+1] "Looks good to clean up 👍" [puppet] - 10https://gerrit.wikimedia.org/r/751085 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [20:04:21] Okay, thanks twentyafterfour! [20:06:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:28] Juan_90264: You're welcome! [20:07:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] (03CR) 10JHathaway: [C: 03+2] hieradata: fix incorrect yaml [puppet] - 10https://gerrit.wikimedia.org/r/751794 (owner: 10JHathaway) [20:08:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:34] !log banned elastic2051 from both chi and omega search clusters - T298674 [20:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:38] T298674: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 [20:21:13] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@3297991]: update rdf-spark-tools jar to 0.3.98 [20:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:04] (03CR) 10Dzahn: [C: 03+1] "!thanks. I am kind of surprised CI allowed this. years ago we did a huge "re-format everything from tabs to spaces" in the repo and I thou" [puppet] - 10https://gerrit.wikimedia.org/r/751794 (owner: 10JHathaway) [20:23:28] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@3297991]: update rdf-spark-tools jar to 0.3.98 (duration: 02m 15s) [20:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:06] 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10jhathaway) [20:26:08] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10Gehel) @Papaul do you know if we have spare SSDs for this host? The host is already banned from the cluster, you can take it offline and reboot it whenever you want. (@b... [20:27:29] (03PS1) 10JHathaway: sodium: change role to insetup, to prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/751990 [20:44:23] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10Papaul) @Gehel we have some disks that we took out from decom servers I will look when i am back on site tomorrow if we can find one. [20:48:38] (03PS1) 1020after4: Revert "mw.title: Add pageLanguage property" [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752006 (https://phabricator.wikimedia.org/T298659) [20:56:58] (03CR) 10Herron: [C: 03+1] icinga: let Jack Gleeson run commands for any host or service [puppet] - 10https://gerrit.wikimedia.org/r/751980 (https://phabricator.wikimedia.org/T298649) (owner: 10Dzahn) [21:01:29] (03CR) 10Dzahn: [C: 03+2] icinga: let Jack Gleeson run commands for any host or service [puppet] - 10https://gerrit.wikimedia.org/r/751980 (https://phabricator.wikimedia.org/T298649) (owner: 10Dzahn) [21:25:32] (03CR) 10Krinkle: [C: 03+1] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [21:33:35] (03CR) 1020after4: [C: 03+2] "Landing this to unblock the train." [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752006 (https://phabricator.wikimedia.org/T298659) (owner: 1020after4) [21:42:06] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) >>! In T298649#7602627, @Jgreen wrote: > Fundraising-related services are done with passive checks, so the r... [21:46:32] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:47:02] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Thanks for all the digging on this @Dzahn, hugely appreciated! Sorry to be a pain but a few others on fr... [21:48:14] 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10bking) [21:48:44] (03PS1) 10Dzahn: nagios_common: add jgleeson to fr-tech-ops Icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/752002 (https://phabricator.wikimedia.org/T298649) [21:49:52] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:51:07] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10wiki_willy) a:03Papaul [21:51:18] ryankemper: maintenance or surprise hardware fail? wdqs1012 [21:51:33] (03Merged) 10jenkins-bot: Revert "mw.title: Add pageLanguage property" [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752006 (https://phabricator.wikimedia.org/T298659) (owner: 1020after4) [21:51:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:51:57] uh oh [21:52:02] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:53:28] nah, it's up, blazegraph is running [21:53:53] mutante: I’m OOO but here’s some cc’s: inflatador & ebernhardson (wdqs 1012) [21:54:18] mutante: most likely blazegraph is locked up tho, systemd will look happy but the actual system is likely deadlocked [21:54:24] ryankemper: not really host down. was just busy blazegraph [21:54:35] ack [21:55:44] (03CR) 10Dzahn: [C: 03+2] nagios_common: add jgleeson to fr-tech-ops Icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/752002 (https://phabricator.wikimedia.org/T298649) (owner: 10Dzahn) [21:58:34] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:59:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:24] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar, 10Patch-For-Review: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) @jgleeson Let's see if it works now, as "jgleeson". it should work both via global rig... [22:01:02] (03PS1) 10Bking: icinga: enable host and service commands for Brian King (bking) [puppet] - 10https://gerrit.wikimedia.org/r/752005 (https://phabricator.wikimedia.org/T298738) [22:01:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:47] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar, 10Patch-For-Review: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) Would still have to confirm whether it's actually enough to use only the fr-tech-ops g... [22:03:39] (03CR) 10Dzahn: [C: 03+1] icinga: enable host and service commands for Brian King (bking) [puppet] - 10https://gerrit.wikimedia.org/r/752005 (https://phabricator.wikimedia.org/T298738) (owner: 10Bking) [22:04:20] (03CR) 10Dzahn: [C: 03+1] "looks good to me! upon merge, please run puppet manually on the icinga host and do the icinga config syntax check again" [puppet] - 10https://gerrit.wikimedia.org/r/752005 (https://phabricator.wikimedia.org/T298738) (owner: 10Bking) [22:14:11] !log twentyafterfour@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/Scribunto/: sync Scribunto to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/752006/ (duration: 01m 08s) [22:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:16] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:30] (03CR) 10Ebernhardson: [C: 03+1] icinga: enable host and service commands for Brian King (bking) [puppet] - 10https://gerrit.wikimedia.org/r/752005 (https://phabricator.wikimedia.org/T298738) (owner: 10Bking) [22:15:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) I am not the partman recipe master but typically even with a h/w controller, there is a partman recipe that breaks down how to in... [22:17:06] (03CR) 10Bking: [C: 03+2] icinga: enable host and service commands for Brian King (bking) [puppet] - 10https://gerrit.wikimedia.org/r/752005 (https://phabricator.wikimedia.org/T298738) (owner: 10Bking) [22:18:09] (03CR) 10CDanis: [C: 03+1] logstash: update weekly indexes to use weekyear pattern syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [22:19:16] (03CR) 10CDanis: [C: 03+1] "Thanks especially for tracking down the tricky logstash-vs-elasticsearch formatting difference." [puppet] - 10https://gerrit.wikimedia.org/r/751766 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [22:20:38] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Cmjohnson) [22:20:42] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) 05Open→03Resolved @ayounsi I am resolving this task, re-open if something is not right or still needed. Thanks [22:21:49] (03PS1) 1020after4: all wikis to 1.38.0-wmf.16 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752026 [22:21:51] (03CR) 1020after4: [C: 03+2] all wikis to 1.38.0-wmf.16 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752026 (owner: 1020after4) [22:22:40] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.16 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752026 (owner: 1020after4) [22:25:15] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.16 refs T293958 [22:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:19] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [22:26:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:43] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10Cmjohnson) A dell dispatch has been created You have successfully submitted request SR1080863256. [22:27:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:27:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:23] (03PS1) 10Cmjohnson: Trying a different partman recipe for an-test-worker servers [puppet] - 10https://gerrit.wikimedia.org/r/752028 (https://phabricator.wikimedia.org/T293938) [22:32:23] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:33:03] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33152/console" [puppet] - 10https://gerrit.wikimedia.org/r/751990 (owner: 10JHathaway) [22:33:15] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:34:48] (03PS2) 10Cmjohnson: Trying a different partman recipe for an-test-worker servers [puppet] - 10https://gerrit.wikimedia.org/r/752028 (https://phabricator.wikimedia.org/T293938) [22:35:50] (03CR) 10Cmjohnson: [C: 03+2] Trying a different partman recipe for an-test-worker servers [puppet] - 10https://gerrit.wikimedia.org/r/752028 (https://phabricator.wikimedia.org/T293938) (owner: 10Cmjohnson) [22:55:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS buster [22:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:20] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet wit... [23:02:56] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) @Papaul Checked the box with 'hdparm', the failed disk is at sda, but it is not displaying its serial number. The working disk (sdb) has a serial number of 68GS10... [23:23:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-coord1002.eqiad.wmnet with OS buster [23:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:06] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with... [23:31:10] RECOVERY - Device not healthy -SMART- on elastic2051 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2051&var-datasource=codfw+prometheus/ops [23:39:49] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:41:50] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:43:35] second time this has paged and flapped, anyone know if something changed? [23:45:05] the graphs: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-6h&to=now [23:45:14] look very different starting at 21:45 [23:47:19] (03PS1) 104nn1l2: viwiktiobary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) [23:47:37] to my eyes, it looks like several of the wdqs servers are so backlogged they've taken themselves out of rotation for serving queries, which is causing a capacity crunch on the rest of the cluster [23:48:02] blazegraph is known to get, ah, stuck sometimes [23:48:07] perhaps that has happened here? [23:48:08] sounds reasonable [23:48:09] (03CR) 10jerkins-bot: [V: 04-1] viwiktiobary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) (owner: 104nn1l2) [23:48:45] (03PS2) 104nn1l2: viwiktionary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) [23:49:16] wdqs1004, wdqs1005, wdqs1006, wdqs1012 are the ones where rps plummeted and lag is really high [23:49:30] yeah [23:49:40] (03CR) 10jerkins-bot: [V: 04-1] viwiktionary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) (owner: 104nn1l2) [23:49:57] java is pegging a single cpu, perhaps it is hung? [23:50:06] yeah https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock [23:50:18] the symptoms match this -- for instance 'triples' has been unavailable on those servers for a while [23:50:38] sigh, blazegraph :( [23:51:15] I can try bouncing the daemon on wdqs1004? [23:51:40] please do :) [23:52:02] I'm away from my actual laptop atm so I can't actually log in to production [23:52:16] !log bouncing blazegraph on wdqs1004 [23:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:16] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:53:55] now it is using many cores, which seems much better [23:54:11] perhaps we can use a systemd watchdog to "autofix" this issue [23:54:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:54:53] looks like it has a couple hours of writes to catch up on, to be expected [23:55:18] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:57:46] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal