[00:01:26] (03PS2) 10Nray: Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) [00:01:28] (03CR) 10Dave Pifke: "OK, I think this is ready to roll out next week. The dependencies have all been merged and deployed." [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [00:01:45] (03PS3) 10Nray: Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) [00:01:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [00:18:23] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [00:19:08] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [00:19:42] 10SRE, 10Wikimedia-Logstash, 10observability, 10SRE Observability (FY2021/2022-Q2), 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10colewhite) [00:24:01] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [00:36:13] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [00:44:21] (03CR) 10Jdlrobson: [C: 03+1] "@Juan90264 please organize a backport for this using wikitech.org/wiki/Deployments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [02:01:28] (03CR) 10Juan90264: [C: 03+1] Adding and use wordmark in azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [02:05:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:19] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:44:16] (03PS1) 10BryanDavis: toolhub: set https_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725180 (https://phabricator.wikimedia.org/T292027) [03:00:56] (03CR) 10BryanDavis: [C: 03+2] toolhub: set https_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725180 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [03:04:01] (03PS1) 10BryanDavis: toolhub: Bump container version to 021-10-01-024845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/725181 (https://phabricator.wikimedia.org/T292027) [03:04:58] (03Merged) 10jenkins-bot: toolhub: set https_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725180 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [03:08:20] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 021-10-01-024845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/725181 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [03:12:29] (03Merged) 10jenkins-bot: toolhub: Bump container version to 021-10-01-024845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/725181 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [03:15:56] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [03:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:23] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:46] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [03:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:11] (03CR) 10Legoktm: [C: 03+2] Have PdfHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724576 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [03:52:54] (03Merged) 10jenkins-bot: Have PdfHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724576 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [03:57:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:31] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Have PdfHandler use Shellbox on Commons for 10% of requests (T289228) (duration: 00m 59s) [04:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:37] T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox - https://phabricator.wikimedia.org/T289228 [04:53:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10wiki_willy) [04:54:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10wiki_willy) Updated task description based on @JMeybohm's comment >>! In T290202#7327682, @JMeybohm wrote: > The latest kubernetes node there is is kubernetes... [04:55:06] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Data-Engineering: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10wiki_willy) 05Open→03Declined Resolving this racking task, since the project has been pushed back to Q3. [05:05:34] (03PS1) 10Marostegui: Revert "db2080: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/725130 [05:13:41] (03CR) 10Dzahn: "took me forever to find this attempt, please let me do it, it's just temp and mostly a copy of existing stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [05:19:23] (03CR) 10Marostegui: [C: 03+2] Revert "db2080: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/725130 (owner: 10Marostegui) [05:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 for upgrade', diff saved to https://phabricator.wikimedia.org/P17377 and previous config saved to /var/cache/conftool/dbconfig/20211001-052133-marostegui.json [05:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:05] !log Upgrade db1119 [05:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17378 and previous config saved to /var/cache/conftool/dbconfig/20211001-052438-root.json [05:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 for upgrade', diff saved to https://phabricator.wikimedia.org/P17379 and previous config saved to /var/cache/conftool/dbconfig/20211001-052509-marostegui.json [05:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:02] !log Upgrade db1114 [05:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17380 and previous config saved to /var/cache/conftool/dbconfig/20211001-052831-root.json [05:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17381 and previous config saved to /var/cache/conftool/dbconfig/20211001-053942-root.json [05:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17382 and previous config saved to /var/cache/conftool/dbconfig/20211001-054335-root.json [05:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17383 and previous config saved to /var/cache/conftool/dbconfig/20211001-055445-root.json [05:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [05:58:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17384 and previous config saved to /var/cache/conftool/dbconfig/20211001-055838-root.json [05:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17385 and previous config saved to /var/cache/conftool/dbconfig/20211001-060949-root.json [06:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:39] (03CR) 10DCausse: [C: 03+1] Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [06:13:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [06:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17386 and previous config saved to /var/cache/conftool/dbconfig/20211001-061342-root.json [06:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:42] (03CR) 10DCausse: [C: 03+1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [06:18:54] 10SRE, 10SRE-Access-Requests: Add Majavah to #mediawiki_security - https://phabricator.wikimedia.org/T292214 (10Joe) 05Open→03Resolved a:03Joe Done :) [06:21:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17387 and previous config saved to /var/cache/conftool/dbconfig/20211001-062453-root.json [06:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) Hi @rhuang and welcome! Can you please confirm you've also read https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities? @CMacholan can you... [06:26:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) p:05Triage→03Medium [06:27:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [06:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17388 and previous config saved to /var/cache/conftool/dbconfig/20211001-062846-root.json [06:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:54] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) [06:29:49] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) [06:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) [06:45:21] (03CR) 10Juan90264: [C: 03+1] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725132 (https://phabricator.wikimedia.org/T291344) (owner: 10Juan90264) [06:49:32] (03PS5) 10Giuseppe Lavagetto: service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211001T0700) [07:02:20] (03CR) 10Jelto: [V: 03+1 C: 04-1] "the dedicated sshd is missing. Working and amending the change" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [07:10:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [07:11:06] (03CR) 10Muehlenhoff: [C: 04-1] Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [07:12:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724946 (https://phabricator.wikimedia.org/T292069) (owner: 10Giuseppe Lavagetto) [07:25:35] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:32:56] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Deploy only to k8s [alerts] - 10https://gerrit.wikimedia.org/r/724999 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [07:33:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 (owner: 10Giuseppe Lavagetto) [07:34:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:35:26] (03Merged) 10jenkins-bot: rdf-streaming-updater: Deploy only to k8s [alerts] - 10https://gerrit.wikimedia.org/r/724999 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [07:37:47] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:45:25] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10User-jbond, 10cloud-services-team (dcaro): Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 (10dcaro) [07:48:24] (03Abandoned) 10Gehel: [DNM] quick hack to start discussion [software/spicerack] - 10https://gerrit.wikimedia.org/r/724443 (owner: 10Gehel) [07:49:43] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s-staging) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [07:52:35] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [07:52:55] this is me ^ (silencing) [08:03:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on testvm[2001,2003].codfw.wmnet with reason: Ganeti tests [08:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on testvm[2001,2003].codfw.wmnet with reason: Ganeti tests [08:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:19] (03PS1) 10Jbond: P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 [08:13:53] (03CR) 10jerkins-bot: [V: 04-1] P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [08:15:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31435/console" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [08:15:23] <_joe_> !log restarting pybal in codfw to pick up config changes [08:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [08:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:28] (03PS2) 10Majavah: Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) [08:21:16] (03PS3) 10Majavah: Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) [08:21:18] (03CR) 10Filippo Giunchedi: [C: 03+2] Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [08:21:22] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [08:21:27] (03PS3) 10Filippo Giunchedi: Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) [08:21:44] (03CR) 10Majavah: [C: 03+2] Identify when venvs are for wrong Python versions (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [08:21:50] (03CR) 10jerkins-bot: [V: 04-1] Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [08:22:20] (03PS4) 10Majavah: Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) [08:22:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Ok given this is a naming proposal:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [08:22:31] (03CR) 10Majavah: [C: 03+2] Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [08:23:14] (03Merged) 10jenkins-bot: Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [08:23:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I thought I'd never see this day. Thanks a ton!" [puppet] - 10https://gerrit.wikimedia.org/r/725099 (owner: 10Legoktm) [08:23:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Pillage and pour salt on the remains please." [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [08:24:36] (03CR) 10Majavah: [C: 03+2] "Let's ship this in the next release, now that we cleaned up the ingress objects for tools.wmflabs.org URLs. This won't affect directly any" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [08:24:49] (03CR) 10jerkins-bot: [V: 04-1] Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [08:26:09] (03PS2) 10Jbond: P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 [08:27:06] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [08:27:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31436/console" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [08:27:52] (03PS1) 10Ladsgroup: mediawiki: Drop absented systemd timers of test wikidata change dispatching [puppet] - 10https://gerrit.wikimedia.org/r/725261 (https://phabricator.wikimedia.org/T291610) [08:29:09] (03PS3) 10Jbond: P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 [08:30:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31437/console" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [08:32:09] (03PS12) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [08:32:30] (03CR) 10Majavah: [C: 03+2] "re-+2 after fixing merge conflicts" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [08:33:06] (03Merged) 10jenkins-bot: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [08:36:02] (03PS4) 10Jbond: P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 [08:36:18] (03PS1) 10Muehlenhoff: Switch ganeti2016 to 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/725262 [08:39:14] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [08:40:11] (03PS2) 10Muehlenhoff: Switch ganeti2016 to 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/725262 [08:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2080 T290868', diff saved to https://phabricator.wikimedia.org/P17390 and previous config saved to /var/cache/conftool/dbconfig/20211001-084345-marostegui.json [08:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:52] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [08:44:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 for upgrade', diff saved to https://phabricator.wikimedia.org/P17391 and previous config saved to /var/cache/conftool/dbconfig/20211001-084411-marostegui.json [08:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172 for upgrade', diff saved to https://phabricator.wikimedia.org/P17392 and previous config saved to /var/cache/conftool/dbconfig/20211001-084435-marostegui.json [08:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] !log Upgrade db1135 and db1172 [08:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti2016 to 2.16 backport [puppet] - 10https://gerrit.wikimedia.org/r/725262 (owner: 10Muehlenhoff) [08:48:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17393 and previous config saved to /var/cache/conftool/dbconfig/20211001-084847-root.json [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17394 and previous config saved to /var/cache/conftool/dbconfig/20211001-084859-root.json [08:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:51] (03CR) 10Jbond: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [08:53:53] (03PS2) 10KartikMistry: Remove deprecated SectionTranslationTargetLanguage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) [08:54:47] (03CR) 10Jbond: [C: 03+1] profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [08:55:45] (03PS5) 10Jbond: P:monitoring::host: rename to P:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/725257 [08:58:32] (03CR) 10David Caro: "LGTM, I'll leave the +1 to @fgiunchedi" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [09:00:06] <_joe_> !log restarting pybal low-traffic in eqiad to pick up the drop of proxyfetch to kubernetes services [09:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [09:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17395 and previous config saved to /var/cache/conftool/dbconfig/20211001-090351-root.json [09:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17396 and previous config saved to /var/cache/conftool/dbconfig/20211001-090402-root.json [09:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] (03PS1) 10Urbanecm: Let DB expressions intersect DB lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725263 (https://phabricator.wikimedia.org/T290609) [09:14:54] (03PS1) 10Urbanecm: WIP: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:15:37] (03PS2) 10Urbanecm: WIP: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:18:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 206): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31439/console" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [09:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17397 and previous config saved to /var/cache/conftool/dbconfig/20211001-091854-root.json [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17398 and previous config saved to /var/cache/conftool/dbconfig/20211001-091906-root.json [09:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:00] (03PS3) 10Urbanecm: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:22:04] (03CR) 10Urbanecm: "diff generated by puppet compiler looks how" [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:24:09] I'm really wondering if I got that right at the first try. Tbh, I expected something to yell at me for writing invalid puppet code 🙂 [09:25:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2002.codfw.wmnet [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:38] urbanecm: I think you need to leave the old job in there with ensure => absent, at least for one full puppet run for it to be removed [09:27:13] good point [09:27:50] (03PS1) 10Muehlenhoff: Update MAC for testvm2002 [puppet] - 10https://gerrit.wikimedia.org/r/725265 [09:28:53] (03PS4) 10Urbanecm: WIP: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:29:25] (03CR) 10jerkins-bot: [V: 04-1] WIP: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:29:54] (03PS1) 10Urbanecm: growthexperiments: Remove absented systemd job [puppet] - 10https://gerrit.wikimedia.org/r/725286 (https://phabricator.wikimedia.org/T290609) [09:30:57] majavah: like this? [09:31:18] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:31:20] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: clean up unused cache clear script [puppet] - 10https://gerrit.wikimedia.org/r/725105 (https://phabricator.wikimedia.org/T144396) (owner: 10Cwhite) [09:31:30] with proper indentation, but yes [09:31:35] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10hashar) Clearing a few projects. @ayounsi mentioned at T283582#7111164 that it is most probably an issue with an unmanaged switc... [09:31:41] uploading that, but git review takes more time than i want it to [09:31:49] (03PS5) 10Urbanecm: WIP: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:31:51] finally [09:32:10] (03PS6) 10Urbanecm: growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) [09:32:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [09:33:47] (03CR) 10Muehlenhoff: [C: 03+2] Update MAC for testvm2002 [puppet] - 10https://gerrit.wikimedia.org/r/725265 (owner: 10Muehlenhoff) [09:33:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17399 and previous config saved to /var/cache/conftool/dbconfig/20211001-093358-root.json [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17400 and previous config saved to /var/cache/conftool/dbconfig/20211001-093410-root.json [09:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:28] (03PS1) 10Ladsgroup: Make two new jobs of Wikidata dispatcher high priority [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) [09:37:22] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:37:35] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: (3) WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [09:37:59] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=frwiki --force # to get an idea about timing for T290609, runs in a tmux session under my account [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:05] T290609: Make mentee overview module's updateMenteeData.php scale better - https://phabricator.wikimedia.org/T290609 [09:40:40] <_joe_> dcausse: any idea whiat's up with flink? [09:41:10] _joe_: it's me, renewed the downtime [09:41:16] sorry for the noise [09:41:33] (03PS1) 10Elukey: helmfile.d: increase Typha's replicas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/725289 (https://phabricator.wikimedia.org/T292077) [09:41:35] I scheduled 2 hours but got distracted [09:42:10] <_joe_> dcausse: no problems, I just wanted to be sure I didn't need to care :) [09:42:21] sure thanks! :) [09:49:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17401 and previous config saved to /var/cache/conftool/dbconfig/20211001-094902-root.json [09:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17402 and previous config saved to /var/cache/conftool/dbconfig/20211001-094913-root.json [09:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] (03PS1) 10Hashar: zuul: raise zuul queue alarm [puppet] - 10https://gerrit.wikimedia.org/r/725290 (https://phabricator.wikimedia.org/T292284) [09:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177 and db1164 for upgrade', diff saved to https://phabricator.wikimedia.org/P17403 and previous config saved to /var/cache/conftool/dbconfig/20211001-095433-marostegui.json [09:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:31] !log Upgrade db1164 and db1177 [09:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10CMacholan) Hi @Joe -- confirming approval here. Thanks for your help! [09:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17404 and previous config saved to /var/cache/conftool/dbconfig/20211001-095720-root.json [09:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17405 and previous config saved to /var/cache/conftool/dbconfig/20211001-095834-root.json [09:58:35] (03CR) 10Jbond: [C: 03+2] "All changes where no op" [puppet] - 10https://gerrit.wikimedia.org/r/725257 (owner: 10Jbond) [09:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@c123ab9] (eqiad): Increase mirrored traffic to 80% for eqiad [09:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:01] moritzm: merging 37ecc11e1a [10:00:03] (03Abandoned) 10Hnowlan: apt::package_from_component: add update condition for multiple packages [puppet] - 10https://gerrit.wikimedia.org/r/721275 (owner: 10Hnowlan) [10:00:08] jbond: ack, sorry [10:00:16] np :) [10:00:44] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@c123ab9] (eqiad): Increase mirrored traffic to 80% for eqiad (duration: 00m 51s) [10:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:57] (03PS2) 10Urbanecm: growthexperiments: Remove absented systemd job [puppet] - 10https://gerrit.wikimedia.org/r/725286 (https://phabricator.wikimedia.org/T290609) [10:07:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:12:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17406 and previous config saved to /var/cache/conftool/dbconfig/20211001-101224-root.json [10:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:13:00] (03PS1) 10Hashar: alertmanager: add release engineering team [puppet] - 10https://gerrit.wikimedia.org/r/725294 (https://phabricator.wikimedia.org/T292284) [10:13:17] yeah the spike is fine, yesterday there was a spike [10:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17407 and previous config saved to /var/cache/conftool/dbconfig/20211001-101338-root.json [10:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:20] (03CR) 10Hashar: "The Alert Manager system is described at https://wikitech.wikimedia.org/wiki/Alertmanager . It is the new way of emitting notifications." [puppet] - 10https://gerrit.wikimedia.org/r/725294 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:22:29] (03CR) 10Hashar: "For Joe since he mentioned on IRC the alert is annoying and I definitely agree it is too sensible. It fires off for normal workload." [puppet] - 10https://gerrit.wikimedia.org/r/725290 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:23:12] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add release engineering team [puppet] - 10https://gerrit.wikimedia.org/r/725294 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:25:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] zuul: raise zuul queue alarm [puppet] - 10https://gerrit.wikimedia.org/r/725290 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:25:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Thanks hashar 😊" [puppet] - 10https://gerrit.wikimedia.org/r/725290 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:26:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:55] (03PS8) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [10:27:28] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17408 and previous config saved to /var/cache/conftool/dbconfig/20211001-102728-root.json [10:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:07] (03PS26) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [10:28:11] wdqs1009 is me [10:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17409 and previous config saved to /var/cache/conftool/dbconfig/20211001-102841-root.json [10:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31441/console" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [10:30:37] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [10:34:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) [10:40:35] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10cmooney) a:03cmooney Thanks @hashar. I would agree with @ayounsi's analysis, if considering contint2001.mgmt **in... [10:42:12] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.366e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [10:42:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17410 and previous config saved to /var/cache/conftool/dbconfig/20211001-104232-root.json [10:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@d4caf6d] (eqiad): Increase mirrored traffic to 100% for eqiad [10:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17411 and previous config saved to /var/cache/conftool/dbconfig/20211001-104345-root.json [10:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:59] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@d4caf6d] (eqiad): Increase mirrored traffic to 100% for eqiad (duration: 00m 49s) [10:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:54] (03CR) 10Effie Mouzeli: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [10:46:20] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [10:47:54] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 4.56e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:52:52] (03PS9) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [10:54:00] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31443/console" [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [10:54:06] (03PS1) 10Jgiannelos: tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/725296 (https://phabricator.wikimedia.org/T283159) [10:57:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17412 and previous config saved to /var/cache/conftool/dbconfig/20211001-105735-root.json [10:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17413 and previous config saved to /var/cache/conftool/dbconfig/20211001-105849-root.json [10:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:58] (03PS2) 10Ladsgroup: changeprop-jobqueue: Make new jobs of Wikidata dispatcher high priority [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) [11:03:49] (03CR) 10Jbond: [C: 04-1] "some minor nit's and comments see inline. -1 is highlighted and relates to a wrong file name" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [11:07:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:09:47] (03PS18) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:09:49] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/725296 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [11:10:19] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:11:08] !log manually migrating some vms out of ganeti1009 to avoid excessive memory pressure [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:13:54] I think that is expected [11:13:59] (03Merged) 10jenkins-bot: tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/725296 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [11:15:28] and I think soon ganeti1009 should stop swapping [11:15:48] (03PS19) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:16:24] (03PS20) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:17:01] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:17:10] (03CR) 10Jbond: P:base: move production specific code to there own profile (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:18:19] one thing that is strange on logstash is a higher baseline of messages since around 16h yesterday [11:18:39] (03PS21) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:19:13] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:19:18] (03PS22) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:19:48] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:20:15] (03PS23) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:20:45] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:23:16] increase happened between 16:02-16:06, which would fit with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/725019/ and probably normal, but I will report on ticket for a heads up [11:27:39] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10hashar) That is quite an epic diagnostic @cmooney ! It is definitely not trivial to end up root causing some specific... [11:28:08] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10aborrero) [11:28:37] (03PS1) 10Jbond: P:base: move more monitoring stuff to the monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/725298 [11:28:50] jynus: can you let me know more about it? [11:28:58] I just commented on ticket [11:29:10] I think on the right one, hopefully [11:29:25] https://phabricator.wikimedia.org/T48643#7394374 [11:30:50] (03PS2) 10Jbond: P:base: move more monitoring stuff to the monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/725298 [11:31:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31447/console" [puppet] - 10https://gerrit.wikimedia.org/r/725298 (owner: 10Jbond) [11:31:43] I doubt it would make that many logs, I check [11:32:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31448/console" [puppet] - 10https://gerrit.wikimedia.org/r/725298 (owner: 10Jbond) [11:35:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base: move more monitoring stuff to the monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/725298 (owner: 10Jbond) [11:42:47] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:30] Amir1, I found the logs, not sure whay they are, let me try to find a link [11:47:16] Amir1: https://logstash.wikimedia.org/goto/263ff9c2e9bdb0668b0864f9cebdb0d5 [11:47:30] (03PS24) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:47:44] thanks! [11:48:03] maybe they are unrelated and only happend to happen at the same time as your deploy! [11:48:43] as a quick look shows a lot of cirrusSearch errors [11:49:06] or it could be like an indirect cause [11:49:26] (03PS25) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:49:55] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:50:19] (03CR) 10Jbond: P:base: move production specific code to there own profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:52:48] jynus: it doesn't seem related, the wikis it's failing are not the same [11:53:09] yeah, maybe the deploy just triggered an existing issue [11:53:35] poked infra and made another issue obvious or something [11:53:43] I think I will report it on a separate ticket [11:53:52] Thanks [11:54:24] Do you want to comment with that "it doesn't seem related" as a response on ticket, or I do? [11:55:24] maybe it is actually T292048 [11:55:25] T292048: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 [11:55:54] I will comment there before creating a new ticket [12:02:16] (03PS6) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [12:02:17] I make a comment there [12:02:52] I did https://phabricator.wikimedia.org/T292048#7394439 [12:04:18] (03PS26) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [12:07:21] certainly, failing jobs won't make things faster, even if unrelated :-( [12:08:19] (03PS1) 10David Caro: base::environment: add types to the parameter [puppet] - 10https://gerrit.wikimedia.org/r/725301 [12:08:21] (03PS1) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [12:09:03] (03PS1) 10Jelto: aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.3 [puppet] - 10https://gerrit.wikimedia.org/r/725303 (https://phabricator.wikimedia.org/T292256) [12:23:19] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31449/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:33:05] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 7 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Pginer-WMF) [12:34:10] (03PS1) 10Ema: Release 6.0.8-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/725307 (https://phabricator.wikimedia.org/T268736) [12:40:20] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 8 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10jcrespo) [12:46:19] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-fgiunchedi: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10fgiunchedi) [12:48:13] (03PS1) 10Elukey: WIP: add support for admin_ng helmfile secrets [puppet] - 10https://gerrit.wikimedia.org/r/725310 [12:50:25] (03PS1) 10Jbond: P:rsyslog::kafka_shipper: set hiera config directly in this profile [puppet] - 10https://gerrit.wikimedia.org/r/725311 (https://phabricator.wikimedia.org/T289661) [12:51:53] (03PS2) 10Elukey: WIP: add support for admin_ng helmfile secrets [puppet] - 10https://gerrit.wikimedia.org/r/725310 [12:54:12] (03PS3) 10Elukey: WIP: add support for admin_ng helmfile secrets [puppet] - 10https://gerrit.wikimedia.org/r/725310 [13:00:30] (03CR) 10Jbond: [C: 03+2] P:rsyslog::kafka_shipper: set hiera config directly in this profile [puppet] - 10https://gerrit.wikimedia.org/r/725311 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:03:32] (03PS27) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [13:04:27] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.8-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/725307 (https://phabricator.wikimedia.org/T268736) (owner: 10Ema) [13:04:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:04:49] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:05:22] (03PS28) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [13:10:28] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:55] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:11:56] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [13:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:11] (03PS14) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [13:12:15] (03CR) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:12:22] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:48] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:15:44] (03PS10) 10Gehel: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:16:47] (03CR) 10Gehel: [C: 03+2] wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:16:58] (03PS4) 10Dzahn: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724115 [13:17:15] (03CR) 10Dzahn: [C: 03+1] Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [13:17:37] (03PS27) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [13:23:41] !log manually trying LE expired root workaround on mwdebug1001 with puppet disabled ... [13:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:08] (03PS4) 10Elukey: WIP: add support for admin_ng helmfile secrets [puppet] - 10https://gerrit.wikimedia.org/r/725310 [13:29:39] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31456/console" [puppet] - 10https://gerrit.wikimedia.org/r/725310 (owner: 10Elukey) [13:32:07] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10hashar) I went to fetch the IRC log from https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/ which are fr... [13:32:55] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 2.753e+06 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:33:47] (03PS1) 10Elukey: role::deployment_server: move admin helmfile secrets under services [labs/private] - 10https://gerrit.wikimedia.org/r/725316 [13:34:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::deployment_server: move admin helmfile secrets under services [labs/private] - 10https://gerrit.wikimedia.org/r/725316 (owner: 10Elukey) [13:36:35] (03PS1) 10Jbond: debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) [13:37:10] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:38:17] (03PS2) 10Jbond: debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) [13:38:37] (03PS5) 10Elukey: Add support for admin_ng helmfile secrets used by ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/725310 [13:40:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/725303 (https://phabricator.wikimedia.org/T292256) (owner: 10Jelto) [13:42:15] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:35] (03CR) 10Michael Große: [C: 03+1] "Looks good to me, though I don't know enough about the details of this system to have an opinion on the specific numbers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [13:42:48] wdqs2008 problem is "normal" it's being reloaded [13:42:51] (03CR) 10Elukey: [C: 03+2] helmfile.d: increase Typha's replicas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/725289 (https://phabricator.wikimedia.org/T292077) (owner: 10Elukey) [13:43:35] (03PS3) 10Jbond: debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) [13:44:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31462/console" [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:45:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:28] (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:49:34] (03CR) 10Muehlenhoff: debdeploy: move base::autorestart into debdeploy module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:56:57] (03PS4) 10Jbond: debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) [13:57:10] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:01:35] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [14:01:44] ACKNOWLEDGEMENT - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Guillaume Lederrey data reload in progress - https://phabricator.wikimedia.org/T288231 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:44] ACKNOWLEDGEMENT - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Guillaume Lederrey data reload in progress - https://phabricator.wikimedia.org/T288231 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:44] ACKNOWLEDGEMENT - WDQS high update lag on wdqs2008 is CRITICAL: 2.75e+06 ge 3600 Guillaume Lederrey data reload in progress - https://phabricator.wikimedia.org/T288231 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:04:11] !log C:envoyproxy (appservers and others): ca-certificates updated via cumin to workaround T292291 issues [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:18] T292291: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 [14:04:22] !log C:envoyproxy (appservers and others): restarting envoyproxy [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:43] (03CR) 10Jbond: geoip: create transitional class geoip::data::maxmind::ipinfo (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:08:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/725301 (owner: 10David Caro) [14:12:05] (03PS1) 10Giuseppe Lavagetto: admin: add Rui Huang to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/725325 (https://phabricator.wikimedia.org/T292258) [14:17:54] (03CR) 10Muehlenhoff: debdeploy: move base::autorestart into debdeploy module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:19:52] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s-staging) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [14:20:34] hm.. this alert should not be deployed on k8s-staging ^ [14:22:56] godog: I wonder if switching from a deploy-tag: local to something more specific like deploy-tag: k8s does not require a cleanup [14:24:30] (03PS1) 10Elukey: Create new deploy group for k8s ML services [puppet] - 10https://gerrit.wikimedia.org/r/725326 [14:25:07] (03CR) 10jerkins-bot: [V: 04-1] Create new deploy group for k8s ML services [puppet] - 10https://gerrit.wikimedia.org/r/725326 (owner: 10Elukey) [14:25:55] (03CR) 10Jbond: "looks good to me so far. have left some comments some for some for now, but most can be handled in future CR's, ping me on irc if anything" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [14:27:32] (03PS1) 10Muehlenhoff: Enable ganeti216 also for ganeti2025 [puppet] - 10https://gerrit.wikimedia.org/r/725327 [14:28:30] (03CR) 10Physikerwelt: [C: 03+1] "I think this is a no brainer... I am not sure what is going to happen here. I suppose we are stuck in step 4 of https://meta.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [14:32:46] (03PS3) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009) [14:33:19] (03PS6) 10Elukey: Add support for admin_ng helmfile secrets used by ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/725310 [14:33:21] (03PS2) 10Elukey: Create new deploy group for k8s ML services [puppet] - 10https://gerrit.wikimedia.org/r/725326 [14:36:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31463/console" [puppet] - 10https://gerrit.wikimedia.org/r/725310 (owner: 10Elukey) [14:38:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/725326 (owner: 10Elukey) [14:39:18] (03CR) 10Ppchelko: Clean up temporary variable wgMathUseRestBase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [14:39:56] (03CR) 10Jelto: [V: 03+1] "I added the dedicated ssh daemon using the gitlab::ssh class and tested a full install cycle on WMCS. I also made sure that we keep all ss" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [14:40:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) @odimitrijevic I would ask you to approve for analytics. [14:41:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ksiebert - https://phabricator.wikimedia.org/T292053 (10KSiebert) Hey all, thanks for processing this so quickly!!! Just saw it now. [14:42:27] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/725328 [14:43:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add support for admin_ng helmfile secrets used by ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/725310 (owner: 10Elukey) [14:46:43] (03PS29) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [14:52:49] dcausse: mmhh I'll check [14:58:53] (03CR) 10Cwhite: [C: 03+1] Revert "prometheus: add ThanosSidecarUploadFailure to prometheus/ops" [puppet] - 10https://gerrit.wikimedia.org/r/724949 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [14:58:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10odimitrijevic) Approved [14:59:13] (03CR) 10Cwhite: [C: 03+1] o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [14:59:34] (03PS1) 10Elukey: helmfile.d: update the private paths for knative and kfserving [deployment-charts] - 10https://gerrit.wikimedia.org/r/725329 [14:59:37] dcausse: yeah it is an error during cleanup, I'll send a patch [14:59:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) [14:59:47] godog: thanls! [15:01:34] also definitely an alerts deploy failure must result in a puppet failure, so we get an alert [15:04:37] (03PS1) 10Filippo Giunchedi: alerts: cleanup only files [puppet] - 10https://gerrit.wikimedia.org/r/725330 (https://phabricator.wikimedia.org/T289662) [15:06:37] dcausse: ^ [15:09:23] (03CR) 10Elukey: [C: 03+1] "uid ok from LDAP, krb flag present, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/725325 (https://phabricator.wikimedia.org/T292258) (owner: 10Giuseppe Lavagetto) [15:09:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add Rui Huang to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/725325 (https://phabricator.wikimedia.org/T292258) (owner: 10Giuseppe Lavagetto) [15:09:49] (03CR) 10Elukey: [C: 03+2] helmfile.d: update the private paths for knative and kfserving [deployment-charts] - 10https://gerrit.wikimedia.org/r/725329 (owner: 10Elukey) [15:10:07] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) Recapping from an IRC conversation: this was a fallout of the great Let's En... [15:10:46] 10SRE, 10LDAP-Access-Requests: Add to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Deniz_WMDE) [15:12:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) 05Open→03Resolved a:03Joe Hi @rhuang, your access should be set up in half an hour. You should then be able to ssh into stat1006, given you've set up ssh correc... [15:12:23] (03PS2) 10BBlack: Add wikiworkshop.org to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/723590 (https://phabricator.wikimedia.org/T251732) [15:12:25] (03PS1) 10BBlack: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) [15:12:51] (03PS2) 10BBlack: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) [15:13:54] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10ifried) I have read & signed the L3 document. Thank you everyone! [15:14:01] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus: add ThanosSidecarUploadFailure to prometheus/ops" [puppet] - 10https://gerrit.wikimedia.org/r/724949 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [15:14:06] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [15:14:11] (03PS3) 10Filippo Giunchedi: o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) [15:14:22] (03CR) 10jerkins-bot: [V: 04-1] sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:14:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [15:15:06] (03CR) 10jerkins-bot: [V: 04-1] sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:16:27] (03PS1) 10Elukey: helmfile.d: fix knative-serving's istio helmfile path [deployment-charts] - 10https://gerrit.wikimedia.org/r/725332 [15:21:47] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: cleanup only files [puppet] - 10https://gerrit.wikimedia.org/r/725330 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [15:24:11] (03CR) 10Elukey: [C: 03+2] helmfile.d: fix knative-serving's istio helmfile path [deployment-charts] - 10https://gerrit.wikimedia.org/r/725332 (owner: 10Elukey) [15:26:20] (03PS3) 10BBlack: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) [15:27:08] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10jcrespo) For more longer term, I also would like to wonder if there something we cou... [15:27:57] (03PS27) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [15:28:26] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [15:29:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10Joe) I forgot to add - you should receive an email with instructions on how to enable your kerberos access. [15:29:34] (03CR) 10Jbond: [C: 03+1] "lgtm minor comment" [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:31:38] (03PS5) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [15:32:03] dcausse: I was a little hasty in my puppet, next week I'll fix the deployment for good with T292303 [15:32:03] T292303: Move alerts-deploy exec to systemd unit - https://phabricator.wikimedia.org/T292303 [15:32:11] for now I'll bandaid it [15:32:35] godog: np, no rush on my side I have acked the alerts [15:33:06] SGTM, thanks! [15:35:15] (03CR) 10Giuseppe Lavagetto: "Not sure if important enough to rework the patch, but there is a dependency issue in its current form." [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:35:58] (03CR) 10Bearloga: "Based on @ottomata's comment in a related CR I got rid of PYTHONPATH environment variable (since the scheduled script main.sh activates An" [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [15:36:01] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: add external postgres db [puppet] - 10https://gerrit.wikimedia.org/r/725048 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [15:41:27] (03CR) 10Jbond: "just a note that this was opriginaly added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/570637/" [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [15:45:08] (03PS1) 10Jbond: "P:base: drop broad dependency" [puppet] - 10https://gerrit.wikimedia.org/r/725276 [15:45:26] (03PS2) 10Jbond: "P:base: drop broad dependency" [puppet] - 10https://gerrit.wikimedia.org/r/725276 [15:46:59] (03CR) 10jerkins-bot: [V: 04-1] "P:base: drop broad dependency" [puppet] - 10https://gerrit.wikimedia.org/r/725276 (owner: 10Jbond) [15:51:11] (03CR) 10Jbond: [C: 03+1] sslcert::ca_deselect_dstx3 for envoyproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [15:56:33] (03PS15) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [15:56:51] (03CR) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:57:14] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:00:15] (03PS16) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [16:15:51] (03PS1) 10AOkoth: gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) [16:16:25] (03CR) 10jerkins-bot: [V: 04-1] gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:20:17] !log testing upcoming Scap 4.0.2 release on beta [16:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] (03PS2) 10AOkoth: gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) [16:36:59] (03CR) 10AOkoth: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31464/" [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:42:40] (03CR) 10Dzahn: gitlab: install backup restore script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:47:00] 10SRE, 10ops-eqiad, 10Analytics-Clusters: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Cmjohnson) Replaced the cable but still don't have access, this server will require me to power it off and drain flea power. That has been the standard fix for... [16:49:52] (03CR) 10AOkoth: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31464/" [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:54:52] (03PS3) 10AOkoth: gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) [16:56:45] (03CR) 10Dzahn: [C: 03+1] gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [16:58:03] (03PS10) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [17:02:24] (03PS4) 10AOkoth: gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) [17:03:45] (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [17:03:50] (03CR) 10Dzahn: [C: 03+2] gitlab: install backup restore script [puppet] - 10https://gerrit.wikimedia.org/r/725340 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [17:09:24] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10RhinosF1) [17:14:48] (03CR) 10Hnowlan: "Seems okay to me although I'm ccing petr to make sure the numbers make sense" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [17:16:10] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10SWakiyama) Thank you, Joe. I'm confirming that I read the data access user responsibilities. Cheers, Shari [17:24:45] (03CR) 10Physikerwelt: [C: 03+1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/725328 (owner: 10PipelineBot) [17:34:14] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) The assertion was added f... [17:36:58] (03Abandoned) 10Reedy: CommonSettings.php: Minor code style tweak inside $wmgUseFooterContactLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688506 (owner: 10Reedy) [17:39:51] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) @robh confirmed, they only have 2 disks. I'm not sure what the next step is for them [17:42:46] (03CR) 10Ppchelko: [C: 04-1] "one question inlined. if this is no-issue, feel free to remove my -1. Otherwise looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [17:46:44] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) So I'm now reviewing the entire purchase history of this request. T286517 was filed, for config C-1G which is only 2... [17:49:23] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) thanks! @RobH [17:49:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [17:51:05] (03PS1) 10RobH: fixing partition for an-db hosts [puppet] - 10https://gerrit.wikimedia.org/r/725350 (https://phabricator.wikimedia.org/T289632) [17:53:12] (03CR) 10RobH: [C: 03+2] fixing partition for an-db hosts [puppet] - 10https://gerrit.wikimedia.org/r/725350 (https://phabricator.wikimedia.org/T289632) (owner: 10RobH) [17:58:51] !log depool mw1025, mw1319, mw1312 for test [17:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:57] !log robh@cumin1001 START - Cookbook sre.experimental.reimage for host an-db1001.eqiad.wmnet [18:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:08] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet [18:06:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05Open→03In progress [18:07:12] !log robh@cumin1001 END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host an-db1001.eqiad.wmnet [18:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:22] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet executed w... [18:08:13] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-db1001.eqiad.wmnet', 'a... [18:10:13] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [18:40:38] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-db1001.eqiad.wm... [18:53:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1001.eqiad.wmnet with reason: REIMAGE [18:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1002.eqiad.wmnet with reason: REIMAGE [18:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1001.eqiad.wmnet with reason: REIMAGE [18:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1002.eqiad.wmnet with reason: REIMAGE [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:07:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet'] ` and were **ALL... [19:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:20:35] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) [19:35:30] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:37:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for TTaylor - https://phabricator.wikimedia.org/T292299 (10ttaylor) [19:39:02] (03PS1) 10Ssingh: dnsdist: allow setting additional custom HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/725359 [19:41:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31465/console" [puppet] - 10https://gerrit.wikimedia.org/r/725359 (owner: 10Ssingh) [19:48:00] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [19:48:23] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05In progress→03Resolved These are now ready for use! [19:50:28] (03CR) 10Ladsgroup: changeprop-jobqueue: Make new jobs of Wikidata dispatcher high priority (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [19:57:46] (03Abandoned) 10Ssingh: dnsdist: allow setting additional custom HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/725359 (owner: 10Ssingh) [20:57:45] (03PS1) 10Legoktm: Revert "Have PdfHandler use Shellbox on Commons for 10% of requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725281 [20:58:44] (03CR) 10Legoktm: [C: 03+2] Revert "Have PdfHandler use Shellbox on Commons for 10% of requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725281 (owner: 10Legoktm) [20:58:47] !log temp disabling puppet on puppetmasters - deploying gerrit:724115 (gerrit:723310) T273673 [20:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:54] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [20:59:32] (03Merged) 10jenkins-bot: Revert "Have PdfHandler use Shellbox on Commons for 10% of requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725281 (owner: 10Legoktm) [20:59:39] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [21:01:03] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Revert "Have PdfHandler use Shellbox on Commons for 10% of requests" (duration: 00m 59s) [21:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:05:00] !log puppetmaster1001 - re-enabled puppet, noop as expected, the passive host pulls from the active one, so only 2001 has the cron/job/timer [21:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:28] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@d309a6e] (eqiad): tegola: reduce load to 50% during the weekend [21:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:14] (03Abandoned) 10Reedy: Remove old list of sites listed on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/627551 (owner: 10Reedy) [21:06:22] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@d309a6e] (eqiad): tegola: reduce load to 50% during the weekend (duration: 00m 54s) [21:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:38] !log puppetmaster1004, puppetmaster1005, puppetmaster2004, puppetmaster2005: re-enabled puppet, they are "insetup" role [21:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:03] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10KFrancis) Hi all, I am preparing the NDA now. Will send for signatures once it's approved. [21:12:32] !log puppetmaster1002, puppetmaster1003, puppetmaster2002, puppetmaster2003: re-enabled puppet, they are backends. backends don't have the sync cron/job/timer, so noop as well, just like 1004/1005/2004/2005. this just leaves the actual change on 2001 - T273673 [21:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:38] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [21:19:26] !log puppetmaster2001 - puppet removed cron sync_volatile and cron sync_ca - starting and verifying new timers: 'systemctl status sync-puppet-volatile', 'systemctl status sync-puppet-ca' T273673 [21:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:33] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [21:22:58] (03PS2) 10Dzahn: puppetmaster: rename cron references to jobs/timers [puppet] - 10https://gerrit.wikimedia.org/r/723313 [21:26:06] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31466/" [puppet] - 10https://gerrit.wikimedia.org/r/723313 (owner: 10Dzahn) [21:30:24] (03CR) 10Dzahn: "carefully deployed (disabled puppet on all via cumin, re-enabled again etc..) noop everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/723313 (owner: 10Dzahn) [21:30:45] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [21:30:55] (03PS2) 10Dzahn: puppetmaster::rsync: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/723311 (https://phabricator.wikimedia.org/T273673) [21:32:42] (03CR) 10Ladsgroup: [C: 03+1] puppetmaster::rsync: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/723311 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:32:50] (03CR) 10Dzahn: [C: 03+2] "only affected one server in prod and it's been half an hour" [puppet] - 10https://gerrit.wikimedia.org/r/723311 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:34:45] (WdqsStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [21:36:26] (03PS17) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [21:37:28] (03CR) 10RLazarus: "LGTM - I don't know the subject matter well enough to review for e.g. ethtool specifics, but on a surface level it looks good. The file co" [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [21:37:41] (03CR) 10RLazarus: [C: 03+1] interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [21:38:00] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/31467/" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:39:45] (WdqsStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [21:40:45] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [21:43:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:44:33] !log puppetmasters - temp. disabling puppet one more time, now for a different deploy, to fetch an additional MaxMind database - T288844 [21:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:40] T288844: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 [21:45:34] (03PS1) 10BryanDavis: toolhub: Add "localhost" to no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725376 (https://phabricator.wikimedia.org/T292027) [21:50:56] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add "localhost" to no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725376 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [21:53:21] (03PS1) 10Dzahn: geoip/maxmind: fix/rename resources related to new ipinfo job [puppet] - 10https://gerrit.wikimedia.org/r/725378 [21:55:28] (03Merged) 10jenkins-bot: toolhub: Add "localhost" to no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725376 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [21:56:30] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:54] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31468/" [puppet] - 10https://gerrit.wikimedia.org/r/725378 (owner: 10Dzahn) [22:00:18] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Stanglavine) on ptwiki, same error: http... [22:07:16] (03PS1) 10Dzahn: geoip: fix syntax in erb template for new maxmind job [puppet] - 10https://gerrit.wikimedia.org/r/725379 (https://phabricator.wikimedia.org/T288844) [22:07:54] (03CR) 10Dzahn: [C: 03+2] geoip: fix syntax in erb template for new maxmind job [puppet] - 10https://gerrit.wikimedia.org/r/725379 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:14:35] (03CR) 10Dzahn: "after some small follow-ups:" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:15:03] (03PS1) 10Dduvall: train-dev: Fix indentation in wmf-config/DevServices.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725380 [22:15:42] !log puppetmaster2001 - sudo /usr/local/bin/geoipupdate_job after adding new shell command and timer - succesfully downloaded enterprise database for T288844 [22:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:49] (03PS1) 10Dduvall: train-dev: Use an array for etcd configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725381 [22:15:49] T288844: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 [22:16:58] !log puppetmaster2001 systemctl disable geoip_update_ipinfo.timer [22:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:16] (03PS2) 10Dduvall: train-dev: Use an array for etcd configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725381 [22:20:59] (03PS1) 10BryanDavis: toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) [22:21:52] (03CR) 10Dduvall: [C: 03+2] "Merging since this branch is currently broken in train-dev." [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725380 (owner: 10Dduvall) [22:22:09] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) [22:22:13] (03CR) 10Dduvall: [C: 03+2] "Merging since this branch is currently broken in train-dev." [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725381 (owner: 10Dduvall) [22:22:38] (03Merged) 10jenkins-bot: train-dev: Fix indentation in wmf-config/DevServices.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725380 (owner: 10Dduvall) [22:22:40] ^ fyi just the train-dev branch. no production impact [22:22:51] (03Merged) 10jenkins-bot: train-dev: Use an array for etcd configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725381 (owner: 10Dduvall) [22:24:46] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_ipinfo.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:33] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: Make new jobs of Wikidata dispatcher high priority [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [22:26:44] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:01] !log puppetmaster2001 - systemctl reset-failed [22:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:03] (03PS1) 10Gergő Tisza: Add image_suggestion_interaction event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725386 [22:34:42] (03PS2) 10BryanDavis: toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) [22:35:22] (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [22:36:44] (03PS4) 10Dzahn: geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) [22:37:23] (03CR) 10jerkins-bot: [V: 04-1] geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:53:07] (03PS5) 10Dzahn: geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) [22:53:44] (03CR) 10jerkins-bot: [V: 04-1] geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:55:53] (03PS3) 10BryanDavis: toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) [23:06:59] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [23:08:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [23:11:13] (03Merged) 10jenkins-bot: toolhub: Add envoy and mcrouter sidecars to cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/725384 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [23:13:59] (03PS1) 10Dzahn: puppetmaster/geoip: do not duplicate pulling of maxmind on all servers [puppet] - 10https://gerrit.wikimedia.org/r/725390 [23:19:25] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:47] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) After quite some necessary puppet changes (see above) we are now at a state where we could succesfully do... [23:27:54] (03PS1) 10Dduvall: train-dev: Add missing service configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725391 [23:27:56] (03PS1) 10Dduvall: train-dev: Remove hardcoding of datacenters in redis configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725392 [23:29:08] (03CR) 10jerkins-bot: [V: 04-1] train-dev: Remove hardcoding of datacenters in redis configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725392 (owner: 10Dduvall) [23:30:16] (03PS2) 10Dduvall: train-dev: Remove hardcoding of datacenters in redis configuration [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/725392 [23:58:35] (03CR) 10Gergő Tisza: [C: 03+1] "Looks good, although I wonder if there will be an expectation of & having higher priority." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725263 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm)