[00:00:05] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T0000). [00:01:33] (03PS4) 10Gergő Tisza: Add Wikivoyage in wgImportSources to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736520 (https://phabricator.wikimedia.org/T294928) (owner: 10Juan90264) [00:01:53] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.6/extensions/GrowthExperiments: Backport: [[gerrit:736518|Add Image: add HTTP proxy config (T290949)]] [[gerrit:736519|Add Image: Harden API response parsing]] (duration: 01m 05s) [00:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:56] T290949: Add an image: Enable on test wikis - https://phabricator.wikimedia.org/T290949 [00:02:18] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at reading authorization packet, system error: 104 Connection reset by peer https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:03:08] ... [00:04:21] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:05:17] legoktm: I got this error: https://phabricator.wikimedia.org/P17677 [00:05:32] "Error running command with poolcounter: timed out" seems to be the gist of it [00:06:21] o.O [00:06:42] and it failed on mwdebug1001 [00:06:51] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:07:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:51] tgr: feel free to keep deploying [00:07:54] eh, didn't you already roll back scap to the previous version? [00:07:57] thx [00:08:36] mutante: I did [00:08:49] this is something unrelated [00:08:53] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:09:12] tgr: ... Sorry to interrupt the conversation, but will my change roll-out continue? [00:09:19] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:736593|Enable GrowthExperiments image recommendations on ar,bn,cs,vi (T294878)]] (duration: 01m 03s) [00:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:23] T294878: Add Image: Enable on pilot wikis in dark mode - https://phabricator.wikimedia.org/T294878 [00:09:34] yeah, in a sec [00:09:41] Okay [00:09:56] tgr: did that sync go through fine? [00:10:01] yes [00:10:29] (03CR) 10Gergő Tisza: [C: 03+2] Add Wikivoyage in wgImportSources to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736520 (https://phabricator.wikimedia.org/T294928) (owner: 10Juan90264) [00:10:49] Cool, are you looking to enable dark mode on wikis in the future? [00:11:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:17] (03Merged) 10jenkins-bot: Add Wikivoyage in wgImportSources to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736520 (https://phabricator.wikimedia.org/T294928) (owner: 10Juan90264) [00:11:30] Great merged [00:11:56] Will we use WikimediaDebug? [00:11:57] legoktm: ACK, so the check-and-restart-php command on mwdebug1001.. just "fails" because it's not restarting php-fpm because "Number of cached keys 2299" [00:12:25] no, it failed connecting to poolcounter [00:12:40] ok, so now that is over then [00:12:59] and now it just says "over MAX_CACHED_KEYS $limit, nothing to do" [00:13:26] Juan_90264: yeah, it's on mwdebug1001 [00:15:14] and from mwdebug1001, I was able to ping the two hosts specified in /etc/poolcounter-backends.yaml just fine [00:16:52] (03CR) 10Legoktm: [C: 03+1] cumin: add parsoid-canary to mw-canary and reuse other aliases [puppet] - 10https://gerrit.wikimedia.org/r/736594 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [00:21:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:23] (03CR) 10Legoktm: [C: 04-1] cumin: reorganize mediawiki aliases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [00:22:26] tgr: I approved [00:24:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:43] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:736520|Add Wikivoyage in wgImportSources to enwikiversity (T294928)]] (duration: 01m 05s) [00:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:46] T294928: Add Wikivoyage to en.wikiversity Special:Import - https://phabricator.wikimedia.org/T294928 [00:25:55] Juan_90264: thx, it's live [00:26:04] !log UTC late deploys done [00:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:58] Okay, thanks tgr! [00:27:58] there's a small spike of "Undefined index: image-recommendation" errors, in theory that should be transitional [00:29:03] ...although I'm not quite sure why it's happening at all. [00:30:54] tgr: multiple hosts or just one? [00:32:05] there has also been a small but steady stream of deadlocks from UserOptionsManager::saveOptionsInternal, from the timing that's probably train-related [00:32:28] legoktm: many hosts [00:32:59] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:33:02] Would it be a bad idea to create the possibility to enable dark mode directly on the wiki with extension usage? This one came sketched out in my mind, and I'm maybe thinking about developing it. (Without putting my full name on it, too big) [00:33:46] https://logstash.wikimedia.org/goto/14e20d94526124a98d5d7001136a88fb [00:34:53] do you know where it's coming from? [00:35:01] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:35:31] It doesn't seem to be a user-visible error. I'll debug. [00:36:15] ack [00:36:47] it's definitely related to the patch I deployed, just not sure how since that did create the image-recommendation type and as far as I can tell it's working properly. [00:39:49] (03CR) 10Dzahn: cumin: reorganize mediawiki aliases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [00:39:57] (03PS3) 10Dzahn: cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [00:40:08] in other news I broke mailman [00:40:24] I guess the apache2 upgrade has some regression, the path matching is broken [00:41:34] Juan_90264: I guess I would expect a dark mode setting under user preferences -> appearance [00:41:40] wow, the mwdebug log has 2500 events from en.wikiversity [00:42:01] legoktm: the webserver seems up though? [00:42:07] live hacked [00:42:10] ack [00:42:14] oh, I guess that's from Juan_90264 testing the import feature [00:43:18] (03PS1) 10Legoktm: lists: Unbreak Mailman's web interface [puppet] - 10https://gerrit.wikimedia.org/r/736602 [00:43:19] Juan_90264: are you asking about dark mode support for some skin? [00:43:34] (03CR) 10Legoktm: [V: 03+2 C: 03+2] lists: Unbreak Mailman's web interface [puppet] - 10https://gerrit.wikimedia.org/r/736602 (owner: 10Legoktm) [00:43:45] tgr: Yes [00:44:19] mutante: Great place to add the activation button, but I think it could possibly have a quick activation button at some strategic point. [00:44:26] wow, regression in ProxyPath behaviour? I am glad this did not hit mediawiki rewrites [00:44:30] these days there's CSS-level support for dark mode, I think [00:44:45] ProxyPass [00:45:00] mutante: well I upgraded all the MW hosts too, I guess that config is written differently [00:45:08] so some OSes can enable/disable directly [00:45:30] anyway that's probably a better conversation for #mediawiki or #wikimedia-tech [00:47:55] Juan_90264: you mean to change what the default is per wiki unless the user overrides it in their settings? [00:50:38] legoktm: ACK, not in mw redirects but Gerrit does have this pattern [00:51:54] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996570 [00:53:49] ah, just in combination with uwsgi? ack [00:54:47] that means just lists and ... idp [00:54:53] and striker [00:55:00] and debmon [00:55:02] mutante: I'm not looking to change, it was just an idea of a shortcut to this dark mode [00:55:19] filed T294995 about the deadlock issue [00:55:19] T294995: Deadlocks from job setting VectorSkinVersion user preference to 1 - https://phabricator.wikimedia.org/T294995 [00:55:31] 30 instances in the last hour, probably not a blocker? [00:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [00:59:47] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 184 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:00:26] that's a bit of a spike there [01:00:35] um [01:01:41] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 298 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:03:21] I think it recovered [01:03:36] it was all parsoid but recovering? [01:03:38] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:03:43] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:08:15] yeah, seems fine now [01:08:20] just a storm of bad rquests I think [01:08:27] did not see errors in Parsoid.log either [01:08:33] on mwlog [01:08:51] they're all in exception.log [01:09:03] it was someone/thing trying to parse very big pages that were timing out/OOMing [01:09:22] I'm really going off now :) [01:09:33] ACK, ok. thanks, cya [01:18:21] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:27:41] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:28:12] the undefined index errors should be fixed now [01:57:56] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10copypaste) Also just affected [File:Systemd-on-fedora.svg](https://commons.wikimedia.org/wiki/File:Systemd-on-fedora.s... [02:00:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [02:02:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:19:55] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:21:55] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:37:03] (03PS1) 10Gergő Tisza: Add Image: Do not use proxy in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736623 (https://phabricator.wikimedia.org/T294987) [03:40:48] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 10.58 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [03:48:54] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 12.7 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [03:53:00] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 1.606 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:12:58] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:14:45] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:16:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:34] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 12.1 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:49:38] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 1.407 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:10:06] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Legoktm) > And if you would like to be kept informed of the discussion process, please let me know. (This is a private discussion and you will need to have a Telegram account)... [05:10:09] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:12:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:33:59] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:35:59] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:48:00] (03PS1) 10Marostegui: Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/736523 [05:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for the old special replicas T263127', diff saved to https://phabricator.wikimedia.org/P17679 and previous config saved to /var/cache/conftool/dbconfig/20211104-055419-marostegui.json [05:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:23] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:09:10] (03CR) 10Legoktm: [C: 03+2] Ignore flake8-bugbear B904 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734965 (owner: 10Hashar) [06:09:47] (03Merged) 10jenkins-bot: Ignore flake8-bugbear B904 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734965 (owner: 10Hashar) [06:11:14] (03CR) 10Legoktm: "Personally I think some of these are funny and others are dated. I think it would reasonable for you to suggest some newer, funny replacem" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734964 (owner: 10Hashar) [06:21:46] (03PS3) 10Juan90264: Allow bureaucrats to grant and revoke the importer rights to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736522 (https://phabricator.wikimedia.org/T294930) [06:34:03] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:50] (03PS1) 10Legoktm: lists: Fix internal apache2 config, for real [puppet] - 10https://gerrit.wikimedia.org/r/736632 [06:47:25] (03CR) 10Legoktm: [C: 03+2] lists: Fix internal apache2 config, for real [puppet] - 10https://gerrit.wikimedia.org/r/736632 (owner: 10Legoktm) [06:56:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:23] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) >>! In T286066#7476400, @Legoktm wrote: > because of issues serving it over localhost, Mailman connects to itself over https://lists.wikimedia.org/ which is not terrible w... [07:05:31] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:08] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Legoktm) >>! In T294800#7473846, @Joe wrote: > If anything, I think we should go in the other direction, and progressively and drastically red... [07:11:15] (03PS2) 10Legoktm: lists: Set profile::base::firewall::block_abuse_nets: true [puppet] - 10https://gerrit.wikimedia.org/r/736552 [07:15:51] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) Setting up envoy as a tlsproxy should be straightforward. The one thing I'm not sure about is how to have it it talk to Apache over HTTP, since we currently have Apache enf... [07:15:56] (03CR) 10Legoktm: [C: 03+2] lists: Set profile::base::firewall::block_abuse_nets: true [puppet] - 10https://gerrit.wikimedia.org/r/736552 (owner: 10Legoktm) [07:32:32] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Joe) >>! In T294800#7480542, @Legoktm wrote: > Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST r... [07:34:55] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Joe) Let me add another data point: Of those 8 requests over 175 seconds, only 2 were to POSTs to Special:Upload. [07:38:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:23] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:43] 10SRE-swift-storage, 10MediaWiki-Uploading: FAILED: stashfailed: Internal error: Server failed to publish temporary file. - https://phabricator.wikimedia.org/T279407 (10Legoktm) [07:41:11] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner timeouts on cross-DC file uploads because of HTTP/2 - https://phabricator.wikimedia.org/T275752 (10Legoktm) [07:41:26] 10SRE, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner timeouts on cross-DC file uploads because of HTTP/2 - https://phabricator.wikimedia.org/T275752 (10Legoktm) [07:43:13] (03CR) 10Marostegui: [C: 03+2] Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/736523 (owner: 10Marostegui) [07:43:28] 10SRE-swift-storage, 10Internet-Archive, 10MediaWiki-Uploading: Uploading ~160MB DjVu by URL results in 503 error at Commons and failed upload - https://phabricator.wikimedia.org/T280048 (10Legoktm) I suspect this is the same cause as T292954#7473355. Any logs from April are long gone now unfortunately so it... [07:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17681 and previous config saved to /var/cache/conftool/dbconfig/20211104-074346-root.json [07:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:33] 10SRE-tools, 10Infrastructure-Foundations, 10IPv6: Some Data Persistence DB clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271140 (10Marostegui) Removing the DBA tag [07:53:19] (03PS3) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [07:53:41] (03PS3) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [07:58:39] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, and 2 others: Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Nardog) [07:59:31] (03PS1) 10Giuseppe Lavagetto: wcqs: remove from loadbalancers to avoid alerting [puppet] - 10https://gerrit.wikimedia.org/r/736644 (https://phabricator.wikimedia.org/T294865) [08:09:09] 10SRE, 10Internet-Archive, 10serviceops: Improve download speed from archive.org on appservers - https://phabricator.wikimedia.org/T295009 (10Legoktm) [08:10:40] (03CR) 10Muehlenhoff: [C: 03+1] cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [08:11:43] 10SRE, 10Commons, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10Legoktm)... [08:12:23] 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10MediaWiki-API, and 3 others: Large PDF upload issue - https://phabricator.wikimedia.org/T254459 (10Legoktm) Is this still an issue? [08:17:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1163', diff saved to https://phabricator.wikimedia.org/P17682 and previous config saved to /var/cache/conftool/dbconfig/20211104-081726-marostegui.json [08:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17683 and previous config saved to /var/cache/conftool/dbconfig/20211104-081729-root.json [08:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:26] 10SRE-swift-storage, 10User-Inductiveload: Unable to upload to Commons: uploadstash-file-not-found: Key "187kyl5ozj74.xtav8j.51508.djvu" not found in stash - https://phabricator.wikimedia.org/T278104 (10Legoktm) [08:22:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wcqs: remove from loadbalancers to avoid alerting [puppet] - 10https://gerrit.wikimedia.org/r/736644 (https://phabricator.wikimedia.org/T294865) (owner: 10Giuseppe Lavagetto) [08:25:54] (03PS1) 10Muehlenhoff: Remove obsolete group memberships [puppet] - 10https://gerrit.wikimedia.org/r/736647 [08:27:08] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:27:26] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:27:27] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:28:01] (03PS2) 10Muehlenhoff: Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 [08:28:34] (03CR) 10Muehlenhoff: Add further engineering managers for ops: approval (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [08:29:19] <_joe_> !log restarting pybal on low-traffic nodes in eqiad and codfw [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:31:27] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17684 and previous config saved to /var/cache/conftool/dbconfig/20211104-083233-root.json [08:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:58] (03CR) 10Kosta Harlan: [C: 03+1] Add Image: Do not use proxy in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736623 (https://phabricator.wikimedia.org/T294987) (owner: 10Gergő Tisza) [08:37:03] <_joe_> !log ipvsadm -Dt 10.2.2.67:443 on lvs101{5,6} [08:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:44] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:41:02] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:41:04] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10ayounsi) a:05ayounsi→03RobH If it needs to be mass updated across all PDUs, like we did in https://wikitech.wikimedia.org/wiki/LibreNMS#Mass_update_PDU_alerting_thresholds then I can t... [08:43:32] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17685 and previous config saved to /var/cache/conftool/dbconfig/20211104-084736-root.json [08:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, minor comments below." [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [08:53:50] (03Abandoned) 10Arturo Borrero Gonzalez: cloudgw: keepalived: set same priority on the 2 VRRP instances [puppet] - 10https://gerrit.wikimedia.org/r/736548 (https://phabricator.wikimedia.org/T294956) (owner: 10Arturo Borrero Gonzalez) [08:56:39] !log restarting blazegraph on wdqs1012 (stuck for the past 6 hours) [08:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:02:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17686 and previous config saved to /var/cache/conftool/dbconfig/20211104-090240-root.json [09:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] (03CR) 10David Caro: [C: 03+1] "LGTM, there's the missing '{}' around data, but that seems to be a copy-pasted error" [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [09:07:37] (03PS1) 10Volans: Add section and host to error log message [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/736652 [09:09:28] !log depool cp4034.ulsfo.wmnet - T290694 [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:31] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [09:12:54] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:00] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [09:13:31] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10MoritzMuehlenhoff) >>! In T294961#7479399, @EBernhardson wrote: > Without having better information, i would guess we are trig... [09:17:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17687 and previous config saved to /var/cache/conftool/dbconfig/20211104-091744-root.json [09:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:47] (03CR) 10David Caro: [C: 04-1] "All nits can be ignored" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [09:29:03] 10SRE-swift-storage, 10Internet-Archive, 10MediaWiki-Uploading: Uploading ~160MB DjVu by URL results in 503 error at Commons and failed upload - https://phabricator.wikimedia.org/T280048 (10Inductiveload) 05Open→03Resolved a:03Inductiveload Well, I have also completely forgotten the context of this and... [09:32:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17688 and previous config saved to /var/cache/conftool/dbconfig/20211104-093247-root.json [09:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:56] (03PS1) 10Muehlenhoff: Also apply ganeti_test role to 2002/2003 [puppet] - 10https://gerrit.wikimedia.org/r/736707 (https://phabricator.wikimedia.org/T286206) [09:44:08] (03PS2) 10Muehlenhoff: Also apply ganeti_test role to 2002/2003 [puppet] - 10https://gerrit.wikimedia.org/r/736707 (https://phabricator.wikimedia.org/T286206) [09:47:09] (03PS9) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [09:47:11] (03PS9) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [09:47:13] (03PS3) 10Vgutierrez: prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [09:47:15] (03PS2) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [09:47:17] (03PS1) 10Vgutierrez: haproxy: Add missing TLS configuration options [puppet] - 10https://gerrit.wikimedia.org/r/736708 (https://phabricator.wikimedia.org/T290005) [09:52:20] (03CR) 10Jelto: [C: 03+1] "lgtm." [deployment-charts] - 10https://gerrit.wikimedia.org/r/736554 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [09:54:34] (03PS2) 10Vgutierrez: haproxy: Add missing TLS configuration options [puppet] - 10https://gerrit.wikimedia.org/r/736708 (https://phabricator.wikimedia.org/T290005) [09:54:36] (03PS10) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [09:54:38] (03PS10) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [09:54:40] (03PS4) 10Vgutierrez: prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [09:54:42] (03PS3) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [09:56:47] I won't make the backport window, this week and this week only thanks due to daylight savings or not, it conflicts with another meeting I really need to be at. cc Amir1, I don't see Lucas in here. [09:59:42] er, drmrs mgmt might alert [10:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1000). [10:01:21] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4034.ulsfo.wmnet with OS buster [10:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed: - cp4034 (**WARN**... [10:03:01] (03CR) 10ArielGlenn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:03:43] (03CR) 10ArielGlenn: [C: 03+1] "Fine once the previous patch has taken effect." [puppet] - 10https://gerrit.wikimedia.org/r/736600 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:04:34] (03CR) 10Muehlenhoff: [C: 03+2] Also apply ganeti_test role to 2002/2003 [puppet] - 10https://gerrit.wikimedia.org/r/736707 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:05:41] (03PS1) 10Marostegui: parsercache.my.cnf: Add innodb_checksum_algorithm = full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/736709 (https://phabricator.wikimedia.org/T287244) [10:05:56] (03PS1) 10MMandere: install_server: Add drmrs nodes to partman configs [puppet] - 10https://gerrit.wikimedia.org/r/736710 (https://phabricator.wikimedia.org/T282787) [10:06:16] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf: Add innodb_checksum_algorithm = full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/736709 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [10:09:41] (03CR) 10Btullis: [C: 03+1] "Yep, looks great. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [10:16:43] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10aborrero) note: for cloudcephosd1022 in Netbox, manually updated `Status` from `Planned` to `Staged`. [10:18:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-enc-cli: added set_prefix_roles subcommand (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [10:21:16] !log pool cp4034.ulsfo.wmnet - T290694 [10:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:20] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [10:27:11] !log depool cp4036.ulsfo.wmnet - T290694 [10:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:14] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [10:28:31] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4036.ulsfo.wmnet with OS buster [10:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [10:31:32] (03PS4) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) [10:31:39] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [10:34:50] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add missing TLS configuration options [puppet] - 10https://gerrit.wikimedia.org/r/736708 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:39:11] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: eqiad: include new nodes in the farm [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) [10:43:41] (03PS2) 10Ssingh: dnsdist: update configuration template for dnsdist 1.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/734711 [10:44:40] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32113/console" [puppet] - 10https://gerrit.wikimedia.org/r/734711 (owner: 10Ssingh) [10:45:16] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: update configuration template for dnsdist 1.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/734711 (owner: 10Ssingh) [10:45:34] (03CR) 10Vgutierrez: [C: 03+2] cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:50:12] (03CR) 10David Caro: "LGTM just one question" [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [10:52:23] (03PS1) 10David Caro: Remove future/rich data modes support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) [10:52:28] (03PS1) 10David Caro: Remove resource_filter support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736721 (https://phabricator.wikimedia.org/T294630) [10:53:03] (03PS1) 10Ssingh: dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 [10:53:36] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 (owner: 10Ssingh) [10:55:01] (03PS2) 10Ssingh: dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 [10:55:29] (03CR) 10David Caro: Remove future/rich data modes support (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [10:55:34] (03PS2) 10David Caro: Remove future/rich data modes support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) [10:55:36] (03PS2) 10David Caro: Remove resource_filter support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736721 (https://phabricator.wikimedia.org/T294630) [10:55:39] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 (owner: 10Ssingh) [10:56:59] (03PS3) 10Ssingh: dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 [10:58:19] (03CR) 10Ssingh: [C: 03+2] dnsdist: disable check_cmd for version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/736722 (owner: 10Ssingh) [10:58:29] (03CR) 10David Caro: [C: 03+1] wmcs-enc-cli: added set_prefix_roles subcommand (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [10:59:26] (03PS1) 10Muehlenhoff: Enable RAPI Netbox sync for new test cluster [puppet] - 10https://gerrit.wikimedia.org/r/736724 (https://phabricator.wikimedia.org/T286206) [11:00:04] Amir1, Lucas_WMDE, and apergos: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1100). [11:00:28] sorryr, not here, conflicting meeting, blame daylight savings time [11:01:03] !log upload dnsdist 1.6.1-1wm1 to apt.wm.o (buster) - T273679 [11:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:06] T273679: Wikidough: Upgrade to dnsdist 1.6.0 - https://phabricator.wikimedia.org/T273679 [11:11:54] (03CR) 10Arturo Borrero Gonzalez: cloud: ceph: eqiad: include new nodes in the farm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [11:12:27] (03PS1) 10David Caro: Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 [11:13:38] (03CR) 10David Caro: [C: 03+1] cloud: ceph: eqiad: include new nodes in the farm [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [11:14:15] (03CR) 10Ema: "The job definition is missing, see for example $varnish_jobs" [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:14:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: ceph: eqiad: include new nodes in the farm [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [11:14:35] (03CR) 10Majavah: [C: 04-1] "This probably needs helm3 support like added in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/735979" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [11:14:37] (03PS1) 10Jelto: hiera::role::common::deployment_server update helmBinary staging [puppet] - 10https://gerrit.wikimedia.org/r/736727 (https://phabricator.wikimedia.org/T251305) [11:14:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: ceph: eqiad: include new nodes in the farm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736719 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [11:15:40] (03CR) 10David Caro: Black and isort all the things (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [11:16:02] (03PS2) 10David Caro: Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 [11:17:19] (03PS1) 10Majavah: Re-assign role to cloudcephosd1020 [puppet] - 10https://gerrit.wikimedia.org/r/736728 [11:17:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Re-assign role to cloudcephosd1020 [puppet] - 10https://gerrit.wikimedia.org/r/736728 (owner: 10Majavah) [11:21:14] if there is a patch needs to be deployed, let me know [11:22:46] 10SRE, 10Traffic-Icebox: Wikidough: Upgrade to dnsdist 1.6.0 - https://phabricator.wikimedia.org/T273679 (10ssingh) On `doh1001`, ` $ dnsdist --version dnsdist 1.6.1 (Lua 5.1.4 [LuaJIT 2.1.0-beta3]) ` ` kdig @185.71.138.138 +tls-ca +tls-host=wikimedia-dns.org wikipedia.org +nsid ;; TLS session (TLS1.3)-(E... [11:24:01] !log update dnsdist on O:wikidough [11:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:02] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [11:26:05] 10SRE, 10Traffic-Icebox: Wikidough: Upgrade to dnsdist 1.6.0 - https://phabricator.wikimedia.org/T273679 (10ssingh) 05Open→03Resolved ` ===== NODE GROUP ===== (10) doh[1001-1002,2001-2002,3001-30... [11:28:32] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4036.ulsfo.wmnet with OS buster [11:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed: - cp4036 (**WARN**... [11:29:06] (03PS1) 10Jbond: DO NOT MEREG - broken commit for testing pcc changes [puppet] - 10https://gerrit.wikimedia.org/r/736729 [11:29:46] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MEREG - broken commit for testing pcc changes [puppet] - 10https://gerrit.wikimedia.org/r/736729 (owner: 10Jbond) [11:30:32] (03PS2) 10Volans: Add section and host to error log message [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/736652 (https://phabricator.wikimedia.org/T293975) [11:32:06] (03PS1) 10Ssingh: Revert "dnsdist: disable check_cmd for version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/736730 [11:32:30] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736724 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:35:13] (03CR) 10Muehlenhoff: [C: 03+2] Enable RAPI Netbox sync for new test cluster [puppet] - 10https://gerrit.wikimedia.org/r/736724 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:40:21] (03CR) 10Jbond: [C: 03+1] "This all looks good and think its good to clean this out. In general i think its usefull to be able to test different puppet.conf configu" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [11:40:36] (03CR) 10Jbond: [C: 03+1] "fyii also tested this and seemd to workfine" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [11:41:18] (03PS1) 10MMandere: site: Add dns instances for drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) [11:42:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32114/console" [puppet] - 10https://gerrit.wikimedia.org/r/736727 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:42:17] (03CR) 10Jbond: [C: 03+1] "LGTM and tested" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736721 (https://phabricator.wikimedia.org/T294630) (owner: 10David Caro) [11:44:22] (03PS2) 10Jbond: DO NOT MEREG - broken commit for testing pcc changes [puppet] - 10https://gerrit.wikimedia.org/r/736729 [11:44:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "dnsdist: disable check_cmd for version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/736730 (owner: 10Ssingh) [11:46:09] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MEREG - broken commit for testing pcc changes [puppet] - 10https://gerrit.wikimedia.org/r/736729 (owner: 10Jbond) [11:53:51] !log pool cp4036.ulsfo.wmnet - T290694 [11:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:54] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [11:54:40] (03CR) 10Hashar: Remove bot humors for deployers (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734964 (owner: 10Hashar) [11:55:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:38] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: fix interface name on newest osd servers [puppet] - 10https://gerrit.wikimedia.org/r/736733 (https://phabricator.wikimedia.org/T295012) [11:58:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: ceph: fix interface name on newest osd servers [puppet] - 10https://gerrit.wikimedia.org/r/736733 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [11:59:17] (03CR) 10David Caro: [C: 03+1] cloud: ceph: fix interface name on newest osd servers [puppet] - 10https://gerrit.wikimedia.org/r/736733 (https://phabricator.wikimedia.org/T295012) (owner: 10Arturo Borrero Gonzalez) [12:02:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:14] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) >>! 在T294676#7480475中,@Legoktm写道: >> And if you would like to be kept informed of the discussion process, please let me know. (This is a private discussion and you... [12:05:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:41] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736730 (owner: 10Ssingh) [12:06:51] (03CR) 10Ssingh: [C: 03+2] Revert "dnsdist: disable check_cmd for version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/736730 (owner: 10Ssingh) [12:09:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:56] (03CR) 10Jbond: "LGTM, minor comment which can be dealt with later." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [12:12:37] (03CR) 10Jbond: [C: 03+1] Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [12:14:41] (03CR) 10David Caro: Remove future/rich data modes support (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [12:16:09] (03CR) 10David Caro: Black and isort all the things (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [12:18:58] (03PS11) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [12:19:00] (03PS5) 10Vgutierrez: prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [12:19:03] (03PS4) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [12:19:23] (03CR) 10Jbond: [C: 03+1] "thanks" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [12:21:02] 10SRE, 10Commons, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10AlexisJazz... [12:24:03] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:24:04] \o\ |o| /o/ welcome to the team Amir1 XD [12:24:36] ^^ [12:25:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2144 for mysql upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17689 and previous config saved to /var/cache/conftool/dbconfig/20211104-122504-ladsgroup.json [12:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:08] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [12:25:13] there you go Amir1 [12:28:25] (03PS1) 10Majavah: P::kerberos: automate principal management [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) [12:28:56] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [12:29:12] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:30:33] (03CR) 10Ema: [C: 03+1] prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:32:52] (03PS1) 10Ema: prometheus: remove varnish-canary job [puppet] - 10https://gerrit.wikimedia.org/r/736755 [12:40:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:45] !log Upgrade db2144 (kernel and mariadb) T295026 [12:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:48] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [12:45:41] (03PS1) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [12:45:53] Amir1: now you ops, do you need to be in mailman-roots / whatever deployment is etc.? [12:46:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:17] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Jonathan5566) To be clear, what kind of on-wiki dissociation will SRE like to see? Will we need to do that on VP? Or consensus between reviewers is fine? If on-wiki dissociatio... [12:47:29] (03PS1) 10Muehlenhoff: Update names of nodes in new test cluster [puppet] - 10https://gerrit.wikimedia.org/r/736759 (https://phabricator.wikimedia.org/T286206) [12:50:16] (03CR) 10Muehlenhoff: [C: 03+2] Update names of nodes in new test cluster [puppet] - 10https://gerrit.wikimedia.org/r/736759 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:51:22] RhinosF1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736647 [12:51:32] I'm planning to comment there [12:52:34] Ah! [12:52:54] (03PS2) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [12:52:59] (03CR) 10jerkins-bot: [V: 04-1] populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 (owner: 10Jbond) [12:54:43] (03PS3) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [12:54:59] PROBLEM - Host db2144 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:32] I'm going to redeploy all services in staging Kubernetes cluster. Expect some helmfile sync SAL spam the next hour [12:55:55] RECOVERY - Host db2144 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [12:56:42] I downtimed everything :/ sorry [12:58:23] (03PS4) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [12:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [12:59:09] (03CR) 10BBlack: [C: 03+1] install_server: Add drmrs nodes to partman configs [puppet] - 10https://gerrit.wikimedia.org/r/736710 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:01:11] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:16] (03PS14) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [13:01:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:34] (03PS2) 10Ssingh: dnsrecursor: prepare pdns-recursor for the 4.5.5 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 [13:02:10] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:24] (03CR) 10BBlack: "Good stuff! As noted by @volans as well, there's a few items here that have to wait until after the dns hosts are up and working (chicken " [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:03:04] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:48] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:08] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32115/console" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [13:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2144 (re)pooling @ 50%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17690 and previous config saved to /var/cache/conftool/dbconfig/20211104-130412-root.json [13:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:15] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [13:04:30] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [13:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:10] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:03] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:53] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:33] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs nodes to partman configs [puppet] - 10https://gerrit.wikimedia.org/r/736710 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:09:42] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:09:42] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:09:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1124.eqiad.wmnet with reason: Testing with the test host [13:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1124.eqiad.wmnet with reason: Testing with the test host [13:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:44] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:35] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:31] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] (03CR) 10Hnowlan: [C: 03+1] R:cassandra::instance::monitoring: make sure cassandra is loaded (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735012 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:14:19] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:58] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [13:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:54] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:58] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2144 (re)pooling @ 100%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17691 and previous config saved to /var/cache/conftool/dbconfig/20211104-131916-root.json [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:19] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [13:24:11] (03CR) 10MMandere: site: Add dns instances for drmrs DC site (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:26:51] (03PS5) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) [13:26:52] !log update eqiad & esams cp nodes to ATS 8.0.8-1wm5 - T294897 [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:41] (03PS2) 10MMandere: site: Add dns instances for drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) [13:27:45] (03CR) 10JMeybohm: [C: 03+1] hiera::role::common::deployment_server update helmBinary staging [puppet] - 10https://gerrit.wikimedia.org/r/736727 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:28:27] (03PS1) 10Elukey: Add sslcert::trusted_root_ca [puppet] - 10https://gerrit.wikimedia.org/r/736765 (https://phabricator.wikimedia.org/T291905) [13:28:35] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [13:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:23] (03CR) 10Elukey: "First draft, let me know :)" [puppet] - 10https://gerrit.wikimedia.org/r/736765 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:29:25] (03CR) 10BBlack: [C: 03+1] "Looks good to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:29:28] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [13:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] (03CR) 10MMandere: [C: 03+2] site: Add dns instances for drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/736731 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:33:35] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:54] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:46] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [13:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:28] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [13:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:04] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [13:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:01] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [13:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:29] (03CR) 10Jgiannelos: tegola-vector-tiles: Setup cronjob parallelism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/736554 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [13:40:01] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [13:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:02] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:40:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:59] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [13:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [13:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:35] (03PS3) 10Ssingh: dnsrecursor: prepare pdns-recursor for the 4.5.5 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 [13:43:38] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:21] (03CR) 10Vgutierrez: [C: 03+2] cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:44:42] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32117/console" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [13:45:16] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:27] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 3: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [13:46:54] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [13:46:55] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [13:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:57] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:02] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:47] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:53:46] (03CR) 10Jelto: [V: 03+1 C: 03+2] hiera::role::common::deployment_server update helmBinary staging [puppet] - 10https://gerrit.wikimedia.org/r/736727 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:54:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:26] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:56] (03CR) 10Ladsgroup: "I don't mind it either way and don't know the policies but I want to say that I got most of these for different occasions and different ti" [puppet] - 10https://gerrit.wikimedia.org/r/736647 (owner: 10Muehlenhoff) [14:10:20] (03CR) 10David Caro: [C: 03+2] Remove future/rich data modes support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [14:10:40] (03CR) 10David Caro: [C: 03+2] Remove future/rich data modes support (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [14:10:51] (03CR) 10David Caro: [C: 03+2] Remove resource_filter support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736721 (https://phabricator.wikimedia.org/T294630) (owner: 10David Caro) [14:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [14:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] (03CR) 10Andrew Bogott: wmcs-enc-cli: added set_prefix_roles subcommand (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [14:14:36] (03PS3) 10Andrew Bogott: wmcs-enc-cli: added set_prefix_roles subcommand [puppet] - 10https://gerrit.wikimedia.org/r/736590 [14:15:40] (03PS1) 10Ssingh: dnsrecursor: add support for enabling EDNS padding [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) [14:16:16] (03PS1) 10Ottomata: analytics-meta.cnf.erb - set character set to binary [puppet] - 10https://gerrit.wikimedia.org/r/736777 (https://phabricator.wikimedia.org/T279440) [14:16:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32118/console" [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) (owner: 10Ssingh) [14:16:52] (03PS1) 10Muehlenhoff: Update VIP for ganeti test cluster following netbox change [dns] - 10https://gerrit.wikimedia.org/r/736778 [14:17:10] (03Merged) 10jenkins-bot: Remove future/rich data modes support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736720 (https://phabricator.wikimedia.org/T294540) (owner: 10David Caro) [14:17:19] (03Abandoned) 10Ottomata: analytics-meta.cnf.erb - set character set to binary [puppet] - 10https://gerrit.wikimedia.org/r/736777 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:17:50] (03CR) 10Herron: [C: 03+2] centrallog: prep rsync from centrallog2001 -> centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/736563 (owner: 10Herron) [14:18:12] (03PS2) 10Ssingh: dnsrecursor: add support for enabling EDNS padding [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) [14:18:33] (03PS2) 10ArielGlenn: add fake enterprise api dumps downloader credentials [labs/private] - 10https://gerrit.wikimedia.org/r/736527 (https://phabricator.wikimedia.org/T273585) [14:18:46] (03CR) 10Muehlenhoff: Remove obsolete group memberships (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736647 (owner: 10Muehlenhoff) [14:19:17] (03Merged) 10jenkins-bot: Remove resource_filter support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736721 (https://phabricator.wikimedia.org/T294630) (owner: 10David Caro) [14:19:43] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32119/console" [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) (owner: 10Ssingh) [14:19:54] (03PS3) 10David Caro: Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 [14:19:56] (03CR) 10David Caro: Black and isort all the things (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [14:19:58] (03PS1) 10David Caro: Show the directory where fact files are searched on error [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 [14:20:18] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add fake enterprise api dumps downloader credentials [labs/private] - 10https://gerrit.wikimedia.org/r/736527 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [14:20:23] (03CR) 10Muehlenhoff: [C: 03+2] Update VIP for ganeti test cluster following netbox change [dns] - 10https://gerrit.wikimedia.org/r/736778 (owner: 10Muehlenhoff) [14:20:25] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 2: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/736776 (https://phabricator.wikimedia.org/T274431) (owner: 10Ssingh) [14:21:03] (03PS1) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:21:07] (03PS1) 10Ema: trafficserver: set ssl.client.verify.server to 1 [puppet] - 10https://gerrit.wikimedia.org/r/736781 (https://phabricator.wikimedia.org/T294897) [14:21:34] (03CR) 10jerkins-bot: [V: 04-1] Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:21:41] (03PS2) 10Urbanecm: Add Image: Do not use proxy in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736623 (https://phabricator.wikimedia.org/T294987) (owner: 10Gergő Tisza) [14:21:53] jouncebot: nowandnext [14:21:53] No deployments scheduled for the next 1 hour(s) and 38 minute(s) [14:21:53] In 1 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1600) [14:22:05] (03CR) 10Urbanecm: [C: 03+2] Add Image: Do not use proxy in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736623 (https://phabricator.wikimedia.org/T294987) (owner: 10Gergő Tisza) [14:22:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [14:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:40] (03CR) 10BBlack: [C: 03+1] "I'd probably +1 anything that removes lines without adding any! :)" [puppet] - 10https://gerrit.wikimedia.org/r/736755 (owner: 10Ema) [14:22:42] (03PS2) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:22:50] (03Merged) 10jenkins-bot: Add Image: Do not use proxy in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736623 (https://phabricator.wikimedia.org/T294987) (owner: 10Gergő Tisza) [14:23:12] (03CR) 10jerkins-bot: [V: 04-1] Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:23:26] (03CR) 10Ema: [C: 04-1] "To be merged after we backport the upstream patch mentioned in commit log" [puppet] - 10https://gerrit.wikimedia.org/r/736781 (https://phabricator.wikimedia.org/T294897) (owner: 10Ema) [14:25:00] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 1e5b2503c47f0eb3fd8d797acf94dd48a8e7a6f6: Add Image: Do not use proxy in Beta (T294987) (duration: 01m 05s) [14:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:03] T294987: [betalabs] Add image - "Suggestions are no longer available..." with 403 error - https://phabricator.wikimedia.org/T294987 [14:25:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:32] (03CR) 10jerkins-bot: [V: 04-1] Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [14:26:06] (03CR) 10jerkins-bot: [V: 04-1] Show the directory where fact files are searched on error [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 (owner: 10David Caro) [14:26:08] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/735012 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [14:26:32] (03CR) 10Andrew Bogott: dnsrecursor: prepare pdns-recursor for the 4.5.5 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [14:27:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:25] (03PS3) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:29:23] (03CR) 10Ssingh: [V: 03+1] dnsrecursor: prepare pdns-recursor for the 4.5.5 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [14:30:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [14:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:31] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:00] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32121/console" [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:34:05] (03CR) 10Jbond: [C: 03+1] "LGTM lets give it a test" [puppet] - 10https://gerrit.wikimedia.org/r/736765 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:36:25] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache codfw on all recursors [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) codfw on all recursors [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736647 (owner: 10Muehlenhoff) [14:36:58] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti-test01.svc.codfw.wmnet on all recursors [14:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti-test01.svc.codfw.wmnet on all recursors [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [14:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:41] (03PS4) 10Jbond: Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [14:38:25] (03CR) 10Jbond: [C: 03+1] Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [14:38:41] (03CR) 10Jbond: [C: 03+1] "FYI i fixed the minor ci issue" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [14:38:46] (03PS4) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:39:47] (03PS1) 10Jbond: hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) [14:40:10] !log disabling puppet on C:cassandra in advance of merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631789 [14:40:10] (03CR) 10Ema: [C: 03+2] prometheus: remove varnish-canary job [puppet] - 10https://gerrit.wikimedia.org/r/736755 (owner: 10Ema) [14:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] (03CR) 10jerkins-bot: [V: 04-1] hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [14:41:14] (03PS2) 10Jbond: hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) [14:41:43] (03PS3) 10Jbond: hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) [14:41:51] (03CR) 10jerkins-bot: [V: 04-1] hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [14:42:33] (03PS4) 10Jbond: hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) [14:42:57] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [14:43:35] (03PS5) 10Ottomata: Sync db1108's my.cnf settings with analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) [14:44:05] !log imported jenkins 2.303.3 to thirdparty/ci for buster-wikimedia T294838 [14:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:06] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:45:18] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) 05Open→03In progress I migrated all services in `staging` to helm3 using the snippet https://phabricator.wikimedia.org/P17671. It took around 1 hour. helm2 list shows no relea... [14:46:38] (03PS5) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [14:47:04] (03PS1) 10Hnowlan: cassandra: fix incorrect parameter for version [puppet] - 10https://gerrit.wikimedia.org/r/736784 (https://phabricator.wikimedia.org/T261966) [14:48:34] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32124/console" [puppet] - 10https://gerrit.wikimedia.org/r/736780 (https://phabricator.wikimedia.org/T279440) (owner: 10Ottomata) [14:49:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [14:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:53] (03CR) 10ArielGlenn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:49:57] !log Upgrading CI Jenkins [14:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:14] !log disable cr1-codfw:et-0/0/0 [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:17] (03PS1) 10Elukey: profile::kafka::broker: add truststore for pki-based tls certs [puppet] - 10https://gerrit.wikimedia.org/r/736785 (https://phabricator.wikimedia.org/T291905) [14:51:28] (03CR) 10Ssingh: [V: 03+1] dnsrecursor: prepare pdns-recursor for the 4.5.5 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [14:52:55] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:53:41] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Setup cronjob parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/736554 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [14:53:45] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Gehel) [14:53:51] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Gehel) [14:53:55] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:54:01] (03PS1) 10Elukey: Enable PKI TLS certificates for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) [14:54:24] (03CR) 10MSantos: [C: 03+1] maps: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [14:54:50] (03PS1) 10Ema: prometheus: remove varnish_2layer [puppet] - 10https://gerrit.wikimedia.org/r/736787 (https://phabricator.wikimedia.org/T241239) [14:55:06] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Gehel) 05Open→03Resolved Work is continuing on more specific tickets. [14:55:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:31] PROBLEM - cassandra SSL 10.64.16.27:7001 on maps1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:55:31] PROBLEM - cassandra SSL 10.192.48.166:7001 on maps2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:56:57] hnowlan: ^^ related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/631789/? [14:57:28] majavah: probably not, there was an earlier change merged that added absent checks [14:57:29] PROBLEM - cassandra-a SSL 10.64.0.213:7001 on aqs1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:57:30] PROBLEM - cassandra-a SSL 10.64.48.122:7001 on aqs1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:57:54] (03Merged) 10jenkins-bot: tegola-vector-tiles: Setup cronjob parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/736554 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [14:57:59] puppet is disabled on those hosts since the profile::java change was merged, but looking [14:58:03] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:58:25] (03PS2) 10Elukey: profile::kafka::broker: add truststore for pki-based tls certs [puppet] - 10https://gerrit.wikimedia.org/r/736785 (https://phabricator.wikimedia.org/T291905) [14:58:27] (03PS2) 10Elukey: Enable PKI TLS certificates for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) [14:58:43] cassandra is still up, probably a problem with the checks [14:58:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:22] there might be more noise soon though [14:59:55] PROBLEM - cassandra-b SSL 10.64.48.123:7001 on aqs1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:00:28] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736795 (https://phabricator.wikimedia.org/T128546) [15:00:34] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10MPhamWMF) [15:00:47] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32126/console" [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:03:10] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:09] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.213:7001 on aqs1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:09] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.237:7001 on aqs1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:10] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.122:7001 on aqs1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:10] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.123:7001 on aqs1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:10] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.65:7001 on aqs1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:10] ACKNOWLEDGEMENT - cassandra SSL 10.64.0.12:7001 on maps1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:10] ACKNOWLEDGEMENT - cassandra SSL 10.64.16.27:7001 on maps1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:11] ACKNOWLEDGEMENT - cassandra SSL 10.192.16.31:7001 on maps2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:11] ACKNOWLEDGEMENT - cassandra SSL 10.192.48.166:7001 on maps2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services. https://phabricator.wikimedia.org/T120662 [15:04:24] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] (03PS4) 10Dzahn: cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [15:05:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [15:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] (03CR) 10Dzahn: "Have to check which other places exactly might be using these, for example switchdc or other cookbooks or users might have them in their p" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [15:09:30] PROBLEM - cassandra-b SSL 10.64.48.67:7001 on aqs1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:10:13] (03PS1) 10Hnowlan: cassandra: allow empty tls_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/736799 [15:11:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [15:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:36] (03PS5) 10Dzahn: cumin: reorganize mediawiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [15:11:47] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.88:7001 on aqs1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services, erroneous check https://phabricator.wikimedia.org/T120662 [15:11:48] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.67:7001 on aqs1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Hnowlan Checking non-SSL services, erroneous check https://phabricator.wikimedia.org/T120662 [15:15:00] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10jbond) [15:15:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) [15:15:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:16:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) [15:19:00] PROBLEM - cassandra-a SSL 10.64.32.146:7001 on aqs1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:19:53] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:21:13] PROBLEM - cassandra-b SSL 10.64.0.120:7001 on aqs1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:21:23] (03CR) 10Ladsgroup: [C: 03+1] Remove obsolete group memberships (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736647 (owner: 10Muehlenhoff) [15:22:41] (03PS1) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [15:23:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/736784 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [15:23:37] PROBLEM - cassandra-b SSL 10.64.32.147:7001 on aqs1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:24:05] PROBLEM - cassandra SSL 10.192.32.46:7001 on maps2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:24:32] (03CR) 10Dzahn: [C: 03+2] "welcome to root, Ladsgroup" [puppet] - 10https://gerrit.wikimedia.org/r/736647 (owner: 10Muehlenhoff) [15:26:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [15:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [15:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32128/console" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [15:29:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2143 for mysql upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17694 and previous config saved to /var/cache/conftool/dbconfig/20211104-152919-ladsgroup.json [15:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:23] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [15:29:51] 10SRE, 10SRE-swift-storage, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10thcipriani) [15:29:56] (03PS2) 10David Caro: Show the directory where fact files are searched on error [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 [15:30:27] (03PS2) 10David Caro: workspace: Improve docs and create missing directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735596 [15:30:27] !log drain codfw-ulsfo link [15:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:44] 10SRE, 10SRE-swift-storage, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10thcipriani) ==== Error ==== * mwversion: `1.38.0-wmf.7` * reqI... [15:31:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2143.codfw.wmnet with reason: Maintenance T295026 [15:31:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2143.codfw.wmnet with reason: Maintenance T295026 [15:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:04] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [15:32:12] (03CR) 10Jbond: [V: 03+1 C: 04-1] "-01 just to get confirmation that the produced changes are expected?" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [15:34:02] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:36:19] (03CR) 10Cwhite: P:rsyslog: ship puppetmaster logs to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [15:37:07] (03PS3) 10David Caro: Show the directory where fact files are searched on error [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 [15:37:49] (03PS1) 10Jbond: pcc: add support for cumin:R syntax [puppet] - 10https://gerrit.wikimedia.org/r/736804 [15:37:53] !log Upgrade db2143 T295026 [15:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:57] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [15:38:40] (03PS3) 10Elukey: Enable PKI TLS certificates for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) [15:38:42] (03PS1) 10Elukey: profile::kafka::broker: allow to override super_users [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) [15:38:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] add credentials file for downloading enterprise html dumps [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:39:58] (03CR) 10Majavah: "Do you have a labs/private patch too?" [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:40:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32129/console" [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:40:41] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:40:44] (03CR) 10ArielGlenn: add credentials file for downloading enterprise html dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:40:46] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] add credentials file for downloading enterprise html dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:40:55] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:41:37] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:41:51] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [15:42:14] we lost ulsfo transport? [15:44:05] (03PS1) 10JMeybohm: Rename everything to cfssl-issuer, ensure e2e completed [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736807 [15:44:07] (03PS1) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 [15:44:09] (03PS1) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 [15:45:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 50%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17695 and previous config saved to /var/cache/conftool/dbconfig/20211104-154543-root.json [15:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:47] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [15:45:49] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Use --no-ssh-key-check for initial node setup [cookbooks] - 10https://gerrit.wikimedia.org/r/736810 [15:45:52] (03PS2) 10JMeybohm: Rename everything to cfssl-issuer, ensure e2e completed [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736807 (https://phabricator.wikimedia.org/T294560) [15:45:54] (03PS2) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [15:45:56] (03PS2) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [15:46:19] (03PS2) 10Elukey: profile::kafka::broker: allow to override super_users [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) [15:46:21] (03PS4) 10Elukey: Enable PKI TLS certificates for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) [15:46:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/736810 (owner: 10Muehlenhoff) [15:46:25] (03CR) 10ArielGlenn: add credentials file for downloading enterprise html dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736461 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:46:43] (03PS1) 10Jbond: logstash - reporter: drop superfluous total [puppet] - 10https://gerrit.wikimedia.org/r/736811 [15:46:51] [re ulsfo transport for anyone watching - known/aware, dcops+netops working on related things in dcops chan] [15:48:23] (03CR) 10Majavah: Add simple-cfssl image for development and e2e tests (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:48:47] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736811 (owner: 10Jbond) [15:49:40] (03CR) 10Jbond: [C: 03+2] logstash - reporter: drop superfluous total [puppet] - 10https://gerrit.wikimedia.org/r/736811 (owner: 10Jbond) [15:50:10] !log disable puppet fleet wide to deploy a puppet change [15:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [15:51:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32131/console" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [15:52:19] RECOVERY - puppet last run on wcqs2002 is OK: OK: Puppet is currently disabled (merge Gerrit:736811), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:52:45] !log ppchelko@deploy1002 Started deploy [restbase/deploy@0848b15]: Add new wikis T292422 T294587 T294588 [15:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:50] T292422: Add amiwiki to RESTBase - https://phabricator.wikimedia.org/T292422 [15:52:50] T294587: Add pwnwiki to RESTBase - https://phabricator.wikimedia.org/T294587 [15:52:51] T294588: Add lmowiktionary to RESTBase - https://phabricator.wikimedia.org/T294588 [15:53:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736784 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [15:54:03] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Reedy) 05In progress→03Resolved a:03Reedy >>! In T294106#7452667, @Reedy wrote: >>list:member:digest:footer > > ` > _______________________________________________ > $display_name mai... [15:54:12] (03PS1) 10Dzahn: dumps: add a 'silent' flag fulldumps.sh to run it from timers [puppet] - 10https://gerrit.wikimedia.org/r/736815 (https://phabricator.wikimedia.org/T273673) [15:54:33] (03CR) 10Hnowlan: cassandra: allow empty tls_cluster_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [15:54:50] (03CR) 10Hnowlan: [C: 03+2] cassandra: fix incorrect parameter for version [puppet] - 10https://gerrit.wikimedia.org/r/736784 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [15:58:12] (03PS5) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) [15:58:28] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Use --no-ssh-key-check for initial node setup [cookbooks] - 10https://gerrit.wikimedia.org/r/736810 (owner: 10Muehlenhoff) [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:20] puppet window complete ✅ [16:00:26] checks, oh, heh [16:00:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 100%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17696 and previous config saved to /var/cache/conftool/dbconfig/20211104-160047-root.json [16:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:51] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [16:01:06] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [16:01:13] (03CR) 10JMeybohm: "Dear reviewers," [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:01:25] (03CR) 10Herron: [C: 03+1] "thanks for this! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [16:01:51] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [16:02:54] (03PS2) 10Dzahn: dumps: add a 'silent' flag fulldumps.sh to run it from timers [puppet] - 10https://gerrit.wikimedia.org/r/736815 (https://phabricator.wikimedia.org/T273673) [16:03:55] (03PS3) 10Dzahn: snapshot:: add a 'silent' flag to fulldumps.sh to run it from timers [puppet] - 10https://gerrit.wikimedia.org/r/736815 (https://phabricator.wikimedia.org/T273673) [16:04:01] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:06:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-enc-cli: added set_prefix_roles subcommand [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [16:07:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-enc-cli: added set_prefix_roles subcommand [puppet] - 10https://gerrit.wikimedia.org/r/736590 (owner: 10Andrew Bogott) [16:08:51] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@0848b15]: Add new wikis T292422 T294587 T294588 (duration: 16m 06s) [16:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:56] T292422: Add amiwiki to RESTBase - https://phabricator.wikimedia.org/T292422 [16:08:57] T294587: Add pwnwiki to RESTBase - https://phabricator.wikimedia.org/T294587 [16:08:57] T294588: Add lmowiktionary to RESTBase - https://phabricator.wikimedia.org/T294588 [16:14:51] 10SRE, 10Internet-Archive, 10serviceops: Improve download speed from archive.org on appservers - https://phabricator.wikimedia.org/T295009 (10Yann) Slow bandwith from IA seems indeed the issue. I expected that upload-by-url (i.e. direct transfer from IA servers to WM servers) would just trump any limit to an... [16:16:47] 10SRE, 10Commons, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10Legoktm)... [16:16:53] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.04018 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:17:56] (03PS3) 10Muehlenhoff: Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 [16:18:59] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/736804 (owner: 10Jbond) [16:20:38] (03CR) 10Muehlenhoff: [C: 03+2] Add further engineering managers for ops: approval [puppet] - 10https://gerrit.wikimedia.org/r/736451 (owner: 10Muehlenhoff) [16:20:56] (03PS1) 10Dzahn: parsoid::testing: (scandium) remove MediaWiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736818 (https://phabricator.wikimedia.org/T294378) [16:24:22] (03PS2) 10Dzahn: parsoid: remove mw font packages from test servers (scandium, testreduce1001) [puppet] - 10https://gerrit.wikimedia.org/r/736818 (https://phabricator.wikimedia.org/T294378) [16:26:07] (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce network tests This test suite runs a bunch of commands to check if the cloud network is actually working as expected. Bug: T294955 Signed-off-by: Arturo Borrero Gonzalez Change-Id: Iab2dd5bbfe02d60c5f0ddc45e8fe0aba1a879b3a [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [16:27:50] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests This test suite runs a bunch of commands to check if the cloud network is actually working as expected. Bug: T294955 Signed-off-by: Arturo Borrero Gonzalez Change-Id: Iab2dd5bbfe02d60c5f0ddc45e8fe0aba1a879b3a [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Bor [16:28:05] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 308 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:29:13] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:30:28] (03PS3) 10Dzahn: parsoid: remove mw font packages from test server (scandium) [puppet] - 10https://gerrit.wikimedia.org/r/736818 (https://phabricator.wikimedia.org/T294378) [16:34:44] (03PS3) 10Andrew Bogott: start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 [16:34:46] (03PS1) 10MMandere: site: Add drmrs ganeti instances [puppet] - 10https://gerrit.wikimedia.org/r/736820 (https://phabricator.wikimedia.org/T282787) [16:35:33] (03CR) 10Andrew Bogott: start_instance_with_prefix: return id and fqdn of new instance (035 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [16:36:02] (03PS3) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [16:36:04] (03PS3) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [16:39:02] (03CR) 10Dzahn: [C: 03+2] "since this is just a test server I am going ahead. added parsoid team members as CC as an FYI." [puppet] - 10https://gerrit.wikimedia.org/r/736818 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [16:39:20] (03CR) 10David Caro: start_instance_with_prefix: return id and fqdn of new instance (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [16:40:07] (03CR) 10jerkins-bot: [V: 04-1] start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [16:41:48] (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce network tests This test suite runs a bunch of commands to check if the cloud network is actually working as expected. Bug: T294955 Signed-off-by: Arturo Borrero Gonzalez Change-Id: Iab2dd5bbfe02d60c5f0ddc45e8fe0aba1a879b3a [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [16:41:58] (03PS1) 10Majavah: cloud cumin: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/736821 [16:42:26] (03PS3) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [16:42:47] !log scandium (parsoid::testing) - purging MW font packages [16:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:14] (03CR) 10Dzahn: "Notice: /Stage[main]/Mediawiki::Packages::Fonts/Package[fonts-arabeyes]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/736818 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [16:44:37] (03PS1) 10Jelto: hiera::role::common::deployment_server update helmBinary codfw [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305) [16:45:21] (03CR) 10Jelto: [C: 04-1] "do not merge, services need to be redeployed first" [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [16:47:05] (03PS4) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [16:47:23] (03PS1) 10RLazarus: admin: Add aokoth to ops 🎉 [puppet] - 10https://gerrit.wikimedia.org/r/736823 [16:47:44] (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/736821 (owner: 10Majavah) [16:48:00] (03PS2) 10RLazarus: admin: Add aokoth to ops 🎉 [puppet] - 10https://gerrit.wikimedia.org/r/736823 [16:48:02] (03CR) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:48:45] !log T294961 [WCQS] Power cycled all 6 wcqs* hosts via the mgmt console (`racadm serveraction powercycle`) [16:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:48] T294961: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 [16:49:05] (03CR) 10AOkoth: [C: 03+1] admin: Add aokoth to ops 🎉 [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [16:49:49] (03CR) 10RLazarus: [C: 03+2] admin: Add aokoth to ops 🎉 [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [16:49:51] (03CR) 10Legoktm: [C: 03+1] "🎉🎉🎉" [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [16:51:15] RECOVERY - SSH on wcqs2003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:51:15] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:51:23] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Add support to run pcc on cloud and production hosts - https://phabricator.wikimedia.org/T295062 (10jbond) [16:51:35] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Add support to run pcc on cloud and production hosts - https://phabricator.wikimedia.org/T295062 (10jbond) 05Open→03In progress p:05Triage→03Medium a:03jbond [16:51:44] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) [16:52:52] (03PS1) 10Dzahn: parsoid: remove mw font packages from canary servers [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) [16:53:18] (03PS1) 10Majavah: hieradata: add cloud-cumin-03 [puppet] - 10https://gerrit.wikimedia.org/r/736846 [16:53:31] (03CR) 10jerkins-bot: [V: 04-1] parsoid: remove mw font packages from canary servers [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [16:54:13] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add cloud-cumin-03 [puppet] - 10https://gerrit.wikimedia.org/r/736846 (owner: 10Majavah) [16:54:32] (03CR) 10Dzahn: [C: 03+1] "✨" [puppet] - 10https://gerrit.wikimedia.org/r/736823 (owner: 10RLazarus) [16:55:15] (03CR) 10David Caro: [C: 03+2] "thanks!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [16:55:57] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 3 others: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10Milimetric) One idea is to move the logic that finds the host/port to wmfdata, and then to make refinery depend on wmfdata. Pro... [16:56:11] (03CR) 10David Caro: [C: 03+2] Black and isort all the things (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [16:56:47] (03PS2) 10Dzahn: parsoid: remove mw font packages from canary servers [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) [16:58:35] (03PS1) 10MMandere: site: Add drmrs ganeti instances [puppet] - 10https://gerrit.wikimedia.org/r/736847 (https://phabricator.wikimedia.org/T282787) [16:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:59:23] (03PS1) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [16:59:53] (03CR) 10Jbond: [C: 03+1] workspace: Improve docs and create missing directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735596 (owner: 10David Caro) [17:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1700). [17:00:14] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:00:39] (03CR) 10Jbond: [C: 03+2] P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [17:01:05] (03CR) 10Jbond: [C: 03+2] hiera: logstash::curator_actions: Ensure keys are strings (second try) [puppet] - 10https://gerrit.wikimedia.org/r/736679 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [17:01:07] (03Merged) 10jenkins-bot: Black and isort all the things [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736726 (owner: 10David Caro) [17:02:16] (03CR) 10BBlack: [C: 03+1] site: Add drmrs ganeti instances [puppet] - 10https://gerrit.wikimedia.org/r/736847 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [17:02:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 (owner: 10David Caro) [17:03:01] (03CR) 10MMandere: [C: 03+2] site: Add drmrs ganeti instances [puppet] - 10https://gerrit.wikimedia.org/r/736847 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [17:03:39] (03CR) 10David Caro: [C: 03+2] Show the directory where fact files are searched on error [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736779 (owner: 10David Caro) [17:03:47] (03CR) 10Jbond: pcc: add support for cumin:R syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736804 (owner: 10Jbond) [17:03:54] (03CR) 10David Caro: [C: 03+2] workspace: Improve docs and create missing directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735596 (owner: 10David Caro) [17:05:08] (03CR) 10Jbond: pcc: add support for cumin:R syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736804 (owner: 10Jbond) [17:05:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736804 (owner: 10Jbond) [17:05:31] (03CR) 10Jbond: [C: 03+2] pcc: add support for cumin:R syntax [puppet] - 10https://gerrit.wikimedia.org/r/736804 (owner: 10Jbond) [17:09:26] (03PS2) 10Hnowlan: cassandra: allow empty tls_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/736799 [17:10:15] (03Merged) 10jenkins-bot: workspace: Improve docs and create missing directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735596 (owner: 10David Caro) [17:12:11] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32134/console" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [17:13:12] (03CR) 10Jbond: cassandra: allow empty tls_cluster_name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [17:13:41] RECOVERY - puppet last run on wcqs1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:14:34] (03CR) 10Herron: "AIUI display_name may be used in the CGI, but it would not be output in alerts since we don't currently use $HOSTDISPLAYNAME$ in notificat" [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [17:14:51] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:24] (03CR) 10Jbond: [C: 03+1] "seems i responded too late 😊 (nit is minor)" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [17:20:45] RECOVERY - SSH on wcqs2001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:21:45] (03PS3) 10Hnowlan: cassandra: allow empty tls_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/736799 [17:23:57] !log T294961 [WCQS] Installed kernel version `Linux 5.10.0-0.bpo.9-amd64` on all wcqs* hosts [17:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:01] T294961: Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 [17:24:29] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [17:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:44] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32136/console" [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [17:26:53] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: allow empty tls_cluster_name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736799 (owner: 10Hnowlan) [17:28:50] (03CR) 10Ottomata: [C: 03+1] profile::kafka::broker: allow to override super_users [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:29:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:27] (03CR) 10Herron: [C: 03+2] add logstash gelf relay to elastic1049 [puppet] - 10https://gerrit.wikimedia.org/r/721364 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [17:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1152 for mysql upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17697 and previous config saved to /var/cache/conftool/dbconfig/20211104-172950-ladsgroup.json [17:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:53] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [17:30:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1152.eqiad.wmnet with reason: Maintenance T295026 [17:30:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1152.eqiad.wmnet with reason: Maintenance T295026 [17:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:36] (03PS2) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [17:33:07] !log Upgrade db1152 T295026 [17:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:56] (03CR) 10Ottomata: [C: 03+1] "Looks fine to me!" [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) (owner: 10Elukey) [17:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 50%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17698 and previous config saved to /var/cache/conftool/dbconfig/20211104-173926-root.json [17:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:30] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [17:40:32] (03CR) 10Jbond: "LGTM by as said over irc im not reviewing the go stuff 😊, please do keep me looped in on any progress though as im interested" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [17:42:59] (03CR) 10Kormat: Add role::analytics_cluster::database::meta on an-db100[12] (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [17:46:34] !log enabling puppet on C:cassandra after profile::java transition [17:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:29] !log T288620 [Elastic] Rebooting `elastic1049.eqiad.wmnet` to uptake new gelf settings change [17:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:32] T288620: Document path forward and Retire remaining non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T288620 [17:48:05] PROBLEM - Host elastic1049 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:10] (03PS3) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [17:49:12] (03PS1) 10Elukey: admin: reduce privileges for ml-team-admins [puppet] - 10https://gerrit.wikimedia.org/r/736852 [17:49:47] RECOVERY - Host elastic1049 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [17:50:46] !log restarted puppetdb.service on puppetdb2002 [17:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 100%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17699 and previous config saved to /var/cache/conftool/dbconfig/20211104-175429-root.json [17:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:34] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [17:54:59] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:19] 10SRE, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10hnowlan) [17:55:48] 10SRE, 10Cassandra, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 (10hnowlan) 05Open→03Resolved a:03hnowlan [17:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1153 for mysql upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17700 and previous config saved to /var/cache/conftool/dbconfig/20211104-175606-ladsgroup.json [17:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1153.eqiad.wmnet with reason: Maintenance T295026 [17:57:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1153.eqiad.wmnet with reason: Maintenance T295026 [17:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:46] !log Upgrade db1153 T295026 [17:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1800). [18:00:04] jan_drewniak: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10DannyH) I approve; thank you. [18:01:23] Looks like I'm the only one. I'll deploy my patch then. [18:01:51] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736795 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:02:55] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736795 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:04:12] (03PS1) 10Herron: role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) [18:05:28] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:736795| Bumping portals to master (T128546)]] (duration: 01m 04s) [18:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [18:06:16] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:06:33] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:736795| Bumping portals to master (T128546)]] (duration: 01m 03s) [18:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:58] (03PS2) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [18:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:43] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [18:07:51] (03PS3) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (https://phabricator.wikimedia.org/T295062) [18:08:06] !log uploaded scap 4.0.3-2 to apt.wm.o for buster/stretch (T294966) [18:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:09] T294966: Deploy Scap version 4.0.3 - https://phabricator.wikimedia.org/T294966 [18:08:38] jan_drewniak: are you done deploying? if not just ping me whenever you are :) [18:08:56] legoktm: yup I'm done now :) [18:09:09] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (https://phabricator.wikimedia.org/T295062) (owner: 10Jbond) [18:09:14] ty [18:09:43] (03PS2) 10Herron: role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) [18:10:41] (03PS4) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [18:11:15] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-psi-eqiad.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:25] !log upgrading to scap 4.0.3 on canaries again (T294966) [18:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 50%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17701 and previous config saved to /var/cache/conftool/dbconfig/20211104-181151-root.json [18:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:55] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [18:11:58] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [18:18:35] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:20:39] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 100%: After upgrade T295026', diff saved to https://phabricator.wikimedia.org/P17703 and previous config saved to /var/cache/conftool/dbconfig/20211104-182655-root.json [18:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:59] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [18:34:20] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005643 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:44:22] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn fiber cut, Telia currently working on it https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:45:37] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn ongoing incident, Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:46:43] (03CR) 10Legoktm: [C: 03+1] parsoid: remove mw font packages from canary servers [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [18:46:49] (03PS1) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) [18:53:08] (03PS2) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) [18:53:55] (03CR) 10jerkins-bot: [V: 04-1] Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [18:53:58] (03CR) 10Dzahn: [C: 03+2] parsoid: remove mw font packages from canary servers [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [18:56:12] (03PS3) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) [18:56:18] (03PS1) 10Herron: logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) [18:56:58] (03CR) 10jerkins-bot: [V: 04-1] logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:59:13] (03PS2) 10Herron: logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) [19:00:04] dduvall and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T1900). [19:00:28] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:30] (03CR) 10Dzahn: "This is noop.. but I did not expect it to be noop and the fonts are still there.. must be a difference in how we include the fonts class.." [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:01:30] (03PS3) 10Herron: logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) [19:03:04] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:07:17] (03PS1) 10Ladsgroup: mailman: rename public hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) [19:08:02] (03PS4) 10Herron: logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) [19:08:06] dduvall: around? need any help with train or do you have it handled? [19:09:16] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:09:44] (03CR) 10Dzahn: "Want to do all of them? there are some more using profile::mailman3: (the ones looking up passwords from private repo)" [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:10:31] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:10:35] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) 05Open→03Resolved [19:10:41] twentyafterfour: here! always good to have backup, yeah [19:10:59] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Legoktm) [19:11:23] (03CR) 10Ladsgroup: mailman: rename public hiera keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:11:38] * twentyafterfour fires up kibana [19:11:53] (03CR) 10Dzahn: mailman: rename public hiera keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:12:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "PCC noop: Merging https://puppet-compiler.wmflabs.org/compiler1003/1052/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:13:48] (03CR) 10Dzahn: [C: 03+1] "lgtm, would still compile it" [puppet] - 10https://gerrit.wikimedia.org/r/736866 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:14:20] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736867 [19:14:22] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736867 (owner: 10Dduvall) [19:14:28] here we go [19:15:08] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736867 (owner: 10Dduvall) [19:15:24] (03PS1) 10RLazarus: admin: Shell account and analytics-privatedata-users for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/736868 (https://phabricator.wikimedia.org/T291508) [19:15:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10RLazarus) 05Open→03In progress [19:16:27] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.7 refs T293948 [19:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:31] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:18:01] (03CR) 10Herron: "PCC for elastic1049 https://puppet-compiler.wmflabs.org/compiler1002/32139/elastic1049.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:18:42] (03PS1) 10Bartosz Dziewoński: Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) [19:21:25] (03CR) 10Dzahn: [C: 03+1] "all looks good to me, has approval, UID matches, upgrade from ldap-only" [puppet] - 10https://gerrit.wikimedia.org/r/736868 (https://phabricator.wikimedia.org/T291508) (owner: 10RLazarus) [19:21:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:06] (03CR) 10Cwhite: [C: 03+1] "It's hardcoded, but that's probably fine for now." [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:22:21] (03CR) 10RLazarus: [C: 03+2] admin: Shell account and analytics-privatedata-users for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/736868 (https://phabricator.wikimedia.org/T291508) (owner: 10RLazarus) [19:23:39] (03CR) 10Jbond: [C: 03+1] puppetmaster::gitsync: Replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/732991 (https://phabricator.wikimedia.org/T273673) (owner: 10Majavah) [19:24:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736499 (owner: 10Majavah) [19:25:20] twentyafterfour: looks good so far. i think i'll call it a deployment unless you see anything of concern [19:25:30] (03CR) 10Herron: [C: 03+2] "thx for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:25:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:49] (03PS5) 10Herron: logstash: switch monitoring API to port 9675 [puppet] - 10https://gerrit.wikimedia.org/r/736865 (https://phabricator.wikimedia.org/T288620) [19:27:52] dduvall: everything looks good to me [19:28:09] right on. thanks! [19:28:24] You're welcome! [19:28:45] Congrats on a fairly uneventful train [19:29:10] !log 1.38.0-wmf.7 on all wikis. no new errors or increase in error rates (refs T293948) [19:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:13] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:29:23] twentyafterfour: always a pleasant surprise :) [19:29:32] the uneventful part that is [19:29:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-access for NRodriguez - https://phabricator.wikimedia.org/T291508 (10RLazarus) 05In progress→03Resolved Shell account created with the above change. Kerberos principal created: ` rzl@krb1001:~$ sudo manage_princ... [19:30:00] (03CR) 10Dzahn: "Yea... so this turns out to be not as simple as I thought. The font class is used in profile::mediawiki::webserver. And while we do have t" [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:30:31] jbond: can I ask for merges for those two patches too? still can't do it myself [19:33:48] (03PS1) 10Herron: logstash_exporter: add service notify to defaults file [puppet] - 10https://gerrit.wikimedia.org/r/736872 (https://phabricator.wikimedia.org/T288620) [19:34:19] (03CR) 10Dzahn: "my previous comment is also not entirely correct. After testing to remove one of the font packages manually and running puppet.. it got re" [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:35:05] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736872 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:35:54] (03PS5) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [19:36:26] (03CR) 10DLynch: [C: 03+1] Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) (owner: 10Bartosz Dziewoński) [19:37:10] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [19:37:13] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) Thank you for the reply! > 1. Are you OK with using the `S3` protocol (rather than the Swift protocol)? Yes that would work well with the largish file like objects we intend to st... [19:39:23] (03CR) 10Herron: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32140/" [puppet] - 10https://gerrit.wikimedia.org/r/736872 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:41:13] (03PS1) 10Ladsgroup: Rename mailman3 hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/736873 (https://phabricator.wikimedia.org/T282303) [19:42:13] (03CR) 10Cwhite: [C: 04-1] Add the first eventgate alert to Alertmanager (0315 comments) [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [19:42:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Rename mailman3 hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/736873 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:46:50] (03PS1) 10Ladsgroup: lists: Use the new private hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/736876 (https://phabricator.wikimedia.org/T282303) [19:50:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "PCC noop: https://puppet-compiler.wmflabs.org/compiler1002/32141/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/736876 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:55:20] (03CR) 10Accraze: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) (owner: 10Elukey) [19:58:53] (03PS6) 10Jbond: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 [19:59:01] PROBLEM - MariaDB Replica Lag: x2 #page on db1153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7246.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:59:16] (03PS6) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [19:59:33] Amir1: ^ [19:59:34] hello hello [19:59:38] downtime expired? [19:59:42] X2 isn't in use [19:59:46] hi [19:59:53] I messed it up [19:59:53] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [20:00:00] it should be repooled [20:00:01] o/ [20:00:10] it's not inuse, ignroe [20:00:15] yeah [20:00:16] ak thanks marostegui Amir1 [20:00:21] back to lunch then :) thanks both [20:00:23] Amir1: feel free to downtime it for 24h [20:00:35] yeah, let me down time it [20:00:58] I'll resolve it in VO so it doesn't page again tomorrow [20:01:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1153.eqiad.wmnet with reason: Maintenance T295026 [20:01:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1153.eqiad.wmnet with reason: Maintenance T295026 [20:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:06] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [20:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:28] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:28] thanks legoktm [20:02:00] I investigate why it pages. [20:02:14] sorry for the spam [20:06:45] the lag here is 0 https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?viewPanel=13&orgId=1 [20:08:51] https://www.irccloud.com/pastebin/4eyPrajz/ [20:09:36] restarting mariadb [20:10:56] it should be fixed now, I restarted mysql [20:11:45] > Slave_IO_State: Waiting for master to send event [20:51:48] is something up with irc bots seems strange to have nothing for so long? [20:56:36] 10Puppet, 10Infrastructure-Foundations: tes irc bots - https://phabricator.wikimedia.org/T295081 (10jbond) [20:57:02] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10colewhite) [20:57:04] 10SRE, 10Discovery-Search, 10Observability-Logging: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10colewhite) 05Open→03Resolved a:03colewhite This was completed. [20:57:26] 10Puppet, 10Infrastructure-Foundations: test irc bots - https://phabricator.wikimedia.org/T295081 (10jbond) 05Open→03Invalid [20:58:52] 0hmm not sure seems to have picked back up and i dont see anything on icinga or irc1001 [20:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:59:21] maybe we were all just taking a few minutes of quiet contemplation [20:59:53] I mean not me, I was eating a burrito, but that's close enough [20:59:54] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) 05Open→03Resolved Ok, fixed all the stuck renames. Done. [21:00:02] lol maybe :D [21:00:33] althugh i did just close T1 and didn't see it [21:00:33] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [21:00:41] :o [21:01:01] perhaps it to short to work with the api :P [21:01:42] * bd808 adds rzl to bash -- https://bash.toolforge.org/quip/ynjB7HwBa_6PSCT9o1fk [21:01:46] this channel is configured `SRE(-.*)?`, but that task is in "SRE Observability" [21:02:55] ahh that'll do it [21:03:01] OMG jbond! You closed T1!!! My hero :) [21:03:16] thanks AntiComposite [21:03:40] and bd hoping it stays closed but yes \o/ thanks for the comments and work btw [21:04:29] I got to close T2 when we ripped salt out of prod and made it moot :) [21:04:30] T2: Get salt logs into logstash - https://phabricator.wikimedia.org/T2 [21:05:10] 10SRE, 10Privacy Engineering, 10Security-Team, 10Wikimedia-Mailing-lists, and 2 others: /var/log/mailman/subscribe* has PII (IP addresses) from August 2020 - https://phabricator.wikimedia.org/T281619 (10Ladsgroup) [21:05:51] bd808: nice :) (had just checked to see if T2 was still open ;)) [21:13:29] hmmm I wonder what the lowest-numbered open task is, then [21:13:38] it's not single-digit but that's as far as I'm going by hand :) [21:14:12] (03PS1) 10BBlack: Remove digicert-2020 from puppet config [puppet] - 10https://gerrit.wikimedia.org/r/736890 (https://phabricator.wikimedia.org/T289507) [21:14:18] looks like T45 [21:14:19] T45: Phabricator should suggest possible duplicates when creating a new task - https://phabricator.wikimedia.org/T45 [21:15:01] I wonder if there's a dupe for T45? Guess we'll never know! :) [21:15:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [21:15:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: ensure_packages should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) [21:15:36] (03CR) 10BBlack: [C: 03+2] Remove digicert-2020 from puppet config [puppet] - 10https://gerrit.wikimedia.org/r/736890 (https://phabricator.wikimedia.org/T289507) (owner: 10BBlack) [21:15:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: ensure_packages should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) 05Resolved→03Open >>! In T195981#7477690, @MoritzMuehlenhoff wrote: >>>! In T195981#7477405, @jbond wrote: >> reqiure_package has... [21:17:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: the package resource should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) [21:18:03] RECOVERY - MariaDB Replica Lag: x2 #page on db1153 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:18:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: the package resource should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) have updated the title, the underlining issue is actually with the `apt` provider for the `package` type [21:20:16] Fun fact: T45 was the #1 most requested fix in the 2017 developer wishlist survey. [21:20:17] T45: Phabricator should suggest possible duplicates when creating a new task - https://phabricator.wikimedia.org/T45 [21:20:54] And basically upstream said "meh, I didn't like that feature inside Facebook and think its too hard" [21:22:42] indeed that dose sound usefull [21:22:58] * jbond is reminded they need to check on the current state of phab development [21:23:28] https://we.phorge.it/ [21:24:39] thanks Reedy [21:25:06] https://we.phorge.it/T15023 seems very relevant and very open [21:26:16] nice, thx [21:26:31] !log cpNNNN: manual (cumin) removal of outdated digicert-2020 ocsp configuration and output files, to avoid icinga alerts and clean up [21:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:07] (03PS3) 10Hashar: role: system::role for all mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/730004 [21:31:12] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:32:53] (03CR) 10Hashar: "That is for puppet-lint wmf style guide which complains when a role lacks a system::role." [puppet] - 10https://gerrit.wikimedia.org/r/730004 (owner: 10Hashar) [21:33:10] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 3891.32 ms [21:34:17] (03PS1) 10Jdlrobson: Fix loading of related articles via IntersectionObserver [extensions/RelatedArticles] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736830 (https://phabricator.wikimedia.org/T223844) [21:35:08] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:36:23] (03PS4) 10Hashar: Split canary jobrunner to their own role [puppet] - 10https://gerrit.wikimedia.org/r/724694 (https://phabricator.wikimedia.org/T291870) [21:39:16] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.58 ms [21:39:19] (03PS7) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [21:39:22] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 312.98 ms [21:39:56] (03CR) 10jerkins-bot: [V: 04-1] controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [21:45:04] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736907 [21:45:06] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736908 [21:49:41] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736909 [21:49:43] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736910 [21:53:40] (03CR) 10Jbond: [C: 04-1] "added moritz who has been doing some clean up in this area, but i think we shouldn't have this in the common role (which should possibly b" [puppet] - 10https://gerrit.wikimedia.org/r/730004 (owner: 10Hashar) [21:54:36] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736912 [21:54:38] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736913 [22:06:57] (03PS4) 10Andrew Bogott: start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 [22:06:59] (03PS1) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [22:08:47] (03PS1) 10Bearloga: statistics::product_analytics: Update contact group for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T291957) [22:09:52] (03CR) 10jerkins-bot: [V: 04-1] start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [22:09:54] (03CR) 10jerkins-bot: [V: 04-1] Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [22:14:19] (03PS5) 10Andrew Bogott: start_instance_with_prefix: return id and fqdn of new instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 [22:14:21] (03PS2) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [22:15:37] (03CR) 10Bearloga: statistics::product_analytics: Update contact group for monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [22:17:04] (03CR) 10jerkins-bot: [V: 04-1] Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [22:18:27] (03PS3) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [22:21:56] (03PS2) 10Bearloga: statistics::product_analytics: Update contact group for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/736916 (https://phabricator.wikimedia.org/T291957) [22:22:41] (03CR) 10jerkins-bot: [V: 04-1] Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [22:39:46] (03PS2) 10Bartosz Dziewoński: Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) [22:44:27] (03PS1) 10Bartosz Dziewoński: Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 [22:45:51] (03PS2) 10Bartosz Dziewoński: Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 (https://phabricator.wikimedia.org/T270536) [22:50:14] jouncebot: next [22:50:14] In 0 hour(s) and 9 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T2300) [22:50:17] jouncebot: refresh [22:50:18] I refreshed my knowledge about deployments. [23:00:05] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T2300). [23:00:05] Juan_90264, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:22] hello [23:00:43] Hello, I'm present [23:00:52] * thcipriani waves [23:01:11] * thcipriani thought he updated jouncebot to ping him to deploy [23:01:13] ¯\_(ツ)_/¯ [23:01:28] hello [23:01:34] howdy all [23:02:53] (03CR) 10Thcipriani: [C: 03+2] Fix loading of related articles via IntersectionObserver [extensions/RelatedArticles] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736830 (https://phabricator.wikimedia.org/T223844) (owner: 10Jdlrobson) [23:02:58] jouncebot: now [23:02:59] For the next 0 hour(s) and 57 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211104T2300) [23:03:27] (03CR) 10Clare Ming: [C: 03+2] Allow bureaucrats to grant and revoke the importer rights to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736522 (https://phabricator.wikimedia.org/T294930) (owner: 10Juan90264) [23:04:24] (03Merged) 10jenkins-bot: Allow bureaucrats to grant and revoke the importer rights to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736522 (https://phabricator.wikimedia.org/T294930) (owner: 10Juan90264) [23:05:30] (03PS1) 10Dzahn: parsoid: move inclusion of mediawiki::webserver profile to role [puppet] - 10https://gerrit.wikimedia.org/r/736923 [23:06:31] Perfect merged! [23:07:19] Let's test? [23:07:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:36] getting it staged now :) [23:09:06] hi Juan_90264: can you test on mwdebug1002? [23:11:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:32] (03Merged) 10jenkins-bot: Fix loading of related articles via IntersectionObserver [extensions/RelatedArticles] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736830 (https://phabricator.wikimedia.org/T223844) (owner: 10Jdlrobson) [23:12:49] (03CR) 10Dzahn: "duuuh.. I used FQDNs in .yaml file names but it needs to be just $(hostname -s) of course, that explains" [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:12:49] Juan_90264: standing by for your thumbs up before syncing live [23:13:01] (03PS1) 10Esanders: Disable DiscussionTools mobile everywhere except for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736924 [23:13:18] Okay [23:14:18] Juan_90264: "Okay" it's good? Or "Okay" you are checking? [23:14:40] checking [23:15:09] (03Abandoned) 10Esanders: Disable DiscussionTools mobile everywhere except for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736924 (owner: 10Esanders) [23:15:12] I approved [23:15:20] (03CR) 10Bartosz Dziewoński: "Oh, I'm doing this in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/736918 D:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736924 (owner: 10Esanders) [23:15:31] (03CR) 10Esanders: [C: 03+1] Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 (https://phabricator.wikimedia.org/T270536) (owner: 10Bartosz Dziewoński) [23:15:37] cool, thank you, syncing! [23:15:54] (03PS1) 10Dzahn: parsoid: remove mw font packages from parsoid-canary, for real [puppet] - 10https://gerrit.wikimedia.org/r/736925 (https://phabricator.wikimedia.org/T294378) [23:16:45] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:736522|Allow bureaucrats to grant and revoke the importer rights to enwikiversity (T294930)]] (duration: 00m 56s) [23:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:48] T294930: Allow en.wikiversity Bureaucrats to grant and revoke Importer right - https://phabricator.wikimedia.org/T294930 [23:17:26] (03CR) 10Dzahn: [C: 03+2] parsoid: remove mw font packages from parsoid-canary, for real [puppet] - 10https://gerrit.wikimedia.org/r/736925 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:17:27] ^ Juan_90264 should be live! Thanks for the patch! [23:19:34] Change working, thanks! [23:19:54] !log wtp1025, wtp1026, parse2001, parse2002 (parsoid-canary): purging mediawiki font packages (T294378) [23:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:57] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [23:20:25] hi Jdlrobson: can you check on mwdebug1002? [23:20:59] yep [23:21:01] looking [23:21:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:22] (03CR) 10Dzahn: "after follow-up fix the packages are now actually removed from the 4 canary servers (wtp1025/1026, parse2001/2002)" [puppet] - 10https://gerrit.wikimedia.org/r/736825 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:21:31] cjming: it's alive! phew. Please sync. [23:21:37] will do! [23:21:45] (03Abandoned) 10Dzahn: parsoid: move inclusion of mediawiki::webserver profile to role [puppet] - 10https://gerrit.wikimedia.org/r/736923 (owner: 10Dzahn) [23:22:34] (03PS3) 10Thcipriani: Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) (owner: 10Bartosz Dziewoński) [23:22:38] (03CR) 10Thcipriani: [C: 03+2] Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) (owner: 10Bartosz Dziewoński) [23:22:56] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/RelatedArticles: Backport: [[gerrit:736830|Fix loading of related articles via IntersectionObserver (T223844)]] (duration: 00m 55s) [23:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:00] T223844: Reserve space for the "Related articles" container - https://phabricator.wikimedia.org/T223844 [23:23:04] Jdlrobson: should be live now [23:23:24] checking cjming [23:23:24] (03Merged) 10jenkins-bot: Fix value of wgDTSchemaEditAttemptStepSamplingRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736869 (https://phabricator.wikimedia.org/T295052) (owner: 10Bartosz Dziewoński) [23:24:23] cjming: doesn't seem to be live.. [23:24:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:58] ^ cjming [23:26:08] we're investigating [23:27:40] Jdlrobson: the code is in place on a random appserver: client-side cache? or is this a resourceloader thing? [23:27:40] ack [23:27:51] (03PS1) 10Dzahn: cloudweb2002-dev (labtestwikitech): purge mediawiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) [23:27:54] thcipriani: the stylesheet is not loading [23:29:28] what's the link? [23:29:58] https://en.m.wikipedia.org/w/load.php?lang=en&modules=ext.cite.styles%7Cext.relatedArticles.styles%7Cext.wikimediaBadges%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cmobile.init.styles%7Cskins.minerva.base.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.wikimedia%7Cskins.minerva.mainMenu.icons%2Cstyles&only=styles&skin=minerva [23:30:03] Problematic modules: {"ext.relatedArticles.styles":"missing"} [23:30:28] when I cache bust it is fixes [23:30:33] so perhaps this is going to resolve itself in 5 mins? [23:30:43] Sounds likely [23:30:51] ResourceLoader has been acting a bit weird recently did anything change recently? [23:31:05] that's my hope. I can't seem to find the instructions for how to force resourceloader to do its thing anymore :\ [23:31:15] This happened during the last deploy: https://phabricator.wikimedia.org/T295079 [23:31:22] which I've never seen before [23:32:22] It's been longer than 5 minutes though which is odd [23:32:31] okay it seems to be working now [23:32:40] hrm [23:32:40] So perhaps 10 minutes is the cache lifetime? [23:33:00] maybe... [23:33:11] anyway, should re-find the instructions for how to purge via mwscript [23:33:24] the startup module's cache lifetime is 5 minutes [23:33:33] thcipriani: purgeList.php won't help in this case [23:33:46] nah, that's not the one I'm thinking of [23:33:49] if you added a dependency to your code, it won't be available for those 5 minutes, because dependencies are stored in the startup module [23:34:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:34:09] there were instructions that I used once upon a time for making a resource loader instance and purging via mwscript iirc [23:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:18] but this is a very old memory :) [23:34:28] You can probably hack it manually, sure [23:34:29] if you added a new module, it likewise won't be available [23:34:39] Shouldn't be worth the effort though.. Just wait a little bit ;D [23:35:01] we generally just groan and accept the breakage for train deployments [23:35:13] fun [23:35:45] FWIW, i also know scap does something here [23:36:07] hi MatmaRex: can you check on mwdebug1002 for your 1st patch? [23:36:17] yeah, looking [23:36:38] Fun indeed! Anyway looks like we're good for now. [23:36:44] cjming: seems good [23:37:04] cool - syncing now [23:37:05] (03PS1) 10Dzahn: profile::openstack::base::wikitech::web: add parameter to remove font packages [puppet] - 10https://gerrit.wikimedia.org/r/736928 (https://phabricator.wikimedia.org/T294378) [23:37:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:54] (03CR) 10Dzahn: "Let me make it possible that we do this separately on "labtestwikitech" before production wikitech. Then we can actually be sure. Needs ht" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:38:21] (03PS1) 10Tim Starling: Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) [23:38:32] (03PS3) 10Clare Ming: Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 (https://phabricator.wikimedia.org/T270536) (owner: 10Bartosz Dziewoński) [23:38:36] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:736869|Fix value of wgDTSchemaEditAttemptStepSamplingRate (T295052)]] (duration: 00m 55s) [23:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:39] T295052: DivisionByZeroError: Modulo by zero - https://phabricator.wikimedia.org/T295052 [23:39:19] (03PS4) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [23:39:22] (03CR) 10Clare Ming: [C: 03+2] Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 (https://phabricator.wikimedia.org/T270536) (owner: 10Bartosz Dziewoński) [23:40:07] (03Merged) 10jenkins-bot: Disable upcoming DiscussionTools mobile interface, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736918 (https://phabricator.wikimedia.org/T270536) (owner: 10Bartosz Dziewoński) [23:40:09] (03PS2) 10Dzahn: profile::openstack::base::wikitech::web: add parameter to remove font packages [puppet] - 10https://gerrit.wikimedia.org/r/736928 (https://phabricator.wikimedia.org/T294378) [23:41:18] MatmaRex: can you test patch #2 on same test server? [23:42:07] cjming: it has no effect on production, the new value is identical to the default [23:42:19] ok then - syncing now [23:42:23] cjming: i don't see the expected effect on beta though, i guess that might happen separately? [23:43:57] MatmaRex: I think that *just* finished: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/19946/ [23:44:00] !log cjming@deploy1002 Synchronized wmf-config: Config: [[gerrit:736918|Disable upcoming DiscussionTools mobile interface, enable on beta (T270536)]] (duration: 00m 55s) [23:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:05] T270536: Introduce the Reply Tool on mobile - https://phabricator.wikimedia.org/T270536 [23:44:25] cjming: oh, indeed! thanks [23:44:30] thcipriani: ^ [23:44:30] (03CR) 10Arlolra: [C: 03+2] Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) (owner: 10Tim Starling) [23:44:38] (03CR) 10Scardenasmolinar: [C: 03+1] Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736863 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [23:44:54] !log end of UTC late backport & config window [23:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:03] (03CR) 10Arlolra: Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) (owner: 10Tim Starling) [23:45:11] (03CR) 10Arlolra: [C: 03+1] Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) (owner: 10Tim Starling) [23:46:00] (03CR) 10Tim Starling: [C: 03+2] Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) (owner: 10Tim Starling) [23:46:16] (03Merged) 10jenkins-bot: Add shorttimeout option to X-Wikimedia-Debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736929 (https://phabricator.wikimedia.org/T293568) (owner: 10Tim Starling) [23:46:35] (03CR) 10Dzahn: [V: 03+1] "being bold, effective noop on labweb* https://puppet-compiler.wmflabs.org/compiler1003/32146/" [puppet] - 10https://gerrit.wikimedia.org/r/736928 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:46:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] profile::openstack::base::wikitech::web: add parameter to remove font packages [puppet] - 10https://gerrit.wikimedia.org/r/736928 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:47:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:19] !log tstarling@deploy1002 Synchronized src/XWikimediaDebug.php: XWD timeout testing (duration: 00m 54s) [23:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:23] (03CR) 10Dzahn: "confirmed noop on both labweb (wikitech prod)" [puppet] - 10https://gerrit.wikimedia.org/r/736928 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:50:01] (03PS2) 10Dzahn: cloudweb2002-dev (labtestwikitech): purge mediawiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) [23:50:21] (03CR) 10Dzahn: "now possible after I13e96cb4ec6f569954" [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:51:03] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: XWD timeout testing T293568 (duration: 00m 54s) [23:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:05] T293568: PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page - https://phabricator.wikimedia.org/T293568 [23:51:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:20] (03PS3) 10Dzahn: cloudweb2002-dev (labtestwikitech): purge mediawiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378)