[00:17:12] (03PS12) 10Krinkle: Create a Jenkins job for Fresh that runs tests inside a Qemu VM [integration/config] - 10https://gerrit.wikimedia.org/r/593034 (https://phabricator.wikimedia.org/T250808) [00:22:25] (03PS13) 10Krinkle: Create a Jenkins job for Fresh that runs tests inside a Qemu VM [integration/config] - 10https://gerrit.wikimedia.org/r/593034 (https://phabricator.wikimedia.org/T250808) [00:27:31] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments, 10User-DannyS712, 10User-brennen: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10LucasWerkmeister) [00:32:39] (03PS1) 10Krinkle: wmui-osproject: remove dangling caret icon from right side of tile [integration/docroot] - 10https://gerrit.wikimedia.org/r/594816 [00:36:34] (03PS2) 10Krinkle: wmui-osproject: remove dangling caret icon from right side of tile [integration/docroot] - 10https://gerrit.wikimedia.org/r/594816 [00:55:02] (03CR) 10VolkerE: [C: 03+1] wmui-osproject: remove dangling caret icon from right side of tile [integration/docroot] - 10https://gerrit.wikimedia.org/r/594816 (owner: 10Krinkle) [01:13:10] (03CR) 10Krinkle: [C: 03+2] wmui-osproject: remove dangling caret icon from right side of tile [integration/docroot] - 10https://gerrit.wikimedia.org/r/594816 (owner: 10Krinkle) [01:13:38] (03Merged) 10jenkins-bot: wmui-osproject: remove dangling caret icon from right side of tile [integration/docroot] - 10https://gerrit.wikimedia.org/r/594816 (owner: 10Krinkle) [02:16:17] 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Wikidata: Description property missing in beta cluster WP - https://phabricator.wikimedia.org/T251550 (10Mholloway) Now visible on group1 production Wikipedias, for example, hewiki: https://he.wikipedia.org/w/api.php?action=query&prop=d... [03:04:50] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments, 10User-DannyS712, 10User-brennen: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10brennen) Rolled back to group0 at 02:56 UTC for T252079; writing a "blocked" status update mail. [03:44:25] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments, 10User-DannyS712, 10User-brennen: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10brennen) a:05brennen→03hashar Logging off; reassigning to Antoine for now as backup conductor i... [03:54:31] (03PS1) 10BearND: Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 [03:55:39] (03CR) 10jerkins-bot: [V: 04-1] Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [03:58:30] (03CR) 10BearND: "Not sure if this is the right approach and where the previously used templates (service-pipeline-test + service-pipeline-test-and-publish)" [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [06:28:57] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:14] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:31:51] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:46] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92907 bytes in 0.988 second response time [06:54:07] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 52086 bytes in 1.025 second response time [06:56:43] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 93217 bytes in 0.995 second response time [07:51:42] 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Operations, 10Patch-For-Review: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) @hashar So.. when do we schedule the ma... [09:43:41] (03CR) 10Lars Wirzenius: [C: 03+2] Create a Jenkins job for Fresh that runs tests inside a Qemu VM (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/593034 (https://phabricator.wikimedia.org/T250808) (owner: 10Krinkle) [09:44:40] (03Merged) 10jenkins-bot: Create a Jenkins job for Fresh that runs tests inside a Qemu VM [integration/config] - 10https://gerrit.wikimedia.org/r/593034 (https://phabricator.wikimedia.org/T250808) (owner: 10Krinkle) [10:20:23] 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO, 10ChangeProp, 10Operations, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10hnowlan) a:05holger.knust→03hnowlan [11:20:40] o/ I just +2ed a revert on the WIkibase branch for the thing blocking the train btw [11:20:51] will aim to deploy it soon ish [11:20:56] / backport it [11:58:07] last shinken [12:13:57] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10Addshore) [12:13:59] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments, 10User-DannyS712, 10User-brennen: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10Addshore) [13:14:56] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:13] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:53] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:15] "check coverage" should work on any repo we do coverage checking, right? [13:28:13] hashar: ^ [13:29:47] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92907 bytes in 0.951 second response time [13:30:04] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 52056 bytes in 0.979 second response time [13:32:43] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 93217 bytes in 0.990 second response time [13:35:53] Reedy: might? ;) [13:36:02] need a job to be defined [13:36:12] I mean, that has to be configured in zuul/layout.yaml [13:36:19] https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/ [13:36:26] It's not working on https://gerrit.wikimedia.org/r/#/c/mediawiki/libs/IPUtils/+/594895/ though [13:36:45] Or is the pre-merge different? [13:52:00] https://github.com/wikimedia/integration-config/blob/master/zuul/layout.yaml#L7997-L8002 [13:52:23] Reedy: so that is only published AFTER the change got merged [13:52:33] Right, I'm not looking for that to change [13:52:45] I'm looking for the "code coverage increased/decreased" response from jerkins [13:52:55] ahh [13:53:00] we don't have that for libs [13:53:05] boo :( [13:53:14] Any particular reason? [13:53:39] nobody went to add it ? ;) [13:53:49] makes sense [13:53:52] there is a "coverage" pipeline in zuul [13:53:54] * Reedy files a task in the first instance [13:54:12] and the only jobs it has is mwext-phpunit-coverage-patch-docker [13:54:28] which is tied to mediawiki/core / mediawiki extensions [13:54:36] 10Continuous-Integration-Config: Make "check coverage" work on libs - https://phabricator.wikimedia.org/T252120 (10Reedy) [13:54:49] I guess it would not be too complicated to create a generic coverage job :] [13:57:41] 10Continuous-Integration-Config: Make "check coverage" work on libs - https://phabricator.wikimedia.org/T252120 (10Reedy) [14:03:01] looks like quite a bit of copy paste [14:03:35] Though, running phpunit is different (`composer cover` instead) [14:03:59] But shouldn't need all the DB setup and stuff [14:28:12] 10Scap, 10Operations, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Gehel) [14:41:32] hia, i'm locked out of my phabricator account due to the 2fa TOTP reset that happened in january [14:41:49] My laptop is on the fritz, so i'm using an older computer [14:42:07] and it has been forever since I had to re-log into phabriactor [14:42:37] i don't seem to have any 2FA set up on my phone for phabricator...I think I got a new phone since I had to log in [14:43:13] twentyafterfour: can you help? or point me to where I should ask? [14:47:36] (03PS2) 10BearND: Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 [14:48:27] (03CR) 10jerkins-bot: [V: 04-1] Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [14:49:10] (03PS3) 10BearND: Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 [14:49:56] (03CR) 10jerkins-bot: [V: 04-1] Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [14:52:06] (03CR) 10BearND: "Ouch! That commit message check doesn't even allow long URLs. No idea why the zuul_tests fail." [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [14:52:47] ottomata: can you verify your identity for me? [14:52:56] I can remove your auth tokens after that [14:53:05] how would you like me to? [14:53:06] "I am Andrew Otto" [14:53:16] no, i AM Otto [14:53:50] "I am Andrew Otto and I love bicycles and am a loyal member of the church of Kafka and event streams" [14:54:34] haha, that's like 95% of the way there, but I could know that from your instagram ;) [14:54:39] * greg-g sent you a hangout url [14:57:35] ottomata: done [14:58:06] awesome thank you, adding new 2fa [14:58:08] hashar: Hey. Does https://gerrit.wikimedia.org/r/c/integration/config/+/594804 look right to you? How can we test it? Just merge and try to deploy? [14:59:33] James_F: I guess you can try running one of the fab target? [14:59:50] if that works, well it means integration-cumin-01 is fine ;) [14:59:56] hashar: Could you? I've only ever run cumin once. [15:00:14] My DevOps-fu is clearly lacking. ;-) [15:00:19] fab deploy_save_scripts [15:00:20] ;D [15:00:43] * hashar tries [15:01:18] No hosts found that matches the query [15:01:25] cause it selects 'name:slave-docker' [15:02:12] What label is that trying to read? [15:02:17] Has it been broken for a while? [15:07:37] (03PS3) 10Hashar: Point fabric at new, stretch-based CI cumin host [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [15:07:39] (03PS1) 10Hashar: fab: update cumin selector for Docker agents [integration/config] - 10https://gerrit.wikimedia.org/r/594976 [15:08:43] (03CR) 10Hashar: "integration-cumin-01:~$ sudo cumin --force '*' 'hostname'" [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [15:09:19] gotta arm the keyholder [15:10:56] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:13] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:44] !log mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=enwiki --print-orphaned-records | T246539 [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:11:46] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [15:12:01] !log Armed keyholder on integration-cumin-01 using key from integration-puppetmaster-02:/var/lib/git/labs/private/files/ssh/ [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:12:25] !log Armed keyholder on integration-cumin-01 using key from integration-puppetmaster-02:/var/lib/git/labs/private/files/ssh/ # T236576 [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:12:28] T236576: Move all Wikimedia CI (WMCS integration project) instances from jessie to stretch - https://phabricator.wikimedia.org/T236576 [15:13:04] (03CR) 10Hashar: [C: 04-1] "I have armed the keyholder on integration-cumin-01 but somehow it can not ssh to the instances :-(" [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [15:13:24] James_F: there is an issue with the keyholder/ssh from cumin-01 to the instances [15:13:38] no bandwith to debug it now though [15:13:51] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:53] :-( [15:21:00] James_F: trying now ;) [15:25:23] Authentication tried for root with correct key but not from a permitted host (host=172.16.1.185, ip=172.16.1.185). [15:25:24] oh [15:26:10] (03CR) 10Jforrester: "> Patch Set 1:" [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [15:28:06] I guess there is some restriction somewhere via hiera [15:29:12] (03CR) 10BearND: "> Making your own service runner that also just runs the test pipeline like the generic runner is essentially a no-op; it has some trivial" [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [15:29:31] I couldn't find the IPs in git anywhere. [15:30:16] yeah [15:30:21] so we need cumin_masters [15:30:29] maybe it is set in horizon at the instance level [15:32:46] Path "/etc/puppet/hieradata/cloud/eqiad1/integration/hosts/integration-cumin.nuyaml" [15:32:46] Original path: "cloud/%{::wmcs_deployment}/%{::labsproject}/hosts/%{::hostname}" [15:32:46] Found key: "cumin_masters" value: [ [15:32:46] "172.16.4.46", [15:32:46] "172.16.6.133" [15:32:47] ] [15:33:02] from integration-puppetmaster-02: sudo puppet lookup --explain --compile --node integration-cumin.integration.eqiad.wmflabs cumin_masters [15:33:43] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 93209 bytes in 1.014 second response time [15:35:48] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92914 bytes in 1.128 second response time [15:35:59] which are the WMCS ip bah [15:36:04] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 52077 bytes in 1.031 second response time [15:36:23] hashar: Found it [15:36:24] https://horizon.wikimedia.org/project/puppet/ [15:36:30] Oh, that URL isn't useful. [15:36:41] so we gotta add 172.16.1.185 [15:36:43] profile::openstack::eqiad1::cumin::project_masters: [15:36:43] - 172.16.1.103 [15:36:53] and !log it ;] [15:37:02] then run puppet on all agents [15:37:22] Hmm. But that value isn't what the boxes report? [15:38:04] !log foreachwiki extensions/AbuseFilter/maintenance/updateVarDumps.php --print-orphaned-records | T246539 [15:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:38:06] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [15:41:12] James_F: maybe I got the ip wrong sorry [15:41:30] (03CR) 10Jforrester: "> Patch Set 3:" [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [15:41:38] integration-cumin 172.16.1.103 [15:42:00] integration-cumin-01: 172.16.1.185 [15:42:11] so those need to be in hiera variable "cumin_masters" [15:42:21] then using integration-cumin we can batch run puppet on all agents [15:42:43] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Patch-For-Review, 10Release, 10Train Deployments, and 2 others: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10brennen) a:05hashar→03brennen [15:43:06] !log Added integration-cumin-01 (172.16.1.185) to WMCS integration cumin master list [15:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:43:30] hashar: Puppet will run in 10 mins anyway, right? [15:43:40] Well, I guess some might be behind if there are issues. [15:44:26] integration-cumin$ sudo cumin --force --batch-size 5 '*' 'puppet-run' [15:44:37] Sure. [15:45:30] hashar: Are you running or should I? [15:47:31] I'll run it. [15:48:50] James_F: it i running [15:48:58] cumin is a good tool [15:49:33] Oh, I guess we're both running it, then. :-) [15:49:37] one has to get familiar with the ways to select hosts (I just use either '*' for everything or 'name:agent' to select based on instances names) [15:49:47] --force to avoid the "are you sure?" prompt [15:51:38] ok that ran [15:51:56] so on integration-cumin-01 you can try: sudo cumin --force '*' 'hostname' [15:52:01] Should I do it? [15:52:04] yeah [15:52:10] as an exercise I guess [15:52:14] the commands run as root [15:52:28] to make it easier to brick the remote instances ;] [15:52:30] Permission denied (publickey). [15:52:34] :/ [15:52:36] All 26 failed. [15:52:46] bah [15:52:48] Which is better than some of them working and some not, I guess. [15:52:53] should have run puppet on a single host to confirm the change [15:54:06] yeah that worked [15:54:20] but now it fails for some other reason bha [15:54:54] Hmm. [15:58:09] sign_and_send_pubkey: signing failed: agent refused operation [15:58:26] That's new. [15:59:02] so keyholder rejects it [15:59:30] (03CR) 10Dduvall: [C: 04-1] Make typescript services use pipeline config (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [15:59:53] restarted keyholder, arming it again [16:01:14] MOAUAHAHAHHAHA [16:01:16] puppet sucks [16:01:23] or our manifests do [16:01:30] I guess what happenned is that keyholder got started [16:01:34] then its configuration deployed [16:01:41] so they were not taken in account [16:01:45] restarting it did the trick [16:02:10] marxarelli: Wow, I managed to summon you to look at that without even asking yet. :-) [16:02:20] Helpful. [16:02:25] :) [16:02:48] (03Abandoned) 10Thcipriani: Gerrit 2.16.16 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/574092 (https://phabricator.wikimedia.org/T200739) (owner: 10Thcipriani) [16:03:11] integration-agent-stretch-1001.integration.eqiad.wmflabs is the only one not reachable [16:03:34] hashar: Yeah, integration-agent-stretch-1001 needs puppet fixes. [16:03:36] No space left on device [16:03:42] ?: [16:03:42] ! [16:03:53] Delete the logs and see what happens. [16:04:15] But / has 11G free? [16:04:23] * hashar changes title to Yak Shaver [16:04:28] * James_F laughs. [16:04:36] What even is integration-agent-stretch-1001? [16:05:17] OHH [16:05:17] no [16:05:19] easy [16:05:23] so df [16:05:27] reports there is plenty of space [16:05:34] but if you look at inodes with df -hi [16:05:44] /dev/vda2 1.2M 1.2M 0 100% / [16:05:48] 1.2 million inodes [16:05:49] Ooooh. [16:05:54] Lots of tiny log files? [16:06:20] cause the pbuilder builds are not cleaned out in /var/cache/pbuilder/build bah [16:06:43] something weird happened at some point [16:06:59] !log integration-agent-stretch-1001 : clearing out /var/cache/pbuilder/build [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:07:48] I am not sure what that agent is for to be fair [16:07:55] Me neither. [16:08:00] Switch it off and see what breaks? [16:08:05] hold on ;) [16:08:50] It's not listed on https://integration.wikimedia.org/ci/computer/ [16:09:17] and not on https://sal.toolforge.org/releng?p=0&q=%22integration-agent-stretch-1001%22&d= [16:09:25] Yeah. [16:09:28] Experiment? [16:09:30] it is not attached to jenkins (no agent running [16:09:35] Let's kill it. [16:09:38] horizon might let us know who/when it spawned [16:09:53] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] Looking. [16:10:20] You created it on 27 September 2019. [16:10:25] damn You [16:10:29] always breaking stuff [16:10:31] It's a role::ci::slave::labs. [16:10:38] * James_F grins. [16:10:43] I suggest we just kill it. [16:10:45] might have been to test the migration of the debian package builder [16:10:49] yeah +1 [16:10:51] Same some server kitties. [16:10:55] OK, shutting it down now. [16:11:08] and I would then went with https://integration.wikimedia.org/ci/computer/integration-agent-pkgbuilder-1001/ [16:11:15] !log Shutting down integration-agent-stretch-1001, unused. [16:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:11:25] meanwhile cumin-01 works ;] [16:11:36] so I guess we can merge your integration/config fab.py change [16:11:42] OK, so do we merge my patch and shut off the old one and remove its IP from puppet? [16:11:43] Yeah. [16:11:45] delete integration-cumin and strike that task [16:11:52] yeah [16:11:55] it works ;] [16:11:59] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:04] well done! [16:12:14] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:21] Well done you, too. And Jeena and Alex and everyone who helped. [16:12:25] (03CR) 10Hashar: [C: 03+2] Point fabric at new, stretch-based CI cumin host [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [16:12:31] (03CR) 10Hashar: [C: 03+2] "works now ;)" [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [16:12:47] James_F: and https://gerrit.wikimedia.org/r/#/c/integration/config/+/594976/ fix the cumin selector ( slave > agent ) [16:13:02] (03CR) 10Jforrester: [C: 03+2] fab: update cumin selector for Docker agents [integration/config] - 10https://gerrit.wikimedia.org/r/594976 (owner: 10Hashar) [16:13:54] (03Merged) 10jenkins-bot: fab: update cumin selector for Docker agents [integration/config] - 10https://gerrit.wikimedia.org/r/594976 (owner: 10Hashar) [16:13:57] (03Merged) 10jenkins-bot: Point fabric at new, stretch-based CI cumin host [integration/config] - 10https://gerrit.wikimedia.org/r/594804 (https://phabricator.wikimedia.org/T236576) (owner: 10Jforrester) [16:14:27] !log Dropped integration-cumin (172.16.1.103) from WMCS integration cumin master list [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:15:05] !log Shutting off integration-cumin for T236576 [16:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:15:08] T236576: Move all Wikimedia CI (WMCS integration project) instances from jessie to stretch - https://phabricator.wikimedia.org/T236576 [16:16:17] PROBLEM - Host integration-agent-stretch-1001 is DOWN: CRITICAL - Host Unreachable (172.16.0.192) [16:16:29] 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review: Move all Wikimedia CI (WMCS int... - https://phabricator.wikimedia.org/T236576 [16:16:49] two less instances in one shot \o/ [16:17:11] We also should probably drop cumin-02 (on buster, doesn't work). [16:17:57] !log Shut down integration-cumin-02, non-functional buster cumin host. [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:19:50] yeah [16:20:06] and later we can create an integration-cumin with buster when v.olans has build the deb package for it [16:20:12] Yeah. [16:20:44] !log Deleted integration-cumin.integration.eqiad.wmflabs for last part of T236576 [16:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:20:47] T236576: Move all Wikimedia CI (WMCS integration project) instances from jessie to stretch - https://phabricator.wikimedia.org/T236576 [16:21:02] 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review: Move all Wikimedia CI (WMCS int... - https://phabricator.wikimedia.org/T236576 [16:21:17] hashar: Now we can start working on T252071! ;-) [16:21:18] T252071: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster - https://phabricator.wikimedia.org/T252071 [16:21:52] 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review: Move all Wikimedia CI (WMCS int... - https://phabricator.wikimedia.org/T236576 [16:22:45] marxarelli: https://gerrit.wikimedia.org/r/c/mediawiki/services/push-notifications/+/594352 looks wrong to me but I can't work out why. [16:23:21] which part? the .pipeline/config.yaml? [16:25:57] i see an issue there [16:27:22] looks like the 'build' npm-script needs to be integrated into the blubber.yaml somehow [16:31:31] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) [16:32:35] brennen: looks like the train went all fine at least last time I checked the logs (1h30 ago) [16:32:41] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) [16:32:55] brennen: so i guess you will probably be able to promote to all wikis later today ;] I will try to be around [16:34:35] hashar: yep [16:35:04] keeping an eye on logs at present. i can't quite bring myself to believe that there won't be another blocker before promoting to all wikis, but perhaps. [16:37:02] marxarelli: James_F: Is it using the added .pipeline/config.yaml? I can't seem to figure out whether this new file has any effect and if not how to make it so. My thinking was that this would define the test variant somehow but frankly I may be way off since I'm very new to blubber/pipeline, and my Dockerfu is quite limited. This is just cobbled together from what I could find on the guides mentioned in the commit message. [16:38:20] The other thing I found very confusing was that the common build variant was called "build". It seems to be more a baseline thing/sharing config. [16:38:29] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10hashar) Must be a stale cache somewhere. It would help to have the headers. I haven't looked at t... [16:39:03] brennen: we will see ;:] [16:39:05] What I want to happen is basically: execute `npm run build` while it can still write to the file system [16:39:07] I am off for a bit [16:39:52] Also not sure if '${npm-run.imageID}' is a valid name in the config.yaml [16:41:49] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) ` HTTP/2 200 OK date: Thu, 07 May 2020 15:42:30 GMT server: Apache content-security-policy... [16:44:43] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 93217 bytes in 0.927 second response time [16:46:48] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92907 bytes in 0.901 second response time [16:47:05] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 52087 bytes in 1.079 second response time [16:48:03] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) ` HTTP/2 200 OK date: Thu, 07 May 2020 16:46:02 GMT server: Apache content-security-policy... [17:05:49] bearND: re: `npm run build` right! so here's the basic pattern: 1) define all image building stuff in blubber.yaml (think of that as packaging your software for distinctly useful runtimes). 2) define ci orchestration in .pipeline/config.yaml, which variants (distinct package) to build, run, publish, test-deploy, etc. [17:06:18] since running `npm run build` is part of building/packaging your software, it needs to go in blubber.yaml [17:07:56] and since it's a specialized (non generic npm thing like `npm install`) you'll probably want to utilize blubber's `builder` directive for that [17:07:57] https://wikitech.wikimedia.org/wiki/Blubber/User_Guide#Variant_Config [17:08:29] i can comment inline on your patchset if that's more helpful [17:09:12] marxarelli: ok. yes. that would be good. ty [17:10:26] is your `build` npm script typically run after `npm install`? [17:10:41] marxarelli: yes [17:11:22] well, `npm install` for now but eventually we'll add a package-lock.json. Then we would use `npm ci` instead [17:12:14] Also I think we'll want the build output to be available for production images, not just the test run [17:20:49] no problem. yeah, longma recently added support for `npm ci` and `.pipeline/config.yaml` supports publishing of built images to our registry [17:31:06] bearND: commented on your patch. also, the reason the pipeline isn't being run yet is because your changes to integration/config have not yet merged [17:31:49] (the need to alter integration/config is one roadblock keeping us from having this thing be fully self serve, fwiw) [17:31:54] yeah, i remember that patch about npm ci [17:32:54] marxarelli: so, i still need to go ahead with https://gerrit.wikimedia.org/r/c/integration/config/+/594823? [17:33:15] you do, yes [17:33:57] ok, go it. Will work on this in a bit. Right now I'm dealing with an issue deploying wikifeed changes. [17:34:04] that part binds zuul (which schedules jobs for gerrit events) to your project's pipelines [17:34:27] You wouldn't know by any chance where to find the `.hfenv` files? [17:34:54] I was hoping they were in /srv/deployment-charts/helmfile.d/services/staging/wikifeeds [17:36:06] could be that I'm reading this wrong or the steps have change but not reflected on https://wikitech.wikimedia.org/wiki/Pipeline_admin#Managing_applications_in_the_cluster [17:37:47] i'm totally sure about that bit [17:38:00] _joe_ or akosiaris (not in channel atm) might be able to help you out there [17:38:33] <_joe_> bearND: that's where they should be [17:38:43] <_joe_> if they're not there still, there is some setup missing [17:39:35] _joe_: shouldn't they be in the git repo? [17:39:52] <_joe_> no [17:40:01] <_joe_> they should be in prduction [17:40:21] <_joe_> deploy1001:/srv/deployment-charts/helmfile.d/services/staging/wikifeeds has the file [17:40:44] ok, then i'm probably on the wrong server. I was trying beta cluster first [17:41:06] i'm on deployment-deploy01 [17:41:52] <_joe_> yeah, no. In beta we don't have kubernetes [17:42:10] ah, that explains [17:43:52] Thanks, I'm just gonna do production then. There the `.hfenv` files exist. :) [17:49:08] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Tested with helm 2." [releng/local-charts] - 10https://gerrit.wikimedia.org/r/594798 (owner: 10Jeena Huneidi) [18:02:19] (03PS1) 10Jeena Huneidi: Run helm init if using helm 2 [releng/local-charts] - 10https://gerrit.wikimedia.org/r/595010 [18:06:00] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Checked under helm 2 with no pre-existing cluster, looks good." [releng/local-charts] - 10https://gerrit.wikimedia.org/r/595010 (owner: 10Jeena Huneidi) [18:13:45] 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Wikidata: Description property missing in beta cluster WP - https://phabricator.wikimedia.org/T251550 (10Mholloway) Update: The breaking change was reverted in wmf.31 but left in master, so production wikis shouldn't be affected any long... [18:32:20] 10Phabricator (Search): Adjustments for substring matching - https://phabricator.wikimedia.org/T252149 (10kostajh) [18:33:27] 10Phabricator (Search): Adjust defaults for Search - https://phabricator.wikimedia.org/T252150 (10kostajh) [18:53:24] (03PS1) 10Krinkle: Enable Qemu VM job for Fresh [integration/config] - 10https://gerrit.wikimedia.org/r/595026 (https://phabricator.wikimedia.org/T250808) [19:09:58] 10Scap, 10Operations, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10colewhite) p:05Triage→03Medium [19:22:39] 10Scap, 10Operations, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10thcipriani) The `scap::dsh::groups` hiera configuration variable is capable of populating dsh group files from conftoo... [19:25:45] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Patch-For-Review, 10Release, 10Train Deployments, and 2 others: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10brennen) [19:33:46] (03CR) 10Krinkle: [C: 03+2] Enable Qemu VM job for Fresh [integration/config] - 10https://gerrit.wikimedia.org/r/595026 (https://phabricator.wikimedia.org/T250808) (owner: 10Krinkle) [19:34:35] (03Merged) 10jenkins-bot: Enable Qemu VM job for Fresh [integration/config] - 10https://gerrit.wikimedia.org/r/595026 (https://phabricator.wikimedia.org/T250808) (owner: 10Krinkle) [19:41:00] PROBLEM - Free space - all mounts on deployment-snapshot01 is CRITICAL: CRITICAL: deployment-prep.deployment-snapshot01.diskspace._data.byte_percentfree (No valid datapoints found)deployment-prep.deployment-snapshot01.diskspace.root.byte_percentfree (<10.00%) [19:46:00] RECOVERY - Free space - all mounts on deployment-snapshot01 is OK: OK: deployment-prep.deployment-snapshot01.diskspace._data.byte_percentfree (No valid datapoints found) [19:49:01] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10thcipriani) @DannyS712 @Pchelolo: hi both! Last train there were a few blockers related to ongoing revision work (thank you for the respo... [19:52:36] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/595026 [19:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:53:10] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Fresh, 10Patch-For-Review: Decide how to run a test involving docker inside WMF CI - https://phabricator.wikimedia.org/T250808 (10Krinkle) [20:11:17] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10DannyS712) >>! In T249964#6117475, @thcipriani wrote: > @DannyS712 @Pchelolo: hi both! Last train there were a few blockers related to on... [20:13:02] 10Continuous-Integration-Infrastructure, 10Operations, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10colewhite) p:05Triage→03Medium [20:13:43] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10Reedy) > As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues. I'm not s... [20:15:17] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10DannyS712) >>! In T249964#6117539, @Reedy wrote: >> As for lessening risk, I started looking into requesting logstash access so I could h... [20:16:27] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10Reedy) >>! In T249964#6117553, @DannyS712 wrote: >>>! In T249964#6117539, @Reedy wrote: >>> As for lessening risk, I started looking into... [20:16:47] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10Pchelolo) Hey @thcipriani yeah... somehow the issue stacked up on this train. I've been thinking why did that happen as well and can't... [20:31:33] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10Reedy) >>! In T249964#6117553, @DannyS712 wrote: >>>! In T249964#6117539, @Reedy wrote: >>> As for lessening risk, I started looking into... [20:32:13] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Release, 10Train Deployments: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 (10thcipriani) >>! In T249964#6117528, @DannyS712 wrote: > Yeah, sorry about all of the UBNs :( Thanks for all the work you've been doing!... [20:39:34] brennen - What's the status on the roll-back of group 1 to wmf.30? [21:12:59] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:14] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:51] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:19] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Fresh, 10Patch-For-Review: Decide how to run a test involving docker inside WMF CI - https://phabricator.wikimedia.org/T250808 (10Krinkle) @hashar Looks like it worked when I triggered the builds manually, but not via Zuul. The iss... [21:25:43] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 93217 bytes in 1.034 second response time [21:27:49] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 92907 bytes in 0.857 second response time [21:28:09] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 52087 bytes in 2.126 second response time [21:41:32] (03PS4) 10BearND: Make typescript services use pipeline config [integration/config] - 10https://gerrit.wikimedia.org/r/594823 [21:59:41] (03PS5) 10BearND: Allow a service to use pipeline config file [integration/config] - 10https://gerrit.wikimedia.org/r/594823 [22:02:13] (03CR) 10BearND: "If you think of a better name than `typescript-service` let me know or feel free to change it later. It's not really TypeScript specific. " (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [22:21:43] (03CR) 10Dduvall: "> Patch Set 5:" (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [22:26:33] (03CR) 10BearND: "> For publishing, you'll want to define a separate project pipeline that can be scheduled under postmerge as we don't want images publishe" [integration/config] - 10https://gerrit.wikimedia.org/r/594823 (owner: 10BearND) [22:28:21] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10Scap, 10WMF-JobQueue, and 2 others: Add a jobrunner server to the Scap canary pool - https://phabricator.wikimedia.org/T172480 (10Krinkle) [23:05:58] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Pywikibot, 10Pywikibot-tests: Grant Pywikibot-gerritbot an access to Gerrit stream (stream-events group) - https://phabricator.wikimedia.org/T248262 (10Legoktm) >>! In T248262#6083132, @Dvorapa wrote: > "Pywikibot-gerritbot" has been created For w... [23:09:37] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Patch-For-Review, 10Release, 10Train Deployments, and 2 others: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10brennen) [23:14:36] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10Scap: scap clean leaving lots of empty directories on mw hosts - https://phabricator.wikimedia.org/T252177 (10thcipriani) [23:15:19] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10Scap: scap clean leaving lots of empty directories on mw hosts - https://phabricator.wikimedia.org/T252177 (10thcipriani) p:05Triage→03Medium [23:53:49] 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10Patch-For-Review, 10Release, 10Train Deployments, and 2 others: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 (10DannyS712)