[00:43:52] Yippee, build fixed! [00:43:53] Project beta-code-update-eqiad build #53422: FIXED in 52 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/53422/ [00:53:33] 10Continuous-Integration-Infrastructure, 10Deployment-Systems: Failed to create a temp file on /mnt/home/jenkins-deploy/workspace/beta-code-update-eqiad - https://phabricator.wikimedia.org/T97257#1237322 (10thcipriani) From what I've been seeing all weekend, this job keeps failing because `deployment-bastion`... [01:20:19] Yippee, build fixed! [01:20:20] Project beta-update-databases-eqiad build #9203: FIXED in 19 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9203/ [02:07:13] PROBLEM - Puppet staleness on deployment-elastic07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [02:51:53] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:21:54] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:29:19] PROBLEM - Puppet staleness on deployment-redis01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [06:40:37] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [09:39:08] 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1237716 (10fgiunchedi) ok even with unsampled xff fluorine grows at ~12G/day with ~800G free, if we're short on space again we can e... [09:45:11] 10Deployment-Systems, 6Release-Engineering, 6Services, 6operations: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1237721 (10akosiaris) [13:08:09] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #628: FAILURE in 36 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/628/ [13:13:31] 10Continuous-Integration-Infrastructure, 10Deployment-Systems: Failed to create a temp file on /mnt/home/jenkins-deploy/workspace/beta-code-update-eqiad - https://phabricator.wikimedia.org/T97257#1238053 (10greg) Could be core dumps from T93194 ? [13:38:09] thcipriani|afk: when you return — do you think that deployment-prep was stable enough over the weekend that we can declare victory? [14:27:05] andrewbogott: I'd say the virt host fix was a success. [14:27:20] great, thanks. [14:27:25] the problems this weekend were all related to free space on libvirt guests [15:32:55] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:34:36] hmm ^d I think zuul may be having a bad time. [15:35:11] gate pipeline seems...high, also, merged this https://gerrit.wikimedia.org/r/#/c/206786 and no news... [15:36:37] 10Deployment-Systems, 6Release-Engineering, 7Epic, 3releng-201415-Q4: EPIC: The future of MediaWiki deployment: Tooling - https://phabricator.wikimedia.org/T94620#1238329 (10greg) [15:37:29] 6Release-Engineering, 10Staging, 3releng-201415-Q3, 3releng-201415-Q4: [Quarterly Success Metric] Green nightly builds on the staging cluster (tracking) - https://phabricator.wikimedia.org/T88701#1238330 (10greg) p:5Low>3Lowest [15:38:01] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:13] nvmd, operator error. [15:42:57] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 9.821 second response time [15:57:54] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:37:51] (03Abandoned) 10MarkAHershberger: Start up php mess detector rule set [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/201956 (owner: 10MarkAHershberger) [16:46:20] Hi, we would like to deploy a new extension, Josa, on the Beta Cluster, more specifically on ko.wikipedia.beta.wmflabs.org. Configuration change is located at https://gerrit.wikimedia.org/r/#/c/203627/. How could we proceed? [17:02:29] 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1238670 (10mmodell) [17:02:31] 10Deployment-Systems, 6Release-Engineering, 7Epic, 3releng-201415-Q4: EPIC: The future of MediaWiki deployment: Tooling - https://phabricator.wikimedia.org/T94620#1238669 (10mmodell) [17:04:27] 10Deployment-Systems: Come up with an abstract deployment model that roughly addresses the needs of existing projects - https://phabricator.wikimedia.org/T97068#1238685 (10mobrovac) Draft of the current node.js services deployment process: https://wikitech.wikimedia.org/wiki/User:Mobrovac/Service_Deployment [17:04:31] ^d: ping for deployment meeting [17:22:59] (03PS2) 10JanZerebecki: [TODO] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [17:25:09] (03CR) 10jenkins-bot: [V: 04-1] [TODO] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [17:31:10] !log Jenkins slave deployment-bastion deadlock waiting for executors [17:31:12] Logged the message, Master [17:31:33] Project beta-update-databases-eqiad build #9219: FAILURE in 18 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9219/ [17:32:01] !log Relauch slave agent on deployment-bastion [17:32:03] Logged the message, Master [17:32:18] Krinkle: thcipriani said that /tmp on -bastion has been repeatedly full this weekend [17:34:45] greg-g: Unrelated to that bug is the deadlocks. I believe https://phabricator.wikimedia.org/T96199 would help avoid the deadlocks. Could someone who knows more about the beta cluster perhaps work on that at some point? [17:34:56] I'm checking -bastion now [17:35:51] Krinkle: so, something weird I noticed on bastion [17:36:12] getent passwd 992 [17:36:19] and getent passwd 603 [17:36:23] $ du -sh /tmp [17:36:23] 1.9M mw-cache-master [17:36:23] 1.2G scap_l10n_1283037312 [17:36:24] both map to mwdeploy [17:36:25] 1.2G scap_l10n_1909596178 [17:36:27] That doesn't look good [17:36:31] Does it not clean up after itself? [17:36:45] Krinkle: it should, there is a find delete in scap [17:36:59] thcipriani: Is that ensured to run if the process is aborted? [17:37:08] Processes get aborted all the time when commits cancel each other out [17:37:32] That directory is new right? I don't remember it being there in the past. [17:37:44] Krinkle: no, it doesn't seem like it is ensured. [17:37:57] yeah, it's a new directory, rather than generate files in place it moves them from /tmp [17:38:06] l10nupdate files [17:38:24] not sure of why the change other than the code looks a little cleaner [17:38:28] thcipriani: Maybe source global-setup.sh from the start of the job, and global-teardown in postbuild [17:38:38] That will give it a dedidated /tmp compartment, and then delete it afterwards [17:38:43] without needing to know what happened [17:38:50] assuming scap honours $TMPDIR [17:39:18] We've had similar tmp leakage in other builds and those were all fixed by this [17:39:22] Krinkle: pretty sure there would need to be modifications made :\ [17:40:03] probably easy to make though [17:40:06] there are some other ongoing problems that I noticed over the weekend: 1. the dual mwdeploy users on deployment-bastion [17:40:13] thcipriani: I can update the job to make use of our global-setup/teardown (surprised those aren't used by the beta jobs yet). [17:40:22] Can you make sure it honours $TMPDIR? [17:40:35] Unless it hardcodes /tmp, it probably works naturally already [17:40:51] Krinkle: it does use hardcoded /tmp I'm pretty sure [17:40:56] I'll look [17:41:02] and make necessary mods [17:43:07] 10Beta-Cluster, 10Continuous-Integration-Config, 10Math: beta-recompile-math-texvc-eqiad job fails with "/usr/local/bin/scap-recompile: No such file or directory" - https://phabricator.wikimedia.org/T91191#1238847 (10Krinkle) [17:43:36] thcipriani: Which task is about /tmp full ? [17:43:40] Can't find it :/ [17:44:08] * thcipriani looks [17:44:39] could be renamed: https://phabricator.wikimedia.org/T97257 [17:45:55] 10Beta-Cluster: beta-scap-eqiad fails due to rsync permissions issues - https://phabricator.wikimedia.org/T97314#1238855 (10thcipriani) 3NEW [17:46:48] 10Continuous-Integration-Infrastructure, 10Deployment-Systems: Failed to create a temp file on /mnt/home/jenkins-deploy/workspace/beta-code-update-eqiad - https://phabricator.wikimedia.org/T97257#1238875 (10Krinkle) [17:47:37] 10Continuous-Integration-Infrastructure, 10Deployment-Systems: Failed to create a temp file in beta-code-update-eqiad (Full deployment-bastion:/tmp) - https://phabricator.wikimedia.org/T97257#1238884 (10Krinkle) [17:48:13] :( [17:50:38] 10Continuous-Integration-Infrastructure, 10Deployment-Systems: Failed to create a temp file in beta-code-update-eqiad (Full deployment-bastion:/tmp) - https://phabricator.wikimedia.org/T97257#1238941 (10Krinkle) ``` $ du -sh /tmp 1.9M mw-cache-master 1.2G scap_l10n_1283037312 1.2G scap_l10n_1909596178 ``` Sca... [17:51:43] greg-g: thcipriani: I'll update the job for you [17:52:21] Krinkle: ty, I'm updating scap to use tempfile [17:55:10] (03PS1) 10Krinkle: Add global-setup and global-teardown to beta-code-update [integration/config] - 10https://gerrit.wikimedia.org/r/206850 (https://phabricator.wikimedia.org/T97257) [17:57:05] Woops, deployment-beta:/srv/deployment/integration hasn't been updated since March 18 2014 [17:57:31] <^d> thcipriani, twentyafterfour: Apologies, I had a scheduling conflict this AM. [17:58:50] ^d: it's ok, we just went over https://wikitech.wikimedia.org/wiki/User:Mobrovac/Service_Deployment and plan to continue in the same direction as last week, no new homework [17:59:13] <^d> Ah, that the working draft? [17:59:15] * ^d bookmarks [17:59:36] yeah, that wikipage is helpful for me understanding the services deployment process better [18:13:01] (03PS1) 10Thcipriani: Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) [18:13:10] Hm.. something is wrong in how deployment-bastion is set up. When I add CI's fake /sv/deployment polyfill (which uses git::clone ensure latest instead of trebuchet) it fails because /srv/deployment is already declared. Apparently that host thinks it is a Trebuchet target. [18:13:21] Except that when we deploy from tin that isn't really what happens. [18:13:35] so it probably created the initial clone last year and has never been updated since. [18:14:03] How is this handled for the other projects cloned under /srv/deployment? [18:16:44] Krinkle: looks like deployment-bastion is a target for "scap" which is nothing (pretty sure) and "scap/scap" which _is_ scap. [18:17:13] thcipriani: From where do you deploy? [18:17:31] I guess we have a tin equivalent? [18:17:57] I just need /srv/deployment/integration/slave-scripts to be more recent instead of last year's version [18:18:19] Oh, ight. This is the tin equiv [18:18:29] Those directories aren;t there as a target, they're there as a source [18:19:33] Krinkle: deployment-prep always confuses me a bit wrt tin v. bastion v. salt [18:19:52] thcipriani: Do you see a feasible path to make that directory git::clone latest? [18:20:13] https://gerrit.wikimedia.org/r/#/c/206853/2/manifests/role/ci.pp doesn't work since it conflicts with role/deployment which creates the same directories [18:20:23] * thcipriani looks [18:20:29] include contint::slave-scripts creates the same directories as git clones [18:22:33] hmmm, would probably have to do package install with the trebuchet provider? [18:25:46] thcipriani: For maintenance it's important that it either tracks the deployment tags we create on tin in prod (for gallium and lanthanum, and potentially this host as well), or that it ensures latest master (like we do for other non-trebuchet targets, like the main integration slaves) [18:27:01] Having to git-pull or deploy for beta specifically will likely be forgotten. And would increase maintenance during CI changes to the number of projects. (we'd need the same for other project-embedded CI slaves) [18:29:08] so from the sounds of it, we need a jenkins job, but not sure of the appropriate hook to trigger a git deploy [18:29:30] or at least, that seemingly is what jenkins jobs have been used for [18:32:27] thcipriani: It being a deployment host (not target) means it's safe to do a git-pull there without something overwriting it. [18:32:37] So I'll do that for the interim. [18:40:21] 10Beta-Cluster, 10Continuous-Integration-Infrastructure: Ensure /srv/deployment/integration/slave-scripts is latest master on deployment-bastion - https://phabricator.wikimedia.org/T97324#1239246 (10Krinkle) 3NEW [18:43:51] (03CR) 10Krinkle: [C: 032] "Compiled/Deployed updated 'beta-code-update-eqiad' job." [integration/config] - 10https://gerrit.wikimedia.org/r/206850 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:44:02] (03CR) 10BryanDavis: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [18:45:48] thcipriani: bd808: Is the directory being 1.2G normal? [18:45:58] (03Merged) 10jenkins-bot: Add global-setup and global-teardown to beta-code-update [integration/config] - 10https://gerrit.wikimedia.org/r/206850 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:45:59] The tmpdir Jenkins creates is 512M [18:47:13] So we may need another workaround.. [18:47:27] Krinkle: The prod php-1.26wmf3/cache/l10n is 2.4G [18:47:33] :/ [18:47:34] Okay [18:47:39] O_O [18:47:42] That's a lot of data [18:47:58] a lot of i18n strings * 2 [18:48:11] once in cdb and once as json blobls [18:48:51] well i18n strings + origin file + timestamps for every message [18:51:41] (03CR) 10Thcipriani: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [19:00:27] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL: CRITICAL: deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<30.00%) [19:10:52] (03PS1) 10Dduvall: Prototyping on riotjs [integration/raita] - 10https://gerrit.wikimedia.org/r/206867 [19:10:54] (03PS1) 10Dduvall: Refactor into one-way data flow [integration/raita] - 10https://gerrit.wikimedia.org/r/206868 [19:10:56] (03PS1) 10Dduvall: Allow filtering features/scenarios by status or tag [integration/raita] - 10https://gerrit.wikimedia.org/r/206869 [19:10:58] (03PS1) 10Dduvall: Defines some fixtures for testing [integration/raita] - 10https://gerrit.wikimedia.org/r/206870 [19:11:00] (03PS1) 10Dduvall: Reimplement filtering using ES filters [integration/raita] - 10https://gerrit.wikimedia.org/r/206871 [19:11:02] (03PS1) 10Dduvall: Omit background steps from scenarios [integration/raita] - 10https://gerrit.wikimedia.org/r/206872 [19:11:04] (03PS1) 10Dduvall: Simplify feature/element status filter using aggregates [integration/raita] - 10https://gerrit.wikimedia.org/r/206873 [19:11:06] (03PS1) 10Dduvall: Prototypal refactoring [integration/raita] - 10https://gerrit.wikimedia.org/r/206874 [19:11:08] (03PS1) 10Dduvall: Renamed application to Raita [integration/raita] - 10https://gerrit.wikimedia.org/r/206875 [19:11:10] (03PS1) 10Dduvall: Improve API documentation [integration/raita] - 10https://gerrit.wikimedia.org/r/206876 [19:17:58] (03CR) 10Dduvall: [C: 032] Prototyping on riotjs [integration/raita] - 10https://gerrit.wikimedia.org/r/206867 (owner: 10Dduvall) [19:18:08] (03CR) 10Dduvall: [C: 032] Refactor into one-way data flow [integration/raita] - 10https://gerrit.wikimedia.org/r/206868 (owner: 10Dduvall) [19:18:16] (03CR) 10Dduvall: [C: 032] Allow filtering features/scenarios by status or tag [integration/raita] - 10https://gerrit.wikimedia.org/r/206869 (owner: 10Dduvall) [19:18:21] (03CR) 10Dduvall: [C: 032] Defines some fixtures for testing [integration/raita] - 10https://gerrit.wikimedia.org/r/206870 (owner: 10Dduvall) [19:18:27] (03CR) 10Dduvall: [C: 032] Reimplement filtering using ES filters [integration/raita] - 10https://gerrit.wikimedia.org/r/206871 (owner: 10Dduvall) [19:18:32] (03CR) 10Dduvall: [C: 032] Omit background steps from scenarios [integration/raita] - 10https://gerrit.wikimedia.org/r/206872 (owner: 10Dduvall) [19:18:37] (03CR) 10Dduvall: [C: 032] Simplify feature/element status filter using aggregates [integration/raita] - 10https://gerrit.wikimedia.org/r/206873 (owner: 10Dduvall) [19:18:43] (03CR) 10Dduvall: [C: 032] Prototypal refactoring [integration/raita] - 10https://gerrit.wikimedia.org/r/206874 (owner: 10Dduvall) [19:18:49] (03CR) 10Dduvall: [C: 032] Renamed application to Raita [integration/raita] - 10https://gerrit.wikimedia.org/r/206875 (owner: 10Dduvall) [19:53:21] !log Jenkins unable to re-create Gearman connection. (HTTP 503 error from /configure). Have to force restart Jenkins [19:53:24] Logged the message, Master [20:04:42] is it still coming back up? [20:15:07] !log Relaunched Gearman connection [20:15:10] Logged the message, Master [20:15:47] twentyafterfour: is there a bug for https://gerrit.wikimedia.org/r/#/c/205988/1 ? [20:16:35] legoktm: I don't think so [20:17:34] I probably should have filed one but it was interrupting my scap and I was already running late on my deployment window [20:17:46] sorry about that :/ [20:19:27] hi guys. is beta not updating itself again? [20:19:46] after https://gerrit.wikimedia.org/r/#/c/206842/ was merged, https://phabricator.wikimedia.org/T97262 should be fixed now, but it isn't [20:19:56] (03CR) 10BryanDavis: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [20:20:23] Yippee, build fixed! [20:20:23] Project beta-update-databases-eqiad build #9222: FIXED in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9222/ [20:22:01] legoktm: no biggie it wasn't exactly obvious - the calling script was using include and checking the return value to see if the include worked - it shouldn't have been built that way, so we fixed it to not care about the return value [20:22:12] (thanks to bd808 for catching that for me) [20:22:59] it was wacky stuff [20:24:11] A good rule of thumb is is you see code that says `$foo = include bar;` bad things will happen eventually [20:24:29] yeah it's never good to have a return value that represents two different things (the success of the include, or the return value of the included code) [20:26:44] 10Browser-Tests, 10Continuous-Integration-Infrastructure, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1239581 (10Jdlrobson) [20:26:45] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1239579 (10Jdlrobson) 5Open>3Resolved \o/ https://integration.wikimedia.org/ci/view/Mobile/... [20:30:38] (03CR) 10Dduvall: [V: 032] Prototyping on riotjs [integration/raita] - 10https://gerrit.wikimedia.org/r/206867 (owner: 10Dduvall) [20:31:48] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206868 (owner: 10Dduvall) [20:32:24] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206869 (owner: 10Dduvall) [20:32:57] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206870 (owner: 10Dduvall) [20:33:10] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206871 (owner: 10Dduvall) [20:33:23] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206872 (owner: 10Dduvall) [20:33:35] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206873 (owner: 10Dduvall) [20:33:49] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206874 (owner: 10Dduvall) [20:34:02] (03CR) 10Dduvall: [V: 032] "Initial import." [integration/raita] - 10https://gerrit.wikimedia.org/r/206875 (owner: 10Dduvall) [20:34:41] Yippee, build fixed! [20:34:41] Project beta-code-update-eqiad build #53515: FIXED in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/53515/ [20:34:42] * marxarelli is sorry for the noise, should have done this via a force push [20:38:43] yes :P [20:39:10] marxarelli: what is raita btw? I only know of it as indian food [20:39:37] beta labs is *not* updating :( [20:39:51] legoktm: new dashboard for cucumber test results :) [20:40:07] fifteen commits behind :( [20:40:32] MatmaRex: I think Krinkle was fiddling with it earlier? [20:40:34] should i file a bug? is there monitoring for it? known issue? [20:40:44] marxarelli: haha, then that's well named :P [20:41:12] legoktm: thanks! [20:41:32] MatmaRex, maybe https://phabricator.wikimedia.org/T97257 this? [20:42:28] grrr... ownership issues [20:42:33] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/50687/console [20:43:10] because this issue, CI -> beta bridge is currently not available [20:43:34] MatmaRex: thcipriani|afk opened this one too -- https://phabricator.wikimedia.org/T97314 [20:43:45] looks like things are borked in multiple ways :/ [20:44:26] what a colourful messages [20:48:29] twentyafterfour: are you seeing the beta cluster/scap issues ^ [20:51:11] devunt: The bridge should be find now [20:51:16] fine [20:52:35] 10Beta-Cluster: beta-scap-eqiad fails due to rsync permissions issues - https://phabricator.wikimedia.org/T97314#1239666 (10bd808) Not sure what made Puppet freak out and do this, but the mwdeploy user is in both ldap (expected) and the local /etc/passwd on deployment-bastion (unexpected). [20:53:20] !log removed mwdeploy user from deployment-bastion:/etc/passwd [20:53:23] Logged the message, Master [20:55:38] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [20:55:46] beta-scap-equid is still failing... [20:55:54] eqiad [20:55:55] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/50690/console [20:56:06] it's getting further this time [20:57:30] just finished with failure [20:57:38] Filing for a different reason this time -- failed: No space left on device (28) [20:57:39] 20:56:52 rsync: write failed on "/srv/mediawiki/wikiversions-labs.cdb": No space left on device (28) [20:58:08] That's on deployment-jobrunner01.eqiad.wmflabs [21:00:09] !log Root partition full on deployment-jobrunner01 [21:00:13] Logged the message, Master [21:01:14] what's eating all the disk there... [21:04:02] 10Beta-Cluster: beta-scap-eqiad fails due to rsync permissions issues - https://phabricator.wikimedia.org/T97314#1239722 (10bd808) 5Open>3Resolved a:3bd808 I ran `sudo vipw` and removed the local `mwdeploy` user from deployment-bastion's /etc/passwd. A forced puppet run didn't recreate the user so I'm goin... [21:06:10] !log deployment-jobrunner01 missing role::labs::lvm::srv [21:06:12] Logged the message, Master [21:06:29] failure again [21:06:43] still there's no spaces in deployment-jobrunner01 [21:06:56] devunt: I'm working on it [21:08:03] !log Deleted deployment-jobrunner01:/srv/* in preparation for applying role::labs::lvm::srv [21:08:05] Logged the message, Master [21:08:14] that's great. thank you [21:08:48] !log Applied role::labs::lvm::srv on deployment-jobrunner01 and forced puppet run [21:08:51] Logged the message, Master [21:14:54] Yippee, build fixed! [21:14:55] Project beta-scap-eqiad build #50692: FIXED in 1 min 0 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50692/ [21:19:06] Yippee [21:19:14] devunt: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/50692/ [21:19:48] yeah I'd seen it [21:20:27] !log beta-scap-equad job green again after adding a /srv/ disk to deployment-jobrunner01 [21:20:29] Logged the message, Master [21:23:30] so now, all of pending deployments that targeting beta clusters should be deployed normally? [21:31:09] (03PS2) 10BryanDavis: Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [21:32:51] bd808, shouldn't they? [21:33:43] devunt: All the deploy jobs were green last I checked, so yes beta clsuter should be up to date [21:33:58] https://integration.wikimedia.org/ci/view/Beta/ [21:34:21] That one failed job for texvc is a known issue and not causing any problems [21:34:55] thcipriani: I kinda re-wrote your scap patch [21:35:10] and I didn't ask first which I should have :/ [21:35:15] (03PS1) 10Dduvall: Config for integration/raita [integration/config] - 10https://gerrit.wikimedia.org/r/206967 [21:35:37] bd808: I see that, I think moving that to utils is likely a good thing. [21:35:57] in terms of maintainability [21:36:02] I mostly wanted contextmanager magic there [21:36:08] I guess https://gerrit.wikimedia.org/r/#/c/203627/ hadn't deployed [21:36:15] to make the cleanup more deterministic [21:36:51] that, too. I still have no idea why cleanup didn't happen over the weekend. [21:37:51] I couldn't find any entries with name Josa in http://ko.wikipedia.beta.wmflabs.org/wiki/Special:Version?uselang=en [21:38:06] devunt: It should be deployed... I'll check on the hosts [21:39:17] thank you [21:39:24] devunt: hmmm.... it is missing. I'll see if I can figure out why [21:40:41] devunt: there are 6 config changes missing (which is probably what MatmaRex was pointing out earlier) [21:41:16] no, i was pointing out mediawiki/core being a few commits behind master [21:41:23] which is fixed now, by the way, thanks [21:41:40] yw [21:44:43] so there's a little problem with deployment system [21:45:01] !log Manually triggered beta-mediawiki-config-update-eqiad for zuul build df1e789c726ad4aae60d7676e8a4fc8a2f6841fb [21:45:02] ^ understatement, etc :) [21:45:03] Logged the message, Master [21:45:06] * YuviPanda goes away after that hit and run [21:45:16] * bd808 chases YuviPanda with a trout [21:45:33] * YuviPanda stops to tie shoe laces, gets beaten with trout [21:47:12] devunt: I'm not sure what caused the config changes to be missed; I'm going to file a task for hashar to look into that and see if he can figure out what happened [21:47:48] as soon as this job completes -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/50697/console -- your new extension should be live on kowiki [21:49:09] bd808: maybe scap failed :/ [21:49:40] hasharAway-: it was worse than that. The changes didn't get triggered by zuul [21:50:03] o [21:50:03] there were 6 config changes that didn't make it to deployment-bastion [21:50:11] I'm writing it up now [21:50:32] bd808: make sure to give at least one change # [21:50:33] hasharAway-: bd808 I think Krinkle restarted jenkins this morning due to the gearman plugin locking up [21:50:44] then I / someone can dig in /var/log/zuul somewhere [21:51:09] the state should be kept in Zuul, but yeah maybe restarting Jenkins removed the changes [21:51:34] potentially, a merge in operations/mediawiki-config should redeploy everything [21:51:49] (ie the job checkout the ref that has been merged and scap the resulting checkout) [21:51:52] the 6 changes were queued in Zuul because deployment-bastion was locked. After that was resolved we also took it down manually temporarily while resolving https://phabricator.wikimedia.org/T97257 [21:52:12] ahh bastion locked :/ [21:52:17] stupid jenkins [21:52:32] I tried restarting that slave but it wouldn't come back in gearman [21:52:46] and restarting gearman didnt work because of the Configure 503 error [21:52:49] cool. mystery solved [21:52:51] hence full jenkins restart [21:52:56] yeah that is a deadlock somewere in Jenkins scheduler. Apparently the only way to solve it is to restart Jenkins [21:53:08] should I finish the ticket or just hit cancel? [21:53:20] the conifgure 503 is usually because it is deadlocked waiting for the IRC plugin to disconect [21:53:40] such a pity :/ [21:53:55] bd808: can probably cancel it [21:54:02] I am off [21:54:05] The list of known/recurring issues in CI (see https://wikitech.wikimedia.org/wiki/Release_Engineering/Argh) is getting larger. We should really see to it that some of those get resolved. It happened more than one that we had hours of outage going from one issue into the next, all known issues. [21:54:08] o/ [21:54:13] been building disk images all day long :] [21:55:39] yeah I'd seen that six jobs were in the queue about one or two hours ago [21:55:46] postmerge queue I think [21:55:49] Yep [22:01:22] devunt: \o/ -- http://ko.wikipedia.beta.wmflabs.org/w/index.php?title=%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84&uselang=en -- has Josa finally [22:01:25] bd808, scap task finished successfully and Josa is live in kowiki normally [22:01:33] :) [22:01:38] hurray [22:01:56] such a long trip :p [22:02:38] heh. Somedays the computers revolt. Those of us in the resistance have to dive in a get them to do their jobs agian [22:04:01] always automation is the root of evil [22:11:36] greg-g: no I was away making a late lunch [22:12:09] twentyafterfour: all better now [22:12:44] but you could review the patch for scap that thcipriani and I made -- https://gerrit.wikimedia.org/r/#/c/206856/ [22:20:18] (03CR) 1020after4: [C: 031] Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [22:34:30] 10Browser-Tests, 7Documentation: Document how to make browser tests running after every commit and voting - https://phabricator.wikimedia.org/T94023#1240031 (10bd808) [23:29:34] PROBLEM - Puppet failure on integration-raita is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:31:20] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:31:24] PROBLEM - Puppet staleness on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [23:31:44] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:35:00] twentyafterfour: so it should be safe now to have the extension entry points return null now? or should they still be returning true? [23:39:41] PROBLEM - Host integration-raita is DOWN: CRITICAL - Host Unreachable (10.68.16.53) [23:46:11] RECOVERY - Host integration-raita is UP: PING OK - Packet loss = 0%, RTA = 5.64 ms