[00:38:08] (03CR) 10Legoktm: [C: 04-1] "https://github.com/squizlabs/PHP_CodeSniffer/commit/6172062f09eccea2c851cbc11daad137ebe52f98 is apparently a 3.2.0 regression so I'm think" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/398050 (owner: 10Ricordisamoa) [03:29:41] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<44.44%) [04:54:08] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [05:14:07] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [07:14:37] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:23:15] 10Release-Engineering-Team, 10Maps-Sprint, 10Maps (Tilerator): Setup diffusion and github sync for kartotherian and tilerator package repositories - https://phabricator.wikimedia.org/T182848#3836615 (10Gehel) [08:02:36] hi hashar :) not sure if you've seen https://phabricator.wikimedia.org/T182750 yet, I'd appreciate your input [08:03:03] legoktm: yup I have seen it :] [08:03:22] legoktm: but I don't have good answers as to how we can store the backlogs [08:03:42] right now I'm just storing them on toolforge [08:03:45] potentially we could make the Jenkins job to keep the build info for a longer period of day but there is not much guarantee it is going to persist [08:04:18] I filed a separate task (https://phabricator.wikimedia.org/T182751) about moving the storage to the CI infra properly [08:04:27] is the idea to keep all the clover.xml files ? [08:04:45] one way or the other, that surely can be stored on the CI master (contint1001) at some point [08:04:50] eg whenever you get something that seems to work [08:04:59] ah yeah [08:05:42] then the mediawiki code coverage job could potentially publish / save the clover.xml to some persistent place [08:05:51] right now I'm storing the clover.xml files with all of the per-file information stripped out of them so they're very small, but we could easily condense those into one file per month or a sqlite databsae or whatever [08:08:11] legoktm: I remember you wrote a clover.xml minimification tool. Then contint1001 has a lot of disk space so we can store them as long as needed [08:08:34] and yeah potentially the clover.xml could be processed and saved to some database [08:08:46] but flat files are probably just fine [08:08:51] one sure thing [08:08:55] yeah, that's what I'm using to shrink the size :) [08:09:04] I like the idea of offering a long time trend of coverage improvement [08:09:46] yeah hmm /srv has 570GB free :] [08:10:24] https://gitlab.com/legoktm/tool-coverage/blob/master/fetch.py my download script is very stupid on purpose :) [08:10:37] s/stupid/kept very simple/ [08:10:41] ;] [08:12:35] legoktm: an alternative is to have a cronjob on contint1001 that saves/snapshot the files such as /srv/jenkins/builds/mediawiki-core-code-coverage/*/cloverphp/clover.xml [08:12:40] and keeps pilling them up [08:13:35] (which would be 3.5G per year without minimification/gzip) [08:14:14] * hashar coffee [08:14:18] probably, I don't really have a strong opinion on how to implement it right now, my current goal is to get the backfill thing to work :p [08:18:41] legoktm: a few months would probably be enough :] [08:18:52] also we have a task to run the coverage job with php7 / phpdbg [08:18:57] that should make the generation fster [08:19:36] on my laptop running coverage in a docker container (php7+xdebug+stretch) took 15-18 minutes [08:20:07] but I was getting values that differed by 1% from what jenkins was running [08:20:16] and I didn't really debug further on where it was coming from [08:21:15] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Release Pipeline, and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836675 (10hashar) \o/ [08:30:14] legoktm: also most of tox and npm jobs are on docker nowadays. I will look at migrating some of the composer ones :] [08:39:55] (03PS1) 10Hashar: dib: contint::worker_localhost is now a profile [integration/config] - 10https://gerrit.wikimedia.org/r/398228 [08:41:29] (03CR) 10jerkins-bot: [V: 04-1] dib: contint::worker_localhost is now a profile [integration/config] - 10https://gerrit.wikimedia.org/r/398228 (owner: 10Hashar) [08:51:56] 10Continuous-Integration-Config, 10TestMe: fix or mark as inactive extensions currently failing CI - https://phabricator.wikimedia.org/T134090#3836743 (10Tpt) [09:00:12] (03PS1) 10Hashar: Bump puppet to 4.8.2 to match production [integration/config] - 10https://gerrit.wikimedia.org/r/398229 [09:05:35] (03PS2) 10Hashar: dib: contint::worker_localhost is now a profile [integration/config] - 10https://gerrit.wikimedia.org/r/398228 [09:06:58] (03CR) 10jerkins-bot: [V: 04-1] dib: contint::worker_localhost is now a profile [integration/config] - 10https://gerrit.wikimedia.org/r/398228 (owner: 10Hashar) [09:14:24] (03PS3) 10Hashar: dib: contint::worker_localhost is now a profile [integration/config] - 10https://gerrit.wikimedia.org/r/398228 [09:16:55] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Release Pipeline, and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836778 (10akosiaris) A more interesting question would... [09:17:17] (03CR) 10Hashar: "Was failling because rspec-puppet has no hiera configuration." [integration/config] - 10https://gerrit.wikimedia.org/r/398228 (owner: 10Hashar) [09:17:23] (03CR) 10Hashar: [C: 032] Bump puppet to 4.8.2 to match production [integration/config] - 10https://gerrit.wikimedia.org/r/398229 (owner: 10Hashar) [09:18:33] (03CR) 10jerkins-bot: [V: 04-1] Bump puppet to 4.8.2 to match production [integration/config] - 10https://gerrit.wikimedia.org/r/398229 (owner: 10Hashar) [09:19:25] (03CR) 10Hashar: [C: 032] Bump puppet to 4.8.2 to match production [integration/config] - 10https://gerrit.wikimedia.org/r/398229 (owner: 10Hashar) [09:20:32] (03Merged) 10jenkins-bot: Bump puppet to 4.8.2 to match production [integration/config] - 10https://gerrit.wikimedia.org/r/398229 (owner: 10Hashar) [09:29:25] (03CR) 10Ricordisamoa: "> https://github.com/squizlabs/PHP_CodeSniffer/commit/6172062f09eccea2c851cbc11daad137ebe52f98" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/398050 (owner: 10Ricordisamoa) [10:00:50] 10Continuous-Integration-Config, 10Proton, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Set up Jenkins for chromium-render and chromium-render-deploy repositories - https://phabricator.wikimedia.org/T179552#3836810 (10phuedx) @hashar: How would we go about setting that environment variable in the... [10:05:11] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [11:15:52] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836942 (10akosiaris) >>! In T181661#3834939, @awight wrote: > Looks like I'm getting the same error. > >> commit b67bba77acb7c0ffc678201c9f3f... [11:23:43] 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3836956 (10hashar) [11:23:59] (03CR) 10Hashar: "Filled as T182860" [integration/config] - 10https://gerrit.wikimedia.org/r/398086 (owner: 10Thcipriani) [11:24:21] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3836966 (10hashar) [11:26:49] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3836970 (10hashar) twist: In sudoers: %contint-admins ALL = (:docker) NOPASSWD: /usr/bin/docker* Then: sudo... [11:28:36] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836975 (10awight) Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error code 70: {P6463} [11:44:50] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Ops-Access-Requests: Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3837027 (10hashar) [11:45:43] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/#/c/398240/ would let one do:" [integration/config] - 10https://gerrit.wikimedia.org/r/398086 (owner: 10Thcipriani) [12:00:06] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837059 (10mmodell) `scap deploy-log -v` reveals more: ``` 11:27:10 [ores1001.eqiad.wmnet] Unhandled error: Traceback (most recent call last):... [12:04:26] twentyafterfour: hi! ^ Do you think that situation would be improved by rm’ing deploy-cache? [12:04:58] oh wat. Those revisions aren’t in gerrit. [12:05:13] will comment on the task [12:05:37] awight: maybe but I'm trying to get to the bottom of this rather than just work around it. if it's seriously holding up your work then I'll expidite a workaround, still I'm debugging it now and close to hopefully getting to the root cause [12:06:09] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837081 (10awight) Those revisions aren't in gerrit. I think the github -> gerrit mirroring broke when we were messing around with pointing to... [12:06:24] (I realized I can ssh as deploy-service and inspect the target's git state... I'm not sure if I'm supposed to be able to do that but it's quite helpful for debugging this issue) [12:06:40] twentyafterfour: I think it’s a much different issue than the timeout, sorry to blindly mix the topics! [12:06:54] awight: yeah, it's not a timeout this time [12:07:00] lol please do break the rules :). These aren’t technically production machines yet, fwiw. [12:07:12] They’ll be reimaged and stuff before we use them. [12:07:26] awight: that's what I thought so I'll take advantage of that to learn some things about the context of the failure [12:09:44] twentyafterfour: https://phabricator.wikimedia.org/source/editquality/manage/uris/ [12:10:06] twentyafterfour: https://github.com/wiki-ai/editquality/commit/15d5283b7422919d85203b5ba907027f9356e421 [12:10:58] Here it says the last “update” was earlier today, https://phabricator.wikimedia.org/source/editquality/manage/basics/ [12:13:06] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:18:38] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837099 (10awight) The Phabricator control panels look happy, https://phabricator.wikimedia.org/source/editquality/manage/uris/ shows that we'r... [12:19:00] twentyafterfour: ^ a clue! The revision is in Phabricator but not in gerrit. Unless I’m just looking at gerrit wrong? Or it’s not even supposed to be there? [12:19:31] It’s not supposed to be in gerrit. Sorry. [12:19:44] So maybe something broke with the clone that we’re --reference-ing? [12:24:37] 10Scap, 10ORES, 10Scoring-platform-team: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837105 (10awight) [12:24:53] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797363 (10awight) Oops—we aren't expecting this repo to be mirrored to gerrit. So the surprise is that the revision exists in Phabricator but... [12:31:09] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837143 (10mmodell) @awight: yeah, I'm getting to the bottom of it now. The issue is that the commit does not exist on a local branch on tin, i... [12:31:39] awight: the issue is that the commit only exists at HEAD and not on a local branch on tin [12:32:07] I think what needs to happen is that you check out the branch in the submodule on tin so that there is a local ref that can be fetched [12:32:23] ores is making the most complex use of git submodules that we've seen I think [12:32:34] and it's stretching what scap knows how to handle [12:32:54] twentyafterfour: I did notice that, * I found the code in an updated state this morning and went ahead with deployment, * after deployed, I noticed that the two submodules weren’t updated, and did a git submodule update [12:33:05] Which totally matches the symptoms [12:33:20] However, then I deploy -f’d after doing the submodule update [12:33:44] the deal is that those submodule pointers are pointing to a branch that isn't checked out on tin [12:33:44] I think I hit an edge case where the submodule repos won’t be pushed correctly once I’ve done it wrong the first time? [12:33:58] submodules always suck, honestly [12:33:58] But it is checked out now... [12:33:59] I hate them [12:34:05] +1 [12:34:13] is it checked out as a detached head? because that won't work [12:34:24] I mean they need to exist in a regular branch on tin [12:34:36] oho [12:34:36] not only in remotes/origin/* or HEAD [12:34:52] I can probably fix scap's refspec to also fetch the remotes [12:35:07] I’m pretty sure that detached head is the default for submodules tho [12:35:11] but that'll take an update to the scap package which will take time to build and deploy [12:35:18] So it’s really surprising that this didn’t hit us until today [12:35:39] awight: yeah it is ok if the submodule is in a detached head but the commit that it's pointing to needs to also exist on a branch [12:36:11] Awesome debugging, that gives me a simple workaround at least, of going into the submodules, checking out master, then back to the root and updating to the correct snapshot. [12:36:25] right that _should_ work [12:36:39] I haven't tried it yet but I poked all of the pieces and that seems to be the issue [12:36:49] wanna try that and see if it helps? [12:36:51] I made a new task btw, where we can work on the long-term fix, T182865 [12:36:51] T182865: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865 [12:36:56] yep lemme try that [12:37:29] there are so many layers to this onion it's kinda hard to peel ;) [12:37:42] Sorry if my ping woke you up btw—aren’t you in UTC-6? [12:38:06] it's ok, my sleep schedule is about 120 degrees off from normal [12:38:19] lol [12:38:30] So, in submodules/editquality: Your branch is behind 'origin/master' by 225 commits [12:38:37] That makes me think this can’t really be the bug [12:38:39] * twentyafterfour plans to sleep a day away sometime soon to get back on normal daytime hours [12:38:49] cos we would have noticed earlier. [12:38:57] I’ll try the workaround nonetheless [12:39:02] hmm [12:40:05] :D works like a charm. [12:40:10] (WTF) [12:40:33] this may be a change in behavior in scap due to the --reference majic [12:41:11] I think that the old way that we cloned submodules might have ended up with a different refspec (treating the repo as bare vs non-bare, is the only thing I can think of ) [12:41:32] When was this change deployed to production scap? [12:42:26] \o/ parallel deployment to ores* just completed perfectly, and in 58 seconds [12:42:41] That’s gonads to the wall. [12:43:13] awight: monday the 11th is when it got published [12:43:23] That’s gotta be it, then. [12:45:22] yeah so the revs/b67bba77.../submodules/* repos habe fetch = +refs/heads/*:refs/remotes/origin/* [12:45:33] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837171 (10awight) 05Open>03Resolved a:03awight Using a workaround for T182865, where we go into submodules and checkout master before su... [12:46:26] twentyafterfour: That’s a bit over my head, but your lay-person explanation did work for me. [12:46:34] and http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality/refs/heads/* doesn't have a lot of useful refs [12:47:07] essentially tin has all of the refs in refs/remotes/origin/* and the final submodule is looking for refs/heads/* [12:48:06] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:15:19] 10Scap, 10ORES, 10Scoring-platform-team: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837105 (10mmodell) Ok I'm going to try to summarize what we learned by quite a lot of manual poking at the `ores1001` target to form a hypothesis and... [13:16:16] 10Scap, 10ORES, 10Scoring-platform-team: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837277 (10mmodell) So I'm going to figure out what needs to change to make git use the right refspec in the submodule update in deploy-local. Probabl... [13:16:44] 10Release-Engineering-Team (Kanban), 10Scap, 10ORES, 10Scoring-platform-team: Source revision is in Phabricator, but can't be found by deployment tools - https://phabricator.wikimedia.org/T182865#3837297 (10mmodell) p:05Triage>03High a:03mmodell [13:23:11] 10Release-Engineering-Team (Kanban), 10MediaWiki-Debian, 10Patch-For-Review, 10User-zeljkofilipin: Upgrade RuboCop and Rubyzip in mediawiki/debian - https://phabricator.wikimedia.org/T182401#3837330 (10zeljkofilipin) 05Open>03Resolved [13:33:30] PROBLEM - Puppet errors on deployment-sca02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:34:06] PROBLEM - Puppet errors on deployment-cache-text04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:34:25] PROBLEM - Puppet errors on deployment-redis05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:34:41] PROBLEM - Puppet errors on integration-slave-k8s-1002 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:34:49] PROBLEM - Puppet errors on deployment-parsoid09 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:35:10] PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:35:36] PROBLEM - Puppet errors on integration-slave-docker-1005 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:35:42] PROBLEM - Puppet errors on deployment-videoscaler01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:35:54] PROBLEM - Puppet errors on integration-slave-docker-1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:36:00] PROBLEM - Puppet errors on saucelabs-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:36:02] PROBLEM - Puppet errors on deployment-changeprop is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:36:56] PROBLEM - Puppet errors on deployment-cassandra3-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:36:56] PROBLEM - Puppet errors on deployment-cumin is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:37:40] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:38:00] PROBLEM - Puppet errors on deployment-kafka05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:38:33] PROBLEM - Puppet errors on deployment-cpjobqueue is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:39:04] PROBLEM - Puppet errors on deployment-sca01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:39:42] PROBLEM - Puppet errors on deployment-etcd-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:39:54] PROBLEM - Puppet errors on saucelabs-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:42:17] PROBLEM - Puppet errors on integration-slave-jessie-android is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:42:37] PROBLEM - Puppet errors on deployment-elastic05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:42:45] PROBLEM - Puppet errors on deployment-memc06 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:43:44] PROBLEM - Puppet errors on deployment-imagescaler02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:43:58] PROBLEM - Puppet errors on deployment-db04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:10:53] RECOVERY - Puppet errors on integration-slave-docker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [14:12:19] RECOVERY - Free space - all mounts on integration-slave-jessie-1003 is OK: OK: All targets OK [14:13:30] RECOVERY - Puppet errors on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:13:32] RECOVERY - Puppet errors on deployment-cpjobqueue is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:05] RECOVERY - Puppet errors on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:24] RECOVERY - Puppet errors on deployment-redis05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:40] RECOVERY - Puppet errors on integration-slave-k8s-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:42] RECOVERY - Puppet errors on deployment-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:14:48] RECOVERY - Puppet errors on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:12] RECOVERY - Puppet errors on deployment-mediawiki07 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:35] RECOVERY - Puppet errors on integration-slave-docker-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:43] RECOVERY - Puppet errors on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:15:52] thcipriani|afk: hashar greg-g just a heads up for deployment-prep or contint things we have recently been revising how unattended upgrades work and the default is now to include wmf and distro packages. This can be disabled like so per project atm https://gerrit.wikimedia.org/r/#/c/398259/3/hieradata/labs/project-proxy/common.yaml [14:15:59] RECOVERY - Puppet errors on saucelabs-03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:16:03] RECOVERY - Puppet errors on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [14:16:07] I don't know if it's a problem for releng but I wanted to give a shoutout [14:16:54] RECOVERY - Puppet errors on deployment-cumin is OK: OK: Less than 1.00% above the threshold [0.0] [14:16:54] RECOVERY - Puppet errors on deployment-cassandra3-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:17:36] RECOVERY - Puppet errors on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:17:37] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:17:48] RECOVERY - Puppet errors on deployment-memc06 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:01] chasemp: I think I reviewed the patch Arturo made a while back. At least for integration we already blindly updated from wikimedia/updates :] [14:18:02] RECOVERY - Puppet errors on deployment-kafka05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:42] hashar: we have extended / revised a bit but if that's teh case then all is well that's still teh defult [14:18:46] RECOVERY - Puppet errors on deployment-imagescaler02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:55] chasemp: good :] [14:18:59] RECOVERY - Puppet errors on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:07] RECOVERY - Puppet errors on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:53] RECOVERY - Puppet errors on saucelabs-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:22:18] RECOVERY - Puppet errors on integration-slave-jessie-android is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:27] (03PS1) 10Hashar: docker: convert zuul-cloner to docker-pkg [integration/config] - 10https://gerrit.wikimedia.org/r/398263 [14:58:17] (03CR) 10Hashar: "My first attempt ever :] The image got created by Addshore originally, it is a way for someone to easily run the same zuul-cloner version" [integration/config] - 10https://gerrit.wikimedia.org/r/398263 (owner: 10Hashar) [15:13:47] wheee hashar :D [15:13:59] I think I can start reviewing these new docker things now and possible merging them :O [15:14:53] maybe [15:15:02] PROBLEM - Puppet errors on deployment-redis06 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:15:41] addshore: i am not quite sur how it works really :) [15:15:59] addshore: for deployment, I guess we will have them build and published on contint1001 [15:22:26] hashar: well, there is already a fab command for that? [15:23:33] "There is a new Fab task called deploy_docker. This task will build and publish all docker-pkg images in integation/config on contint1001" [15:24:54] it pulls integration-config on contint1001, runs docker_pkg as jenkins-slave [15:27:45] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:33:43] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:18:00] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:20:13] 10Continuous-Integration-Config, 10Pywikibot-core: Pywikibot's Post-merge Jenkins builds are failing - https://phabricator.wikimedia.org/T182895#3837872 (10Dalba) [16:22:13] 10Continuous-Integration-Config, 10Pywikibot-core: Pywikibot's Post-merge Jenkins builds are failing - https://phabricator.wikimedia.org/T182895#3837882 (10Dalba) [16:52:10] 10Gerrit, 10Developer-Relations, 10Developer-Wishlist (2017): Add a welcome bot to Gerrit for first time contributors - https://phabricator.wikimedia.org/T73357#3837944 (10Paladox) That tool will not work with newer releases of gerrit as the db is going away and is deprecated from version 2.14+. And also thi... [16:52:58] RECOVERY - Puppet errors on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:59:36] no_justification hi, i wonder should we install https://gerrit.googlesource.com/plugins/high-availability/+/master ?. Would allow us to use gerit2001 as a backup. [17:13:52] "For this to work, http must be enabled in both instances, the plugin must be configured with valid credentials and a shared directory must be accesssible from both instances. For further information, refer to config documentation." [17:13:57] So what, NFS? [17:14:01] Yeah no [17:15:11] Otherwise, why not just share the caches over NFS to begin with? Hehehe [17:16:38] heh [17:20:44] I39c60b2d059d1cb2c1c0d3a4206232d961536697" target="_blank" rel="nofollow">https://gerrit.wikimedia.org/r/#/q/I39c60b2d059d1cb2c1c0d3a4206232d961536697 [17:20:46] ffs gerrit [17:35:37] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3836956 (10RobH) As this increases sudo permissions for a... [18:04:05] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:58] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:39:21] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:27:00] 10Scap, 10ORES, 10Scoring-platform-team: scap deploy --service-restart doesn't affect ORES celery - https://phabricator.wikimedia.org/T182912#3838509 (10awight) [20:34:30] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:42:21] 10Release-Engineering-Team, 10Scap, 10ORES, 10Operations, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3838537 (10thcipriani) [20:42:24] 10Scap, 10Operations, 10Packaging: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3838535 (10thcipriani) 05Resolved>03Open I'm unclear what happened here, but we're missing an important configuration flag `scap3_mediawiki` from the current release on tin: https:/... [20:47:48] 10Scap, 10Operations, 10Packaging: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3838567 (10thcipriani) I don't see the `debian/3.7.4-1` tag in the repo, but I do see `3.7.4` tagged. The configuration flag is present at that tag, could an opsen rebuild and upload a `... [21:24:08] 10Scap, 10ORES, 10Operations, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3838636 (10awight) [21:25:26] 10Release-Engineering-Team (Kanban), 10ORES, 10Operations, 10Scoring-platform-team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3838638 (10awight) 05Open>03Resolved I haven't seen this issue in a few weeks, closing. Thank you! [22:14:45] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is OK: OK: Less than 100.00% above the threshold [0.0] [22:17:14] When someone get a chance can you make https://gerrit.wikimedia.org/r/#/admin/groups/1332,info owned by itself so we can edit the group w/o bugging gerrit admins! Thank you! [22:20:15] PROBLEM - Puppet errors on deployment-sentry01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:20:48] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3838762 (10hashar) @RobH thanks! I was at first hesitant t... [22:35:43] PROBLEM - Puppet errors on deployment-etcd-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:30:57] PROBLEM - Puppet errors on deployment-kafka-jumbo-2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:36:08] PROBLEM - Puppet errors on deployment-netbox is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:37:25] PROBLEM - Puppet errors on deployment-kafka-jumbo-1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]