[00:00:58] By all means [00:19:25] Krinkle: I'm also killing a few lagging test queue jobs that were superseded by gate-and-submit. [00:19:32] (since we're +2ing on sight) [00:19:39] indeed [00:19:48] I just noticed :) [00:20:18] You can push for review with CR attached [00:20:18] I've disabled mwcore-jsduck and -doxygen for the moment (instant auto-skip using a preprended shell script) [00:20:22] So I did that :) [00:20:40] aha, nice [00:20:53] `git push origin HEAD:refs/for/%l=Code-Review+2` [00:21:17] Which zuul correctly (thankfully) throws straight into gate-and-submit [00:21:38] (it meant the changes all landed in master pretty quick, which was the first branch I did) [00:29:38] RainbowSprinkles: REL1_27 qunit failing again (wikiBase) [00:29:49] Yeah I saw [00:30:35] also phpcs :/ [00:30:36] And a minor phpcs failure [00:30:39] I'll tidy that up [00:30:46] I might wait a few minutes to let test queue further unclog [00:30:50] So I can have more nodes [00:30:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [00:31:41] Yeah. [00:31:52] A few of them got stuck btw, not sure how or why but it's SNAFU with Zuul [00:32:10] post-merge has a few >1h old that are queued but jenkins never got it [00:32:15] and it'll never go away until the next zuul restart [00:32:26] as long as they don't occupy ghost nodes, it's all fine though [00:34:15] I was wondering if we have any ghosts that haven't been reclaimed [00:34:19] But dunno how to check tbh [00:38:21] yeah, me neither. afaik the stuckness is just Zuul seeing it still in internal state array. It doesn't relate to nodes or anything, that happens via Gearman/Jenkins only when it is allocated and then from there it it eventually starts to run without issue afaik, and then reclaimed once aborted/completed. [00:38:27] So this just means it never got a node in the first place [00:38:43] Or it got one, was aborted, reclaimed, but Zuul still thinks it's pending in Jenkins [00:47:09] A bunch of queued post-merge ones disappeared [00:47:11] Something caught up [00:49:24] And a gazillion fundraising patches just appeared :) [00:49:35] Oh that chain has been there [00:49:36] Pushed way down [00:49:50] And someone playing with MobileFrontened ;-) [00:50:21] Sadly, it's not really possible to stop people from committing short of taking gerrit offline :p [00:50:28] Which might upset someone! [01:19:34] Krinkle: And this time qunit works, figures [01:24:16] Oh crap I forgot to fix the phpcs issue [01:24:17] dammit [01:29:18] YAY THE QUEUE IS ALL GONE [04:57:54] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#1152205 (10Rammanojpotla) can anyone give me more information of this issue? [05:05:15] (03PS1) 10KartikMistry: Add apertium-spa-cat package [integration/config] - 10https://gerrit.wikimedia.org/r/346941 [07:50:10] 10Scap, 07HHVM: [scap] Compile HHVM bytecode cache as deployment step - https://phabricator.wikimedia.org/T66272#685932 (10MoritzMuehlenhoff) I'm not convinced that this would be an overall win. Last week we had problems with a depleted byte code cache I checked monitoring after pruning the cache for effects o... [08:33:17] 10Scap (Scap3-Adoption-Phase1), 06Operations, 10RESTBase, 13Patch-For-Review, and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3162764 (10akosiaris) Agreed. [10:48:44] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3163027 (10zeljkofilipin) p:05Normal>03Low [11:08:05] PROBLEM - Puppet run on integration-c1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:15:26] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3163114 (10zeljkofilipin) Looks like this is the context: T312#1152213 @Mattflaschen-WMF: the license should be similar to license in the fo... [11:20:27] 06Release-Engineering-Team, 06Operations, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3163120 (10MoritzMuehlenhoff) p:05Triage>03High [11:46:17] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3163542 (10hashar) Just have the README.md to link to the lICENSE files that are in the repo. That will then show up on the doc page. Eventu... [12:28:09] elukey: good afternoon. [12:28:21] I noticed in logstash bunch of redis connections are closed due to "bad file descriptor" [12:28:49] no clue if that could be related. And I dont know why opening a tcp socket would lead to a "bad file" :D [12:29:31] hashar: o/ [12:29:53] do you have an example of filter that I can use? I tried a config on mw1306 today but now it is reverted [12:30:20] (a socket is a fd so maybe it is complaining about a socket that is not there anymore?) [12:30:35] (s/is/has/) [12:30:41] ahhhhhh [12:30:51] so that explain the "file" prt :] [12:30:55] part [12:31:23] for logstash I have looked at https://logstash.wikimedia.org/app/kibana#/dashboard/Redis [12:31:35] and dropping the filter at top left in a green bubble box [12:33:54] elukey: from your last comment maybe that is redis not being fast enough [12:34:06] or maybe the server cant access enough simultaneoux connections [12:34:46] still not finding a log with the bad file desc, do you have a link that I can check? [12:34:57] I can see only the usual Unable to connect to redis server rdb1005.eqiad.wmnet:6379. [12:35:00] etc.. [12:37:18] I am lost [12:38:37] elukey: https://logstash.wikimedia.org/goto/c17233aad31878d5c3d06be602029efb [12:38:56] that is one event per minute or so. Probably less concerning [12:40:30] and that would be another reason for the "Could not connect to server "rdb1001.eqiad.wmnet:6381"" [12:41:18] ah these are appservers, not jobrunners! [12:41:24] I believe those are different problems [12:41:50] yeah and I filled that one as https://phabricator.wikimedia.org/T158770 [12:41:59] bad file desc in my experience is usually a write() to a socket already closed [12:42:19] the bad file descriptors appeared after Tim applied a fix to "Route PHP warnings from the handler into udp2log" [12:43:23] ah nice :) [12:43:31] this could also fit very well [12:43:53] I got one for mw1287 which is an appserver::api [12:44:31] which query site informations, that ends up hitting the jobqueue redis I guess to get the size of the queues [12:47:05] elukey: eg https://en.wikipedia.org//w/api.php?action=query&meta=siteinfo&siprop=statistics&format=json [12:47:16] which yields statistics including the # of jobs [12:47:23] and that is apparently uncached [13:03:05] (03CR) 10Hashar: [C: 032] Add apertium-spa-cat package [integration/config] - 10https://gerrit.wikimedia.org/r/346941 (owner: 10KartikMistry) [13:04:41] (03Merged) 10jenkins-bot: Add apertium-spa-cat package [integration/config] - 10https://gerrit.wikimedia.org/r/346941 (owner: 10KartikMistry) [13:09:00] (03Abandoned) 10Hashar: Add more jobs to test-prio [integration/config] - 10https://gerrit.wikimedia.org/r/346656 (owner: 10Zppix) [13:49:09] PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:37:33] Project selenium-MobileFrontend » chrome,beta,Linux,BrowserTests build #383: 04FAILURE in 15 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/383/ [15:46:16] Project selenium-MobileFrontend » firefox,beta,Linux,BrowserTests build #383: 04FAILURE in 24 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/383/ [19:41:21] 10MediaWiki-Codesniffer: Provide an MWCS rule to enforce "short" type definitions: int and bool, not integer and boolean - https://phabricator.wikimedia.org/T145162#3164729 (10Krinkle) [19:56:37] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 10releng-201516-q3, 13Patch-For-Review, 07WorkType-NewFunctionality: [keyresult] Migrate as many misc CI jobs as possible to Nodepool - https://phabricator.wikimedia.org/T119140#3164757 (10EddieGP) [20:00:07] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T160551#3164770 (10demon) [21:45:16] 10Continuous-Integration-Infrastructure, 07Jenkins, 07Upstream, 07WorkType-NewFunctionality: Jenkins trilead-ssh2 doesn't support our MAC/KEX algorithms - https://phabricator.wikimedia.org/T103351#3165201 (10Paladox) A new release may happen next week :) [22:43:28] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:58:14] (03PS1) 10Thcipriani: PoC: Port existing Trusty jobs to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/347130 [22:59:39] (03CR) 10jerkins-bot: [V: 04-1] PoC: Port existing Trusty jobs to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/347130 (owner: 10Thcipriani) [22:59:41] thcipriani: whats wrong with just using trusty? [23:00:15] just another thing to support, but nothing generally :) [23:03:17] (03PS2) 10Thcipriani: PoC: Port existing Trusty jobs to Docker [integration/config] - 10https://gerrit.wikimedia.org/r/347130 [23:08:01] Do we really need 3 instances of saucelabs i barely even see 1 instance being used [23:17:50] 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Zppix) [23:18:33] 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Paladox) This always happens. Workaround is refresh the page. It's problem varnish problem. [23:18:51] 06Release-Engineering-Team, 06Operations: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165440 (10Paladox) [23:19:09] 06Release-Engineering-Team, 06Operations, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Paladox) [23:19:13] 06Release-Engineering-Team, 06Operations, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165442 (10Zppix) Instead of constant work around why dont we fix it? [23:20:07] 06Release-Engineering-Team, 06Operations, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165443 (10Paladox) It would be good to fix that problem. [23:22:25] 06Release-Engineering-Team, 06Operations, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Dzahn) >>! In T162505#3165438, @Paladox wrote: > Varnish looks like the problem. I don't think so. Varnish says "Backend fetch failed". The Backend is integration.wm.org. [23:29:30] 06Release-Engineering-Team, 06Operations, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165452 (10Paladox) Oh [23:47:54] 10Continuous-Integration-Config, 07Regression: doc.wikimedia.org generates master branch docs with label of old release - https://phabricator.wikimedia.org/T162506#3165466 (10Krinkle) [23:48:29] 10Continuous-Integration-Config, 07Regression: doc.wikimedia.org docs for old releases is actually master - https://phabricator.wikimedia.org/T162506#3165479 (10Krinkle)