[00:00:11] PROBLEM - Puppet run on deployment-ircd is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:00:44] this is the full query it runs: https://github.com/wikimedia/puppet/blob/production/modules/service/files/logstash_checker.py#L115 [00:00:53] the hhvm messages might not show up in logstash for a while, especially if the events are happening a lot. there are 2 layers of rsyslog buffering between hhvm and logstash [00:00:56] doesn't do notice or info [00:01:10] https://github.com/wikimedia/puppet/blob/production/modules/service/files/logstash_checker.py#L143-L150 [00:02:20] twentyafterfour is this meant to fail like [00:02:21] https://phabricator.wikimedia.org/diffusion/EBSES/browse/master/ [00:02:25] The Almanac service for this repository is invalid or could not be loaded. [00:03:28] bd808: we do do a 20 second wait for messages to show up FWIW [00:04:02] *nod* [00:05:13] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182008 (10Paladox) BSExtendedSearch is now BlueSpiceExtendedSearch [00:07:33] thcipriani: Hm.. should probably only exclude DEBUG/INFO, not NOTICE/WARNING (that only leaves error/critical). And in addition to hhvm.log and mediawiki/channel:exception, may also wanna include mediawiki/channel:error.log. [00:07:40] Those three channels afaik don't even have any debug/info entries [00:08:01] exception=fatal, and error and hhvm are both php notices/warnigns/errors [00:08:38] which is probably mapped incorrectly, we may wanna consider all php errors as level=error, php doesn't follow PSR internally (PHP Notice = an error) [00:08:46] but anyway :) [00:09:34] the levels in the hhvm channel are set by rsyslog [00:12:34] so that leaves us with something like: https://logstash.wikimedia.org/goto/440b4c7822c826d46dae732b84185233 [00:12:38] seem correct? [00:15:03] seemingly would have caught the "undefined variable" one anyway [00:15:33] 10Scap: Scap's canary check should stop us from deploying config changes that cause floods of "undefined variable" errors - https://phabricator.wikimedia.org/T162974#3182085 (10Catrope) [00:15:38] (task ----^^) [00:20:07] PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:20:45] 10Scap: Scap's canary check should stop us from deploying config changes that cause floods of "undefined variable" errors - https://phabricator.wikimedia.org/T162974#3182111 (10thcipriani) p:05Triage>03Normal canary check missed this as it ignores NOTICE log level messages in logstash https://github.com/wik... [00:21:12] updated [00:21:18] gotta run tho [00:21:28] * thcipriani nick -all thcipriani|afk [00:21:53] embarrassing IRC there :) [00:22:55] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:24:06] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182118 (10demon) p:05High>03Low [00:25:44] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3178119 (10demon) 05Open>03Resolved a:03demon [00:29:06] bd808: channel:error I think, error.log? [00:29:21] But yeah, I suppose trakcing hhvm with notice enabled is the same [00:29:23] they go to both right now [00:29:35] channel:error is just for the stacktraces I suppose [00:30:22] oh channel:error is from our MWException stuff [00:33:23] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182158 (10Betacommand) 05Resolved>03Open looks like //DetectLanguage// is also throwing the same issue, and I suspect there will be a half dozen or so others that need fixed too. [00:34:21] Yup. [00:45:31] !log cherry-picking 348184/1 (T161563) [00:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [00:45:35] T161563: ORES logs not being saved to logstash - https://phabricator.wikimedia.org/T161563 [00:57:27] rebase failed on puppetmaster [00:59:07] PROBLEM - Puppet run on deployment-restbase02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:59:29] !log three cherry-picks failed to merge, skipped them 93dad5bec8e937ef93bdd63046b0bbbf14ad9722 92c7d0b002a02ff46a00e79d6d89fe83d5f65c17 21d60a478ffb21160049679eea235bdb1010a489 [00:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [01:01:02] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182200 (10Betacommand) and //Notifications// [01:10:09] RECOVERY - Puppet run on deployment-ircd is OK: OK: Less than 1.00% above the threshold [0.0] [01:22:59] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:25:27] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:31:10] PROBLEM - Puppet run on deployment-ircd is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:32:30] 10Gerrit: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182224 (10Betacommand) and //PhpTagsDebugger// [01:39:09] RECOVERY - Puppet run on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:49:20] These ^ were probably because of me (rebasing puppetmaster) [02:01:01] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Icinga, 06Operations, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182243 (10Dzahn) fixed. gone on 2001, exists on 1001, no more cruft in Icinga https://icinga.w... [02:01:15] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3182246 (10Dzahn) [02:01:17] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Icinga, 06Operations, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182244 (10Dzahn) 05Open>03Resolved [02:02:06] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2795939 (10Dzahn) once jenkins is running on both servers, don't forget to remove https://gerrit.wikimedia.org/r/#/c/348171/... [03:19:43] 06Release-Engineering-Team, 06Operations, 10Phabricator, 10hardware-requests, 10ops-eqiad: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3182333 (10faidon) Sure, that's OK. [06:22:12] Yippee, build fixed! [06:22:12] Project selenium-Wikibase » chrome,test,Linux,BrowserTests build #330: 09FIXED in 1 hr 42 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=BrowserTests/330/ [07:51:37] !sal [07:51:38] https://tools.wmflabs.org/sal/releng [07:52:27] !log Puppet failing on deployment-tin and deployment-mira . Some patches have been dropped from the puppet master :-(( [07:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:03:42] !log beta: resetting puppetmaster to last good tag snapshot-20170414T0030 A cherry pick for T161563 end up dropping three patches which broke other parts of the infrastructure [08:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:03:45] T161563: Use gzip for logstash - https://phabricator.wikimedia.org/T161563 [08:04:12] hashar: I feel a bit stupid today [08:04:39] elukey@mw1306:~$ sudo netstat -tuap | grep TIME_WAIT | grep rdb | wc -l [08:04:42] 70083 [08:04:43] elukey: given it is a Friday and we are all tired, I guess it is to be expected? :} [08:04:44] elukey@mw1306:~$ sudo sysctl net.ipv4.ip_local_port_range [08:04:45] ahhh [08:04:47] net.ipv4.ip_local_port_range = 32768 60999 [08:04:52] so out of sockets? [08:05:03] or fd or connections? [08:05:05] I have no idea why I missed this, I guess I checked only on the rdb hosts [08:05:19] I added a comment today with all the data that we have [08:05:32] and it didn't make sense why only the jobrunners were so different [08:05:32] yesterday night I fond that diamond used to report servers.xxx.network.connections.TIME_WAIT etc [08:05:38] but that is no more collected apparently [08:05:53] this is what I checked this morning and told to myself "you are stupid" [08:05:56] :D [08:06:22] well the errors about fsockopen() reporting "invalid file descriptor" kind of hint about it [08:06:39] but really with all the different levels and moving parts, it is hard to figure out a good lead :} [08:07:26] elukey: can't the kernel just reuse TIME_WAIT connections? [08:07:28] hashar: now I am wondering if the persistent connection thing that aaron removed from mw-config might be the solution [08:07:41] might be [08:07:57] but the problem with the persistent connection is that when you set some specific parameters to that connection, they hmm [08:07:59] persist [08:08:05] which might lead to bunch of other issues [08:08:28] I tracked down why the fix was made, and it was during the last big outage that we had with the jobqueeus [08:08:40] in which they were growing indefinitely [08:12:48] elukey: or [08:13:06] could it be that because of the RST the connection is kept around in TIME_WAIT state? [08:13:40] nono a TCP active close forces TIME_WAIT as last step before giving up the socket [08:14:32] found some doc hinting at net.ipv4.tcp_tw_reuse on the client side [08:15:27] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [08:17:16] yes sometimes those options might be good to use but they are risky, and I think that the fix should be related reduce the amount of conns.. [08:17:16] !log beta: cherry picking again 348184/4 'service: use gzip for logging in uwsgi' for T161563 [08:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:17:20] T161563: Use gzip for logstash - https://phabricator.wikimedia.org/T161563 [08:17:59] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [08:18:17] brb [08:20:33] elukey: https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux ! [08:24:02] yep :) [08:24:16] I have read it in french [08:24:32] found out the english version, and that english ones is easier to read ironically [08:26:40] mw1306:~$ ss -s [08:26:41] TCP: 85713 (estab 196, closed 85476, orphaned 12, synrecv 0, timewait 85472/0), ports 0 [08:26:45] eeek [08:26:58] yep [08:27:05] I feel so stupid, haven't checked before [08:27:20] dont blame yourself [08:27:30] I do!!! : [08:27:30] :D [08:27:34] dns/tcp issues are probably the last things one wanna check [08:27:46] it is taken for granted, much like the server having proper power supply [08:29:41] there is also 11k mysql connections in time_wait [08:34:03] hashar: https://phabricator.wikimedia.org/T129517#2113526 [08:35:15] i like the blog conclusion: "Moreover, when designing protocols, don’t let clients close first. " [08:36:29] oh that is that bug that prompted me to polish up the jobrunner grafana board [08:36:31] neat [08:39:21] elukey: I am tempted to restore the persistent state [08:39:28] but I guess that will be for the week after the dc switch [08:40:32] I think that if we try now people could kill us [08:40:33] :D [08:40:50] ahah [08:40:53] good luck catching me! [08:41:13] I am at my Zombie invasion shelter [08:41:55] all right then let's deploy :P [08:44:31] elukey: also arent we using nutcracker as redis proxy for something? [08:44:59] hashar: only for mw session cache [08:45:18] (hhvm -> mc* hosts, where memcached also live) [08:47:13] well at least we know what is falling now [08:47:19] as for how to fix it properly. Really I dont know :( [08:47:39] I found base::mysterious_sysctl that is applied only to mc* nodes [08:47:53] in that one we use 'net.ipv4.ip_local_port_range' => [ 1024, 65535 ], [08:48:02] that could mitigate a bit the issue [08:52:38] ohh [08:52:45] probably for similar reasons [08:54:07] and also in swift and caching hosts [08:55:07] I am waiting patiently for systemd-communicate [08:55:12] that would overhaul tcp/udp [08:55:35] hashar: check cache/perf.pp [08:55:40] really nice reading :) [08:57:31] elukey: 50$ that is by Brandon Blak [08:58:00] yep :D [08:58:47] so hmm [08:58:53] probably want to bump local port range [08:59:28] yes definitely, and it could be a easy test to do on one host [08:59:40] but it feels that the issue is not really addressed [08:59:50] and potentially [09:00:12] when the client does a QUIT , maybe the server replies with a OK and then close() the socket [09:00:19] so the time wait would be on the redis server side [09:00:26] instead of on the client side [09:00:48] not sure what is worse :D [09:01:02] anyhow, with the new hhvm the QUIT fix will prevent this [09:01:17] even if we should see something even now about it [09:01:21] not sure [09:01:32] the QUIT fix ? [09:01:33] at this point I have 100 variables in mind after this redis tour [09:01:46] yep the commit to remove QUIT from Redis.php close [09:02:02] (hhvm) [09:02:27] how that would fix it ? [09:03:16] if I got it right, which ever side invokes close() ends up with a time_wait anyway [09:03:46] nono sorry I meant to say that it will not move the TIME_WAIT to Redis [09:03:55] that could be even riskier [09:04:02] ahh [09:04:03] yeah [09:04:14] though maybe it is easier to manage on the server side [09:04:34] should you try bumping the local port range a bit ? [09:05:04] I could do it via sysctl but nobody acked me on the sec chan and it is the friday before Easter [09:05:43] it shouldn't be a huge deal though a single test on a jobrunner [09:12:49] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3182492 (10Rammanojpotla) @hashar I have made the required edits needed but when I am typing git status it gives me nothing the directory doc... [09:20:59] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3182493 (10hashar) All the content in `doc/` (for example `doc/README.html`) is automatically generated when running `bundle exec yard`. So... [11:19:38] hashar great news my polygerrit change was merged :) [11:19:46] * paladox backports it to 2.14 [12:15:13] 10Gerrit, 13Patch-For-Review: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182727 (10Betacommand) //PipVideoJs// [12:29:21] !log Delete integration-c1 instance (32GB RAM) on labvirt1004. It was used as a workaround for T161006 [12:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:29:26] T161006: Convince nova-scheduler to pay attention to CPU metrics - https://phabricator.wikimedia.org/T161006 [12:33:30] PROBLEM - Host integration-c1 is DOWN: CRITICAL - Host Unreachable (10.68.21.131) [13:06:07] 10Gerrit, 13Patch-For-Review: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182781 (10Paladox) @Betacommand hi, could you try removing the ones that is blocking you and give a full list of the ones to be removed please? [13:25:19] https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/28387/console [13:25:21] Test failing [13:29:35] That failure looks like it was caused by jquery 3 [13:29:36] update [13:29:39] Reedy: most probably https://phabricator.wikimedia.org/T162876 [13:29:56] Interestingly, it worked fine on the first PS [13:36:02] 10Continuous-Integration-Infrastructure: Investigate disk usage of integration-slave-jessie-1002 - https://phabricator.wikimedia.org/T162635#3182860 (10hashar) 05Open>03Resolved a:03hashar The build cache is way smaller now `/mnt/home/jenkins-deploy/.android/build-cache` so at least that is a thing. Lets... [13:36:21] Reedy: yeah it is some kind of race condition :/ [13:36:33] A race as to which slave it runs on? :P [13:41:12] Reedy: race between tests [13:44:16] (03CR) 10Hashar: "I would rather have PHP CodeSniffer to be run from the 'composer test' job and to speed it up make phpcs to only run against files changed" [integration/config] - 10https://gerrit.wikimedia.org/r/339666 (https://phabricator.wikimedia.org/T158974) (owner: 10Paladox) [13:45:07] (03PS2) 10Hashar: dib: point imagecache to local directory [integration/config] - 10https://gerrit.wikimedia.org/r/347769 [13:46:56] (03CR) 10Hashar: [C: 032] dib: point imagecache to local directory [integration/config] - 10https://gerrit.wikimedia.org/r/347769 (owner: 10Hashar) [13:47:13] (03PS1) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [13:48:15] (03Merged) 10jenkins-bot: dib: point imagecache to local directory [integration/config] - 10https://gerrit.wikimedia.org/r/347769 (owner: 10Hashar) [13:49:41] (03Abandoned) 10Hashar: Migrate wikimedia/fundraising/dash node-0.10 test to node-4.3 test [integration/config] - 10https://gerrit.wikimedia.org/r/291603 (owner: 10Paladox) [13:50:22] (03Abandoned) 10Hashar: [WIP] Add a sqlite variant of extension-unittests-* [integration/config] - 10https://gerrit.wikimedia.org/r/269653 (owner: 10JanZerebecki) [13:50:57] (03Abandoned) 10Hashar: Migrate mediawiki-phpunit-phpflavour-composer to nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/290806 (https://phabricator.wikimedia.org/T135001) (owner: 10Paladox) [13:52:39] (03Abandoned) 10Hashar: Decouple composer package test for oojs-ui [integration/config] - 10https://gerrit.wikimedia.org/r/288320 (https://phabricator.wikimedia.org/T134946) (owner: 10Paladox) [13:54:28] (03Abandoned) 10Hashar: mediawiki/core: Run nodepool jobs in check pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/291302 (owner: 10Legoktm) [13:54:30] (03Abandoned) 10Hashar: pywikibot/core: Run nodepool jobs in check pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/291303 (owner: 10Legoktm) [13:54:34] (03Abandoned) 10Hashar: Allow all nodepool jobs to be run for untrusted users [integration/config] - 10https://gerrit.wikimedia.org/r/291301 (owner: 10Legoktm) [13:57:21] (03Abandoned) 10Hashar: dib: stop hardcoding image type [integration/config] - 10https://gerrit.wikimedia.org/r/347770 (owner: 10Hashar) [14:00:11] hashar ssh-slaves is deprecated now. And is being replaced with a new plugin. Though no timeline on when that will be release. [14:00:41] paladox: so it is deprecated with no replacement? :D [14:02:00] (03PS2) 10Hashar: Switch mw-checks-test to extension-unittests-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/346552 [14:02:17] (03PS5) 10Hashar: test: mw ext/skins have test templates in Zuul [integration/config] - 10https://gerrit.wikimedia.org/r/332890 [14:03:28] (03CR) 10jerkins-bot: [V: 04-1] test: mw ext/skins have test templates in Zuul [integration/config] - 10https://gerrit.wikimedia.org/r/332890 (owner: 10Hashar) [14:03:44] (03PS2) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [14:04:13] hashar theres a replacement [14:04:17] though not ready yet [14:04:46] hashar the ssh slaves will be updated for the time being [14:04:51] (03PS3) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [14:05:04] then when the new plugin is ready ssh slaves i presummingly will be unmaintained then. [14:05:22] but the new plugin will be based on apache mini [14:05:25] mini = mina [14:07:13] hashar it's a plugin that cloudbees have been using for years i think [14:07:18] they are open sourcing it [14:17:50] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing, 15User-zeljkofilipin: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3182946 (10zeljkofilipin) [14:20:59] (03CR) 10Zfilipin: "recheck" [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [14:21:08] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [selenium] - 10https://gerrit.wikimedia.org/r/330671 (owner: 10Hashar) [14:21:13] (03PS1) 10Rammanojpotla: Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 [14:24:15] (03Abandoned) 10Hashar: (WIP) (WIP) PoolCounter-debian-glue (WIP) (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/325322 (https://phabricator.wikimedia.org/T152338) (owner: 10Hashar) [14:24:33] (03CR) 10Hashar: [C: 032] Switch mw-checks-test to extension-unittests-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/346552 (owner: 10Hashar) [14:24:49] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing, 15User-zeljkofilipin: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3182974 (10Rammanojpotla) @hashar Is the change made is correct I have even pushed the readme.md of api now and why... [14:25:57] (03Merged) 10jenkins-bot: Switch mw-checks-test to extension-unittests-non-voting [integration/config] - 10https://gerrit.wikimedia.org/r/346552 (owner: 10Hashar) [14:28:56] (03Abandoned) 10Hashar: (WIP) Try MediaWiki code coverage on Jessie/php7 (WIP) [integration/config] - 10https://gerrit.wikimedia.org/r/314559 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [14:29:32] 10Browser-Tests-Infrastructure, 07Documentation, 07Easy, 07Software-Licensing, 15User-zeljkofilipin: Ruby gem documentation should state license - https://phabricator.wikimedia.org/T94001#3182982 (10Rammanojpotla) @ as both api and selenium are configured differently in git remote I have pushed them to g... [14:33:01] (03CR) 10Zfilipin: [C: 04-1] Ruby gem documentation should state license (031 comment) [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [14:34:21] (03CR) 10Zfilipin: [C: 04-1] "Rammanojpotla, good work, but I think this need to be updated before it can be merged." [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [14:35:34] (03PS4) 10Zfilipin: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 (https://phabricator.wikimedia.org/T94001) (owner: 10Rammanojpotla) [14:39:17] 10Gerrit, 13Patch-For-Review: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182992 (10Betacommand) >>! In T162884#3182781, @Paladox wrote: > @Betacommand hi, could you try removing the ones that is blocking you and give a full list of the ones to be removed please?... [14:39:43] 10Gerrit, 13Patch-For-Review: REL1_28 checkout of extensions.git is broken - https://phabricator.wikimedia.org/T162884#3182993 (10Paladox) Thanks. [14:44:25] 10Deployment-Systems, 10RESTBase, 06Services (done), 15User-mobrovac: RESTBase deployment process - https://phabricator.wikimedia.org/T103344#3182996 (10mobrovac) 05Open>03Resolved a:03mobrovac Yup, done, resolving. [14:44:31] 10Deployment-Systems, 06Release-Engineering-Team, 10RESTBase, 05Goal, 06Services (next): Create or improve the RESTBase deploy method - https://phabricator.wikimedia.org/T102667#3183000 (10mobrovac) [14:45:47] (03CR) 10Zfilipin: [C: 04-1] "Good work, but a minor fix is needed before this can be merged." (031 comment) [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [14:49:46] (03CR) 10EddieGP: Ruby gem documentation should state license (031 comment) [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [14:49:55] (03CR) 10EddieGP: [C: 04-1] Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [14:56:05] 10Gerrit, 07Upstream: "git review -d XXX" doesn't work for http gerrit - https://phabricator.wikimedia.org/T100987#1325879 (10cscott) Hm, I'm seeing this even with the `gerrit` remote set to `ssh://` -- it appears that your `origin` remote also has to be via ssh (not https) in order for gerrit to be happy. [15:03:50] hashar great news i think he managed to fix ssh slaves to use the new trilead version [15:07:34] it's a new plugin [15:07:37] called trilead-api [15:13:16] paladox: I receive the mail notifications :} [15:13:35] hashar oh, im helping him to test the new plugin :) [15:13:44] neat [15:13:50] yep [15:14:22] hashar the plugin is meant for them to remove trilead from jenkins core [15:14:39] but also for jenkins 1.x users to be able to use the new trilead version too [15:14:49] and it works [15:28:12] elukey: the jobrunner patches, I guess I will deploy them after the dc switchover week [15:29:54] hashar hi, what do i do if i get this 15:24:49 stderr: fatal: '/testing/test' does not appear to be a git repository error with zuul [15:30:05] (03CR) 10Hashar: "So I thought a bit more about it. We would need two different caching variables:" [integration/config] - 10https://gerrit.wikimedia.org/r/346152 (https://phabricator.wikimedia.org/T159591) (owner: 10Hashar) [15:30:11] please [15:30:24] paladox: where do you get that ? [15:30:34] on my test labs instance [15:30:36] gerrit-test [15:30:36] ahh [15:30:44] that is a git clone from the zuul merger isn't it ? [15:30:47] Yep [15:31:06] head on the zuul-merger , the repos should be under /srv/zuul/git iirc [15:31:30] oh, i delete it and zuul should recreate it? [15:31:42] * hashar which instance is that ? [15:31:47] (03PS2) 10Rammanojpotla: Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 [15:32:12] hashar gerrit-test [15:32:13] http://gerrit-jenkins.wmflabs.org/job/composer-gerrit-test/21/console [15:32:27] let me check [15:32:30] ok [15:32:32] thanks [15:33:38] Cloning repository 10.68.20.204/testing/test [15:33:53] that is the parameter ZUUL_URL= 10.68.20.204 [15:34:41] oh [15:34:42] (03PS5) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [15:34:46] /srv/zuul/git/testing/test/.git exists [15:34:52] but apparently there is no git-daemon running [15:35:38] oh [15:35:52] there should be one provided by role::zuul::merger [15:36:20] yeah it has: [15:36:24] class { 'contint::zuul::git_daemon': [15:36:24] zuul_git_dir => $conf_merger['git_dir'], [15:36:25] } [15:36:41] hashar: yeah +1 for the patches after the switchover [15:37:07] paladox: looks like you are not applying the puppet role do you? [15:37:11] yep [15:37:20] elukey: and most probably we could bump the timeout [15:37:30] elukey: and later on restore persisting connections [15:37:39] hashar but debian tests works [15:37:44] http://gerrit-jenkins.wmflabs.org/job/debian-glue-non-voting/16/console [15:39:03] (03CR) 10Zfilipin: [C: 04-1] "This is better, but you still did not implement the change I have requested here:" [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [15:39:06] (03CR) 10Zfilipin: [C: 04-1] "This is better, but you still did not implement the change I have requested here:" [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [15:39:29] paladox: eeek [15:41:25] jmm [15:41:26] hmm [15:41:27] doing [15:41:28] git clone git://10.68.20.204/testing/test [15:41:29] works [15:41:30] paladox: looks like zuul-cloner does not even attempt to hit the ZUUL_URL [15:41:31] ah [15:41:56] oh [15:42:03] maybe in the zuul-merger config change the url from 10.68.20.204 to git://10.68.20.204 ? [15:42:32] though git:// should be the default protocol [15:42:55] and there is Gerrit listening there [15:43:00] fixed [15:43:01] it [15:43:01] http://gerrit-jenkins.wmflabs.org/job/composer-gerrit-test/26/console [15:43:04] so maybe that clones from gerrit instead of the git repo from zuul? [15:43:22] ah no [15:43:29] it managed to fetch the refs/zuul/xxx [15:43:34] what have you changed? :} [15:43:45] changed it from a plain ip to git:// [15:43:56] in the /etc/zuul/zuul-merger* conf file [15:43:57] and spawned git-daemon right ? [15:44:11] yep [15:44:12] started git-daemon :) [15:44:22] well done hacker ™ ! [15:45:50] :) [15:46:01] hashar: I tried to bump the timeout on the jobrunners to 0.8s (assumption that I had - changing the related php file would have been enough to trigger the change) but it didn't help [15:46:30] atm I think that having a pool of persistent connections would be the safest thing to do [15:46:46] it is alarming how many sockets we have in TIME_WAIT on appservers and jobrunners [15:46:51] and remember the value has to be changed at two places! [15:46:56] in the jobrunner and in mediawiki :( [15:47:17] well for the # of time wait, given we reuse such connections we could reduce the local port range [15:48:32] hashar: I tried this one https://gerrit.wikimedia.org/r/#/c/346508/1/wmf-config/jobqueue.php [15:49:11] hashar: reducing the local port range would be worst I think, not all of them can be reused (the netstat timings seems to confirm) [15:49:12] another gem I found: is to collect stats every 1 second with : nstat -d 1 Tcp* [15:49:25] then dump the result via nstat Tcp* [15:49:37] that shows bunch of metrics and the average per period [15:49:46] (have to kill nstat once done) [15:51:02] TcpExtTW 8153 [15:51:05] TcpExtTWRecycled 3008 [15:51:10] ymmv :D [15:51:22] I am not sure why there are so many connections being open though [15:51:43] hashar also i forgot to say ssh-slaves requires java 8 now too. [15:52:14] paladox: yeah you told me :} [15:52:20] ok [15:52:30] paladox: I am not sure how we will manage to ssh to Trusty instance (there is no java 8 there) [15:52:41] Oh. [15:53:03] we will find out :} [15:53:07] :) [15:54:22] hashar vote for this https://bugs.launchpad.net/trusty-backports/+bug/1368094 :) [16:01:25] hehe [16:01:36] most probably we will end up with Jenkins slaves that are all Jessie [16:01:40] with Docker running on it [16:01:51] then have Jenkins spawn a docker container and run tests in that [16:02:01] there are some experiments going on :} [16:02:24] well I am off [16:02:45] elukey: have a good week-end! [16:03:29] you too!! [16:14:01] (03CR) 10EddieGP: [C: 04-1] Ruby gem documentation should state license (032 comments) [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [16:14:09] (03CR) 10EddieGP: [C: 04-1] Ruby gem documentation should state license (031 comment) [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [16:14:54] (03PS6) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [16:16:10] (03CR) 10EddieGP: [C: 031] "recheck" [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [16:20:14] (03PS3) 10Rammanojpotla: Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 [16:21:46] (03PS7) 10Rammanojpotla: Ruby gem documentation should state license [selenium] - 10https://gerrit.wikimedia.org/r/348222 [16:23:25] (03CR) 10EddieGP: [C: 031] "recheck" [selenium] - 10https://gerrit.wikimedia.org/r/348222 (owner: 10Rammanojpotla) [16:25:32] (03CR) 10EddieGP: [C: 04-1] "I'm unable to find any difference between this and the patchset uploaded before." [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [16:28:42] (03PS4) 10Rammanojpotla: Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 [16:30:28] (03CR) 10EddieGP: [C: 031] "recheck" [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [16:33:32] (03PS5) 10Rammanojpotla: Ruby gem documentation should state license [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 [16:38:35] (03CR) 10EddieGP: [C: 031] "recheck" [ruby/api] - 10https://gerrit.wikimedia.org/r/348225 (owner: 10Rammanojpotla) [19:27:02] Before I forget on behalf of my Family, Happy Easter! [19:30:51] Zppix: best Easter for you! Hopefully the bells will bring a lot of chocolate :-} [19:31:07] i hope or I will not be a very happy camper :P [20:48:29] PROBLEM - Puppet run on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:15:48] 10Gerrit, 06Operations, 07LDAP: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#1676037 (10bd808) We may be able to do something about this trivially after completing {T161859}. As @hashar points out in the summary, today we... [21:31:30] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure (phase-out-trusty), 06Release-Engineering-Team: Switch MediaWiki coverage job from PHP 5 to PHP 7 - https://phabricator.wikimedia.org/T147778#3183918 (10Krinkle) [23:28:39] 06Release-Engineering-Team, 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3184078 (10jcrespo) [23:28:42] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3184077 (10jcrespo)