[00:00:47] <bd808>	 Is there a page on wikitech or mw.o that answers the question "how do I get access to the Beta Cluster project?"
[00:01:10] <bd808>	 Context is T172040
[00:01:11] <stashbot>	 T172040: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040
[00:02:02] <mutante>	 paladox: i already started, just got distracted by a security thig
[00:02:08] <paladox>	 ok :)
[00:03:12] <TabbyCat>	 bd808: you can tell me to go for a walk instead -- maybe I'm asking too much lately; sorry for that
[00:03:14] <paladox>	 mutante you linked to the wrong bug
[00:03:22] <paladox>	 in your change
[00:03:31] <paladox>	 it should be T163938 and not the codfw one :)
[00:03:31] <stashbot>	 T163938: replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938
[00:04:28] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3488152 (10bd808) >>! In T172040#3487685, @MarcoAurelio wrote: > @bd808 And could then be possible to run that...
[00:04:45] <bd808>	 TabbyCat: asking is never bad :)
[00:04:55] * Reedy steals bd808 car
[00:05:07] <bd808>	 Reedy: the keys are in it
[00:05:10] <Reedy>	 :D
[00:05:25] <paladox>	 lol
[00:05:36] <TabbyCat>	 but we don't want no carrucha -- mercedes, lexus, audi...
[00:05:53] <TabbyCat>	 those cars good
[00:06:14] <mutante>	 paladox: please fix, need to do one other thing really quick
[00:06:18] <mutante>	 thanks
[00:06:22] <paladox>	 ok
[00:08:12] <bd808>	 TabbyCat: I would add you to the deployment-prep project, but I haven't helped with it for more than a year so you should probably find someone else to help you try to figure out how to run that script.
[00:09:19] <TabbyCat>	 bd808: yes, well, I'd not run anything w/o knowing how to and its results
[00:09:28] <TabbyCat>	 don't want to ruin anyone's job
[00:10:12] <TabbyCat>	 off to bed now, see you tomorrow
[00:10:36] <paladox>	 time flys by so much TabbyCat :)
[00:10:38] <paladox>	 already 1am
[00:10:50] <TabbyCat>	 paladox: 2.10 AM
[00:10:55] <paladox>	 lol
[00:11:05] <TabbyCat>	 wtf am I doing here yet xD
[00:14:23] <shinken-wm>	 RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0]
[00:22:32] <shinken-wm>	 RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0]
[00:33:42] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[01:23:44] <wmf-insecte>	 Project selenium-MinervaNeue » chrome,beta,Linux,BrowserTests build #40: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/40/
[01:26:32] <twentyafterfour>	 mutante: I'm not sure what creates that directory, it may not be in puppet
[01:27:40] <twentyafterfour>	 nvm I see you figured it out
[01:30:36] <mutante>	 yea, we did.. just got distracted by security talk
[01:31:02] <mutante>	 i'll add the missing IPv6 record now
[01:31:07] <mutante>	 that will fix ferm issue 
[01:31:11] <mutante>	 that will fix puppet run
[01:32:02] <wmf-insecte>	 Project selenium-MinervaNeue » firefox,beta,Linux,BrowserTests build #40: 04FAILURE in 29 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/40/
[01:34:59] <shinken-wm>	 PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[01:35:21] <shinken-wm>	 PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[01:42:42] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[02:10:23] <shinken-wm>	 RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:15:00] <shinken-wm>	 RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:22:41] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[04:07:33] <wmf-insecte>	 Project selenium-MultimediaViewer » safari,beta,OS X 10.9,BrowserTests build #472: 04FAILURE in 11 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=BrowserTests/472/
[04:22:40] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[04:25:17] <shinken-wm>	 PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[04:25:55] <shinken-wm>	 PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[04:36:22] <shinken-wm>	 PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[04:37:12] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[05:00:17] <shinken-wm>	 RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0]
[05:00:57] <shinken-wm>	 RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:02:41] <shinken-wm>	 RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:11:21] <shinken-wm>	 RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:12:13] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:22:41] <shinken-wm>	 PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[05:23:41] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[05:33:13] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[05:44:25] <shinken-wm>	 PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[05:44:33] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[06:02:40] <shinken-wm>	 RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0]
[06:03:43] <shinken-wm>	 RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:13:15] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:21:57] <shinken-wm>	 PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[06:24:23] <shinken-wm>	 RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:24:44] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[06:34:11] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[06:36:11] <shinken-wm>	 PROBLEM - Puppet errors on deployment-eventlog02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[06:52:46] <shinken-wm>	 PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[07:01:56] <shinken-wm>	 RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:11:10] <shinken-wm>	 RECOVERY - Puppet errors on deployment-eventlog02 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:14:14] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:23:40] <shinken-wm>	 PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[07:24:34] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:32:48] <shinken-wm>	 RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:34:44] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[07:35:14] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[08:03:39] <shinken-wm>	 RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0]
[08:10:12] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:12:28] <wikibugs>	 10Release-Engineering-Team (Kanban), 10Operations, 10Phabricator: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839436 (10mmodell) p:05Normal>03High
[08:14:29] <wikibugs>	 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Use pwstore (a shared gpg-encrypted password store) for Release Engineering related passwords - https://phabricator.wikimedia.org/T139093#3488596 (10mmodell)
[09:00:11] <shinken-wm>	 PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0]
[09:02:22] <shinken-wm>	 PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[09:05:14] <shinken-wm>	 RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0]
[09:42:23] <shinken-wm>	 RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:56:11] <hashar>	 I am going to stop CI for a few minutes to push a mass amount of changes to the mediawiki extensions
[10:12:47] <hashar>	 !log Stopped Zuul / CI for mass mediawiki extension changes
[10:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[10:13:47] <icinga-wm>	 PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused
[10:14:38] <icinga-wm>	 PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server
[10:15:08] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused amusso Stopped Zuul for mass spam to Gerrit
[10:15:08] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server amusso Stopped Zuul for mass spam to Gerrit
[10:21:47] <icinga-wm>	 RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730
[10:22:38] <icinga-wm>	 RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server
[10:24:37] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused amusso Mass spam to Gerrit and puppet disabled to prevent Zuul from coming back up
[10:24:37] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server amusso Mass spam to Gerrit and puppet disabled to prevent Zuul from coming back up
[10:25:02] <hashar>	 ^^^ that is me
[10:43:41] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[10:45:33] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[10:55:45] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[11:18:22] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ms-be03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[11:20:15] <shinken-wm>	 PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[11:20:33] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:21:13] <shinken-wm>	 PROBLEM - Puppet errors on deployment-sca02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[11:22:55] <shinken-wm>	 PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[11:23:39] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[11:30:45] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0]
[11:41:00] <wikibugs>	 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Wikidata, and 3 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3488904 (10Aleksey_WMDE) Can I ask what the status is?
[11:53:45] <shinken-wm>	 PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[12:02:55] <shinken-wm>	 RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:33:47] <shinken-wm>	 RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:45:33] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3489003 (10MarcoAurelio) @bd808 I really don't feel confortable to ask people randomly. It can be disturbing. I...
[12:46:33] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[12:49:57] <wikibugs>	 (03PS1) 10Aude: Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346
[12:50:03] <wikibugs>	 (03CR) 10Aude: [C: 032] Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346 (owner: 10Aude)
[12:50:39] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346 (owner: 10Aude)
[13:15:25] <shinken-wm>	 PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[13:21:34] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:50:15] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3489134 (10MarcoAurelio) 05Open>03declined Declined in favor or the task above.
[13:53:01] <TabbyCat>	 bd808: I closed that ^
[13:53:19] <TabbyCat>	 given that there's that other solution
[13:53:37] <TabbyCat>	 RainbowSprinkles: I left you a message via conph
[13:54:47] <shinken-wm>	 PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[13:55:27] <shinken-wm>	 RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:12:26] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:12:34] <shinken-wm>	 PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:15:46] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Operations, 10Ops-Access-Requests, 10User-MarcoAurelio: Requesting access to deployment-prep for @MarcoAurelio - https://phabricator.wikimedia.org/T172182#3489233 (10MarcoAurelio)
[14:33:13] <wikibugs>	 10Continuous-Integration-Config, 10Tool-stewardbots, 10Patch-For-Review, 10User-MarcoAurelio: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#3489258 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio I think this is resolved for now. In case we need mor...
[14:41:29] <TabbyCat>	 hashar: weirdly jenkins tests +1 on https://gerrit.wikimedia.org/r/#/q/owner:geoffreytrang%2540gmail.com+status:open but refuses on wikimediamessages?
[14:44:42] <shinken-wm>	 PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[14:46:44] <hashar>	 TabbyCat: because that author is not whitelisted, its patches only receives CR+1
[14:46:59] <wikibugs>	 10Gerrit, 10Release-Engineering-Team (Backlog), 10Patch-For-Review: Update gerrit to 2.14.2 - https://phabricator.wikimedia.org/T156120#3489354 (10Paladox) Upstream won't merge the sshd patch that includes the fix for edcsa so we will need to either cherry pick it when we build or wait until 2.15.
[14:47:02] <hashar>	 TabbyCat: and the patch on WikimediaMessages received a "recheck" by a trusted user and thus got a V+2
[14:47:19] <TabbyCat>	 hashar: but the patch at wikimediamessages didn't event got checked
[14:47:27] <TabbyCat>	 I had to "recheck"
[14:47:33] <TabbyCat>	 MA = me
[14:48:25] <TabbyCat>	 so I wonder why all their other patches were run basic tests except that one on wikimediamessages
[14:48:50] <TabbyCat>	 because even when I first comented "recheck", he rebased the patch and the check was aborted
[14:48:58] <TabbyCat>	 and no other job was created at zuul
[14:52:27] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:52:33] <shinken-wm>	 RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:03:57] <Reception123>	 Quick question, is there a "guest" account or something for icinga.wikimedia.org? Or could someone tell me what version of Icinga is currently installed?
[15:04:47] <shinken-wm>	 RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:12:09] <paladox>	 Reception123 that's prod
[15:12:26] <paladox>	 you need to be in the ops group or the nda group.
[15:12:48] <Reception123>	 Ok. Who can be asked about the version then?
[15:12:58] <paladox>	 Reception123 it's running 1.x
[15:13:06] <Reception123>	 ok, thanks :)
[15:13:12] <paladox>	 also you can ask ops :)
[15:13:16] <paladox>	 they maintain it :)
[15:15:21] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489483 (10Paladox) @hashar did you upgrade to npm3 during the morning? :)
[15:16:46] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489489 (10hashar) Na I did a bunch of tests this morning and caught up a few more peer dependencies issues :/
[15:17:51] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10User-MarcoAurelio: Requesting access to deployment-prep for @MarcoAurelio - https://phabricator.wikimedia.org/T172182#3489491 (10bd808) @greg, can you find someone to work with @MarcoAurelio to get access to deployment-prep? I looked a bit for a "h...
[15:19:40] <shinken-wm>	 RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0]
[15:29:59] <paladox>	 hashar nodepool does not seem to be showing on https://integration.wikimedia.org/zuul/
[15:30:00] <paladox>	 or
[15:30:02] <paladox>	 am i wrong
[15:30:05] <paladox>	 i see no blue
[15:30:13] <paladox>	 ah
[15:30:15] <paladox>	 was wrong
[15:30:27] <paladox>	 i did not scroll to the bottom
[15:34:52] <hashar>	 !log Refreshing nodepool Jessie image to bump npm from 2.x to 3.8.x  T161861
[15:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:34:55] <stashbot>	 T161861: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861
[15:35:02] <hashar>	 James_F: paladox lets unleash npm 3 :)
[15:35:07] <paladox>	 :)
[15:35:15] <paladox>	 i've already ran puppet on jenkins-slave-01
[15:35:18] <hashar>	 it is going to take a dozen of minutes or so to update it
[15:35:23] * James_F nods.
[15:35:37] <hashar>	 will report / !log here as needed :}
[15:36:00] <shinken-wm>	 PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[15:36:14] <shinken-wm>	 PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:36:34] <paladox>	 :)
[15:37:25] <paladox>	 James_F was there a reason why we picked npm 3.8 instead of 3.10?
[15:42:23] <hashar>	 paladox: that is the version that comes with node 6
[15:42:30] <paladox>	 ah ok
[15:43:09] <paladox>	 hashar but 3.10 is in nodejs 6.9 whereas npm 3.8 is in nodejs 6.0 :)
[15:44:00] <hashar>	 !log  Debug: Executing '/usr/bin/npm install -g npm@3.8.3' -  T161861
[15:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:44:08] <stashbot>	 T161861: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861
[15:45:10] <paladox>	 :)
[15:45:21] <hashar>	 it takes a while to publish the new image
[15:45:25] <paladox>	 ok
[15:45:51] <hashar>	 !log Image snapshot-ci-jessie-1501601670 in wmflabs-eqiad is ready  && purging old instances T161861
[15:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:46:27] <shinken-wm>	 PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:46:48] <hashar>	 though they would purge by themselves
[15:46:55] <paladox>	 :)
[15:54:48] <hashar>	 James_F: so npm should be upgraded now
[15:54:56] <hashar>	 James_F: and really I am tempted to next switch to yarn :)
[15:56:34] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489604 (10hashar)
[15:56:37] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3489603 (10hashar)
[16:00:19] <hashar>	 James_F: would you be so kind as to send the announce to wikitech-l ? :}
[16:02:23] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3489630 (10hashar) We now have npm 3.8.3 (the version that came with nodejs 6.0). I have rebuild the job:  http...
[16:03:07] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489631 (10hashar) npm 3.8.3 is now on the CI instances \O/
[16:04:52] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Scap, 10ORES, 10Scoring-platform-team: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3489639 (10Halfak) Gotcha.  Thanks.  I think we're interested in putting some energy behind this if you c...
[16:05:54] <James_F>	 hashar: I will once I’m out of meetings.
[16:06:07] <hashar>	 James_F: awesome. Thank you!
[16:10:59] <shinken-wm>	 RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:11:13] <shinken-wm>	 RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:14:36] <shinken-wm>	 PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:21:24] <shinken-wm>	 RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:32:26] <ottomata>	 heyaaa
[16:32:36] <ottomata>	 anyone know why I can't edit gerrit permissions for analytics/wikistats gerrit repO?
[16:32:39] <ottomata>	 git clone https://gerrit.wikimedia.org/r/analytics/wikistats
[16:32:40] <ottomata>	 oops
[16:32:43] <ottomata>	 git clone https://gerrit.wikimedia.org/r/analytics/wikistats
[16:32:47] <ottomata>	 AGh
[16:33:01] <ottomata>	 Upload denied for project 'analytics/wikistats'
[16:33:25] <paladox>	 hmm
[16:33:35] <paladox>	 someone put the phabricator repo as parent to it
[16:33:36] <paladox>	 https://gerrit.wikimedia.org/r/#/admin/projects/analytics/wikistats,access
[16:33:41] <ottomata>	 yeah
[16:33:45] <ottomata>	 it can be 'analytics'
[16:33:47] <ottomata>	 that is fine
[16:34:15] <paladox>	 ah i see
[16:34:27] <ottomata>	 can you change it?
[16:34:33] <wikibugs>	 10Browser-Tests-Infrastructure, 10MinervaNeue, 10Reading-Web-Backlog: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3489734 (10Jdlrobson) #40 and #41 failed with: expected "" to match "This page...
[16:34:39] <paladox>	 I doint have acess has to be done by an admin
[16:34:41] <ottomata>	 hm
[16:34:41] <ottomata>	 ok
[16:35:52] <paladox>	 RainbowSprinkles would you be able to do ^^ please.
[16:39:54] <paladox>	 or twentyafterfour ^^ please
[16:39:55] <paladox>	 :)
[16:40:21] <paladox>	 twentyafterfour also could the gerrit link in https://phabricator.wikimedia.org/diffusion/ANWS/manage/uris/ be changed from mirror to cloning from that url please? :)
[16:45:09] <RainbowSprinkles>	 Fannnnntastic
[16:45:26] <RainbowSprinkles>	 I don't have permission to edit the ACL from the UI
[16:45:27] <RainbowSprinkles>	 hehe
[16:46:09] <wikibugs>	 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T170631#3489754 (10mmodell) ^ I linked the wrong task. wmf.12 is over there: {icon arrow-left} T168053
[16:46:26] <RainbowSprinkles>	 ottomata: Inherits from analytics now.
[16:46:57] <RainbowSprinkles>	 And phab swapped from mirror to observe
[16:47:10] <paladox>	 thanks RainbowSprinkles :)
[16:47:11] <ottomata>	 thank you!
[16:47:24] <RainbowSprinkles>	 yw
[16:48:30] <paladox>	 RainbowSprinkles could you click the update now button at https://phabricator.wikimedia.org/diffusion/ANWS/manage/status/ please? :).
[16:48:31] <RainbowSprinkles>	 ottomata: Future reference, any gerrit admin can do the following: `ssh -p 29418 gerrit.wikimedia.org gerrit set-project-parent -p analytics analytics/wikistats`
[16:48:50] <ottomata>	 ok, i will promptly say "cool!" and then forget that :)
[16:48:54] <paladox>	 thanks :)
[16:48:55] <RainbowSprinkles>	 Heh
[16:48:55] <ottomata>	 am I a gerrit admin?
[16:49:01] <RainbowSprinkles>	 All ops are by default
[16:49:03] <ottomata>	 ah ok
[16:49:06] <RainbowSprinkles>	 (admins includes ldap/ops)
[16:49:18] <Zppix>	 ottomata lol
[16:52:59] <RainbowSprinkles>	 paladox: It updated
[16:53:08] <paladox>	 thanks RainbowSprinkles :) :)
[16:54:40] <shinken-wm>	 RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:48:05] <wikibugs>	 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10JavaScript, 10Patch-For-Review, and 2 others: Port Echo Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T171848#3489998 (10Catrope)
[17:54:17] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3490014 (10Jdforrester-WMF) Can we mark as Resolved then?
[18:03:51] <mutante>	 twentyafterfour: in https://gerrit.wikimedia.org/r/#/c/369001/4/modules/phabricator/manifests/vcs.pp  $basedir is used, in:
[18:04:03] <mutante>	 file { "${basedir}/phabricator/scripts/ssh/":
[18:04:48] <mutante>	 you are changing $basedir above it
[18:04:57] <paladox>	 mutante he seems to fix it in that commit
[18:05:09] <mutante>	 eh? like it never worked?
[18:05:13] <paladox>	  /srv/phab/ links to /srv/phab/phabricator
[18:05:16] <twentyafterfour>	 mutante: yeah, basedir should be /srv/phab already so it's a no-op I think
[18:05:32] <mutante>	 well it's clearly changed from / to /srv/phab 
[18:05:41] <twentyafterfour>	 just changing the default
[18:05:49] <twentyafterfour>	 but isn't it passed in from elsewhere?
[18:05:52] <RainbowSprinkles>	  /srv/phab is kind of a legacy accident prior to doing things Right
[18:06:09] <RainbowSprinkles>	 (copied over from initial test installs in labs)
[18:06:21] <twentyafterfour>	 RainbowSprinkles: I can't say the way we do it now is "right"
[18:06:48] <RainbowSprinkles>	 Lemme rephrase: before we knew this was gonna be A Thing and should've been Done Right
[18:06:49] <RainbowSprinkles>	 :)
[18:06:56] <mutante>	 but that's not really related to the IP addresses?
[18:07:15] <twentyafterfour>	 mutante: yeah the change is unnecessary and unrelated
[18:07:47] <mutante>	 ok
[18:07:48] <paladox>	 that's why it could work
[18:07:50] <paladox>	 mutante see https://gerrit.wikimedia.org/r/#/c/369001/4/modules/phabricator/manifests/init.pp
[18:08:02] <paladox>	 line 276
[18:08:06] * twentyafterfour has a bad habbit of fixing unrelated things
[18:08:17] * twentyafterfour can't resist sometimes
[18:10:31] <mutante>	 i'll merge it per the compiler no-op
[18:12:21] <paladox>	 :)
[18:34:26] <mutante>	 Notice: /Stage[main]/Phabricator::Vcs/Notify[Warning: phabricator::vcs::listen_address is empty]/message: defined 'message' as 'Warning: phabricator::vcs::listen_address is empty'
[18:34:29] <mutante>	 Notice: Finished catalog run in 14.35 seconds
[18:34:35] <paladox>	 :)
[18:34:36] <mutante>	 twentyafterfour: ^ works as designed :) thx
[18:34:46] <twentyafterfour>	 :)
[18:34:47] <mutante>	 removing the access "hack"
[18:34:54] <mutante>	 and we can enable logmail and dumps i guess
[18:35:00] <mutante>	 or at least tomorrow
[18:35:16] <mutante>	 i mean this  https://gerrit.wikimedia.org/r/#/c/369447/1/hieradata/hosts/phab1001.yaml
[18:35:23] <twentyafterfour>	 yeah
[18:35:26] <mutante>	 the first 2 lines were disabling stuff that is active on iridium
[18:35:32] <mutante>	 ok
[18:39:03] <paladox>	 :)
[18:43:02] <twentyafterfour>	 mutante: scap deploy even works, sort of
[18:45:06] <paladox>	 :)
[18:49:03] <mutante>	 :)
[18:49:20] <mutante>	 paladox: wasnt there a "support scap" change , heh
[18:49:39] <mutante>	 oh, for gerrit of course
[18:49:44] <paladox>	 yep
[18:49:46] <twentyafterfour>	 ok it actually works, not just sort-of
[18:49:46] <paladox>	 for gerrit
[18:49:47] <paladox>	 :)
[18:49:52] <mutante>	 nice, twentyafterfour 
[19:08:20] <Sagan>	 twentyafterfour: >>> UNRECOVERABLE FATAL ERROR <<<
[19:08:23] <Sagan>	 Maximum execution time of 10 seconds exceeded
[19:08:31] <Sagan>	  /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/libphutil/src/utils/utils.php:604
[19:08:38] <Sagan>	 at https://phabricator.wikimedia.org/maniphest/report/
[19:08:44] <paladox>	 that's known
[19:08:48] <Sagan>	 is it already?
[19:08:53] <Sagan>	 some weeks ago it was working
[19:08:58] <Sagan>	 I guess we are too big now?
[19:09:12] <paladox>	 it works sometimes and other times not
[19:09:40] <paladox>	 it's been doing that for over a year
[19:09:42] <paladox>	 https://phabricator.wikimedia.org/T125357
[19:09:44] <paladox>	 Sagan ^^
[19:11:11] <Sagan>	 ah, ok
[19:45:24] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-VPS, 10Nodepool, and 2 others: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492#3490615 (10hashar) @Andrew wrote: > This was at 6, which has historically...
[19:48:30] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3490636 (10hashar)
[19:48:33] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3490633 (10hashar) 05Open>03Resolved a:03hashar I have sent hundred of patches to fix up peerDependencies in repositories...
[19:48:43] <hashar>	 James_F: thank you for the assistance with the npm versio bump to 3. Much appreciated!
[19:49:18] <James_F>	 hashar: Thank *you*.
[19:49:29] <James_F>	 hashar: Now for npm 5! :-)
[20:06:44] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown: Make sure extensions using composer/npm for development dependencies have the right .gitignore rules - https://phabricator.wikimedia.org/T116434#3490712 (10hashar) I have added a few ignores for node_modules and was wondering which syntax to use...
[20:09:44] <wikibugs>	 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MediaWiki-General-or-Unknown: Make sure extensions using composer/npm for development dependencies have the right .gitignore rules - https://phabricator.wikimedia.org/T116434#3490716 (10hashar)
[20:10:18] <hashar>	 James_F: hopefully it is straightforward though I have looked at npm 4 / npm 5 change logs :/
[20:10:35] * paladox runs npm 5
[20:10:47] <hashar>	 paladox: ahhh
[20:10:54] <hashar>	 paladox: have you tried setting up a jenkins slave with stretch ?  
[20:10:58] <paladox>	 yes
[20:10:59] <hashar>	 seems you filled a few tasks about it
[20:11:11] <paladox>	 I have already upgraded jenkins-slave-01 to stretch
[20:11:14] <hashar>	 oh
[20:11:20] <paladox>	 and works
[20:11:29] <paladox>	 i had to cherry pick my fixes i did 
[20:11:31] <hashar>	 I am not sure whether we will use a stretch jenkins slave though
[20:11:33] <paladox>	 including blocking hhvm
[20:11:45] <hashar>	 I am tempted to move the use cases directly to Docker instances
[20:11:53] <paladox>	 yeh
[20:12:07] <hashar>	 though I am not sure when we will actually need stretch or when we will consider Docker ready
[20:12:17] <paladox>	 yep
[20:12:36] <paladox>	 we will need stretch in a couple of years
[20:12:41] <hashar>	 (wishes we had a project manager to deal with the stretch switch)
[20:12:44] <paladox>	 jessie will be end of life in a couple of years
[20:12:49] <paladox>	 lol
[20:12:53] <hashar>	 yup
[20:13:02] <hashar>	 but I can guarantee you that production is going to switch to stretch
[20:13:09] <paladox>	 yeh
[20:13:11] <hashar>	 since developers will want more recent deb packages
[20:13:21] <hashar>	 and thus CI gotta follow (if not be ahead) of the move
[20:13:37] <paladox>	 though phabricator will never be on stretch until php 7.1 is in the debian repo
[20:13:47] <hashar>	 :]
[20:13:59] <paladox>	 or php 5.6 but that's eol soon
[20:16:00] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490738 (10demon)
[20:17:02] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490741 (10Dzahn) >  I have a technical person who is in charge  Is that person here on Phabrica...
[20:17:10] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490742 (10demon) a:05bbogaert>03None
[20:20:18] <wikibugs>	 (03PS1) 10XZise: Decode text as UTF8 [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369472
[20:27:24] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490755 (10Dzahn) Hey, so from Operations point of view.. it's a little short notice but we can...
[20:35:10] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3470314 (10Nuria) It did thank you. Our tests are failing cause it looks like a module is missing (cc @fdans )...
[20:41:26] <wikibugs>	 (03PS1) 10XZise: Not always is the last line empty [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369496
[20:42:51] <wikibugs>	 (03PS1) 10XZise: Fix actual tested minimum required number of lines [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369499
[20:43:36] <wikibugs>	 10Release-Engineering-Team (Kanban), 10Patch-For-Review, 10Release, 10Train Deployments: 1.30.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T168053#3490809 (10mmodell)
[20:54:29] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490872 (10eyoung) This was supposed to be something that our Volunteer team was going to handle...
[20:58:13] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490687 (10MaxSem) > I thought I had added him on the original request. I've tried adding as a s...
[21:11:50] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490973 (10Dzahn) @eyoung I have mailed Antoine and asked him to please join us here on the tick...
[21:15:23] <paladox>	 RainbowSprinkles wondering could you +1 or -1 https://gerrit.wikimedia.org/r/#/c/363726/ please? the new key is in the private repo + i've added back ssh:: call. Looks good to me (and tested) :)
[21:17:55] <SMalyshev>	 is there a problem with zuul? I see the queue is growing longer...
[21:24:11] <greg-g>	 probably not zuul (zuul is rarely the culprit) but more likely either nodepool or cloud vps, I see no active nodepool instances running tests: https://integration.wikimedia.org/ci/
[21:24:24] <greg-g>	 thcipriani: le sigh ^
[21:26:21] * thcipriani looks
[21:28:32] <RainbowSprinkles>	 greg-g: We chould move integration.wikimedia.org/zuul to integration.wikimedia.org/zuul-probably-isnt-to-blame
[21:28:48] <paladox>	 lol
[21:29:19] <paladox>	 RainbowSprinkles or put a red notice saying if you see no tests running blame nodepool heh
[21:30:32] <greg-g>	 :/
[21:32:07] <hashar>	 bah it is out of nodes :(
[21:32:12] <thcipriani>	 hrm, I'm able to delete nodes, but they do seem to be stuck in "delete". I see things in the log like: Exception: Timeout waiting for server 9de9696a-d001-4c13-b3b5-42311a6f4b5b deletion in wmflabs-eqiad
[21:32:40] <hashar>	 yeah  nodepool delete FOO   just flag the node for deletion int he mysql database
[21:32:53] <hashar>	 the actual deletion requests are throttled at 1 every 8 seconds :(
[21:33:34] <hashar>	 so if you have like 8 instances flagged for deletion, it takes 8 x 8 = 64 seconds to send the delete requests
[21:33:53] <hashar>	 and in between there are requests to list the servers  so it actually takes more time
[21:34:17] <hashar>	 if some nodes are not more in the list of servers, they are considered as successfully deleted, the instance is removed from the mysql database
[21:34:21] <hashar>	 and new instances can be created
[21:34:27] <thcipriani>	 but there are also timeout messages in the logs? It seems to take ~30seconds to manually delete
[21:34:27] <hashar>	 at once again 1 every 8 seconds
[21:34:51] <hashar>	 manually delete?  Do you mean via horizon  or openstack server delete ?
[21:35:23] <thcipriani>	 nodepool delete --now [instanceid]
[21:35:30] <hashar>	 ah yeah nodepool.NodeDeleter: Exception deleting node 767482:
[21:35:30] <hashar>	  Instance 64ed0f80-c2be-44a3-93d2-63167f2a41e8 could not be found. (HTTP 404)
[21:35:32] <hashar>	 --now !
[21:35:34] <hashar>	 man
[21:35:51] <hashar>	 I wasnt aware of that flag :D
[21:36:43] <hashar>	 doh it takes a while to delete one
[21:36:58] <andrewbogott>	 nodepool looks more or less fine to me, what's the problem?
[21:37:11] <andrewbogott>	 Lots of vms in 'active' state, none in 'error'...
[21:37:19] <thcipriani>	 andrewbogott: just a big backup on https://integration.wikimedia.org/zuul/
[21:37:35] <andrewbogott>	 is that nodepool misbehaving?  Or just lots to do?
[21:37:42] <chasemp>	 I concur with andrew in that operations seem normal
[21:37:45] <thcipriani>	 I'm starting to think it's just lots to do
[21:37:50] <hashar>	 lots of patches yeah
[21:37:51] <andrewbogott>	 ok
[21:38:00] <andrewbogott>	 ping me if you think something's broken :)
[21:38:01] <hashar>	 but the openstack side seems to behave properly
[21:38:03] <chasemp>	 last time I looked at this it was a few loooong running things that ate up slots for too long
[21:38:19] <thcipriani>	 nothing was moving when I checked initially, and I saw some delete timeouts, freaked me out for a minute :)
[21:38:56] <chasemp>	 something that track duration of test by type would be cool even in a new container based system
[21:39:09] <thcipriani>	 everything was just in "delete" or "building" so no movement on the zuul page. Seems like things are moving through now.
[21:39:10] <chasemp>	 there will always be constraints and a few tests I suspect eat up slots for far too long for no good reason
[21:39:38] <hashar>	 ah it had some time out while trying to delete instances since starting at 21:25
[21:39:58] <hashar>	 and some others at 21:00 - 21:10
[21:40:42] <hashar>	 andrewbogott: chasemp: there have been some launch errors and delete timeout around 21:00 and 21:25-21:30
[21:40:44] <hashar>	 https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=12&fullscreen&orgId=1
[21:41:06] <hashar>	 shows nodepool could not reach spawned instances after X time (iirc 5 minutes)
[21:41:22] <hashar>	 so I guess something flapped / restarted at those marks
[21:42:21] <hashar>	 and the rabbitmq stats show nothing interesting ( https://grafana.wikimedia.org/dashboard/db/labs-queue-server-rabbitmq?orgId=1&from=now-12h&to=now )  :(
[21:43:06] <chasemp>	 hard to know atm as nodepool cleaned up all the evidence
[21:43:28] <hashar>	 SMalyshev: seems it was transient. Some instances could not be spawned nor deleted and thus the pool of instances could not replenish.  Seems it is fine again now
[21:43:53] <SMalyshev>	 hashar: yeah my patch is finally being processed after 40 mins in thq eueue
[21:44:03] <hashar>	 chasemp: it has logs of instance id, then there is little reasons given for the actual timeout :\
[21:44:04] <SMalyshev>	 looks like it's recovering
[21:44:22] <SMalyshev>	 the graphs also trending down
[21:44:36] <hashar>	 SMalyshev: and if you browse down at the bottom of the zuul page, there are three little buttons. One of them is Nodepool and show the state of the pool
[21:44:49] <hashar>	 and ultimately https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&orgId=1
[21:45:00] <SMalyshev>	 hashar: yeah, if only I could understand what those graphs mean :)
[21:45:29] <hashar>	 SMalyshev: green = available to run tests,   other colors: busy :D
[21:46:03] <SMalyshev>	 green has been kinda thin looking at the graphs... 7 building, 11 deleting, 2 ready, 4 used
[21:46:21] <hashar>	 yup the queue is busy :/
[21:46:45] <SMalyshev>	 ok looks like it's moving along
[21:52:05] <hashar>	 thcipriani: that delete --now  command is magic
[21:52:20] <hashar>	 thcipriani: the time to delete an instance is quite noticeable ( 30+ secs)
[21:54:00] <thcipriani>	 hashar: yeah, I don't often run that command except when instances are in "delete" for a long time, and it usually times out or hangs forever, so I'm not sure what an appropriate length of time for that command to complete is.
[22:03:08] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Nodepool: Investigate nodepool slow deletion - https://phabricator.wikimedia.org/T172229#3491178 (10hashar)
[22:03:46] <hashar>	 thcipriani: I am not sure either. But it is definitely throttled by the rate setting (  https://phabricator.wikimedia.org/T172229 )
[22:04:04] <hashar>	 seems it does a few preliminary queries before deleting
[22:07:04] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Nodepool: Investigate nodepool slow deletion - https://phabricator.wikimedia.org/T172229#3491199 (10hashar)
[22:14:19] <hashar>	 chasemp: andrewbogott: looks like labcontrol1001 starts swapping  which more or less correlate with the past  issues and raises of nodepool rates
[22:14:31] <hashar>	 the long view for labcontrol1001 swap https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=16&fullscreen&orgId=1&var-server=labcontrol1001&var-network=bond0&from=now-30d&to=now
[22:14:45] <chasemp>	 interesting hashar
[22:14:52] <hashar>	 https://phabricator.wikimedia.org/T170492
[22:15:02] <chasemp>	 we are very close to moving off the puppetmaster overhead there so maybe that gets us back some RAM
[22:15:24] <hashar>	 it is really just a speculation, but the issue on July 12th   was correlated with the swap raising other 2 or 3 days
[22:15:35] <hashar>	 until it got full and whatever service exploded
[22:16:44] <chasemp>	 good thought
[22:16:54] <hashar>	 what I dont get though
[22:17:06] <hashar>	 is that the machine apparently had 6G of free space
[22:17:21] <hashar>	 so maybe it is a process that takes 6+GBytes of memory that attempts to fork
[22:18:05] <hashar>	 maybe linux would try to allocate what it is missing from the swap and eventually bails out because there is not enough memory
[22:18:45] <chasemp>	 the last time I had to debug puppetmasters and memory issues I began drinking heavily
[22:18:53] <chasemp>	 let's see how separating those concerns effects things
[22:18:59] <chasemp>	 nova + puppetmasters
[22:19:06] <chasemp>	 well
[22:19:18] <hashar>	 ahhhhh
[22:19:36] <hashar>	 linux  vm.overcommit_memory https://phabricator.wikimedia.org/T162166#3158827
[22:19:52] <hashar>	 and it is you Chase that pointed it out to me when the Scribunto CI tests were randomy failling :]
[22:21:00] <hashar>	 chasemp: I am going to bed, but most probably a service on labcontrol1001 will fail soonish and cause some issue on openstack
[22:21:15] <chasemp>	 what makes you think that?
[22:21:26] <hashar>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=16&fullscreen&orgId=1&var-server=labcontrol1001&var-network=bond0&from=now-30d&to=now
[22:21:28] <hashar>	 the swap graph
[22:21:35] <chasemp>	 ah
[22:21:38] <hashar>	 which more or less correlate with the bump of nodepool rates :]
[22:22:38] <hashar>	 but most probably it is a red hearing !
[22:22:53] <chasemp>	 or another effect of a common issue
[22:23:01] <chasemp>	 I'd bet on one of those two
[22:23:44] <hashar>	 we will see :]
[22:23:50] <hashar>	 anyway for now I am heading to bed
[22:24:04] <chasemp>	 later on hashar
[23:03:40] <wikibugs>	 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3491406 (10bd808) There is an existing [[https://tools.wmflabs.org/openstack-browser/project/wik...
[23:04:56] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-VPS, 10Nodepool, and 2 others: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492#3491412 (10chasemp) I did a small bit of poking today.  It seems we are u...
[23:59:43] <wikibugs>	 10Browser-Tests-Infrastructure, 10Release-Engineering-Team, 10MinervaNeue, 10Reading-Web-Backlog, and 2 others: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3446357 (10Jdlrobson) Not having mu...