[00:00:47] Is there a page on wikitech or mw.o that answers the question "how do I get access to the Beta Cluster project?" [00:01:10] Context is T172040 [00:01:11] T172040: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040 [00:02:02] paladox: i already started, just got distracted by a security thig [00:02:08] ok :) [00:03:12] bd808: you can tell me to go for a walk instead -- maybe I'm asking too much lately; sorry for that [00:03:14] mutante you linked to the wrong bug [00:03:22] in your change [00:03:31] it should be T163938 and not the codfw one :) [00:03:31] T163938: replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938 [00:04:28] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3488152 (10bd808) >>! In T172040#3487685, @MarcoAurelio wrote: > @bd808 And could then be possible to run that... [00:04:45] TabbyCat: asking is never bad :) [00:04:55] * Reedy steals bd808 car [00:05:07] Reedy: the keys are in it [00:05:10] :D [00:05:25] lol [00:05:36] but we don't want no carrucha -- mercedes, lexus, audi... [00:05:53] those cars good [00:06:14] paladox: please fix, need to do one other thing really quick [00:06:18] thanks [00:06:22] ok [00:08:12] TabbyCat: I would add you to the deployment-prep project, but I haven't helped with it for more than a year so you should probably find someone else to help you try to figure out how to run that script. [00:09:19] bd808: yes, well, I'd not run anything w/o knowing how to and its results [00:09:28] don't want to ruin anyone's job [00:10:12] off to bed now, see you tomorrow [00:10:36] time flys by so much TabbyCat :) [00:10:38] already 1am [00:10:50] paladox: 2.10 AM [00:10:55] lol [00:11:05] wtf am I doing here yet xD [00:14:23] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [00:22:32] RECOVERY - Puppet errors on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:33:42] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [01:23:44] Project selenium-MinervaNeue » chrome,beta,Linux,BrowserTests build #40: 04FAILURE in 21 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/40/ [01:26:32] mutante: I'm not sure what creates that directory, it may not be in puppet [01:27:40] nvm I see you figured it out [01:30:36] yea, we did.. just got distracted by security talk [01:31:02] i'll add the missing IPv6 record now [01:31:07] that will fix ferm issue [01:31:11] that will fix puppet run [01:32:02] Project selenium-MinervaNeue » firefox,beta,Linux,BrowserTests build #40: 04FAILURE in 29 min: https://integration.wikimedia.org/ci/job/selenium-MinervaNeue/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/40/ [01:34:59] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:35:21] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:42:42] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:10:23] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:15:00] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [02:22:41] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [04:07:33] Project selenium-MultimediaViewer » safari,beta,OS X 10.9,BrowserTests build #472: 04FAILURE in 11 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=safari,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=OS%20X%2010.9,label=BrowserTests/472/ [04:22:40] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [04:25:17] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [04:25:55] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [04:36:22] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [04:37:12] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [05:00:17] RECOVERY - Puppet errors on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [05:00:57] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:02:41] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:11:21] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [05:12:13] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:22:41] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:23:41] PROBLEM - Puppet errors on deployment-ores-redis-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [05:33:13] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:44:25] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [05:44:33] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [06:02:40] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [06:03:43] RECOVERY - Puppet errors on deployment-ores-redis-01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:13:15] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:21:57] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:24:23] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [06:24:44] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:34:11] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:36:11] PROBLEM - Puppet errors on deployment-eventlog02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:52:46] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [07:01:56] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:11:10] RECOVERY - Puppet errors on deployment-eventlog02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:14:14] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:23:40] PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:24:34] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [07:32:48] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:34:44] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [07:35:14] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:03:39] RECOVERY - Puppet errors on deployment-secureredirexperiment is OK: OK: Less than 1.00% above the threshold [0.0] [08:10:12] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:12:28] 10Release-Engineering-Team (Kanban), 10Operations, 10Phabricator: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839436 (10mmodell) p:05Normal>03High [08:14:29] 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Use pwstore (a shared gpg-encrypted password store) for Release Engineering related passwords - https://phabricator.wikimedia.org/T139093#3488596 (10mmodell) [09:00:11] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [09:02:22] PROBLEM - Puppet errors on deployment-conf03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:05:14] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [09:42:23] RECOVERY - Puppet errors on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:56:11] I am going to stop CI for a few minutes to push a mass amount of changes to the mediawiki extensions [10:12:47] !log Stopped Zuul / CI for mass mediawiki extension changes [10:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:13:47] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [10:14:38] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [10:15:08] ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused amusso Stopped Zuul for mass spam to Gerrit [10:15:08] ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server amusso Stopped Zuul for mass spam to Gerrit [10:21:47] RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 [10:22:38] RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [10:24:37] ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused amusso Mass spam to Gerrit and puppet disabled to prevent Zuul from coming back up [10:24:37] ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server amusso Mass spam to Gerrit and puppet disabled to prevent Zuul from coming back up [10:25:02] ^^^ that is me [10:43:41] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [10:45:33] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:55:45] PROBLEM - Puppet errors on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:18:22] PROBLEM - Puppet errors on deployment-ms-be03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:20:15] PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:20:33] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [11:21:13] PROBLEM - Puppet errors on deployment-sca02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:22:55] PROBLEM - Puppet errors on deployment-apertium02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:23:39] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [11:30:45] RECOVERY - Puppet errors on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [11:41:00] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Wikidata, and 3 others: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432#3488904 (10Aleksey_WMDE) Can I ask what the status is? [11:53:45] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:02:55] RECOVERY - Puppet errors on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:33:47] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:45:33] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3489003 (10MarcoAurelio) @bd808 I really don't feel confortable to ask people randomly. It can be disturbing. I... [12:46:33] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [12:49:57] (03PS1) 10Aude: Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346 [12:50:03] (03CR) 10Aude: [C: 032] Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346 (owner: 10Aude) [12:50:39] (03Merged) 10jenkins-bot: Bump wikidata to wmf/1.30.0-wmf.12 [tools/release] - 10https://gerrit.wikimedia.org/r/369346 (owner: 10Aude) [13:15:25] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:21:34] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [13:50:15] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-WikimediaMaintenance, 10User-MarcoAurelio: Allow dumpInterwiki.php to be run on Cloud VPS/Toolforge - https://phabricator.wikimedia.org/T172040#3489134 (10MarcoAurelio) 05Open>03declined Declined in favor or the task above. [13:53:01] bd808: I closed that ^ [13:53:19] given that there's that other solution [13:53:37] RainbowSprinkles: I left you a message via conph [13:54:47] PROBLEM - Puppet errors on deployment-jobrunner02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [13:55:27] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [14:12:26] PROBLEM - Puppet errors on deployment-mediawiki04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:12:34] PROBLEM - Puppet errors on deployment-logstash2 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:15:46] 10Beta-Cluster-Infrastructure, 10Operations, 10Ops-Access-Requests, 10User-MarcoAurelio: Requesting access to deployment-prep for @MarcoAurelio - https://phabricator.wikimedia.org/T172182#3489233 (10MarcoAurelio) [14:33:13] 10Continuous-Integration-Config, 10Tool-stewardbots, 10Patch-For-Review, 10User-MarcoAurelio: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#3489258 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio I think this is resolved for now. In case we need mor... [14:41:29] hashar: weirdly jenkins tests +1 on https://gerrit.wikimedia.org/r/#/q/owner:geoffreytrang%2540gmail.com+status:open but refuses on wikimediamessages? [14:44:42] PROBLEM - Puppet errors on deployment-mathoid is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:46:44] TabbyCat: because that author is not whitelisted, its patches only receives CR+1 [14:46:59] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Patch-For-Review: Update gerrit to 2.14.2 - https://phabricator.wikimedia.org/T156120#3489354 (10Paladox) Upstream won't merge the sshd patch that includes the fix for edcsa so we will need to either cherry pick it when we build or wait until 2.15. [14:47:02] TabbyCat: and the patch on WikimediaMessages received a "recheck" by a trusted user and thus got a V+2 [14:47:19] hashar: but the patch at wikimediamessages didn't event got checked [14:47:27] I had to "recheck" [14:47:33] MA = me [14:48:25] so I wonder why all their other patches were run basic tests except that one on wikimediamessages [14:48:50] because even when I first comented "recheck", he rebased the patch and the check was aborted [14:48:58] and no other job was created at zuul [14:52:27] RECOVERY - Puppet errors on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:33] RECOVERY - Puppet errors on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [15:03:57] Quick question, is there a "guest" account or something for icinga.wikimedia.org? Or could someone tell me what version of Icinga is currently installed? [15:04:47] RECOVERY - Puppet errors on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:12:09] Reception123 that's prod [15:12:26] you need to be in the ops group or the nda group. [15:12:48] Ok. Who can be asked about the version then? [15:12:58] Reception123 it's running 1.x [15:13:06] ok, thanks :) [15:13:12] also you can ask ops :) [15:13:16] they maintain it :) [15:15:21] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489483 (10Paladox) @hashar did you upgrade to npm3 during the morning? :) [15:16:46] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489489 (10hashar) Na I did a bunch of tests this morning and caught up a few more peer dependencies issues :/ [15:17:51] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10User-MarcoAurelio: Requesting access to deployment-prep for @MarcoAurelio - https://phabricator.wikimedia.org/T172182#3489491 (10bd808) @greg, can you find someone to work with @MarcoAurelio to get access to deployment-prep? I looked a bit for a "h... [15:19:40] RECOVERY - Puppet errors on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [15:29:59] hashar nodepool does not seem to be showing on https://integration.wikimedia.org/zuul/ [15:30:00] or [15:30:02] am i wrong [15:30:05] i see no blue [15:30:13] ah [15:30:15] was wrong [15:30:27] i did not scroll to the bottom [15:34:52] !log Refreshing nodepool Jessie image to bump npm from 2.x to 3.8.x T161861 [15:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:34:55] T161861: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861 [15:35:02] James_F: paladox lets unleash npm 3 :) [15:35:07] :) [15:35:15] i've already ran puppet on jenkins-slave-01 [15:35:18] it is going to take a dozen of minutes or so to update it [15:35:23] * James_F nods. [15:35:37] will report / !log here as needed :} [15:36:00] PROBLEM - Puppet errors on deployment-eventlogging04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:36:14] PROBLEM - Puppet errors on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:36:34] :) [15:37:25] James_F was there a reason why we picked npm 3.8 instead of 3.10? [15:42:23] paladox: that is the version that comes with node 6 [15:42:30] ah ok [15:43:09] hashar but 3.10 is in nodejs 6.9 whereas npm 3.8 is in nodejs 6.0 :) [15:44:00] !log Debug: Executing '/usr/bin/npm install -g npm@3.8.3' - T161861 [15:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:44:08] T161861: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861 [15:45:10] :) [15:45:21] it takes a while to publish the new image [15:45:25] ok [15:45:51] !log Image snapshot-ci-jessie-1501601670 in wmflabs-eqiad is ready && purging old instances T161861 [15:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:46:27] PROBLEM - Puppet errors on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:46:48] though they would purge by themselves [15:46:55] :) [15:54:48] James_F: so npm should be upgraded now [15:54:56] James_F: and really I am tempted to next switch to yarn :) [15:56:34] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489604 (10hashar) [15:56:37] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3489603 (10hashar) [16:00:19] James_F: would you be so kind as to send the announce to wikitech-l ? :} [16:02:23] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3489630 (10hashar) We now have npm 3.8.3 (the version that came with nodejs 6.0). I have rebuild the job: http... [16:03:07] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3489631 (10hashar) npm 3.8.3 is now on the CI instances \O/ [16:04:52] 10Release-Engineering-Team (Watching / External), 10Scap, 10ORES, 10Scoring-platform-team: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3489639 (10Halfak) Gotcha. Thanks. I think we're interested in putting some energy behind this if you c... [16:05:54] hashar: I will once I’m out of meetings. [16:06:07] James_F: awesome. Thank you! [16:10:59] RECOVERY - Puppet errors on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [16:11:13] RECOVERY - Puppet errors on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:14:36] PROBLEM - Puppet errors on deployment-restbase01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:21:24] RECOVERY - Puppet errors on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [16:32:26] heyaaa [16:32:36] anyone know why I can't edit gerrit permissions for analytics/wikistats gerrit repO? [16:32:39] git clone https://gerrit.wikimedia.org/r/analytics/wikistats [16:32:40] oops [16:32:43] git clone https://gerrit.wikimedia.org/r/analytics/wikistats [16:32:47] AGh [16:33:01] Upload denied for project 'analytics/wikistats' [16:33:25] hmm [16:33:35] someone put the phabricator repo as parent to it [16:33:36] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/wikistats,access [16:33:41] yeah [16:33:45] it can be 'analytics' [16:33:47] that is fine [16:34:15] ah i see [16:34:27] can you change it? [16:34:33] 10Browser-Tests-Infrastructure, 10MinervaNeue, 10Reading-Web-Backlog: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3489734 (10Jdlrobson) #40 and #41 failed with: expected "" to match "This page... [16:34:39] I doint have acess has to be done by an admin [16:34:41] hm [16:34:41] ok [16:35:52] RainbowSprinkles would you be able to do ^^ please. [16:39:54] or twentyafterfour ^^ please [16:39:55] :) [16:40:21] twentyafterfour also could the gerrit link in https://phabricator.wikimedia.org/diffusion/ANWS/manage/uris/ be changed from mirror to cloning from that url please? :) [16:45:09] Fannnnntastic [16:45:26] I don't have permission to edit the ACL from the UI [16:45:27] hehe [16:46:09] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T170631#3489754 (10mmodell) ^ I linked the wrong task. wmf.12 is over there: {icon arrow-left} T168053 [16:46:26] ottomata: Inherits from analytics now. [16:46:57] And phab swapped from mirror to observe [16:47:10] thanks RainbowSprinkles :) [16:47:11] thank you! [16:47:24] yw [16:48:30] RainbowSprinkles could you click the update now button at https://phabricator.wikimedia.org/diffusion/ANWS/manage/status/ please? :). [16:48:31] ottomata: Future reference, any gerrit admin can do the following: `ssh -p 29418 gerrit.wikimedia.org gerrit set-project-parent -p analytics analytics/wikistats` [16:48:50] ok, i will promptly say "cool!" and then forget that :) [16:48:54] thanks :) [16:48:55] Heh [16:48:55] am I a gerrit admin? [16:49:01] All ops are by default [16:49:03] ah ok [16:49:06] (admins includes ldap/ops) [16:49:18] ottomata lol [16:52:59] paladox: It updated [16:53:08] thanks RainbowSprinkles :) :) [16:54:40] RECOVERY - Puppet errors on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:48:05] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team (Kanban), 10JavaScript, 10Patch-For-Review, and 2 others: Port Echo Selenium tests from Ruby to Node.js - https://phabricator.wikimedia.org/T171848#3489998 (10Catrope) [17:54:17] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3490014 (10Jdforrester-WMF) Can we mark as Resolved then? [18:03:51] twentyafterfour: in https://gerrit.wikimedia.org/r/#/c/369001/4/modules/phabricator/manifests/vcs.pp $basedir is used, in: [18:04:03] file { "${basedir}/phabricator/scripts/ssh/": [18:04:48] you are changing $basedir above it [18:04:57] mutante he seems to fix it in that commit [18:05:09] eh? like it never worked? [18:05:13] /srv/phab/ links to /srv/phab/phabricator [18:05:16] mutante: yeah, basedir should be /srv/phab already so it's a no-op I think [18:05:32] well it's clearly changed from / to /srv/phab [18:05:41] just changing the default [18:05:49] but isn't it passed in from elsewhere? [18:05:52] /srv/phab is kind of a legacy accident prior to doing things Right [18:06:09] (copied over from initial test installs in labs) [18:06:21] RainbowSprinkles: I can't say the way we do it now is "right" [18:06:48] Lemme rephrase: before we knew this was gonna be A Thing and should've been Done Right [18:06:49] :) [18:06:56] but that's not really related to the IP addresses? [18:07:15] mutante: yeah the change is unnecessary and unrelated [18:07:47] ok [18:07:48] that's why it could work [18:07:50] mutante see https://gerrit.wikimedia.org/r/#/c/369001/4/modules/phabricator/manifests/init.pp [18:08:02] line 276 [18:08:06] * twentyafterfour has a bad habbit of fixing unrelated things [18:08:17] * twentyafterfour can't resist sometimes [18:10:31] i'll merge it per the compiler no-op [18:12:21] :) [18:34:26] Notice: /Stage[main]/Phabricator::Vcs/Notify[Warning: phabricator::vcs::listen_address is empty]/message: defined 'message' as 'Warning: phabricator::vcs::listen_address is empty' [18:34:29] Notice: Finished catalog run in 14.35 seconds [18:34:35] :) [18:34:36] twentyafterfour: ^ works as designed :) thx [18:34:46] :) [18:34:47] removing the access "hack" [18:34:54] and we can enable logmail and dumps i guess [18:35:00] or at least tomorrow [18:35:16] i mean this https://gerrit.wikimedia.org/r/#/c/369447/1/hieradata/hosts/phab1001.yaml [18:35:23] yeah [18:35:26] the first 2 lines were disabling stuff that is active on iridium [18:35:32] ok [18:39:03] :) [18:43:02] mutante: scap deploy even works, sort of [18:45:06] :) [18:49:03] :) [18:49:20] paladox: wasnt there a "support scap" change , heh [18:49:39] oh, for gerrit of course [18:49:44] yep [18:49:46] ok it actually works, not just sort-of [18:49:46] for gerrit [18:49:47] :) [18:49:52] nice, twentyafterfour [19:08:20] twentyafterfour: >>> UNRECOVERABLE FATAL ERROR <<< [19:08:23] Maximum execution time of 10 seconds exceeded [19:08:31] /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/libphutil/src/utils/utils.php:604 [19:08:38] at https://phabricator.wikimedia.org/maniphest/report/ [19:08:44] that's known [19:08:48] is it already? [19:08:53] some weeks ago it was working [19:08:58] I guess we are too big now? [19:09:12] it works sometimes and other times not [19:09:40] it's been doing that for over a year [19:09:42] https://phabricator.wikimedia.org/T125357 [19:09:44] Sagan ^^ [19:11:11] ah, ok [19:45:24] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-VPS, 10Nodepool, and 2 others: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492#3490615 (10hashar) @Andrew wrote: > This was at 6, which has historically... [19:48:30] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3490636 (10hashar) [19:48:33] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: Jobs with Node 6 should also have npm 3 - https://phabricator.wikimedia.org/T161861#3490633 (10hashar) 05Open>03Resolved a:03hashar I have sent hundred of patches to fix up peerDependencies in repositories... [19:48:43] James_F: thank you for the assistance with the npm versio bump to 3. Much appreciated! [19:49:18] hashar: Thank *you*. [19:49:29] hashar: Now for npm 5! :-) [20:06:44] 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown: Make sure extensions using composer/npm for development dependencies have the right .gitignore rules - https://phabricator.wikimedia.org/T116434#3490712 (10hashar) I have added a few ignores for node_modules and was wondering which syntax to use... [20:09:44] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10MediaWiki-General-or-Unknown: Make sure extensions using composer/npm for development dependencies have the right .gitignore rules - https://phabricator.wikimedia.org/T116434#3490716 (10hashar) [20:10:18] James_F: hopefully it is straightforward though I have looked at npm 4 / npm 5 change logs :/ [20:10:35] * paladox runs npm 5 [20:10:47] paladox: ahhh [20:10:54] paladox: have you tried setting up a jenkins slave with stretch ? [20:10:58] yes [20:10:59] seems you filled a few tasks about it [20:11:11] I have already upgraded jenkins-slave-01 to stretch [20:11:14] oh [20:11:20] and works [20:11:29] i had to cherry pick my fixes i did [20:11:31] I am not sure whether we will use a stretch jenkins slave though [20:11:33] including blocking hhvm [20:11:45] I am tempted to move the use cases directly to Docker instances [20:11:53] yeh [20:12:07] though I am not sure when we will actually need stretch or when we will consider Docker ready [20:12:17] yep [20:12:36] we will need stretch in a couple of years [20:12:41] (wishes we had a project manager to deal with the stretch switch) [20:12:44] jessie will be end of life in a couple of years [20:12:49] lol [20:12:53] yup [20:13:02] but I can guarantee you that production is going to switch to stretch [20:13:09] yeh [20:13:11] since developers will want more recent deb packages [20:13:21] and thus CI gotta follow (if not be ahead) of the move [20:13:37] though phabricator will never be on stretch until php 7.1 is in the debian repo [20:13:47] :] [20:13:59] or php 5.6 but that's eol soon [20:16:00] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490738 (10demon) [20:17:02] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490741 (10Dzahn) > I have a technical person who is in charge Is that person here on Phabrica... [20:17:10] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490742 (10demon) a:05bbogaert>03None [20:20:18] (03PS1) 10XZise: Decode text as UTF8 [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369472 [20:27:24] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490755 (10Dzahn) Hey, so from Operations point of view.. it's a little short notice but we can... [20:35:10] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Analytics-Kanban, 10Analytics-Wikistats: Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3470314 (10Nuria) It did thank you. Our tests are failing cause it looks like a module is missing (cc @fdans )... [20:41:26] (03PS1) 10XZise: Not always is the last line empty [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369496 [20:42:51] (03PS1) 10XZise: Fix actual tested minimum required number of lines [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/369499 [20:43:36] 10Release-Engineering-Team (Kanban), 10Patch-For-Review, 10Release, 10Train Deployments: 1.30.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T168053#3490809 (10mmodell) [20:54:29] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490872 (10eyoung) This was supposed to be something that our Volunteer team was going to handle... [20:58:13] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490687 (10MaxSem) > I thought I had added him on the original request. I've tried adding as a s... [21:11:50] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490973 (10Dzahn) @eyoung I have mailed Antoine and asked him to please join us here on the tick... [21:15:23] RainbowSprinkles wondering could you +1 or -1 https://gerrit.wikimedia.org/r/#/c/363726/ please? the new key is in the private repo + i've added back ssh:: call. Looks good to me (and tested) :) [21:17:55] is there a problem with zuul? I see the queue is growing longer... [21:24:11] probably not zuul (zuul is rarely the culprit) but more likely either nodepool or cloud vps, I see no active nodepool instances running tests: https://integration.wikimedia.org/ci/ [21:24:24] thcipriani: le sigh ^ [21:26:21] * thcipriani looks [21:28:32] greg-g: We chould move integration.wikimedia.org/zuul to integration.wikimedia.org/zuul-probably-isnt-to-blame [21:28:48] lol [21:29:19] RainbowSprinkles or put a red notice saying if you see no tests running blame nodepool heh [21:30:32] :/ [21:32:07] bah it is out of nodes :( [21:32:12] hrm, I'm able to delete nodes, but they do seem to be stuck in "delete". I see things in the log like: Exception: Timeout waiting for server 9de9696a-d001-4c13-b3b5-42311a6f4b5b deletion in wmflabs-eqiad [21:32:40] yeah nodepool delete FOO just flag the node for deletion int he mysql database [21:32:53] the actual deletion requests are throttled at 1 every 8 seconds :( [21:33:34] so if you have like 8 instances flagged for deletion, it takes 8 x 8 = 64 seconds to send the delete requests [21:33:53] and in between there are requests to list the servers so it actually takes more time [21:34:17] if some nodes are not more in the list of servers, they are considered as successfully deleted, the instance is removed from the mysql database [21:34:21] and new instances can be created [21:34:27] but there are also timeout messages in the logs? It seems to take ~30seconds to manually delete [21:34:27] at once again 1 every 8 seconds [21:34:51] manually delete? Do you mean via horizon or openstack server delete ? [21:35:23] nodepool delete --now [instanceid] [21:35:30] ah yeah nodepool.NodeDeleter: Exception deleting node 767482: [21:35:30] Instance 64ed0f80-c2be-44a3-93d2-63167f2a41e8 could not be found. (HTTP 404) [21:35:32] --now ! [21:35:34] man [21:35:51] I wasnt aware of that flag :D [21:36:43] doh it takes a while to delete one [21:36:58] nodepool looks more or less fine to me, what's the problem? [21:37:11] Lots of vms in 'active' state, none in 'error'... [21:37:19] andrewbogott: just a big backup on https://integration.wikimedia.org/zuul/ [21:37:35] is that nodepool misbehaving? Or just lots to do? [21:37:42] I concur with andrew in that operations seem normal [21:37:45] I'm starting to think it's just lots to do [21:37:50] lots of patches yeah [21:37:51] ok [21:38:00] ping me if you think something's broken :) [21:38:01] but the openstack side seems to behave properly [21:38:03] last time I looked at this it was a few loooong running things that ate up slots for too long [21:38:19] nothing was moving when I checked initially, and I saw some delete timeouts, freaked me out for a minute :) [21:38:56] something that track duration of test by type would be cool even in a new container based system [21:39:09] everything was just in "delete" or "building" so no movement on the zuul page. Seems like things are moving through now. [21:39:10] there will always be constraints and a few tests I suspect eat up slots for far too long for no good reason [21:39:38] ah it had some time out while trying to delete instances since starting at 21:25 [21:39:58] and some others at 21:00 - 21:10 [21:40:42] andrewbogott: chasemp: there have been some launch errors and delete timeout around 21:00 and 21:25-21:30 [21:40:44] https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=12&fullscreen&orgId=1 [21:41:06] shows nodepool could not reach spawned instances after X time (iirc 5 minutes) [21:41:22] so I guess something flapped / restarted at those marks [21:42:21] and the rabbitmq stats show nothing interesting ( https://grafana.wikimedia.org/dashboard/db/labs-queue-server-rabbitmq?orgId=1&from=now-12h&to=now ) :( [21:43:06] hard to know atm as nodepool cleaned up all the evidence [21:43:28] SMalyshev: seems it was transient. Some instances could not be spawned nor deleted and thus the pool of instances could not replenish. Seems it is fine again now [21:43:53] hashar: yeah my patch is finally being processed after 40 mins in thq eueue [21:44:03] chasemp: it has logs of instance id, then there is little reasons given for the actual timeout :\ [21:44:04] looks like it's recovering [21:44:22] the graphs also trending down [21:44:36] SMalyshev: and if you browse down at the bottom of the zuul page, there are three little buttons. One of them is Nodepool and show the state of the pool [21:44:49] and ultimately https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&orgId=1 [21:45:00] hashar: yeah, if only I could understand what those graphs mean :) [21:45:29] SMalyshev: green = available to run tests, other colors: busy :D [21:46:03] green has been kinda thin looking at the graphs... 7 building, 11 deleting, 2 ready, 4 used [21:46:21] yup the queue is busy :/ [21:46:45] ok looks like it's moving along [21:52:05] thcipriani: that delete --now command is magic [21:52:20] thcipriani: the time to delete an instance is quite noticeable ( 30+ secs) [21:54:00] hashar: yeah, I don't often run that command except when instances are in "delete" for a long time, and it usually times out or hangs forever, so I'm not sure what an appropriate length of time for that command to complete is. [22:03:08] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Nodepool: Investigate nodepool slow deletion - https://phabricator.wikimedia.org/T172229#3491178 (10hashar) [22:03:46] thcipriani: I am not sure either. But it is definitely throttled by the rate setting ( https://phabricator.wikimedia.org/T172229 ) [22:04:04] seems it does a few preliminary queries before deleting [22:07:04] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Nodepool: Investigate nodepool slow deletion - https://phabricator.wikimedia.org/T172229#3491199 (10hashar) [22:14:19] chasemp: andrewbogott: looks like labcontrol1001 starts swapping which more or less correlate with the past issues and raises of nodepool rates [22:14:31] the long view for labcontrol1001 swap https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=16&fullscreen&orgId=1&var-server=labcontrol1001&var-network=bond0&from=now-30d&to=now [22:14:45] interesting hashar [22:14:52] https://phabricator.wikimedia.org/T170492 [22:15:02] we are very close to moving off the puppetmaster overhead there so maybe that gets us back some RAM [22:15:24] it is really just a speculation, but the issue on July 12th was correlated with the swap raising other 2 or 3 days [22:15:35] until it got full and whatever service exploded [22:16:44] good thought [22:16:54] what I dont get though [22:17:06] is that the machine apparently had 6G of free space [22:17:21] so maybe it is a process that takes 6+GBytes of memory that attempts to fork [22:18:05] maybe linux would try to allocate what it is missing from the swap and eventually bails out because there is not enough memory [22:18:45] the last time I had to debug puppetmasters and memory issues I began drinking heavily [22:18:53] let's see how separating those concerns effects things [22:18:59] nova + puppetmasters [22:19:06] well [22:19:18] ahhhhh [22:19:36] linux vm.overcommit_memory https://phabricator.wikimedia.org/T162166#3158827 [22:19:52] and it is you Chase that pointed it out to me when the Scribunto CI tests were randomy failling :] [22:21:00] chasemp: I am going to bed, but most probably a service on labcontrol1001 will fail soonish and cause some issue on openstack [22:21:15] what makes you think that? [22:21:26] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=16&fullscreen&orgId=1&var-server=labcontrol1001&var-network=bond0&from=now-30d&to=now [22:21:28] the swap graph [22:21:35] ah [22:21:38] which more or less correlate with the bump of nodepool rates :] [22:22:38] but most probably it is a red hearing ! [22:22:53] or another effect of a common issue [22:23:01] I'd bet on one of those two [22:23:44] we will see :] [22:23:50] anyway for now I am heading to bed [22:24:04] later on hashar [23:03:40] 10Release-Engineering-Team (Watching / External), 10Operations, 10Wikimania-Hackathon-2017-Organization: Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3491406 (10bd808) There is an existing [[https://tools.wmflabs.org/openstack-browser/project/wik... [23:04:56] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-VPS, 10Nodepool, and 2 others: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492#3491412 (10chasemp) I did a small bit of poking today. It seems we are u... [23:59:43] 10Browser-Tests-Infrastructure, 10Release-Engineering-Team, 10MinervaNeue, 10Reading-Web-Backlog, and 2 others: MinervaNeue browser test are flaking (waiting for {:class=>"mw-notification", :tag_name=>"div"} to become present ) - https://phabricator.wikimedia.org/T170890#3446357 (10Jdlrobson) Not having mu...